Blackfin arch: Convert Blackfin GPIO driver to use common gpiolib/gpiochip infrastructure
- This patch adds support for ARCH_WANT_OPTIONAL_GPIOLIB.
- It may be changed in future to ARCH_REQUIRE_GPIOLIB.
- Change GPIO_BANK_NUM use DIV_ROUND_UP( , ) macro
Signed-off-by: Michael Hennerich <michael.hennerich@analog.com> Signed-off-by: Bryan Wu <cooloney@kernel.org>
Blackfin arch: Cleanup and unify Blackfin IRQ and GPIO IRQ handling
- Remove SSYNC()
- Use irq_to_gpio where applicable
- Remove gpio_edge_triggered bitfield, check irq_desc fields instead.
- Remove gpio_both_edge_triggeredb bitfield, check irq_desc fields
instead.
- Use BITMAP and bitops on gpio_enabled
- Preferably use 32-bit
- Looking at the disassembly this indeed saves quite a few instructions.
Signed-off-by: Michael Hennerich <michael.hennerich@analog.com> Signed-off-by: Bryan Wu <cooloney@kernel.org>
Mike Frysinger [Tue, 18 Nov 2008 09:48:22 +0000 (17:48 +0800)]
Blackfin arch: remove useless SSYNC() in irq priority code
- remove SSYNC() left over from irq init split
- do not force SSYNC() when masking/unmasking IRQs in the SIC
as any order enforced by the hardware should already be enforced
by software
Signed-off-by: Mike Frysinger <vapier.adi@gmail.com> Signed-off-by: Bryan Wu <cooloney@kernel.org>
Bryan Wu [Tue, 18 Nov 2008 09:48:22 +0000 (17:48 +0800)]
Blackfin arch: fix bug - gpio_bank() macros messed up bank number caculating with positioning a gpio
The whole story:
Before BF51x merged, all the MAX_BLACKFIN_GPIOS are integral multiple of GPIO_BANKSIZE (= 16).
But BF51x provides MAX_BLACKFIN_GPIOS = 40 which includes 3 banks and the 3rd bank has only 8
GPIO pins.
Therefore, gpio_bank() macros is correct when you try to find a GPIO in which bank (GPIO_35 is
in bank 2). But on BF51x gpio_bank(MAX_BLACKFIN_GPIOS) only gives out 2 banks instead of 3
banks for some static array initialization.
This patch add a new macros gpio_bank_n() and GPIO_BANK_NUM to do bank number caculating and
remain the gpio_bank() macros for positioning a gpio in which bank.
Mike Frysinger [Tue, 18 Nov 2008 09:48:22 +0000 (17:48 +0800)]
Blackfin arch: update defconfig file for all boards
- do not bother generating deprecated /sys files by default now since
mdev does not need it
- Don't built-in char sport driver and build it as a module in defconfig
- disable CONFIG_DEVKMEM by default
- enable spi flash driver on boards that have one
- switch config to the NAND platfrom driver rather than the bfin async one
- do not make BFIN_DMA_5XX optional since a large portion of our code relies
on dma functions existing
Signed-off-by: Mike Frysinger <vapier.adi@gmail.com> Signed-off-by: Bryan Wu <cooloney@kernel.org>
Takashi Iwai [Tue, 18 Nov 2008 09:38:56 +0000 (10:38 +0100)]
ALSA: hda - Fix restore of pin configs at resume for STAC/IDT codecs
Fixed the restore of pin configs at resume for some STAC/IDT codec
models. These models set explicitly the pin configs after the default
init configs, and these aren't restored properly at resume.
This patch introduces two changes:
- Allocate always pin_configs array in stac_spec so that the driver
can overwrite the value freely
- Introduce stac_change_pin_config() to change the pin config value
Takashi Iwai [Tue, 18 Nov 2008 08:32:42 +0000 (09:32 +0100)]
ALSA: hda - Create jack detection elements in build_controls
The jack detection input elements should be created in build_controls
callback instead of init callback because init can be called multiple
times by suspend/resume and power-saving.
Rakib Mullick [Tue, 18 Nov 2008 04:15:24 +0000 (10:15 +0600)]
kernel/profile.c: fix section mismatch warning
Impact: fix section mismatch warning in kernel/profile.c
Here, profile_nop function has been called from a non-init function
create_hash_tables(void). Which generetes a section mismatch warning.
Previously, create_hash_tables(void) was a init function. So, removing
__init from create_hash_tables(void) requires profile_nop to be
non-init.
This patch makes profile_nop function inline and fixes the
following warning:
WARNING: vmlinux.o(.text+0x6ebb6): Section mismatch in reference from
the function create_hash_tables() to the function
.init.text:profile_nop()
The function create_hash_tables() references
the function __init profile_nop().
This is often because create_hash_tables lacks a __init
annotation or the annotation of profile_nop is wrong.
Li Zefan [Tue, 18 Nov 2008 06:02:03 +0000 (14:02 +0800)]
cpuset: fix regression when failed to generate sched domains
Impact: properly rebuild sched-domains on kmalloc() failure
When cpuset failed to generate sched domains due to kmalloc()
failure, the scheduler should fallback to the single partition
'fallback_doms' and rebuild sched domains, but now it only
destroys but not rebuilds sched domains.
The regression was introduced by:
| commit dfb512ec4834116124da61d6c1ee10fd0aa32bd6
| Author: Max Krasnyansky <maxk@qualcomm.com>
| Date: Fri Aug 29 13:11:41 2008 -0700
|
| sched: arch_reinit_sched_domains() must destroy domains to force rebuild
After the above commit, partition_sched_domains(0, NULL, NULL) will
only destroy sched domains and partition_sched_domains(1, NULL, NULL)
will create the default sched domain.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Max Krasnyansky <maxk@qualcomm.com> Cc: <stable@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Wu Fengguang [Tue, 18 Nov 2008 03:47:53 +0000 (11:47 +0800)]
ALSA: ELD proc interface for HDMI sinks
Create /proc/asound/card<card_no>/eld#<codec_no> to reflect the audio
configurations and capabilities of the attached HDMI sink.
Some notes:
- Shall we show an empty file if the ELD content is not valid?
Well it's not that simple. There could be partially populated ELD,
and there may be malformed ELD provided by buggy drivers/monitors.
So expose ELD as it is.
- The ELD retrieval routines rely on the Intel HDA interface,
others are/could be universal and independent ones.
- How do we name the proc file?
If there are going to be two HDMI pins per codec, then the current naming
scheme (eld#<codec no>) will fail. Luckily the user space dependencies should
be minimal, so it would be trivial to do the rename if that happens.
- The ELD proc file content is designed to be easy for scripts and human reading.
Its lines all have the pattern:
<item_name>\t[\t]*<item_value>
where <item_name> is a keyword in c language, while <item_value> could be any
contents, including white spaces. <item_value> could also be a null value.
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
prevent cifs_writepages() from skipping unwritten pages
Fixed parsing of mount options when doing DFS submount
[CIFS] Fix check for tcon seal setting and fix oops on failed mount from earlier patch
[CIFS] Fix build break
cifs: reinstate sharing of tree connections
[CIFS] minor cleanup to cifs_mount
cifs: reinstate sharing of SMB sessions sans races
cifs: disable sharing session and tcon and add new TCP sharing code
[CIFS] clean up server protocol handling
[CIFS] remove unused list, add new cifs sock list to prepare for mount/umount fix
[CIFS] Fix cifs reconnection flags
[CIFS] Can't rely on iov length and base when kernel_recvmsg returns error
Dave Kleikamp [Tue, 18 Nov 2008 03:49:05 +0000 (03:49 +0000)]
prevent cifs_writepages() from skipping unwritten pages
Fixes a data corruption under heavy stress in which pages could be left
dirty after all open instances of a inode have been closed.
In order to write contiguous pages whenever possible, cifs_writepages()
asks pagevec_lookup_tag() for more pages than it may write at one time.
Normally, it then resets index just past the last page written before calling
pagevec_lookup_tag() again.
If cifs_writepages() can't write the first page returned, it wasn't resetting
index, and the next call to pagevec_lookup_tag() resulted in skipping all of
the pages it previously returned, even though cifs_writepages() did nothing
with them. This can result in data loss when the file descriptor is about
to be closed.
This patch ensures that index gets set back to the next returned page so
that none get skipped.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Acked-by: Jeff Layton <jlayton@redhat.com> Cc: Shirish S Pargaonkar <shirishp@us.ibm.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
Igor Mammedov [Thu, 23 Oct 2008 09:58:42 +0000 (13:58 +0400)]
Fixed parsing of mount options when doing DFS submount
Since these hit the same routines, and are relatively small, it is easier to review
them as one patch.
Fixed incorrect handling of the last option in some cases
Fixed prefixpath handling convert path_consumed into host depended string length (in bytes)
Use non default separator if it is provided in the original mount options
Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Igor Mammedov <niallain@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
Chris Mason [Tue, 18 Nov 2008 02:14:24 +0000 (21:14 -0500)]
Btrfs: prevent loops in the directory tree when creating snapshots
For a directory tree:
/mnt/subvolA/subvolB
btrfsctl -s /mnt/subvolA/subvolB /mnt
Will create a directory loop with subvolA under subvolB. This
commit uses the forward refs for each subvol and snapshot to error out
before creating the loop.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Tue, 18 Nov 2008 01:37:39 +0000 (20:37 -0500)]
Btrfs: Add backrefs and forward refs for subvols and snapshots
Subvols and snapshots can now be referenced from any point in the directory
tree. We need to maintain back refs for them so we can find lost
subvols.
Forward refs are added so that we know all of the subvols and
snapshots referenced anywhere in the directory tree of a single subvol. This
can be used to do recursive snapshotting (but they aren't yet) and it is
also used to detect and prevent directory loops when creating new snapshots.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Tue, 18 Nov 2008 01:42:26 +0000 (20:42 -0500)]
Btrfs: Give each subvol and snapshot their own anonymous devid
Each subvolume has its own private inode number space, and so we need
to fill in different device numbers for each subvolume to avoid confusing
applications.
This commit puts a struct super_block into struct btrfs_root so it can
call set_anon_super() and get a different device number generated for
each root.
btrfs_rename is changed to prevent renames across subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Tue, 18 Nov 2008 02:02:50 +0000 (21:02 -0500)]
Btrfs: Allow subvolumes and snapshots anywhere in the directory tree
Before, all snapshots and subvolumes lived in a single flat directory. This
was awkward and confusing because the single flat directory was only writable
with the ioctls.
This commit changes the ioctls to create subvols and snapshots at any
point in the directory tree. This requires making separate ioctls for
snapshot and subvol creation instead of a combining them into one.
The subvol ioctl does:
btrfsctl -S subvol_name parent_dir
After the ioctl is done subvol_name lives inside parent_dir.
The snapshot ioctl does:
btrfsctl -s path_for_snapshot root_to_snapshot
path_for_snapshot can be an absolute or relative path. btrfsctl breaks it up
into directory and basename components.
root_to_snapshot can be any file or directory in the FS. The snapshot
is taken of the entire root where that file lives.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef Bacik [Tue, 18 Nov 2008 02:11:49 +0000 (21:11 -0500)]
Btrfs: fix free space leak
In my batch delete/update/insert patch I introduced a free space leak. The
extent that we do the original search on in free_extents is never pinned, so we
always update the block saying that it has free space, but the free space never
actually gets added to the free space tree, since op->del will always be 0 and
it's never actually added to the pinned extents tree.
This patch fixes this problem by making sure we call pin_down_bytes on the
pending extent op and set op->del to the return value of pin_down_bytes so
update_block_group is called with the right value. This seems to fix the case
where we were getting ENOSPC when there was plenty of space available.
Kumar Gala [Sat, 15 Nov 2008 18:02:34 +0000 (12:02 -0600)]
Remove -mno-spe flags as they dont belong
For some unknown reason at Steven Rostedt added in disabling of the SPE
instruction generation for e500 based PPC cores in commit 6ec562328fda585be2d7f472cfac99d3b44d362a.
We are removing it because:
1. It generates e500 kernels that don't work
2. its not the correct set of flags to do this
3. we handle this in the arch/powerpc/Makefile already
4. its unknown in talking to Steven why he did this
Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Tested-and-Acked-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yinghai Lu [Sun, 16 Nov 2008 11:12:49 +0000 (03:12 -0800)]
x86: fix wakeup_cpu with numaq/es7000, v2
Impact: fix secondary-CPU wakeup/init path with numaq and es7000
While looking at wakeup_secondary_cpu for WAKE_SECONDARY_VIA_NMI:
|#ifdef WAKE_SECONDARY_VIA_NMI
|/*
| * Poke the other CPU in the eye via NMI to wake it up. Remember that the normal
| * INIT, INIT, STARTUP sequence will reset the chip hard for us, and this
| * won't ... remember to clear down the APIC, etc later.
| */
|static int __devinit
|wakeup_secondary_cpu(int logical_apicid, unsigned long start_eip)
|{
| unsigned long send_status, accept_status = 0;
| int maxlvt;
|...
| if (APIC_INTEGRATED(apic_version[phys_apicid])) {
| maxlvt = lapic_get_maxlvt();
I noticed that there is no warning about undefined phys_apicid...
because WAKE_SECONDARY_VIA_NMI and WAKE_SECONDARY_VIA_INIT can not be
defined at the same time. So NUMAQ is using wrong wakeup_secondary_cpu.
WAKE_SECONDARY_VIA_NMI, WAKE_SECONDARY_VIA_INIT and
WAKE_SECONDARY_VIA_MIP are variants of a weird and fragile
preprocessor-driven "HAL" mechanisms to specify the kind of secondary-CPU
wakeup strategy a given x86 kernel will use.
The vast majority of systems want to use INIT for secondary wakeup - NUMAQ
uses an NMI, (old-style-) ES7000 uses 'MIP' (a firmware driven in-memory
flag to let secondaries continue).
So convert these mechanisms to x86_quirks and add a
->wakeup_secondary_cpu() method to specify the rare exception
to the sane default.
Extend genapic accordingly as well, for 32-bit.
While looking further, I noticed that functions in wakecup.h for numaq
and es7000 are different to the default in mach_wakecpu.h - but smpboot.c
will only use default mach_wakecpu.h with smphook.h.
So we need to add mach_wakecpu.h for mach_generic, to properly support
numaq and es7000, and vectorize the following SMP init methods:
int trampoline_phys_low;
int trampoline_phys_high;
void (*wait_for_init_deassert)(atomic_t *deassert);
void (*smp_callin_clear_local_apic)(void);
void (*store_NMI_vector)(unsigned short *high, unsigned short *low);
void (*restore_NMI_vector)(unsigned short *high, unsigned short *low);
void (*inquire_remote_apic)(int apicid);
Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Oleg Nesterov [Mon, 17 Nov 2008 14:40:01 +0000 (15:40 +0100)]
thread_group_cputime: kill the bogus ->signal != NULL check
Impact: simplify the code
thread_group_cputime() is called by current when it must have the valid
->signal, or under ->siglock, or under tasklist_lock after the ->signal
check, or the caller is wait_task_zombie() which reaps the child. In any
case ->signal can't be NULL.
But the point of this patch is not optimization. If it is possible to call
thread_group_cputime() when ->signal == NULL we are doing something wrong,
and we should not mask the problem. thread_group_cputime() fills *times
and the caller will use it, if we silently use task_struct->*times* we
report the wrong values.
Oleg Nesterov [Mon, 17 Nov 2008 14:39:52 +0000 (15:39 +0100)]
account_steal_time: kill the unneeded account_group_system_time()
Impact: remove unnecessary accounting call
I don't actually understand account_steal_time() and I failed to find the
commit which added account_group_system_time(), but this looks bogus.
In any case rq->idle must be single-threaded, so it can't have ->totals.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
rtnetlink: propagate error from dev_change_flags in do_setlink()
isdn: remove extra byteswap in isdn_net_ciscohdlck_slarp_send_reply
Phonet: refuse to send bigger than MTU packets
e1000e: fix IPMI traffic
e1000e: fix warn_on reload after phy_id error
phy: fix phy address bug
e100: fix dma error in direction for mapping
igb: use dev_printk instead of printk
qla3xxx: Cleanup: Fix link print statements.
igb: Use device_set_wakeup_enable
e1000: Use device_set_wakeup_enable
e1000e: Use device_set_wakeup_enable
via-velocity: enable perfect filtering for multicast packets
phy: Add support for Marvell 88E1118 PHY
mlx4_en: Pause parameters per port
phylib: fix premature freeing of struct mii_bus
atl1: Do not enumerate options unsupported by chip
atl1e: fix broken multicast by removing unnecessary crc inversion
gianfar: Fix DMA unmap invocations
net/ucc_geth: Fix oops in uec_get_ethtool_stats()
...
Oleg Nesterov [Mon, 17 Nov 2008 14:39:47 +0000 (15:39 +0100)]
sched, signals: fix the racy usage of ->signal in account_group_xxx/run_posix_cpu_timers
Impact: fix potential NULL dereference
Contrary to ad474caca3e2a0550b7ce0706527ad5ab389a4d4 changelog, other
acct_group_xxx() helpers can be called after exit_notify() by timer tick.
Thanks to Roland for pointing out this. Somehow I missed this simple fact
when I read the original patch, and I am afraid I confused Frank during
the discussion. Sorry.
Fortunately, these helpers work with current, we can check ->exit_state
to ensure that ->signal can't go away under us.
Also, add the comment and compiler barrier to account_group_exec_runtime(),
to make sure we load ->signal only once.
netfilter: ctnetlink: get rid of module refcounting in ctnetlink
This patch replaces the unnecessary module refcounting with
the read-side locks. With this patch, all the dump and fill_info
function are called under the RCU read lock.
Based on a patch from Fabian Hugelshofer.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net>
netfilter: ctnetlink: use EOPNOTSUPP instead of EINVAL if the conntrack has no helper
This patch changes the return value if the conntrack has no helper assigned.
Instead of EINVAL, which is reserved for malformed messages, it returns
EOPNOTSUPP.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net>
Jaya Kumar [Tue, 11 Nov 2008 11:17:05 +0000 (12:17 +0100)]
[ARM] 5330/1: mach-pxa: Fixup reset for systems using reboot=cold or other strings
This patch makes do_hw_reset the default reboot behavior when nothing
else matches. This restores reboot functionality on gumstix basix
devices where reboot=cold is the default boot argument.
Signed-off-by: Jaya Kumar <jayakumar.lkml@gmail.com> Acked-by: Eric Miao <eric.miao@marvell.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
tracing: branch tracer, fix writing to trace/trace_options
Impact: fix trace_options behavior
writing to trace/trace_options use the index of the array
to find the value of the flag. With branch tracer flag
defined conditionally, this breaks writing to trace_options
with branch tracer disabled.
Peter Ujfalusi [Fri, 14 Nov 2008 06:57:28 +0000 (08:57 +0200)]
ASoC: Fix for master playback/capture volume range for TWL4030 codec
FGAIN for playback is in range of 0-0x3f, while for capture GAIN it
is in the range of 0-0x1f.
The original value of 128 (0x7f) would modify the CGAIN also for
playback.
Signed-off-by: Peter Ujfalusi <peter.ujfalusi@nokia.com> Acked-by: Steve Sakoman <steve@sakoman.com> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Eric Dumazet [Mon, 17 Nov 2008 10:41:00 +0000 (02:41 -0800)]
net: sctp should update its inuse counter
This patch is a preparation to namespace conversion of /proc/net/protocols
In order to have relevant information for SCTP protocols, we should use
sock_prot_inuse_add() to update a (percpu and pernamespace) counter of
inuse sockets.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 17 Nov 2008 10:38:49 +0000 (02:38 -0800)]
net: af_unix should update its inuse counter
This patch is a preparation to namespace conversion of /proc/net/protocols
In order to have relevant information for UNIX protocol, we should use
sock_prot_inuse_add() to update a (percpu and pernamespace) counter of
inuse sockets.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ingo Molnar [Mon, 17 Nov 2008 08:36:22 +0000 (09:36 +0100)]
Merge branches 'tracing/branch-tracer', 'tracing/ftrace', 'tracing/function-return-tracer', 'tracing/tracepoints' and 'tracing/urgent' into tracing/core
FUJITA Tomonori [Mon, 17 Nov 2008 07:24:34 +0000 (16:24 +0900)]
swiotlb: use coherent_dma_mask in alloc_coherent
Impact: fix DMA buffer allocation coherency bug in certain configs
This patch fixes swiotlb to use dev->coherent_dma_mask in
swiotlb_alloc_coherent().
coherent_dma_mask is a subset of dma_mask (equal to it most of
the time), enumerating the address range that a given device
is able to DMA to/from in a cache-coherent way.
But currently, swiotlb uses dev->dma_mask in alloc_coherent()
implicitly via address_needs_mapping(), but alloc_coherent is really
supposed to use coherent_dma_mask.
This bug could break drivers that uses smaller coherent_dma_mask than
dma_mask (though the current code works for the majority that use the
same mask for coherent_dma_mask and dma_mask).
Johannes Berg [Mon, 17 Nov 2008 07:20:31 +0000 (23:20 -0800)]
rtnetlink: propagate error from dev_change_flags in do_setlink()
Unlike ifconfig, iproute doesn't report an error when setting
an interface up fails:
(example: put wireless network mac80211 interface into repeater mode
with iwconfig but do not set a peer MAC address, it should fail with
-ENOLINK)
without patch:
# ip link set wlan0 up ; echo $?
0
#
with patch:
# ip link set wlan0 up ; echo $?
RTNETLINK answers: Link has been severed
2
#
Propagate the return value from dev_change_flags() to fix this.
Signed-off-by: Patrick McHardy <kaber@trash.net> Tested-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerrit Renker [Mon, 17 Nov 2008 06:56:55 +0000 (22:56 -0800)]
dccp: Tidy up setsockopt calls
This splits the setsockopt calls into two groups, depending on whether an
integer argument (val) is required and whether routines being called do
their own locking.
Some options (such as setting the CCID) use u8 rather than int, so that for
these the test with regard to integer-sizeof can not be used.
The second switch-case statement now only has those statements which need
locking and which make use of `val'.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Reviewed-by: Eugene Teo <eugeneteo@kernel.sg> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerrit Renker [Mon, 17 Nov 2008 06:55:08 +0000 (22:55 -0800)]
dccp: Deprecate Ack Ratio sysctl
This patch deprecates the Ack Ratio sysctl, since
* Ack Ratio is entirely ignored by CCID-3 and CCID-4,
* Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
* even if it would work in CCID-2, there is no point for a user to change it:
- Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
- if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
(since waiting for Acks which will never arrive in this window),
- cwnd is not a user-configurable value.
The only reasonable place for Ack Ratio is to print it for debugging. It is
planned to do this later on, as part of e.g. dccp_probe.
With this patch Ack Ratio is now under full control of feature negotiation:
* Ack Ratio is resolved as a dependency of the selected CCID;
* if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
Ratio 2 for both endpoints";
* what happens then is part of another patch set, since it concerns the
dynamic update of Ack Ratio while the connection is in full flight.
Thanks to Tomasz Grobelny for discussion leading up to this patch.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerrit Renker [Mon, 17 Nov 2008 06:53:48 +0000 (22:53 -0800)]
dccp: Feature negotiation for minimum-checksum-coverage
This provides feature negotiation for server minimum checksum coverage
which so far has been missing.
Since sender/receiver coverage values range only from 0...15, their
type has also been reduced in size from u16 to u4.
Feature-negotiation options are now generated for both sender and receiver
coverage, i.e. when the peer has `forgotten' to enable partial coverage
then feature negotiation will automatically enable (negotiate) the partial
coverage value for this connection.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerrit Renker [Mon, 17 Nov 2008 06:51:23 +0000 (22:51 -0800)]
dccp: Deprecate old setsockopt framework
The previous setsockopt interface, which passed socket options via struct
dccp_so_feat, is complicated/difficult to use. Continuing to support it leads to
ugly code since the old approach did not distinguish between NN and SP values.
This patch removes the old setsockopt interface and replaces it with two new
functions to register NN/SP values for feature negotiation.
These are essentially wrappers around the internal __feat_register functions,
with checking added to avoid
* wrong usage (type);
* changing values while the connection is in progress.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
Gerrit Renker [Mon, 17 Nov 2008 06:49:52 +0000 (22:49 -0800)]
dccp: Mechanism to resolve CCID dependencies
This adds a hook to resolve features whose value depends on the choice of
CCID. It is done at the server since it can only be done after the CCID
values have been negotiated; i.e. the client will add its CCID preference
list on the Change options sent in the Request, which will be reconciled
with the local preference list of the server.
The concept is documented on
http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/feature_negotiation/\
implementation_notes.html#ccid_dependencies
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk> Acked-by: Ian McDonald <ian.mcdonald@jandi.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
If segmentation offload is enabled by the host, we currently allocate
maximum sized packet buffers and pass them to the host. This uses up
20 ring entries, allowing us to supply only 20 packet buffers to the
host with a 256 entry ring. This is a huge overhead when receiving
small packets, and is most keenly felt when receiving MTU sized
packets from off-host.
The VIRTIO_NET_F_MRG_RXBUF feature flag is set by hosts which support
using receive buffers which are smaller than the maximum packet size.
In order to transfer large packets to the guest, the host merges
together multiple receive buffers to form a larger logical buffer.
The number of merged buffers is returned to the guest via a field in
the virtio_net_hdr.
Make use of this support by supplying single page receive buffers to
the host. On receive, we extract the virtio_net_hdr, copy 128 bytes of
the payload to the skb's linear data buffer and adjust the fragment
offset to point to the remaining data. This ensures proper alignment
and allows us to not use any paged data for small packets. If the
payload occupies multiple pages, we simply append those pages as
fragments and free the associated skbs.
This scheme allows us to be efficient in our use of ring entries
while still supporting large packets. Benchmarking using netperf from
an external machine to a guest over a 10Gb/s network shows a 100%
improvement from ~1Gb/s to ~2Gb/s. With a local host->guest benchmark
with GSO disabled on the host side, throughput was seen to increase
from 700Mb/s to 1.7Gb/s.
Based on a patch from Herbert Xu.
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use netdev_priv) Signed-off-by: David S. Miller <davem@davemloft.net>
Mark McLoughlin [Mon, 17 Nov 2008 06:40:36 +0000 (22:40 -0800)]
virtio_net: hook up the set-tso ethtool op
Seems like an oversight that we have set-tx-csum and set-sg hooked
up, but not set-tso.
Also leads to the strange situation that if you e.g. disable tx-csum,
then tso doesn't get disabled.
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Mark McLoughlin [Mon, 17 Nov 2008 06:39:18 +0000 (22:39 -0800)]
virtio_net: Recycle some more rx buffer pages
Each time we re-fill the recv queue with buffers, we allocate
one too many skbs and free it again when adding fails. We should
recycle the pages allocated in this case.
A previous version of this patch made trim_pages() trim trailing
unused pages from skbs with some paged data, but this actually
caused a barely measurable slowdown.
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use netdev_priv) Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Chinner [Mon, 17 Nov 2008 06:37:10 +0000 (17:37 +1100)]
[XFS] Fix double free of log tickets
When an I/O error occurs during an intermediate commit on a rolling
transaction, xfs_trans_commit() will free the transaction structure
and the related ticket. However, the duplicate transaction that
gets used as the transaction continues still contains a pointer
to the ticket. Hence when the duplicate transaction is cancelled
and freed, we free the ticket a second time.
Add reference counting to the ticket so that we hold an extra
reference to the ticket over the transaction commit. We drop the
extra reference once we have checked that the transaction commit
did not return an error, thus avoiding a double free on commit
error.
Credit to Nick Piggin for tripping over the problem.
SGI-PV: 989741
Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Eric Dumazet [Mon, 17 Nov 2008 03:46:36 +0000 (19:46 -0800)]
net: make sure struct dst_entry refcount is aligned on 64 bytes
As found in the past (commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17
[NET]: Fix tbench regression in 2.6.25-rc1), it is really
important that struct dst_entry refcount is aligned on a cache line.
We cannot use __atribute((aligned)), so manually pad the structure
for 32 and 64 bit arches.
for 32bit : offsetof(truct dst_entry, __refcnt) is 0x80
for 64bit : offsetof(truct dst_entry, __refcnt) is 0xc0
As it is not possible to guess at compile time cache line size,
we use a generic value of 64 bytes, that satisfies many current arches.
(Using 128 bytes alignment on 64bit arches would waste 64 bytes)
Add a BUILD_BUG_ON to catch future updates to "struct dst_entry" dont
break this alignment.
"tbench 8" is 4.4 % faster on a dual quad core (HP BL460c G1), Intel E5450 @3.00GHz
(2350 MB/s instead of 2250 MB/s)
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>