Stefan Richter [Sun, 4 Jan 2009 15:23:29 +0000 (16:23 +0100)]
firewire: cdev: unify names of struct types and of their instances
to indicate that they are specializations of struct event or of struct
client_resource, respectively.
struct response was both an event and a client_resource; it is now split
into struct outbound_transaction_resource and ~_event in order to
document more explicitly which types of client resources exist.
struct request and struct_request_event are renamed to struct
inbound_transaction_resource and ~_event because requests and responses
occur in outbound and in inbound transactions.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Stefan Richter [Sun, 4 Jan 2009 15:23:29 +0000 (16:23 +0100)]
firewire: cdev: reference-count client instances
The lifetime of struct client instances must be longer than the lifetime
of any client resource.
This fixes a possible race between fw_device_op_release and transaction
completions. It also prepares for new ioctls for isochronous resource
management which will involve delayed processing of client resources.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Reviewed-by: David Moore <dcm@acm.org>
Stefan Richter [Fri, 2 Jan 2009 11:47:13 +0000 (12:47 +0100)]
firewire: cdev: fix documentation of FW_CDEV_IOC_GET_INFO
The FW_CDEV_IOC_GET_INFO ioctl looks at client->device->config_rom, not
at the local node's config ROM.
We could fix the implementation or the documentation. I believe the way
how it is currently implemented is more useful than the way how it is
currently documented. In fact, libdc1394 uses the ABI already as
implemented, not as documented. Hence let's change the documentation.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Stefan Richter [Sun, 21 Dec 2008 15:39:46 +0000 (16:39 +0100)]
firewire: prevent creation of multiple IR DMA contexts for the same channel
OHCI-1394 1.1 clause 10.4.3 says: "If more than one IR DMA context
specifies receives for packets from the same isochronous channel, the
context destination for that channel's packets is undefined."
Any userspace client and in the future also kernelspace clients can
allocate IR DMA contexts for any channel. We don't want them to
interfere with each other, hence it is preferable to return -EBUSY if
allocation of a second context for a channel is attempted.
Notes:
- This limitation is OHCI-1394 specific, therefore its proper place of
implementation is down in the low-level driver.
- Since the <linux/firewire-cdev.h> ABI simply maps one userspace iso
client context to one hardware iso context, this OHCI-1394
limitation alas requires userspace to implement its own multiplexing
of iso reception from the same channel and card to multiple clients
when needed.
- The limitation is independent of channel allocation at the IRM; the
latter is really only important for the initiation of iso
transmission but not of iso reception.
- We don't need to do the same for IT DMA because OHCI-1394 does not
have any ties between IT contexts and channels. Only the voluntary
channel allocation protocol via the IRM, globally to the FireWire
bus, can ensure proper isochronous transmit behaviour anyway.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Stefan Richter [Sun, 14 Dec 2008 18:21:01 +0000 (19:21 +0100)]
firewire: cdev: address handler input validation
Like before my commit 1415d9189e8c59aa9c77a3bba419dcea062c145f,
fw_core_add_address_handler() does not align the address region now.
Instead the caller is required to pass valid parameters.
Since one of the callers of fw_core_add_address_handler() is the cdev
userspace interface, we now check for valid input. If the client is
buggy, we give it a hint with -EINVAL.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Jay Fenlason [Sun, 21 Dec 2008 15:47:17 +0000 (16:47 +0100)]
firewire: cdev: use an idr rather than a linked list for resources
The current code uses a linked list and a counter for storing
resources and the corresponding handle numbers. By changing to an idr
we can be safe from counter wrap-around giving two resources the same
handle.
Furthermore, the deallocation ioctls now check whether the resource to
be freed is of the intended type.
Signed-off-by: Jay Fenlason <fenlason@redhat.com>
Some rework by Stefan R:
- The idr API documentation says we get an ID within 0...0x7fffffff.
Hence we can rest assured that idr handles fit into cdev handles.
- Fix some races. Add a client->in_shutdown flag for this purpose.
- Add allocation retry to add_client_resource().
- It is possible to use idr_for_each() in fw_device_op_release().
- Fix ioctl_send_response() regression.
- Small style changes.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Stefan Richter [Fri, 5 Dec 2008 21:44:42 +0000 (22:44 +0100)]
firewire: cdev: tcodes input validation
The behaviour of fw-transaction.c::fw_send_request is ill-defined for
any other tcodes than read/ write/ lock request tcodes. Therefore
prevent requests with wrong tcodes from entering the transaction layer.
Maybe fw_send_request should check them itself, but I am not inclined to
change it and fw_fill_request from void-valued functions to ones which
return error codes and pass those up. Besides, maybe fw_send_request is
going to support one more tcode than ioctl_send_request in the future
(TCODE_STREAM_DATA).
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Jay Fenlason [Fri, 3 Oct 2008 15:19:09 +0000 (11:19 -0400)]
firewire: add a client_list_lock
This adds a client_list_lock, which only protects the device's
client_list, so that future versions of the driver can call code that
takes the card->lock while holding the client_list_lock. Adding this
lock is much simpler than adding __ versions of all the functions that
the future version may need. The one ordering issue is to make sure
code never takes the client_list_lock with card->lock held. Since
client_list_lock is only used in three places, that isn't hard.
Signed-off-by: Jay Fenlason <fenlason@redhat.com>
Update fill_bus_reset_event() accordingly. Include linux/spinlock.h.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
David Moore [Wed, 23 Jul 2008 06:23:40 +0000 (23:23 -0700)]
firewire: Include iso timestamp in headers when header_size > 4
Previously, when an iso context had header_size > 4, the iso header
(len/tag/channel/tcode/sy) was passed to userspace followed by quadlets
stripped from the payload. This patch changes the behavior:
header_size = 8 now passes the header quadlet followed by the timestamp
quadlet. When header_size > 8, quadlets are stripped from the payload.
The header_size = 4 case remains identical.
Since this alters the semantics of the API, the firewire API version
needs to be bumped concurrently with this change.
This change also refactors the header copying code slightly to be much
easier to read.
Signed-off-by: David Moore <dcm@acm.org> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Thomas Gleixner [Tue, 24 Mar 2009 19:27:39 +0000 (20:27 +0100)]
genirq: provide old request_irq() for CONFIG_GENERIC_HARDIRQ=n
Impact: Undo compile breakage for archs with CONFIG_GENERIC_HARDIRQ=n
The threaded interrupt handler patches changed request_irq from extern
to inline. Architectures which do not use the generic irq code still
have request_irq() as a global function and therefor fail to compile.
Keep the extern declaration for CONFIG_GENERIC_HARDIRQ=n
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Anton Vorontsov [Tue, 24 Mar 2009 19:06:46 +0000 (12:06 -0700)]
ucc_geth: Fix build breakage caused by a merge
This patch fixes following build error:
CC ucc_geth.o
ucc_geth.c: In function 'ucc_geth_probe':
ucc_geth.c:3644: error: implicit declaration of function 'uec_mdio_bus_name'
make[2]: *** [ucc_geth.o] Error 1
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Steven Rostedt [Tue, 24 Mar 2009 15:06:24 +0000 (11:06 -0400)]
function-graph: add option for include sleep times
Impact: give user a choice to show times spent while sleeping
The user may want to see the time a function spent sleeping.
This patch adds the trace option "sleep-time" to allow that.
The "sleep-time" option is default on.
Kumar Gala [Tue, 24 Mar 2009 13:23:13 +0000 (08:23 -0500)]
powerpc/83xx: Update ranges in gianfar node to match other dts
The gianfar@25000 node was missing its ranges prop for the mdio bus
and provided an explicit ranges property on gianfar@24000 to match
change from commit:
Benjamin Krill [Fri, 23 Jan 2009 16:18:05 +0000 (17:18 +0100)]
[MTD] ofpart: Check name property to determine partition nodes.
SLOF has a further node which could not be evaluated
by the current routine. The current routine returns
because the node hasn't the required reg property. As
fix this patch adds a check to determine the partition
child nodes. If the node is not a partition the number
of total partitions will be decreased and loop continues
with the next nodes.
Signed-off-by: Benjamin Krill <ben@codiert.org> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Anton Vorontsov [Thu, 19 Mar 2009 18:01:51 +0000 (21:01 +0300)]
powerpc/86xx: Move gianfar mdio nodes under the ethernet nodes
Currently it doesn't matter where the mdio nodes are placed, but with
power management support (i.e. when sleep = <> properties will take
effect), mdio nodes placement will become important: mdio controller
is a part of the ethernet block, so the mdio nodes should be placed
correctly. Otherwise we may wrongly assume that MDIO controllers are
available during sleep.
Suggested-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Anton Vorontsov [Thu, 19 Mar 2009 18:01:48 +0000 (21:01 +0300)]
powerpc/85xx: Move gianfar mdio nodes under the ethernet nodes
Currently it doesn't matter where the mdio nodes are placed, but with
power management support (i.e. when sleep = <> properties will take
effect), mdio nodes placement will become important: mdio controller
is a part of the ethernet block, so the mdio nodes should be placed
correctly. Otherwise we may wrongly assume that MDIO controllers are
available during sleep.
Suggested-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Anton Vorontsov [Thu, 19 Mar 2009 18:01:45 +0000 (21:01 +0300)]
powerpc/83xx: Move gianfar mdio nodes under the ethernet nodes
Currently it doesn't matter where the mdio nodes are placed, but with
power management support (i.e. when sleep = <> properties will take
effect), mdio nodes placement will become important: mdio controller
is a part of the ethernet block, so the mdio nodes should be placed
correctly. Otherwise we may wrongly assume that MDIO controllers are
available during sleep.
Suggested-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Anton Vorontsov [Thu, 19 Mar 2009 18:01:42 +0000 (21:01 +0300)]
powerpc/83xx: Add power management support for MPC837x boards
This patch adds pmc nodes to the device tree files so that the boards
will able to use standby capability of MPC837x processors. The MPC837x
PMC controllers are compatible with MPC8349 ones (i.e. no deep sleep).
sleep = <> properties are used to specify SCCR masks as described
in "Specifying Device Power Management Information (sleep property)"
chapter in Documentation/powerpc/booting-without-of.txt.
Since I2C1 and eSDHC controllers share the same clock source, they
are now placed under sleep-nexus nodes.
A processor is able to wakeup the boards on LAN events (Wake-On-Lan),
console events (with no_console_suspend kernel command line), GPIO
events and external IRQs (IRQ1 and IRQ2).
The processor can also wakeup the boards by the fourth general purpose
timer in GTM1 block, but the GTM wakeup support isn't yet implemented
(it's tested to work, but it's unclear how can we use the quite short
GTM timers, and how do we want to expose the GTM to userspace).
Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Steven Rostedt [Tue, 24 Mar 2009 05:10:15 +0000 (01:10 -0400)]
function-graph: ignore times across schedule
Impact: more accurate timings
The current method of function graph tracing does not take into
account the time spent when a task is not running. This shows functions
that call schedule have increased costs:
Steven Rostedt [Tue, 24 Mar 2009 04:18:31 +0000 (00:18 -0400)]
function-graph: prevent more than one tracer registering
Impact: prevent crash due to multiple function graph tracers
The function graph tracer can currently only handle a single tracer
being registered. If another tracer registers with the function
graph tracer it can crash the system.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Steven Rostedt [Tue, 24 Mar 2009 03:38:49 +0000 (23:38 -0400)]
function-graph: moved the timestamp from arch to generic code
This patch move the timestamp from happening in the arch specific
code into the general code. This allows for better control by the tracer
to time manipulation.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Eric Dumazet [Tue, 24 Mar 2009 13:26:50 +0000 (14:26 +0100)]
netfilter: nf_conntrack: Reduce conntrack count in nf_conntrack_free()
We use RCU to defer freeing of conntrack structures. In DOS situation, RCU might
accumulate about 10.000 elements per CPU in its internal queues. To get accurate
conntrack counts (at the expense of slightly more RAM used), we might consider
conntrack counter not taking into account "about to be freed elements, waiting
in RCU queues". We thus decrement it in nf_conntrack_free(), not in the RCU
callback.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Tested-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: Patrick McHardy <kaber@trash.net>
Steven Rostedt [Sat, 21 Mar 2009 06:44:50 +0000 (02:44 -0400)]
tracing: fix memory leak in trace_stat
If the function profiler does not have any items recorded and one were
to cat the function stat file, the kernel would take a BUG with a NULL
pointer dereference.
Looking further into this, I found that returning NULL from stat_start
did not stop the stat logic, and would later call stat_next. This breaks
from the way seq_file works, so I looked into fixing the stat code.
This is where I noticed that the last next_entry is never freed.
It is allocated, and if the stat_next returns NULL, the code breaks out
of the loop, unlocks the mutex and exits. We never link the next_entry
nor do we free it. Thus it is a real memory leak.
This patch rearranges the code a bit to not only fix the memory leak,
but also to act more like seq_file where nothing is printed if there
is nothing to print. That is, stat_start returns NULL.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Also:
- make act_mask accept "ahead", "meta", "discard" and "drv_data"
- use strsep() instead of strchr() to parse user input
- return -EINVAL if a token is not found in the mask map
- fix a bug that 'value' is unsigned, so it can < 0
- propagate error value of blk_trace_mask2str() to userspace, but not
always return -ENXIO.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <49C8AB42.1000802@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Honour barrier requests in the loop back block device driver.
In case of barrier bios, flush the backing file once before processing the
barrier and once after to guarantee ordering. In case of filesystems that
does not support fsync, barrier bios would be failed with -EOPNOTSUPP.
Boaz Harrosh [Tue, 24 Mar 2009 11:23:40 +0000 (12:23 +0100)]
bsg: add support for tail queuing
Currently inherited from sg.c bsg will submit asynchronous request
at the head-of-the-queue, (using "at_head" set in the call to
blk_execute_rq_nowait()). This is bad in situation where the queues
are full, requests will execute out of order, and can cause
starvation of the first submitted requests.
The sg_io_v4->flags member is used and a bit is allocated to denote the
Q_AT_TAIL. Zero is to queue at_head as before, to be compatible with old
code at the write/read path. SG_IO code path behavior was changed so to
be the same as write/read behavior. SG_IO was very rarely used and breaking
compatibility with it is OK at this stage.
sg_io_hdr at sg.h also has a flags member and uses 3 bits from the first
nibble and one bit from the last nibble. Even though none of these bits
are supported by bsg, The second nibble is allocated for use by bsg. Just
in case.
Petros Koutoupis [Wed, 11 Mar 2009 09:49:35 +0000 (10:49 +0100)]
block: genhd.h cleanup patch
In include/linux/genhd.h: Line 335 has a comment that needs to be updated from: /* drivers/block/ll_rw_blk.c */ to /* block/blk-core.c */. Also as of kernel 2.6.16, the function definition for get_blkdev_list was removed from block/genhd.c but the function declaration is still present on line 339. This patch addresses both those fixes, by updating the comment and removing the declaration.
Petros Koutoupis [Tue, 10 Mar 2009 07:25:54 +0000 (08:25 +0100)]
block: genhd.h comment needs updating
The include/linux/genhd.h file, on line 338-352 declares some function
prototypes in which the comment on line 338 states that the definition of
these prototypes are to be found at drivers/block/genhd.c. The problem is
that genhd.c has been relocated to block/genhd.c. See attached patch to
correct this minor cosmetic typo.
Jens Axboe [Fri, 27 Feb 2009 19:14:20 +0000 (20:14 +0100)]
cciss: add BUILD_BUG_ON() for catching bad CommandList_struct alignment
The hardware requires 64-bit alignment of commands, so add a build bug
check for that. The recent commit 8a3173de4ab4cdacc43675dc5c077f9a5bf17f5f
didn't change the size of the command, but other additions/changes may and
thus break badly at runtime.
Jens Axboe [Fri, 5 Dec 2008 15:10:29 +0000 (16:10 +0100)]
block: don't create bio_vec slabs of less than the inline number
If we don't have CONFIG_BLK_DEV_INTEGRITY set, then we don't have
any external dependencies on the bio_vec slabs. So don't create
the ones that we will inline anyway.
This removes some old code that was causing issues during
filesystem freeze.
Reported-by: Andrew Price <andy@andrewprice.me.uk> Tested-by: Andrew Price <andy@andrewprice.me.uk> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The logic requires that we mark the glock dirty in page_mkwrite
otherwise we might not flush correctly in the case that no
allocation was required in the process of dirying the page.
Also we need to set the shared write flag early for the same
reason.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This cleans up a number of bits of code mostly based in glops.c.
A couple of simple functions have been merged into the callers
to make it more obvious what is going on, the mysterious raising
of i_writecount around the truncate_inode_pages() call has been
removed. The meta_go_* operations have been renamed rgrp_go_*
since that is the only lock type that they are used with.
The unused argument of gfs2_read_sb has been removed. Also
a bug has been fixed where a check for the rindex inode was
in the wrong callback. More comments are added, and the
debugging code is improved too.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
GFS2: Fix locking bug in failed shared to exclusive conversion
After calling out to the dlm, GFS2 sets the new state of a glock to
gl_target in gdlm_ast(). However, gl_target is not always the lock
state that was requested. If a conversion from shared to exclusive
fails, finish_xmote() will call do_xmote() with LM_ST_UNLOCKED, instead
of gl->gl_target, so that it can reacquire the lock in exlusive the next
time around. In this case, setting the lock to gl_target in gdlm_ast()
will make GFS2 think that it has the glock in exclusive mode, when
really, it doesn't have the glock locked at all. This patch adds a new
field to the gfs2_glock structure, gl_req, to track the mode that was
requested.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Hisashi Hifumi [Tue, 3 Mar 2009 02:45:20 +0000 (11:45 +0900)]
GFS2: Pagecache usage optimization on GFS2
I introduced "is_partially_uptodate" aops for GFS2.
A page can have multiple buffers and even if a page is not uptodate, some buffers
can be uptodate on pagesize != blocksize environment.
This aops checks that all buffers which correspond to a part of a file
that we want to read are uptodate. If so, we do not have to issue actual
read IO to HDD even if a page is not uptodate because the portion we
want to read are uptodate.
"block_is_partially_uptodate" function is already used by ext2/3/4.
With the following patch random read/write mixed workloads or random read after
random write workloads can be optimized and we can get performance improvement.
-2.6.29-rc6
Test execution summary:
total time: 202.6389s
total number of events: 200000
total time taken by event execution: 2580.0480
per-request statistics:
min: 0.0000s
avg: 0.0129s
max: 49.5852s
approx. 95 percentile: 0.0462s
-2.6.29-rc6-patched
Test execution summary:
total time: 177.8639s
total number of events: 200000
total time taken by event execution: 2419.0199
per-request statistics:
min: 0.0000s
avg: 0.0121s
max: 52.4306s
approx. 95 percentile: 0.0444s
arch: ia64
pagesize: 16k
blocksize: 4k
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Hannes Eder [Sat, 21 Feb 2009 01:11:42 +0000 (02:11 +0100)]
GFS2: fix sparse warnings: constant is so big it is ...
Fix this sparse warnings:
fs/gfs2/rgrp.c:156:23: warning: constant 0xffffffffffffffff is so big it is unsigned long long
fs/gfs2/rgrp.c:157:23: warning: constant 0xaaaaaaaaaaaaaaaa is so big it is unsigned long long
fs/gfs2/rgrp.c:158:23: warning: constant 0x5555555555555555 is so big it is long long
fs/gfs2/rgrp.c:194:20: warning: constant 0x5555555555555555 is so big it is long long
fs/gfs2/rgrp.c:204:44: warning: constant 0x5555555555555555 is so big it is long long
Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds support for "quota" and "noquota" mount options in addition to the
existing "quota=on/off/account" so that we are compatible with the names by
which these options are more generally known.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
An alignment issue with the existing bitfit algorithm was reported
on IA64. This patch attempts to fix that, and also to tidy up the
code a bit. There is now more documentation about how this works
and it has survived a number of different tests.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds a sysfs file called demote_rq to GFS2's
per filesystem directory. Its possible to use this
file to demote arbitrary glocks in exactly the same
way as if a request had come in from a remote node.
This is intended for testing issues relating to caching
of data under glocks. Despite that, the interface is
generic enough to send requests to any type of glock,
but be careful as its not always safe to send an
arbitrary message to an arbitrary glock. For that reason
and to prevent DoS, this interface is restricted to root
only.
Which means "please demote inode glock (type 2) number 13324 so that
I can get an EX (exclusive) lock". The lock modes are those which
would normally be sent by a remote node in its callback so if you
want to unlock a glock, you use EX, to demote to shared, use SH or PR
(depending on whether you like GFS2 or DLM lock modes better!).
If the glock doesn't exist, you'll get -ENOENT returned. If the
arguments don't make sense, you'll get -EINVAL returned.
The plan is that this interface will be used in combination with
the blktrace patch which I recently posted for comments although
it is, of course, still useful in its own right.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Since we have a UUID, we ought to expose it to the user via sysfs
and uevents. We already have the fs name in both of these places
(a combination of the lock proto and lock table name) so if we add
the UUID as well, we have a full set.
For older filesystems (i.e. those created before mkfs.gfs2 was writing
UUIDs by default) the sysfs file will appear zero length, and no UUID
env var will be added to the uevents.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch allows GFS2 to generate discard requests for blocks which are
no longer useful to the filesystem (i.e. those which have been freed as
the result of an unlink operation). The requests are generated at the
time which those blocks become available for reuse in the filesystem.
In order to use this new feature, you have to specify the "discard"
mount option. The code coalesces adjacent blocks into a single extent
when generating the discard requests, thus generating the minimum
number.
If an error occurs when the request has been sent to the block device,
then it will print a message and turn off the requests for that
filesystem. If the problem is temporary, then you can use remount to
turn the option back on again. There is also a nodiscard mount option
so that you can use remount to turn discard requests off, if required.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes a deadlock when the journal is flushed and there
are dirty inodes other than the one which caused the journal flush.
Originally the journal flushing code was trying to obtain the
transaction glock while running the flush code for an inode glock.
We no longer require the transaction glock at this point in time
since we know that any attempt to get the transaction glock from
another node will result in a journal flush. So if we are flushing
the journal, we can be sure that the transaction lock is still
cached from when the transaction was started.
By inlining a version of gfs2_trans_begin() (minus the bit which
gets the transaction glock) we can avoid the deadlock problems
caused if there is a demote request queued up on the transaction
glock.
In addition I've also moved the umount rwsem so that it covers
the glock workqueue, since it all demotions are done by this
workqueue now. That fixes a bug on umount which I came across
while fixing the original problem.
Reported-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The time stamp field is unused in the glock now that we are
using a shrinker, so that we can remove it and save sizeof(unsigned long)
bytes in each glock.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is the big patch that I've been working on for some time
now. There are many reasons for wanting to make this change
such as:
o Reducing overhead by eliminating duplicated fields between structures
o Simplifcation of the code (reduces the code size by a fair bit)
o The locking interface is now the DLM interface itself as proposed
some time ago.
o Fewer lookups of glocks when processing replies from the DLM
o Fewer memory allocations/deallocations for each glock
o Scope to do further optimisations in the future (but this patch is
more than big enough for now!)
Please note that (a) this patch relates to the lock_dlm module and
not the DLM itself, that is still a separate module; and (b) that
we retain the ability to build GFS2 as a standalone single node
filesystem with out requiring the DLM.
This patch needs a lot of testing, hence my keeping it I restarted
my -git tree after the last merge window. That way, this has the maximum
exposure before its merged. This is (modulo a few minor bug fixes) the
same patch that I've been posting on and off the the last three months
and its passed a number of different tests so far.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Abhijith Das [Wed, 7 Jan 2009 16:21:34 +0000 (10:21 -0600)]
GFS2: Bring back lvb-related stuff to lock_nolock to support quotas
The quota code uses lvbs and this is currently not implemented in
lock_nolock, thereby causing panics when quota is enabled with
lock_nolock. This patch adds the relevant bits.
Signed-off-by: Abhijith Das <adas@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The following patch fixes an issue relating to remount and argument
parsing. After this fix is applied, remount becomes atomic in that
it either succeeds changing the mount to the new state, or it fails
and leaves it in the old state. Previously it was possible for the
parsing of options to fail part way though and for the fs to be left
in a state where some of the new arguments had been applied, but some
had not.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Thomas Gleixner [Mon, 23 Mar 2009 17:28:15 +0000 (18:28 +0100)]
genirq: add threaded interrupt handler support
Add support for threaded interrupt handlers:
A device driver can request that its main interrupt handler runs in a
thread. To achive this the device driver requests the interrupt with
request_threaded_irq() and provides additionally to the handler a
thread function. The handler function is called in hard interrupt
context and needs to check whether the interrupt originated from the
device. If the interrupt originated from the device then the handler
can either return IRQ_HANDLED or IRQ_WAKE_THREAD. IRQ_HANDLED is
returned when no further action is required. IRQ_WAKE_THREAD causes
the genirq code to invoke the threaded (main) handler. When
IRQ_WAKE_THREAD is returned handler must have disabled the interrupt
on the device level. This is mandatory for shared interrupt handlers,
but we need to do it as well for obscure x86 hardware where disabling
an interrupt on the IO_APIC level redirects the interrupt to the
legacy PIC interrupt lines.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>
Sheng Yang [Wed, 18 Mar 2009 07:33:06 +0000 (15:33 +0800)]
iommu: Add domain_has_cap iommu_ops
This iommu_op can tell if domain have a specific capability, like snooping
control for Intel IOMMU, which can be used by other components of kernel to
adjust the behaviour.
Signed-off-by: Sheng Yang <sheng@linux.intel.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
noticed a bug in pci PAT code and memory type setting.
PCI mmap code did not set the proper protection in vma, when it
inherited protection in reserve_memtype. This bug only affects
the case where there exists a WC mapping before X does an mmap
with /proc or /sys pci interface. This will cause X userlevel
mmap from /proc or /sysfs to fail on fork.
Avi Kivity [Mon, 23 Mar 2009 20:13:44 +0000 (22:13 +0200)]
KVM: VMX: Don't allow uninhibited access to EFER on i386
vmx_set_msr() does not allow i386 guests to touch EFER, but they can still
do so through the default: label in the switch. If they set EFER_LME, they
can oops the host.
Fix by having EFER access through the normal channel (which will check for
EFER_LME) even on i386.
Reported-and-tested-by: Benjamin Gilbert <bgilbert@cs.cmu.edu> Cc: stable@kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>
Andrea Arcangeli [Thu, 12 Mar 2009 17:18:43 +0000 (18:18 +0100)]
KVM: Fix missing smp tlb flush in invlpg
When kvm emulates an invlpg instruction, it can drop a shadow pte, but
leaves the guest tlbs intact. This can cause memory corruption when
swapping out.
Without this the other cpu can still write to a freed host physical page.
tlb smp flush must happen if rmap_remove is called always before mmu_lock
is released because the VM will take the mmu_lock before it can finally add
the page to the freelist after swapout. mmu notifier makes it safe to flush
the tlb after freeing the page (otherwise it would never be safe) so we can do
a single flush for multiple sptes invalidated.
Cc: stable@kernel.org Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Hannes Eder [Sat, 21 Feb 2009 01:19:13 +0000 (02:19 +0100)]
KVM: fix sparse warnings: Should it be static?
Impact: Make symbols static.
Fix this sparse warnings:
arch/x86/kvm/mmu.c:992:5: warning: symbol 'mmu_pages_add' was not declared. Should it be static?
arch/x86/kvm/mmu.c:1124:5: warning: symbol 'mmu_pages_next' was not declared. Should it be static?
arch/x86/kvm/mmu.c:1144:6: warning: symbol 'mmu_pages_clear_parents' was not declared. Should it be static?
arch/x86/kvm/x86.c:2037:5: warning: symbol 'kvm_read_guest_virt' was not declared. Should it be static?
arch/x86/kvm/x86.c:2067:5: warning: symbol 'kvm_write_guest_virt' was not declared. Should it be static?
virt/kvm/irq_comm.c:220:5: warning: symbol 'setup_routing_entry' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Avi Kivity <avi@redhat.com>
Amit Shah [Thu, 28 Feb 2008 10:36:15 +0000 (16:06 +0530)]
KVM: is_long_mode() should check for EFER.LMA
is_long_mode currently checks the LongModeEnable bit in
EFER instead of the LongModeActive bit. This is wrong, but
we survived this till now since it wasn't triggered. This
breaks guests that go from long mode to compatibility mode.
This is noticed on a solaris guest and fixes bug #1842160
Signed-off-by: Amit Shah <amit.shah@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com>
Amit Shah [Fri, 20 Feb 2009 17:23:37 +0000 (22:53 +0530)]
KVM: VMX: Update necessary state when guest enters long mode
setup_msrs() should be called when entering long mode to save the
shadow state for the 64-bit guest state.
Using vmx_set_efer() in enter_lmode() removes some duplicated code
and also ensures we call setup_msrs(). We can safely pass the value
of shadow_efer to vmx_set_efer() as no other bits in the efer change
while enabling long mode (guest first sets EFER.LME, then sets CR0.PG
which causes a vmexit where we activate long mode).
Joerg Roedel [Thu, 19 Feb 2009 11:18:56 +0000 (12:18 +0100)]
KVM: MMU: Fix another largepage memory leak
In the paging_fetch function rmap_remove is called after setting a large
pte to non-present. This causes rmap_remove to not drop the reference to
the large page. The result is a memory leak of that page.