Merge branch 'for_2.6.29' of git://git.kernel.org/pub/scm/linux/kernel/git/kkeil...

author Linus Torvalds <torvalds@linux-foundation.org>

Fri, 9 Jan 2009 23:27:39 +0000 (15:27 -0800)

committer Linus Torvalds <torvalds@linux-foundation.org>

Fri, 9 Jan 2009 23:27:39 +0000 (15:27 -0800)
author Linus Torvalds <torvalds@linux-foundation.org>
Fri, 9 Jan 2009 23:27:39 +0000 (15:27 -0800)
committer Linus Torvalds <torvalds@linux-foundation.org>
Fri, 9 Jan 2009 23:27:39 +0000 (15:27 -0800)
diff --git a/Documentation/filesystems/btrfs.txt b/Documentation/filesystems/btrfs.txt

new file mode 100644 (file)

index 0000000..64087c3
--- /dev/null
+++ b/Documentation/filesystems/btrfs.txt
@@ -0,0 +1,91 @@
+
+       BTRFS
+       =====
+
+Btrfs is a new copy on write filesystem for Linux aimed at
+implementing advanced features while focusing on fault tolerance,
+repair and easy administration. Initially developed by Oracle, Btrfs
+is licensed under the GPL and open for contribution from anyone.
+
+Linux has a wealth of filesystems to choose from, but we are facing a
+number of challenges with scaling to the large storage subsystems that
+are becoming common in today's data centers. Filesystems need to scale
+in their ability to address and manage large storage, and also in
+their ability to detect, repair and tolerate errors in the data stored
+on disk.  Btrfs is under heavy development, and is not suitable for
+any uses other than benchmarking and review. The Btrfs disk format is
+not yet finalized.
+
+The main Btrfs features include:
+
+    * Extent based file storage (2^64 max file size)
+    * Space efficient packing of small files
+    * Space efficient indexed directories
+    * Dynamic inode allocation
+    * Writable snapshots
+    * Subvolumes (separate internal filesystem roots)
+    * Object level mirroring and striping
+    * Checksums on data and metadata (multiple algorithms available)
+    * Compression
+    * Integrated multiple device support, with several raid algorithms
+    * Online filesystem check (not yet implemented)
+    * Very fast offline filesystem check
+    * Efficient incremental backup and FS mirroring (not yet implemented)
+    * Online filesystem defragmentation
+
+
+
+       MAILING LIST
+       ============
+
+There is a Btrfs mailing list hosted on vger.kernel.org. You can
+find details on how to subscribe here:
+
+http://vger.kernel.org/vger-lists.html#linux-btrfs
+
+Mailing list archives are available from gmane:
+
+http://dir.gmane.org/gmane.comp.file-systems.btrfs
+
+
+
+       IRC
+       ===
+
+Discussion of Btrfs also occurs on the #btrfs channel of the Freenode
+IRC network.
+
+
+
+       UTILITIES
+       =========
+
+Userspace tools for creating and manipulating Btrfs file systems are
+available from the git repository at the following location:
+
+ http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs-unstable.git
+ git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git
+
+These include the following tools:
+
+mkfs.btrfs: create a filesystem
+
+btrfsctl: control program to create snapshots and subvolumes:
+
+       mount /dev/sda2 /mnt
+       btrfsctl -s new_subvol_name /mnt
+       btrfsctl -s snapshot_of_default /mnt/default
+       btrfsctl -s snapshot_of_new_subvol /mnt/new_subvol_name
+       btrfsctl -s snapshot_of_a_snapshot /mnt/snapshot_of_new_subvol
+       ls /mnt
+       default snapshot_of_a_snapshot snapshot_of_new_subvol
+       new_subvol_name snapshot_of_default
+
+       Snapshots and subvolumes cannot be deleted right now, but you can
+       rm -rf all the files and directories inside them.
+
+btrfsck: do a limited check of the FS extent trees.
+
+btrfs-debug-tree: print all of the FS metadata in text form.  Example:
+
+       btrfs-debug-tree /dev/sda2 >& big_output_file
diff --git a/Documentation/filesystems/squashfs.txt b/Documentation/filesystems/squashfs.txt

new file mode 100644 (file)

index 0000000..3e79e4a
--- /dev/null
+++ b/Documentation/filesystems/squashfs.txt
@@ -0,0 +1,225 @@
+SQUASHFS 4.0 FILESYSTEM
+=======================
+
+Squashfs is a compressed read-only filesystem for Linux.
+It uses zlib compression to compress files, inodes and directories.
+Inodes in the system are very small and all blocks are packed to minimise
+data overhead. Block sizes greater than 4K are supported up to a maximum
+of 1Mbytes (default block size 128K).
+
+Squashfs is intended for general read-only filesystem use, for archival
+use (i.e. in cases where a .tar.gz file may be used), and in constrained
+block device/memory systems (e.g. embedded systems) where low overhead is
+needed.
+
+Mailing list: squashfs-devel@lists.sourceforge.net
+Web site: www.squashfs.org
+
+1. FILESYSTEM FEATURES
+----------------------
+
+Squashfs filesystem features versus Cramfs:
+
+                               Squashfs                Cramfs
+
+Max filesystem size:           2^64                    16 MiB
+Max file size:                 ~ 2 TiB                 16 MiB
+Max files:                     unlimited               unlimited
+Max directories:               unlimited               unlimited
+Max entries per directory:     unlimited               unlimited
+Max block size:                        1 MiB                   4 KiB
+Metadata compression:          yes                     no
+Directory indexes:             yes                     no
+Sparse file support:           yes                     no
+Tail-end packing (fragments):  yes                     no
+Exportable (NFS etc.):         yes                     no
+Hard link support:             yes                     no
+"." and ".." in readdir:       yes                     no
+Real inode numbers:            yes                     no
+32-bit uids/gids:              yes                     no
+File creation time:            yes                     no
+Xattr and ACL support:         no                      no
+
+Squashfs compresses data, inodes and directories.  In addition, inode and
+directory data are highly compacted, and packed on byte boundaries.  Each
+compressed inode is on average 8 bytes in length (the exact length varies on
+file type, i.e. regular file, directory, symbolic link, and block/char device
+inodes have different sizes).
+
+2. USING SQUASHFS
+-----------------
+
+As squashfs is a read-only filesystem, the mksquashfs program must be used to
+create populated squashfs filesystems.  This and other squashfs utilities
+can be obtained from http://www.squashfs.org.  Usage instructions can be
+obtained from this site also.
+
+
+3. SQUASHFS FILESYSTEM DESIGN
+-----------------------------
+
+A squashfs filesystem consists of seven parts, packed together on a byte
+alignment:
+
+        ---------------
+       |  superblock   |
+       |---------------|
+       |  datablocks   |
+       |  & fragments  |
+       |---------------|
+       |  inode table  |
+       |---------------|
+       |   directory   |
+       |     table     |
+       |---------------|
+       |   fragment    |
+       |    table      |
+       |---------------|
+       |    export     |
+       |    table      |
+       |---------------|
+       |    uid/gid    |
+       |  lookup table |
+        ---------------
+
+Compressed data blocks are written to the filesystem as files are read from
+the source directory, and checked for duplicates.  Once all file data has been
+written the completed inode, directory, fragment, export and uid/gid lookup
+tables are written.
+
+3.1 Inodes
+----------
+
+Metadata (inodes and directories) are compressed in 8Kbyte blocks.  Each
+compressed block is prefixed by a two byte length, the top bit is set if the
+block is uncompressed.  A block will be uncompressed if the -noI option is set,
+or if the compressed block was larger than the uncompressed block.
+
+Inodes are packed into the metadata blocks, and are not aligned to block
+boundaries, therefore inodes overlap compressed blocks.  Inodes are identified
+by a 48-bit number which encodes the location of the compressed metadata block
+containing the inode, and the byte offset into that block where the inode is
+placed (<block, offset>).
+
+To maximise compression there are different inodes for each file type
+(regular file, directory, device, etc.), the inode contents and length
+varying with the type.
+
+To further maximise compression, two types of regular file inode and
+directory inode are defined: inodes optimised for frequently occurring
+regular files and directories, and extended types where extra
+information has to be stored.
+
+3.2 Directories
+---------------
+
+Like inodes, directories are packed into compressed metadata blocks, stored
+in a directory table.  Directories are accessed using the start address of
+the metablock containing the directory and the offset into the
+decompressed block (<block, offset>).
+
+Directories are organised in a slightly complex way, and are not simply
+a list of file names.  The organisation takes advantage of the
+fact that (in most cases) the inodes of the files will be in the same
+compressed metadata block, and therefore, can share the start block.
+Directories are therefore organised in a two level list, a directory
+header containing the shared start block value, and a sequence of directory
+entries, each of which share the shared start block.  A new directory header
+is written once/if the inode start block changes.  The directory
+header/directory entry list is repeated as many times as necessary.
+
+Directories are sorted, and can contain a directory index to speed up
+file lookup.  Directory indexes store one entry per metablock, each entry
+storing the index/filename mapping to the first directory header
+in each metadata block.  Directories are sorted in alphabetical order,
+and at lookup the index is scanned linearly looking for the first filename
+alphabetically larger than the filename being looked up.  At this point the
+location of the metadata block the filename is in has been found.
+The general idea of the index is ensure only one metadata block needs to be
+decompressed to do a lookup irrespective of the length of the directory.
+This scheme has the advantage that it doesn't require extra memory overhead
+and doesn't require much extra storage on disk.
+
+3.3 File data
+-------------
+
+Regular files consist of a sequence of contiguous compressed blocks, and/or a
+compressed fragment block (tail-end packed block).   The compressed size
+of each datablock is stored in a block list contained within the
+file inode.
+
+To speed up access to datablocks when reading 'large' files (256 Mbytes or
+larger), the code implements an index cache that caches the mapping from
+block index to datablock location on disk.
+
+The index cache allows Squashfs to handle large files (up to 1.75 TiB) while
+retaining a simple and space-efficient block list on disk.  The cache
+is split into slots, caching up to eight 224 GiB files (128 KiB blocks).
+Larger files use multiple slots, with 1.75 TiB files using all 8 slots.
+The index cache is designed to be memory efficient, and by default uses
+16 KiB.
+
+3.4 Fragment lookup table
+-------------------------
+
+Regular files can contain a fragment index which is mapped to a fragment
+location on disk and compressed size using a fragment lookup table.  This
+fragment lookup table is itself stored compressed into metadata blocks.
+A second index table is used to locate these.  This second index table for
+speed of access (and because it is small) is read at mount time and cached
+in memory.
+
+3.5 Uid/gid lookup table
+------------------------
+
+For space efficiency regular files store uid and gid indexes, which are
+converted to 32-bit uids/gids using an id look up table.  This table is
+stored compressed into metadata blocks.  A second index table is used to
+locate these.  This second index table for speed of access (and because it
+is small) is read at mount time and cached in memory.
+
+3.6 Export table
+----------------
+
+To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems
+can optionally (disabled with the -no-exports Mksquashfs option) contain
+an inode number to inode disk location lookup table.  This is required to
+enable Squashfs to map inode numbers passed in filehandles to the inode
+location on disk, which is necessary when the export code reinstantiates
+expired/flushed inodes.
+
+This table is stored compressed into metadata blocks.  A second index table is
+used to locate these.  This second index table for speed of access (and because
+it is small) is read at mount time and cached in memory.
+
+
+4. TODOS AND OUTSTANDING ISSUES
+-------------------------------
+
+4.1 Todo list
+-------------
+
+Implement Xattr and ACL support.  The Squashfs 4.0 filesystem layout has hooks
+for these but the code has not been written.  Once the code has been written
+the existing layout should not require modification.
+
+4.2 Squashfs internal cache
+---------------------------
+
+Blocks in Squashfs are compressed.  To avoid repeatedly decompressing
+recently accessed data Squashfs uses two small metadata and fragment caches.
+
+The cache is not used for file datablocks, these are decompressed and cached in
+the page-cache in the normal way.  The cache is used to temporarily cache
+fragment and metadata blocks which have been read as a result of a metadata
+(i.e. inode or directory) or fragment access.  Because metadata and fragments
+are packed together into blocks (to gain greater compression) the read of a
+particular piece of metadata or fragment will retrieve other metadata/fragments
+which have been packed with it, these because of locality-of-reference may be
+read in the near future. Temporarily caching them ensures they are available
+for near future access without requiring an additional read and decompress.
+
+In the future this internal cache may be replaced with an implementation which
+uses the kernel page cache.  Because the page cache operates on page sized
+units this may introduce additional complexity in terms of locking and
+associated race conditions.
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt

index fcc48bf722a8f635768912ee421156b37aab44fe..8511d3532c27743ef1f4f46bc9d0fe94e59bdeec 100644 (file)
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -834,8 +834,8 @@ and is between 256 and 4096 characters. It is defined in the file
  
         hlt             [BUGS=ARM,SH]
  
-       hvc_iucv=       [S390] Number of z/VM IUCV Hypervisor console (HVC)
-                              back-ends. Valid parameters: 0..8
+       hvc_iucv=       [S390] Number of z/VM IUCV hypervisor console (HVC)
+                              terminal devices. Valid values: 0..8
  
         i8042.debug     [HW] Toggle i8042 debug mode
         i8042.direct    [HW] Put keyboard port into non-translated mode
diff --git a/Documentation/nommu-mmap.txt b/Documentation/nommu-mmap.txt

index 7714f57caad5b4312e7289761c0561c68b822049..b565e8279d133f969fbd6cc2642592c3d44058ed 100644 (file)
--- a/Documentation/nommu-mmap.txt
+++ b/Documentation/nommu-mmap.txt
@@ -109,12 +109,18 @@ and it's also much more restricted in the latter case:
  FURTHER NOTES ON NO-MMU MMAP
  ============================
  
- (*) A request for a private mapping of less than a page in size may not return
-     a page-aligned buffer. This is because the kernel calls kmalloc() to
-     allocate the buffer, not get_free_page().
+ (*) A request for a private mapping of a file may return a buffer that is not
+     page-aligned.  This is because XIP may take place, and the data may not be
+     paged aligned in the backing store.
  
- (*) A list of all the mappings on the system is visible through /proc/maps in
-     no-MMU mode.
+ (*) A request for an anonymous mapping will always be page aligned.  If
+     possible the size of the request should be a power of two otherwise some
+     of the space may be wasted as the kernel must allocate a power-of-2
+     granule but will only discard the excess if appropriately configured as
+     this has an effect on fragmentation.
+
+ (*) A list of all the private copy and anonymous mappings on the system is
+     visible through /proc/maps in no-MMU mode.
  
   (*) A list of all the mappings in use by a process is visible through
       /proc/<pid>/maps in no-MMU mode.
@@ -242,3 +248,18 @@ PROVIDING SHAREABLE BLOCK DEVICE SUPPORT
  Provision of shared mappings on block device files is exactly the same as for
  character devices. If there isn't a real device underneath, then the driver
  should allocate sufficient contiguous memory to honour any supported mapping.
+
+
+=================================
+ADJUSTING PAGE TRIMMING BEHAVIOUR
+=================================
+
+NOMMU mmap automatically rounds up to the nearest power-of-2 number of pages
+when performing an allocation.  This can have adverse effects on memory
+fragmentation, and as such, is left configurable.  The default behaviour is to
+aggressively trim allocations and discard any excess pages back in to the page
+allocator.  In order to retain finer-grained control over fragmentation, this
+behaviour can either be disabled completely, or bumped up to a higher page
+watermark where trimming begins.
+
+Page trimming behaviour is configurable via the sysctl `vm.nr_trim_pages'.
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt

index cd05994a49e69795db29f75675180371e1fe3ae4..a3415070bcac1fe6e16c8eafebccea30e3c55cb7 100644 (file)
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -38,6 +38,7 @@ Currently, these files are in /proc/sys/vm:
  - numa_zonelist_order
  - nr_hugepages
  - nr_overcommit_hugepages
+- nr_trim_pages                (only if CONFIG_MMU=n)
  
  ==============================================================
  
@@ -348,3 +349,20 @@ Change the maximum size of the hugepage pool. The maximum is
  nr_hugepages + nr_overcommit_hugepages.
  
  See Documentation/vm/hugetlbpage.txt
+
+==============================================================
+
+nr_trim_pages
+
+This is available only on NOMMU kernels.
+
+This value adjusts the excess page trimming behaviour of power-of-2 aligned
+NOMMU mmap allocations.
+
+A value of 0 disables trimming of allocations entirely, while a value of 1
+trims excess pages aggressively. Any value >= 1 acts as the watermark where
+trimming of allocations is initiated.
+
+The default value is 1.
+
+See Documentation/nommu-mmap.txt for more information.
diff --git a/MAINTAINERS b/MAINTAINERS

index 57e0309243cc5c8c3ef1accd4c52987ef129ccc1..6f65a269cb17f5d767ee1e62d67b4eb54b1dafad 100644 (file)
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4081,6 +4081,13 @@ L:       cbe-oss-dev@ozlabs.org
  W:     http://www.ibm.com/developerworks/power/cell/
  S:     Supported
  
+SQUASHFS FILE SYSTEM
+P:     Phillip Lougher
+M:     phillip@lougher.demon.co.uk
+L:     squashfs-devel@lists.sourceforge.net (subscribers-only)
+W:     http://squashfs.org.uk
+S:     Maintained
+
  SRM (Alpha) environment access
  P:     Jan-Benedict Glaw
  M:     jbglaw@lug-owl.de
diff --git a/arch/arm/include/asm/mmu.h b/arch/arm/include/asm/mmu.h

index 53099d4ee4211d072741841a6513aebf74161a54..b561584d04a18ef6c631d1e05d69f6ff4b22d194 100644 (file)
--- a/arch/arm/include/asm/mmu.h
+++ b/arch/arm/include/asm/mmu.h
@@ -24,7 +24,6 @@ typedef struct {
   *  modified for 2.6 by Hyok S. Choi <hyok.choi@samsung.com>
   */
  typedef struct {
-       struct vm_list_struct   *vmlist;
         unsigned long           end_brk;
  } mm_context_t;
  
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c

index 67960017dc8f1f79c40bd083205f3409cdf0fbd1..310e479309efbfa038b8bbc411ef4a991f942310 100644 (file)
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -71,7 +71,7 @@ static DEFINE_SPINLOCK(consistent_lock);
   * the amount of RAM found at boot time.)  I would imagine that get_vm_area()
   * would have to initialise this each time prior to calling vm_region_alloc().
   */
-struct vm_region {
+struct arm_vm_region {
         struct list_head        vm_list;
         unsigned long           vm_start;
         unsigned long           vm_end;
@@ -79,20 +79,20 @@ struct vm_region {
         int                     vm_active;
  };
  
-static struct vm_region consistent_head = {
+static struct arm_vm_region consistent_head = {
         .vm_list        = LIST_HEAD_INIT(consistent_head.vm_list),
         .vm_start       = CONSISTENT_BASE,
         .vm_end         = CONSISTENT_END,
  };
  
-static struct vm_region *
-vm_region_alloc(struct vm_region *head, size_t size, gfp_t gfp)
+static struct arm_vm_region *
+arm_vm_region_alloc(struct arm_vm_region *head, size_t size, gfp_t gfp)
  {
         unsigned long addr = head->vm_start, end = head->vm_end - size;
         unsigned long flags;
-       struct vm_region *c, *new;
+       struct arm_vm_region *c, *new;
  
-       new = kmalloc(sizeof(struct vm_region), gfp);
+       new = kmalloc(sizeof(struct arm_vm_region), gfp);
         if (!new)
                 goto out;
  
@@ -127,9 +127,9 @@ vm_region_alloc(struct vm_region *head, size_t size, gfp_t gfp)
         return NULL;
  }
  
-static struct vm_region *vm_region_find(struct vm_region *head, unsigned long addr)
+static struct arm_vm_region *arm_vm_region_find(struct arm_vm_region *head, unsigned long addr)
  {
-       struct vm_region *c;
+       struct arm_vm_region *c;
         
         list_for_each_entry(c, &head->vm_list, vm_list) {
                 if (c->vm_active && c->vm_start == addr)
@@ -149,7 +149,7 @@ __dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp,
             pgprot_t prot)
  {
         struct page *page;
-       struct vm_region *c;
+       struct arm_vm_region *c;
         unsigned long order;
         u64 mask = ISA_DMA_THRESHOLD, limit;
  
@@ -214,7 +214,7 @@ __dma_alloc(struct device *dev, size_t size, dma_addr_t *handle, gfp_t gfp,
         /*
          * Allocate a virtual address in the consistent mapping region.
          */
-       c = vm_region_alloc(&consistent_head, size,
+       c = arm_vm_region_alloc(&consistent_head, size,
                             gfp & ~(__GFP_DMA | __GFP_HIGHMEM));
         if (c) {
                 pte_t *pte;
@@ -311,13 +311,13 @@ static int dma_mmap(struct device *dev, struct vm_area_struct *vma,
                     void *cpu_addr, dma_addr_t dma_addr, size_t size)
  {
         unsigned long flags, user_size, kern_size;
-       struct vm_region *c;
+       struct arm_vm_region *c;
         int ret = -ENXIO;
  
         user_size = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
  
         spin_lock_irqsave(&consistent_lock, flags);
-       c = vm_region_find(&consistent_head, (unsigned long)cpu_addr);
+       c = arm_vm_region_find(&consistent_head, (unsigned long)cpu_addr);
         spin_unlock_irqrestore(&consistent_lock, flags);
  
         if (c) {
@@ -359,7 +359,7 @@ EXPORT_SYMBOL(dma_mmap_writecombine);
   */
  void dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr_t handle)
  {
-       struct vm_region *c;
+       struct arm_vm_region *c;
         unsigned long flags, addr;
         pte_t *ptep;
         int idx;
@@ -378,7 +378,7 @@ void dma_free_coherent(struct device *dev, size_t size, void *cpu_addr, dma_addr
         size = PAGE_ALIGN(size);
  
         spin_lock_irqsave(&consistent_lock, flags);
-       c = vm_region_find(&consistent_head, (unsigned long)cpu_addr);
+       c = arm_vm_region_find(&consistent_head, (unsigned long)cpu_addr);
         if (!c)
                 goto no_area;
  
diff --git a/arch/blackfin/include/asm/mmu.h b/arch/blackfin/include/asm/mmu.h

index 757e43906ed4b45b5b3bf410ecb24db3f946ef0b..dbfd686360e6bf65dd0aa0f5a7dc2f178b84f377 100644 (file)
--- a/arch/blackfin/include/asm/mmu.h
+++ b/arch/blackfin/include/asm/mmu.h
@@ -10,7 +10,6 @@ struct sram_list_struct {
  };
  
  typedef struct {
-       struct vm_list_struct *vmlist;
         unsigned long end_brk;
         unsigned long stack_start;
  
diff --git a/arch/blackfin/kernel/ptrace.c b/arch/blackfin/kernel/ptrace.c

index d2d3885366304ea38826c0c2897c0a4008184c20..594e325b40e4fa078ce1c7aee07041a20a0ac4d0 100644 (file)
--- a/arch/blackfin/kernel/ptrace.c
+++ b/arch/blackfin/kernel/ptrace.c
@@ -160,15 +160,15 @@ put_reg(struct task_struct *task, int regno, unsigned long data)
  static inline int is_user_addr_valid(struct task_struct *child,
                                      unsigned long start, unsigned long len)
  {
-       struct vm_list_struct *vml;
+       struct vm_area_struct *vma;
         struct sram_list_struct *sraml;
  
         /* overflow */
         if (start + len < start)
                 return -EIO;
  
-       for (vml = child->mm->context.vmlist; vml; vml = vml->next)
-               if (start >= vml->vma->vm_start && start + len < vml->vma->vm_end)
+       vma = find_vma(child->mm, start);
+       if (vma && start >= vma->vm_start && start + len <= vma->vm_end)
                         return 0;
  
         for (sraml = child->mm->context.sram_list; sraml; sraml = sraml->next)
diff --git a/arch/blackfin/kernel/traps.c b/arch/blackfin/kernel/traps.c

index 17d8e4172896c6dabfe3e48eb122b6e517d0435f..5b0667da8d05ef3256c3b403111075028ef31bb7 100644 (file)
--- a/arch/blackfin/kernel/traps.c
+++ b/arch/blackfin/kernel/traps.c
@@ -32,6 +32,7 @@
  #include <linux/module.h>
  #include <linux/kallsyms.h>
  #include <linux/fs.h>
+#include <linux/rbtree.h>
  #include <asm/traps.h>
  #include <asm/cacheflush.h>
  #include <asm/cplb.h>
@@ -83,6 +84,7 @@ static void decode_address(char *buf, unsigned long address)
         struct mm_struct *mm;
         unsigned long flags, offset;
         unsigned char in_atomic = (bfin_read_IPEND() & 0x10) || in_atomic();
+       struct rb_node *n;
  
  #ifdef CONFIG_KALLSYMS
         unsigned long symsize;
@@ -128,9 +130,10 @@ static void decode_address(char *buf, unsigned long address)
                 if (!mm)
                         continue;
  
-               vml = mm->context.vmlist;
-               while (vml) {
-                       struct vm_area_struct *vma = vml->vma;
+               for (n = rb_first(&mm->mm_rb); n; n = rb_next(n)) {
+                       struct vm_area_struct *vma;
+
+                       vma = rb_entry(n, struct vm_area_struct, vm_rb);
  
                         if (address >= vma->vm_start && address < vma->vm_end) {
                                 char _tmpbuf[256];
@@ -176,8 +179,6 @@ static void decode_address(char *buf, unsigned long address)
  
                                 goto done;
                         }
-
-                       vml = vml->next;
                 }
                 if (!in_atomic)
                         mmput(mm);
diff --git a/arch/frv/kernel/ptrace.c b/arch/frv/kernel/ptrace.c

index 709e9bdc6126de06417624a0edd8297dca38dc22..5e7d401d21e7d6455883f118213a040e97754fbf 100644 (file)
--- a/arch/frv/kernel/ptrace.c
+++ b/arch/frv/kernel/ptrace.c
@@ -69,7 +69,8 @@ static inline int put_reg(struct task_struct *task, int regno,
  }
  
  /*
- * check that an address falls within the bounds of the target process's memory mappings
+ * check that an address falls within the bounds of the target process's memory
+ * mappings
   */
  static inline int is_user_addr_valid(struct task_struct *child,
                                      unsigned long start, unsigned long len)
@@ -79,11 +80,11 @@ static inline int is_user_addr_valid(struct task_struct *child,
                 return -EIO;
         return 0;
  #else
-       struct vm_list_struct *vml;
+       struct vm_area_struct *vma;
  
-       for (vml = child->mm->context.vmlist; vml; vml = vml->next)
-               if (start >= vml->vma->vm_start && start + len <= vml->vma->vm_end)
-                       return 0;
+       vma = find_vma(child->mm, start);
+       if (vma && start >= vma->vm_start && start + len <= vma->vm_end)
+               return 0;
  
         return -EIO;
  #endif
diff --git a/arch/h8300/include/asm/mmu.h b/arch/h8300/include/asm/mmu.h

index 2ce06ea46104c36cb8f59e9c65abc6d4f437c9bd..31309969df705ce5c2a80b8638cb6ceac8e0ef9a 100644 (file)
--- a/arch/h8300/include/asm/mmu.h
+++ b/arch/h8300/include/asm/mmu.h
@@ -4,7 +4,6 @@
  /* Copyright (C) 2002, David McCullough <davidm@snapgear.com> */
  
  typedef struct {
-       struct vm_list_struct   *vmlist;
         unsigned long           end_brk;
  } mm_context_t;
  
diff --git a/arch/m68knommu/include/asm/mmu.h b/arch/m68knommu/include/asm/mmu.h

index 5fa6b68353bab9403e1f9bbd543880427516c5d7..e2da1e6f09fe4c16a1576a407ca168d296edd141 100644 (file)
--- a/arch/m68knommu/include/asm/mmu.h
+++ b/arch/m68knommu/include/asm/mmu.h
@@ -4,7 +4,6 @@
  /* Copyright (C) 2002, David McCullough <davidm@snapgear.com> */
  
  typedef struct {
-       struct vm_list_struct   *vmlist;
         unsigned long           end_brk;
  } mm_context_t;
  
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild

index f88b252e419cb504a3c0938e3d99d541e2113372..2121d99f83648191266df9cad5c293ac9017b0e1 100644 (file)
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -1,3 +1,4 @@
  include include/asm-generic/Kbuild.asm
  
  unifdef-y += pdc.h
+unifdef-y += swab.h
diff --git a/arch/parisc/include/asm/byteorder.h b/arch/parisc/include/asm/byteorder.h

index 83095c5bb379b7e685454e09ea30b926a82c60db..da66029c4cb2572a44731485180909cc1dad8208 100644 (file)
--- a/arch/parisc/include/asm/byteorder.h
+++ b/arch/parisc/include/asm/byteorder.h
@@ -1,69 +1,7 @@
  #ifndef _PARISC_BYTEORDER_H
  #define _PARISC_BYTEORDER_H
  
-#include <asm/types.h>
-#include <linux/compiler.h>
-
-#define __BIG_ENDIAN
-#define __SWAB_64_THRU_32__
-
-static inline __attribute_const__ __u16 __arch_swab16(__u16 x)
-{
-       __asm__("dep %0, 15, 8, %0\n\t"         /* deposit 00ab -> 0bab */
-               "shd %%r0, %0, 8, %0"           /* shift 000000ab -> 00ba */
-               : "=r" (x)
-               : "0" (x));
-       return x;
-}
-#define __arch_swab16 __arch_swab16
-
-static inline __attribute_const__ __u32 __arch_swab24(__u32 x)
-{
-       __asm__("shd %0, %0, 8, %0\n\t"         /* shift xabcxabc -> cxab */
-               "dep %0, 15, 8, %0\n\t"         /* deposit cxab -> cbab */
-               "shd %%r0, %0, 8, %0"           /* shift 0000cbab -> 0cba */
-               : "=r" (x)
-               : "0" (x));
-       return x;
-}
-
-static inline __attribute_const__ __u32 __arch_swab32(__u32 x)
-{
-       unsigned int temp;
-       __asm__("shd %0, %0, 16, %1\n\t"        /* shift abcdabcd -> cdab */
-               "dep %1, 15, 8, %1\n\t"         /* deposit cdab -> cbab */
-               "shd %0, %1, 8, %0"             /* shift abcdcbab -> dcba */
-               : "=r" (x), "=&r" (temp)
-               : "0" (x));
-       return x;
-}
-#define __arch_swab32 __arch_swab32
-
-#if BITS_PER_LONG > 32
-/*
-** From "PA-RISC 2.0 Architecture", HP Professional Books.
-** See Appendix I page 8 , "Endian Byte Swapping".
-**
-** Pretty cool algorithm: (* == zero'd bits)
-**      PERMH   01234567 -> 67452301 into %0
-**      HSHL    67452301 -> 7*5*3*1* into %1
-**      HSHR    67452301 -> *6*4*2*0 into %0
-**      OR      %0 | %1  -> 76543210 into %0 (all done!)
-*/
-static inline __attribute_const__ __u64 __arch_swab64(__u64 x)
-{
-       __u64 temp;
-       __asm__("permh,3210 %0, %0\n\t"
-               "hshl %0, 8, %1\n\t"
-               "hshr,u %0, 8, %0\n\t"
-               "or %1, %0, %0"
-               : "=r" (x), "=&r" (temp)
-               : "0" (x));
-       return x;
-}
-#define __arch_swab64 __arch_swab64
-#endif /* BITS_PER_LONG > 32 */
-
-#include <linux/byteorder.h>
+#include <asm/swab.h>
+#include <linux/byteorder/big_endian.h>
  
  #endif /* _PARISC_BYTEORDER_H */
diff --git a/arch/parisc/include/asm/swab.h b/arch/parisc/include/asm/swab.h

new file mode 100644 (file)

index 0000000..3ff16c5
--- /dev/null
+++ b/arch/parisc/include/asm/swab.h
@@ -0,0 +1,66 @@
+#ifndef _PARISC_SWAB_H
+#define _PARISC_SWAB_H
+
+#include <asm/types.h>
+#include <linux/compiler.h>
+
+#define __SWAB_64_THRU_32__
+
+static inline __attribute_const__ __u16 __arch_swab16(__u16 x)
+{
+       __asm__("dep %0, 15, 8, %0\n\t"         /* deposit 00ab -> 0bab */
+               "shd %%r0, %0, 8, %0"           /* shift 000000ab -> 00ba */
+               : "=r" (x)
+               : "0" (x));
+       return x;
+}
+#define __arch_swab16 __arch_swab16
+
+static inline __attribute_const__ __u32 __arch_swab24(__u32 x)
+{
+       __asm__("shd %0, %0, 8, %0\n\t"         /* shift xabcxabc -> cxab */
+               "dep %0, 15, 8, %0\n\t"         /* deposit cxab -> cbab */
+               "shd %%r0, %0, 8, %0"           /* shift 0000cbab -> 0cba */
+               : "=r" (x)
+               : "0" (x));
+       return x;
+}
+
+static inline __attribute_const__ __u32 __arch_swab32(__u32 x)
+{
+       unsigned int temp;
+       __asm__("shd %0, %0, 16, %1\n\t"        /* shift abcdabcd -> cdab */
+               "dep %1, 15, 8, %1\n\t"         /* deposit cdab -> cbab */
+               "shd %0, %1, 8, %0"             /* shift abcdcbab -> dcba */
+               : "=r" (x), "=&r" (temp)
+               : "0" (x));
+       return x;
+}
+#define __arch_swab32 __arch_swab32
+
+#if BITS_PER_LONG > 32
+/*
+** From "PA-RISC 2.0 Architecture", HP Professional Books.
+** See Appendix I page 8 , "Endian Byte Swapping".
+**
+** Pretty cool algorithm: (* == zero'd bits)
+**      PERMH   01234567 -> 67452301 into %0
+**      HSHL    67452301 -> 7*5*3*1* into %1
+**      HSHR    67452301 -> *6*4*2*0 into %0
+**      OR      %0 | %1  -> 76543210 into %0 (all done!)
+*/
+static inline __attribute_const__ __u64 __arch_swab64(__u64 x)
+{
+       __u64 temp;
+       __asm__("permh,3210 %0, %0\n\t"
+               "hshl %0, 8, %1\n\t"
+               "hshr,u %0, 8, %0\n\t"
+               "or %1, %0, %0"
+               : "=r" (x), "=&r" (temp)
+               : "0" (x));
+       return x;
+}
+#define __arch_swab64 __arch_swab64
+#endif /* BITS_PER_LONG > 32 */
+
+#endif /* _PARISC_SWAB_H */
diff --git a/arch/powerpc/include/asm/cell-pmu.h b/arch/powerpc/include/asm/cell-pmu.h

index 8066eede3a0c9d8bb7b0e63d7c64955ea8b5efa8..b4b7338ad79e34e6a666b801315870ab5774cb03 100644 (file)
--- a/arch/powerpc/include/asm/cell-pmu.h
+++ b/arch/powerpc/include/asm/cell-pmu.h
@@ -37,9 +37,11 @@
  #define CBE_PM_STOP_AT_MAX                 0x40000000
  #define CBE_PM_TRACE_MODE_GET(pm_control)  (((pm_control) >> 28) & 0x3)
  #define CBE_PM_TRACE_MODE_SET(mode)        (((mode)  & 0x3) << 28)
+#define CBE_PM_TRACE_BUF_OVFLW(bit)        (((bit) & 0x1) << 17)
  #define CBE_PM_COUNT_MODE_SET(count)       (((count) & 0x3) << 18)
  #define CBE_PM_FREEZE_ALL_CTRS             0x00100000
  #define CBE_PM_ENABLE_EXT_TRACE            0x00008000
+#define CBE_PM_SPU_ADDR_TRACE_SET(msk)     (((msk) & 0x3) << 9)
  
  /* Macros for the trace_address register. */
  #define CBE_PM_TRACE_BUF_FULL              0x00000800
diff --git a/arch/powerpc/include/asm/oprofile_impl.h b/arch/powerpc/include/asm/oprofile_impl.h

index 95035c602ba625b505738d81e0b957a417ce048a..639dc96077ab815addab3a43bc894e5654fe6bd9 100644 (file)
--- a/arch/powerpc/include/asm/oprofile_impl.h
+++ b/arch/powerpc/include/asm/oprofile_impl.h
@@ -32,6 +32,12 @@ struct op_system_config {
         unsigned long mmcr0;
         unsigned long mmcr1;
         unsigned long mmcra;
+#ifdef CONFIG_OPROFILE_CELL
+       /* Register for oprofile user tool to check cell kernel profiling
+        * suport.
+        */
+       unsigned long cell_support;
+#endif
  #endif
         unsigned long enable_kernel;
         unsigned long enable_user;
diff --git a/arch/powerpc/oprofile/cell/pr_util.h b/arch/powerpc/oprofile/cell/pr_util.h

index dfdbffa068182b8383590f67d789449ca4f73171..964b93974d893003a0054109acfddb10c7a4a49b 100644 (file)
--- a/arch/powerpc/oprofile/cell/pr_util.h
+++ b/arch/powerpc/oprofile/cell/pr_util.h
@@ -30,6 +30,10 @@
  extern struct delayed_work spu_work;
  extern int spu_prof_running;
  
+#define TRACE_ARRAY_SIZE 1024
+
+extern spinlock_t oprof_spu_smpl_arry_lck;
+
  struct spu_overlay_info {      /* map of sections within an SPU overlay */
         unsigned int vma;       /* SPU virtual memory address from elf */
         unsigned int size;      /* size of section from elf */
@@ -89,10 +93,11 @@ void vma_map_free(struct vma_to_fileoffset_map *map);
   * Entry point for SPU profiling.
   * cycles_reset is the SPU_CYCLES count value specified by the user.
   */
-int start_spu_profiling(unsigned int cycles_reset);
-
-void stop_spu_profiling(void);
+int start_spu_profiling_cycles(unsigned int cycles_reset);
+void start_spu_profiling_events(void);
  
+void stop_spu_profiling_cycles(void);
+void stop_spu_profiling_events(void);
  
  /* add the necessary profiling hooks */
  int spu_sync_start(void);
diff --git a/arch/powerpc/oprofile/cell/spu_profiler.c b/arch/powerpc/oprofile/cell/spu_profiler.c

index 83faa958b9d4d3d353596b55c6a51eee5452ea52..9305ddaac5125d941772dc45552264215d1adf8d 100644 (file)
--- a/arch/powerpc/oprofile/cell/spu_profiler.c
+++ b/arch/powerpc/oprofile/cell/spu_profiler.c
@@ -18,11 +18,21 @@
  #include <asm/cell-pmu.h>
  #include "pr_util.h"
  
-#define TRACE_ARRAY_SIZE 1024
  #define SCALE_SHIFT 14
  
  static u32 *samples;
  
+/* spu_prof_running is a flag used to indicate if spu profiling is enabled
+ * or not.  It is set by the routines start_spu_profiling_cycles() and
+ * start_spu_profiling_events().  The flag is cleared by the routines
+ * stop_spu_profiling_cycles() and stop_spu_profiling_events().  These
+ * routines are called via global_start() and global_stop() which are called in
+ * op_powerpc_start() and op_powerpc_stop().  These routines are called once
+ * per system as a result of the user starting/stopping oprofile.  Hence, only
+ * one CPU per user at a time will be changing  the value of spu_prof_running.
+ * In general, OProfile does not protect against multiple users trying to run
+ * OProfile at a time.
+ */
  int spu_prof_running;
  static unsigned int profiling_interval;
  
@@ -31,8 +41,8 @@ static unsigned int profiling_interval;
  
  #define SPU_PC_MASK         0xFFFF
  
-static DEFINE_SPINLOCK(sample_array_lock);
-unsigned long sample_array_lock_flags;
+DEFINE_SPINLOCK(oprof_spu_smpl_arry_lck);
+unsigned long oprof_spu_smpl_arry_lck_flags;
  
  void set_spu_profiling_frequency(unsigned int freq_khz, unsigned int cycles_reset)
  {
@@ -145,13 +155,13 @@ static enum hrtimer_restart profile_spus(struct hrtimer *timer)
                  * sample array must be loaded and then processed for a given
                  * cpu.  The sample array is not per cpu.
                  */
-               spin_lock_irqsave(&sample_array_lock,
-                                 sample_array_lock_flags);
+               spin_lock_irqsave(&oprof_spu_smpl_arry_lck,
+                                 oprof_spu_smpl_arry_lck_flags);
                 num_samples = cell_spu_pc_collection(cpu);
  
                 if (num_samples == 0) {
-                       spin_unlock_irqrestore(&sample_array_lock,
-                                              sample_array_lock_flags);
+                       spin_unlock_irqrestore(&oprof_spu_smpl_arry_lck,
+                                              oprof_spu_smpl_arry_lck_flags);
                         continue;
                 }
  
@@ -162,8 +172,8 @@ static enum hrtimer_restart profile_spus(struct hrtimer *timer)
                                         num_samples);
                 }
  
-               spin_unlock_irqrestore(&sample_array_lock,
-                                      sample_array_lock_flags);
+               spin_unlock_irqrestore(&oprof_spu_smpl_arry_lck,
+                                      oprof_spu_smpl_arry_lck_flags);
  
         }
         smp_wmb();      /* insure spu event buffer updates are written */
@@ -182,13 +192,13 @@ static enum hrtimer_restart profile_spus(struct hrtimer *timer)
  
  static struct hrtimer timer;
  /*
- * Entry point for SPU profiling.
+ * Entry point for SPU cycle profiling.
   * NOTE:  SPU profiling is done system-wide, not per-CPU.
   *
   * cycles_reset is the count value specified by the user when
   * setting up OProfile to count SPU_CYCLES.
   */
-int start_spu_profiling(unsigned int cycles_reset)
+int start_spu_profiling_cycles(unsigned int cycles_reset)
  {
         ktime_t kt;
  
@@ -212,10 +222,30 @@ int start_spu_profiling(unsigned int cycles_reset)
         return 0;
  }
  
-void stop_spu_profiling(void)
+/*
+ * Entry point for SPU event profiling.
+ * NOTE:  SPU profiling is done system-wide, not per-CPU.
+ *
+ * cycles_reset is the count value specified by the user when
+ * setting up OProfile to count SPU_CYCLES.
+ */
+void start_spu_profiling_events(void)
+{
+       spu_prof_running = 1;
+       schedule_delayed_work(&spu_work, DEFAULT_TIMER_EXPIRE);
+
+       return;
+}
+
+void stop_spu_profiling_cycles(void)
  {
         spu_prof_running = 0;
         hrtimer_cancel(&timer);
         kfree(samples);
-       pr_debug("SPU_PROF: stop_spu_profiling issued\n");
+       pr_debug("SPU_PROF: stop_spu_profiling_cycles issued\n");
+}
+
+void stop_spu_profiling_events(void)
+{
+       spu_prof_running = 0;
  }
diff --git a/arch/powerpc/oprofile/common.c b/arch/powerpc/oprofile/common.c

index 17807acb05d998da3fdf6e9af0f459b00bb0d8f0..21f16edf6c8db48065a7f3019bc08ed53cb18f26 100644 (file)
--- a/arch/powerpc/oprofile/common.c
+++ b/arch/powerpc/oprofile/common.c
@@ -132,6 +132,28 @@ static int op_powerpc_create_files(struct super_block *sb, struct dentry *root)
         oprofilefs_create_ulong(sb, root, "mmcr0", &sys.mmcr0);
         oprofilefs_create_ulong(sb, root, "mmcr1", &sys.mmcr1);
         oprofilefs_create_ulong(sb, root, "mmcra", &sys.mmcra);
+#ifdef CONFIG_OPROFILE_CELL
+       /* create a file the user tool can check to see what level of profiling
+        * support exits with this kernel. Initialize bit mask to indicate
+        * what support the kernel has:
+        * bit 0      -  Supports SPU event profiling in addition to PPU
+        *               event and cycles; and SPU cycle profiling
+        * bits 1-31  -  Currently unused.
+        *
+        * If the file does not exist, then the kernel only supports SPU
+        * cycle profiling, PPU event and cycle profiling.
+        */
+       oprofilefs_create_ulong(sb, root, "cell_support", &sys.cell_support);
+       sys.cell_support = 0x1; /* Note, the user OProfile tool must check
+                                * that this bit is set before attempting to
+                                * user SPU event profiling.  Older kernels
+                                * will not have this file, hence the user
+                                * tool is not allowed to do SPU event
+                                * profiling on older kernels.  Older kernels
+                                * will accept SPU events but collected data
+                                * is garbage.
+                                */
+#endif
  #endif
  
         for (i = 0; i < model->num_counters; ++i) {
diff --git a/arch/powerpc/oprofile/op_model_cell.c b/arch/powerpc/oprofile/op_model_cell.c

index 25a4ec2514a3e0a9fb5bdc61167d2b000740ab70..ae06c6236d9c396402040be44d25257283141abd 100644 (file)
--- a/arch/powerpc/oprofile/op_model_cell.c
+++ b/arch/powerpc/oprofile/op_model_cell.c
@@ -40,14 +40,15 @@
  #include "../platforms/cell/interrupt.h"
  #include "cell/pr_util.h"
  
-static void cell_global_stop_spu(void);
+#define PPU_PROFILING            0
+#define SPU_PROFILING_CYCLES     1
+#define SPU_PROFILING_EVENTS     2
  
-/*
- * spu_cycle_reset is the number of cycles between samples.
- * This variable is used for SPU profiling and should ONLY be set
- * at the beginning of cell_reg_setup; otherwise, it's read-only.
- */
-static unsigned int spu_cycle_reset;
+#define SPU_EVENT_NUM_START      4100
+#define SPU_EVENT_NUM_STOP       4399
+#define SPU_PROFILE_EVENT_ADDR          4363  /* spu, address trace, decimal */
+#define SPU_PROFILE_EVENT_ADDR_MASK_A   0x146 /* sub unit set to zero */
+#define SPU_PROFILE_EVENT_ADDR_MASK_B   0x186 /* sub unit set to zero */
  
  #define NUM_SPUS_PER_NODE    8
  #define SPU_CYCLES_EVENT_NUM 2 /*  event number for SPU_CYCLES */
@@ -66,6 +67,21 @@ static unsigned int spu_cycle_reset;
  
  #define MAX_SPU_COUNT 0xFFFFFF /* maximum 24 bit LFSR value */
  
+/* Minumum HW interval timer setting to send value to trace buffer is 10 cycle.
+ * To configure counter to send value every N cycles set counter to
+ * 2^32 - 1 - N.
+ */
+#define NUM_INTERVAL_CYC  0xFFFFFFFF - 10
+
+/*
+ * spu_cycle_reset is the number of cycles between samples.
+ * This variable is used for SPU profiling and should ONLY be set
+ * at the beginning of cell_reg_setup; otherwise, it's read-only.
+ */
+static unsigned int spu_cycle_reset;
+static unsigned int profiling_mode;
+static int spu_evnt_phys_spu_indx;
+
  struct pmc_cntrl_data {
         unsigned long vcntr;
         unsigned long evnts;
@@ -105,6 +121,8 @@ struct pm_cntrl {
         u16 trace_mode;
         u16 freeze;
         u16 count_mode;
+       u16 spu_addr_trace;
+       u8  trace_buf_ovflw;
  };
  
  static struct {
@@ -122,7 +140,7 @@ static struct {
  #define GET_INPUT_CONTROL(x) ((x & 0x00000004) >> 2)
  
  static DEFINE_PER_CPU(unsigned long[NR_PHYS_CTRS], pmc_values);
-
+static unsigned long spu_pm_cnt[MAX_NUMNODES * NUM_SPUS_PER_NODE];
  static struct pmc_cntrl_data pmc_cntrl[NUM_THREADS][NR_PHYS_CTRS];
  
  /*
@@ -152,6 +170,7 @@ static u32 hdw_thread;
  
  static u32 virt_cntr_inter_mask;
  static struct timer_list timer_virt_cntr;
+static struct timer_list timer_spu_event_swap;
  
  /*
   * pm_signal needs to be global since it is initialized in
@@ -165,7 +184,7 @@ static int spu_rtas_token;   /* token for SPU cycle profiling */
  static u32 reset_value[NR_PHYS_CTRS];
  static int num_counters;
  static int oprofile_running;
-static DEFINE_SPINLOCK(virt_cntr_lock);
+static DEFINE_SPINLOCK(cntr_lock);
  
  static u32 ctr_enabled;
  
@@ -336,13 +355,13 @@ static void set_pm_event(u32 ctr, int event, u32 unit_mask)
         for (i = 0; i < NUM_DEBUG_BUS_WORDS; i++) {
                 if (bus_word & (1 << i)) {
                         pm_regs.debug_bus_control |=
-                           (bus_type << (30 - (2 * i)));
+                               (bus_type << (30 - (2 * i)));
  
                         for (j = 0; j < NUM_INPUT_BUS_WORDS; j++) {
                                 if (input_bus[j] == 0xff) {
                                         input_bus[j] = i;
                                         pm_regs.group_control |=
-                                           (i << (30 - (2 * j)));
+                                               (i << (30 - (2 * j)));
  
                                         break;
                                 }
@@ -367,12 +386,16 @@ static void write_pm_cntrl(int cpu)
         if (pm_regs.pm_cntrl.stop_at_max == 1)
                 val |= CBE_PM_STOP_AT_MAX;
  
-       if (pm_regs.pm_cntrl.trace_mode == 1)
+       if (pm_regs.pm_cntrl.trace_mode != 0)
                 val |= CBE_PM_TRACE_MODE_SET(pm_regs.pm_cntrl.trace_mode);
  
+       if (pm_regs.pm_cntrl.trace_buf_ovflw == 1)
+               val |= CBE_PM_TRACE_BUF_OVFLW(pm_regs.pm_cntrl.trace_buf_ovflw);
         if (pm_regs.pm_cntrl.freeze == 1)
                 val |= CBE_PM_FREEZE_ALL_CTRS;
  
+       val |= CBE_PM_SPU_ADDR_TRACE_SET(pm_regs.pm_cntrl.spu_addr_trace);
+
         /*
          * Routine set_count_mode must be called previously to set
          * the count mode based on the user selection of user and kernel.
@@ -441,7 +464,7 @@ static void cell_virtual_cntr(unsigned long data)
          * not both playing with the counters on the same node.
          */
  
-       spin_lock_irqsave(&virt_cntr_lock, flags);
+       spin_lock_irqsave(&cntr_lock, flags);
  
         prev_hdw_thread = hdw_thread;
  
@@ -480,7 +503,7 @@ static void cell_virtual_cntr(unsigned long data)
                 cbe_disable_pm_interrupts(cpu);
                 for (i = 0; i < num_counters; i++) {
                         per_cpu(pmc_values, cpu + prev_hdw_thread)[i]
-                           = cbe_read_ctr(cpu, i);
+                               = cbe_read_ctr(cpu, i);
  
                         if (per_cpu(pmc_values, cpu + next_hdw_thread)[i]
                             == 0xFFFFFFFF)
@@ -527,7 +550,7 @@ static void cell_virtual_cntr(unsigned long data)
                 cbe_enable_pm(cpu);
         }
  
-       spin_unlock_irqrestore(&virt_cntr_lock, flags);
+       spin_unlock_irqrestore(&cntr_lock, flags);
  
         mod_timer(&timer_virt_cntr, jiffies + HZ / 10);
  }
@@ -541,38 +564,146 @@ static void start_virt_cntrs(void)
         add_timer(&timer_virt_cntr);
  }
  
-/* This function is called once for all cpus combined */
-static int cell_reg_setup(struct op_counter_config *ctr,
+static int cell_reg_setup_spu_cycles(struct op_counter_config *ctr,
                         struct op_system_config *sys, int num_ctrs)
  {
-       int i, j, cpu;
-       spu_cycle_reset = 0;
+       spu_cycle_reset = ctr[0].count;
  
-       if (ctr[0].event == SPU_CYCLES_EVENT_NUM) {
-               spu_cycle_reset = ctr[0].count;
+       /*
+        * Each node will need to make the rtas call to start
+        * and stop SPU profiling.  Get the token once and store it.
+        */
+       spu_rtas_token = rtas_token("ibm,cbe-spu-perftools");
+
+       if (unlikely(spu_rtas_token == RTAS_UNKNOWN_SERVICE)) {
+               printk(KERN_ERR
+                      "%s: rtas token ibm,cbe-spu-perftools unknown\n",
+                      __func__);
+               return -EIO;
+       }
+       return 0;
+}
+
+/* Unfortunately, the hardware will only support event profiling
+ * on one SPU per node at a time.  Therefore, we must time slice
+ * the profiling across all SPUs in the node.  Note, we do this
+ * in parallel for each node.  The following routine is called
+ * periodically based on kernel timer to switch which SPU is
+ * being monitored in a round robbin fashion.
+ */
+static void spu_evnt_swap(unsigned long data)
+{
+       int node;
+       int cur_phys_spu, nxt_phys_spu, cur_spu_evnt_phys_spu_indx;
+       unsigned long flags;
+       int cpu;
+       int ret;
+       u32 interrupt_mask;
+
+
+       /* enable interrupts on cntr 0 */
+       interrupt_mask = CBE_PM_CTR_OVERFLOW_INTR(0);
+
+       hdw_thread = 0;
+
+       /* Make sure spu event interrupt handler and spu event swap
+        * don't access the counters simultaneously.
+        */
+       spin_lock_irqsave(&cntr_lock, flags);
+
+       cur_spu_evnt_phys_spu_indx = spu_evnt_phys_spu_indx;
+
+       if (++(spu_evnt_phys_spu_indx) == NUM_SPUS_PER_NODE)
+               spu_evnt_phys_spu_indx = 0;
+
+       pm_signal[0].sub_unit = spu_evnt_phys_spu_indx;
+       pm_signal[1].sub_unit = spu_evnt_phys_spu_indx;
+       pm_signal[2].sub_unit = spu_evnt_phys_spu_indx;
+
+       /* switch the SPU being profiled on each node */
+       for_each_online_cpu(cpu) {
+               if (cbe_get_hw_thread_id(cpu))
+                       continue;
+
+               node = cbe_cpu_to_node(cpu);
+               cur_phys_spu = (node * NUM_SPUS_PER_NODE)
+                       + cur_spu_evnt_phys_spu_indx;
+               nxt_phys_spu = (node * NUM_SPUS_PER_NODE)
+                       + spu_evnt_phys_spu_indx;
  
                 /*
-                * Each node will need to make the rtas call to start
-                * and stop SPU profiling.  Get the token once and store it.
+                * stop counters, save counter values, restore counts
+                * for previous physical SPU
                  */
-               spu_rtas_token = rtas_token("ibm,cbe-spu-perftools");
+               cbe_disable_pm(cpu);
+               cbe_disable_pm_interrupts(cpu);
  
-               if (unlikely(spu_rtas_token == RTAS_UNKNOWN_SERVICE)) {
-                       printk(KERN_ERR
-                              "%s: rtas token ibm,cbe-spu-perftools unknown\n",
-                              __func__);
-                       return -EIO;
-               }
+               spu_pm_cnt[cur_phys_spu]
+                       = cbe_read_ctr(cpu, 0);
+
+               /* restore previous count for the next spu to sample */
+               /* NOTE, hardware issue, counter will not start if the
+                * counter value is at max (0xFFFFFFFF).
+                */
+               if (spu_pm_cnt[nxt_phys_spu] >= 0xFFFFFFFF)
+                       cbe_write_ctr(cpu, 0, 0xFFFFFFF0);
+                else
+                        cbe_write_ctr(cpu, 0, spu_pm_cnt[nxt_phys_spu]);
+
+               pm_rtas_reset_signals(cbe_cpu_to_node(cpu));
+
+               /* setup the debug bus measure the one event and
+                * the two events to route the next SPU's PC on
+                * the debug bus
+                */
+               ret = pm_rtas_activate_signals(cbe_cpu_to_node(cpu), 3);
+               if (ret)
+                       printk(KERN_ERR "%s: pm_rtas_activate_signals failed, "
+                              "SPU event swap\n", __func__);
+
+               /* clear the trace buffer, don't want to take PC for
+                * previous SPU*/
+               cbe_write_pm(cpu, trace_address, 0);
+
+               enable_ctr(cpu, 0, pm_regs.pm07_cntrl);
+
+               /* Enable interrupts on the CPU thread that is starting */
+               cbe_enable_pm_interrupts(cpu, hdw_thread,
+                                        interrupt_mask);
+               cbe_enable_pm(cpu);
         }
  
-       pm_rtas_token = rtas_token("ibm,cbe-perftools");
+       spin_unlock_irqrestore(&cntr_lock, flags);
  
+       /* swap approximately every 0.1 seconds */
+       mod_timer(&timer_spu_event_swap, jiffies + HZ / 25);
+}
+
+static void start_spu_event_swap(void)
+{
+       init_timer(&timer_spu_event_swap);
+       timer_spu_event_swap.function = spu_evnt_swap;
+       timer_spu_event_swap.data = 0UL;
+       timer_spu_event_swap.expires = jiffies + HZ / 25;
+       add_timer(&timer_spu_event_swap);
+}
+
+static int cell_reg_setup_spu_events(struct op_counter_config *ctr,
+                       struct op_system_config *sys, int num_ctrs)
+{
+       int i;
+
+       /* routine is called once for all nodes */
+
+       spu_evnt_phys_spu_indx = 0;
         /*
-        * For all events excetp PPU CYCLEs, each node will need to make
+        * For all events except PPU CYCLEs, each node will need to make
          * the rtas cbe-perftools call to setup and reset the debug bus.
          * Make the token lookup call once and store it in the global
          * variable pm_rtas_token.
          */
+       pm_rtas_token = rtas_token("ibm,cbe-perftools");
+
         if (unlikely(pm_rtas_token == RTAS_UNKNOWN_SERVICE)) {
                 printk(KERN_ERR
                        "%s: rtas token ibm,cbe-perftools unknown\n",
@@ -580,6 +711,58 @@ static int cell_reg_setup(struct op_counter_config *ctr,
                 return -EIO;
         }
  
+       /* setup the pm_control register settings,
+        * settings will be written per node by the
+        * cell_cpu_setup() function.
+        */
+       pm_regs.pm_cntrl.trace_buf_ovflw = 1;
+
+       /* Use the occurrence trace mode to have SPU PC saved
+        * to the trace buffer.  Occurrence data in trace buffer
+        * is not used.  Bit 2 must be set to store SPU addresses.
+        */
+       pm_regs.pm_cntrl.trace_mode = 2;
+
+       pm_regs.pm_cntrl.spu_addr_trace = 0x1;  /* using debug bus
+                                                  event 2 & 3 */
+
+       /* setup the debug bus event array with the SPU PC routing events.
+       *  Note, pm_signal[0] will be filled in by set_pm_event() call below.
+       */
+       pm_signal[1].signal_group = SPU_PROFILE_EVENT_ADDR / 100;
+       pm_signal[1].bus_word = GET_BUS_WORD(SPU_PROFILE_EVENT_ADDR_MASK_A);
+       pm_signal[1].bit = SPU_PROFILE_EVENT_ADDR % 100;
+       pm_signal[1].sub_unit = spu_evnt_phys_spu_indx;
+
+       pm_signal[2].signal_group = SPU_PROFILE_EVENT_ADDR / 100;
+       pm_signal[2].bus_word = GET_BUS_WORD(SPU_PROFILE_EVENT_ADDR_MASK_B);
+       pm_signal[2].bit = SPU_PROFILE_EVENT_ADDR % 100;
+       pm_signal[2].sub_unit = spu_evnt_phys_spu_indx;
+
+       /* Set the user selected spu event to profile on,
+        * note, only one SPU profiling event is supported
+        */
+       num_counters = 1;  /* Only support one SPU event at a time */
+       set_pm_event(0, ctr[0].event, ctr[0].unit_mask);
+
+       reset_value[0] = 0xFFFFFFFF - ctr[0].count;
+
+       /* global, used by cell_cpu_setup */
+       ctr_enabled |= 1;
+
+       /* Initialize the count for each SPU to the reset value */
+       for (i=0; i < MAX_NUMNODES * NUM_SPUS_PER_NODE; i++)
+               spu_pm_cnt[i] = reset_value[0];
+
+       return 0;
+}
+
+static int cell_reg_setup_ppu(struct op_counter_config *ctr,
+                       struct op_system_config *sys, int num_ctrs)
+{
+       /* routine is called once for all nodes */
+       int i, j, cpu;
+
         num_counters = num_ctrs;
  
         if (unlikely(num_ctrs > NR_PHYS_CTRS)) {
@@ -589,14 +772,6 @@ static int cell_reg_setup(struct op_counter_config *ctr,
                        __func__);
                 return -EIO;
         }
-       pm_regs.group_control = 0;
-       pm_regs.debug_bus_control = 0;
-
-       /* setup the pm_control register */
-       memset(&pm_regs.pm_cntrl, 0, sizeof(struct pm_cntrl));
-       pm_regs.pm_cntrl.stop_at_max = 1;
-       pm_regs.pm_cntrl.trace_mode = 0;
-       pm_regs.pm_cntrl.freeze = 1;
  
         set_count_mode(sys->enable_kernel, sys->enable_user);
  
@@ -665,6 +840,63 @@ static int cell_reg_setup(struct op_counter_config *ctr,
  }
  
  
+/* This function is called once for all cpus combined */
+static int cell_reg_setup(struct op_counter_config *ctr,
+                       struct op_system_config *sys, int num_ctrs)
+{
+       int ret=0;
+       spu_cycle_reset = 0;
+
+       /* initialize the spu_arr_trace value, will be reset if
+        * doing spu event profiling.
+        */
+       pm_regs.group_control = 0;
+       pm_regs.debug_bus_control = 0;
+       pm_regs.pm_cntrl.stop_at_max = 1;
+       pm_regs.pm_cntrl.trace_mode = 0;
+       pm_regs.pm_cntrl.freeze = 1;
+       pm_regs.pm_cntrl.trace_buf_ovflw = 0;
+       pm_regs.pm_cntrl.spu_addr_trace = 0;
+
+       /*
+        * For all events except PPU CYCLEs, each node will need to make
+        * the rtas cbe-perftools call to setup and reset the debug bus.
+        * Make the token lookup call once and store it in the global
+        * variable pm_rtas_token.
+        */
+       pm_rtas_token = rtas_token("ibm,cbe-perftools");
+
+       if (unlikely(pm_rtas_token == RTAS_UNKNOWN_SERVICE)) {
+               printk(KERN_ERR
+                      "%s: rtas token ibm,cbe-perftools unknown\n",
+                      __func__);
+               return -EIO;
+       }
+
+       if (ctr[0].event == SPU_CYCLES_EVENT_NUM) {
+               profiling_mode = SPU_PROFILING_CYCLES;
+               ret = cell_reg_setup_spu_cycles(ctr, sys, num_ctrs);
+       } else if ((ctr[0].event >= SPU_EVENT_NUM_START) &&
+                  (ctr[0].event <= SPU_EVENT_NUM_STOP)) {
+               profiling_mode = SPU_PROFILING_EVENTS;
+               spu_cycle_reset = ctr[0].count;
+
+               /* for SPU event profiling, need to setup the
+                * pm_signal array with the events to route the
+                * SPU PC before making the FW call.  Note, only
+                * one SPU event for profiling can be specified
+                * at a time.
+                */
+               cell_reg_setup_spu_events(ctr, sys, num_ctrs);
+       } else {
+               profiling_mode = PPU_PROFILING;
+               ret = cell_reg_setup_ppu(ctr, sys, num_ctrs);
+       }
+
+       return ret;
+}
+
+
  
  /* This function is called once for each cpu */
  static int cell_cpu_setup(struct op_counter_config *cntr)
@@ -672,8 +904,13 @@ static int cell_cpu_setup(struct op_counter_config *cntr)
         u32 cpu = smp_processor_id();
         u32 num_enabled = 0;
         int i;
+       int ret;
  
-       if (spu_cycle_reset)
+       /* Cycle based SPU profiling does not use the performance
+        * counters.  The trace array is configured to collect
+        * the data.
+        */
+       if (profiling_mode == SPU_PROFILING_CYCLES)
                 return 0;
  
         /* There is one performance monitor per processor chip (i.e. node),
@@ -686,7 +923,6 @@ static int cell_cpu_setup(struct op_counter_config *cntr)
         cbe_disable_pm(cpu);
         cbe_disable_pm_interrupts(cpu);
  
-       cbe_write_pm(cpu, pm_interval, 0);
         cbe_write_pm(cpu, pm_start_stop, 0);
         cbe_write_pm(cpu, group_control, pm_regs.group_control);
         cbe_write_pm(cpu, debug_bus_control, pm_regs.debug_bus_control);
@@ -703,7 +939,20 @@ static int cell_cpu_setup(struct op_counter_config *cntr)
          * The pm_rtas_activate_signals will return -EIO if the FW
          * call failed.
          */
-       return pm_rtas_activate_signals(cbe_cpu_to_node(cpu), num_enabled);
+       if (profiling_mode == SPU_PROFILING_EVENTS) {
+               /* For SPU event profiling also need to setup the
+                * pm interval timer
+                */
+               ret = pm_rtas_activate_signals(cbe_cpu_to_node(cpu),
+                                              num_enabled+2);
+               /* store PC from debug bus to Trace buffer as often
+                * as possible (every 10 cycles)
+                */
+               cbe_write_pm(cpu, pm_interval, NUM_INTERVAL_CYC);
+               return ret;
+       } else
+               return pm_rtas_activate_signals(cbe_cpu_to_node(cpu),
+                                               num_enabled);
  }
  
  #define ENTRIES         303
@@ -885,7 +1134,122 @@ static struct notifier_block cpu_freq_notifier_block = {
  };
  #endif
  
-static int cell_global_start_spu(struct op_counter_config *ctr)
+/*
+ * Note the generic OProfile stop calls do not support returning
+ * an error on stop.  Hence, will not return an error if the FW
+ * calls fail on stop. Failure to reset the debug bus is not an issue.
+ * Failure to disable the SPU profiling is not an issue.  The FW calls
+ * to enable the performance counters and debug bus will work even if
+ * the hardware was not cleanly reset.
+ */
+static void cell_global_stop_spu_cycles(void)
+{
+       int subfunc, rtn_value;
+       unsigned int lfsr_value;
+       int cpu;
+
+       oprofile_running = 0;
+       smp_wmb();
+
+#ifdef CONFIG_CPU_FREQ
+       cpufreq_unregister_notifier(&cpu_freq_notifier_block,
+                                   CPUFREQ_TRANSITION_NOTIFIER);
+#endif
+
+       for_each_online_cpu(cpu) {
+               if (cbe_get_hw_thread_id(cpu))
+                       continue;
+
+               subfunc = 3;    /*
+                                * 2 - activate SPU tracing,
+                                * 3 - deactivate
+                                */
+               lfsr_value = 0x8f100000;
+
+               rtn_value = rtas_call(spu_rtas_token, 3, 1, NULL,
+                                     subfunc, cbe_cpu_to_node(cpu),
+                                     lfsr_value);
+
+               if (unlikely(rtn_value != 0)) {
+                       printk(KERN_ERR
+                              "%s: rtas call ibm,cbe-spu-perftools " \
+                              "failed, return = %d\n",
+                              __func__, rtn_value);
+               }
+
+               /* Deactivate the signals */
+               pm_rtas_reset_signals(cbe_cpu_to_node(cpu));
+       }
+
+       stop_spu_profiling_cycles();
+}
+
+static void cell_global_stop_spu_events(void)
+{
+       int cpu;
+       oprofile_running = 0;
+
+       stop_spu_profiling_events();
+       smp_wmb();
+
+       for_each_online_cpu(cpu) {
+               if (cbe_get_hw_thread_id(cpu))
+                       continue;
+
+               cbe_sync_irq(cbe_cpu_to_node(cpu));
+               /* Stop the counters */
+               cbe_disable_pm(cpu);
+               cbe_write_pm07_control(cpu, 0, 0);
+
+               /* Deactivate the signals */
+               pm_rtas_reset_signals(cbe_cpu_to_node(cpu));
+
+               /* Deactivate interrupts */
+               cbe_disable_pm_interrupts(cpu);
+       }
+       del_timer_sync(&timer_spu_event_swap);
+}
+
+static void cell_global_stop_ppu(void)
+{
+       int cpu;
+
+       /*
+        * This routine will be called once for the system.
+        * There is one performance monitor per node, so we
+        * only need to perform this function once per node.
+        */
+       del_timer_sync(&timer_virt_cntr);
+       oprofile_running = 0;
+       smp_wmb();
+
+       for_each_online_cpu(cpu) {
+               if (cbe_get_hw_thread_id(cpu))
+                       continue;
+
+               cbe_sync_irq(cbe_cpu_to_node(cpu));
+               /* Stop the counters */
+               cbe_disable_pm(cpu);
+
+               /* Deactivate the signals */
+               pm_rtas_reset_signals(cbe_cpu_to_node(cpu));
+
+               /* Deactivate interrupts */
+               cbe_disable_pm_interrupts(cpu);
+       }
+}
+
+static void cell_global_stop(void)
+{
+       if (profiling_mode == PPU_PROFILING)
+               cell_global_stop_ppu();
+       else if (profiling_mode == SPU_PROFILING_EVENTS)
+               cell_global_stop_spu_events();
+       else
+               cell_global_stop_spu_cycles();
+}
+
+static int cell_global_start_spu_cycles(struct op_counter_config *ctr)
  {
         int subfunc;
         unsigned int lfsr_value;
@@ -951,18 +1315,18 @@ static int cell_global_start_spu(struct op_counter_config *ctr)
  
                 /* start profiling */
                 ret = rtas_call(spu_rtas_token, 3, 1, NULL, subfunc,
-                 cbe_cpu_to_node(cpu), lfsr_value);
+                               cbe_cpu_to_node(cpu), lfsr_value);
  
                 if (unlikely(ret != 0)) {
                         printk(KERN_ERR
-                              "%s: rtas call ibm,cbe-spu-perftools failed, return = %d\n",
-                              __func__, ret);
+                              "%s: rtas call ibm,cbe-spu-perftools failed, " \
+                              "return = %d\n", __func__, ret);
                         rtas_error = -EIO;
                         goto out;
                 }
         }
  
-       rtas_error = start_spu_profiling(spu_cycle_reset);
+       rtas_error = start_spu_profiling_cycles(spu_cycle_reset);
         if (rtas_error)
                 goto out_stop;
  
@@ -970,11 +1334,74 @@ static int cell_global_start_spu(struct op_counter_config *ctr)
         return 0;
  
  out_stop:
-       cell_global_stop_spu();         /* clean up the PMU/debug bus */
+       cell_global_stop_spu_cycles();  /* clean up the PMU/debug bus */
  out:
         return rtas_error;
  }
  
+static int cell_global_start_spu_events(struct op_counter_config *ctr)
+{
+       int cpu;
+       u32 interrupt_mask = 0;
+       int rtn = 0;
+
+       hdw_thread = 0;
+
+       /* spu event profiling, uses the performance counters to generate
+        * an interrupt.  The hardware is setup to store the SPU program
+        * counter into the trace array.  The occurrence mode is used to
+        * enable storing data to the trace buffer.  The bits are set
+        * to send/store the SPU address in the trace buffer.  The debug
+        * bus must be setup to route the SPU program counter onto the
+        * debug bus.  The occurrence data in the trace buffer is not used.
+        */
+
+       /* This routine gets called once for the system.
+        * There is one performance monitor per node, so we
+        * only need to perform this function once per node.
+        */
+
+       for_each_online_cpu(cpu) {
+               if (cbe_get_hw_thread_id(cpu))
+                       continue;
+
+               /*
+                * Setup SPU event-based profiling.
+                * Set perf_mon_control bit 0 to a zero before
+                * enabling spu collection hardware.
+                *
+                * Only support one SPU event on one SPU per node.
+                */
+               if (ctr_enabled & 1) {
+                       cbe_write_ctr(cpu, 0, reset_value[0]);
+                       enable_ctr(cpu, 0, pm_regs.pm07_cntrl);
+                       interrupt_mask |=
+                               CBE_PM_CTR_OVERFLOW_INTR(0);
+               } else {
+                       /* Disable counter */
+                       cbe_write_pm07_control(cpu, 0, 0);
+               }
+
+               cbe_get_and_clear_pm_interrupts(cpu);
+               cbe_enable_pm_interrupts(cpu, hdw_thread, interrupt_mask);
+               cbe_enable_pm(cpu);
+
+               /* clear the trace buffer */
+               cbe_write_pm(cpu, trace_address, 0);
+       }
+
+       /* Start the timer to time slice collecting the event profile
+        * on each of the SPUs.  Note, can collect profile on one SPU
+        * per node at a time.
+        */
+       start_spu_event_swap();
+       start_spu_profiling_events();
+       oprofile_running = 1;
+       smp_wmb();
+
+       return rtn;
+}
+
  static int cell_global_start_ppu(struct op_counter_config *ctr)
  {
         u32 cpu, i;
@@ -994,8 +1421,7 @@ static int cell_global_start_ppu(struct op_counter_config *ctr)
                         if (ctr_enabled & (1 << i)) {
                                 cbe_write_ctr(cpu, i, reset_value[i]);
                                 enable_ctr(cpu, i, pm_regs.pm07_cntrl);
-                               interrupt_mask |=
-                                   CBE_PM_CTR_OVERFLOW_INTR(i);
+                               interrupt_mask |= CBE_PM_CTR_OVERFLOW_INTR(i);
                         } else {
                                 /* Disable counter */
                                 cbe_write_pm07_control(cpu, i, 0);
@@ -1024,99 +1450,162 @@ static int cell_global_start_ppu(struct op_counter_config *ctr)
  
  static int cell_global_start(struct op_counter_config *ctr)
  {
-       if (spu_cycle_reset)
-               return cell_global_start_spu(ctr);
+       if (profiling_mode == SPU_PROFILING_CYCLES)
+               return cell_global_start_spu_cycles(ctr);
+       else if (profiling_mode == SPU_PROFILING_EVENTS)
+               return cell_global_start_spu_events(ctr);
         else
                 return cell_global_start_ppu(ctr);
  }
  
-/*
- * Note the generic OProfile stop calls do not support returning
- * an error on stop.  Hence, will not return an error if the FW
- * calls fail on stop. Failure to reset the debug bus is not an issue.
- * Failure to disable the SPU profiling is not an issue.  The FW calls
- * to enable the performance counters and debug bus will work even if
- * the hardware was not cleanly reset.
+
+/* The SPU interrupt handler
+ *
+ * SPU event profiling works as follows:
+ * The pm_signal[0] holds the one SPU event to be measured.  It is routed on
+ * the debug bus using word 0 or 1.  The value of pm_signal[1] and
+ * pm_signal[2] contain the necessary events to route the SPU program
+ * counter for the selected SPU onto the debug bus using words 2 and 3.
+ * The pm_interval register is setup to write the SPU PC value into the
+ * trace buffer at the maximum rate possible.  The trace buffer is configured
+ * to store the PCs, wrapping when it is full.  The performance counter is
+ * intialized to the max hardware count minus the number of events, N, between
+ * samples.  Once the N events have occured, a HW counter overflow occurs
+ * causing the generation of a HW counter interrupt which also stops the
+ * writing of the SPU PC values to the trace buffer.  Hence the last PC
+ * written to the trace buffer is the SPU PC that we want.  Unfortunately,
+ * we have to read from the beginning of the trace buffer to get to the
+ * last value written.  We just hope the PPU has nothing better to do then
+ * service this interrupt. The PC for the specific SPU being profiled is
+ * extracted from the trace buffer processed and stored.  The trace buffer
+ * is cleared, interrupts are cleared, the counter is reset to max - N.
+ * A kernel timer is used to periodically call the routine spu_evnt_swap()
+ * to switch to the next physical SPU in the node to profile in round robbin
+ * order.  This way data is collected for all SPUs on the node. It does mean
+ * that we need to use a relatively small value of N to ensure enough samples
+ * on each SPU are collected each SPU is being profiled 1/8 of the time.
+ * It may also be necessary to use a longer sample collection period.
   */
-static void cell_global_stop_spu(void)
+static void cell_handle_interrupt_spu(struct pt_regs *regs,
+                                     struct op_counter_config *ctr)
  {
-       int subfunc, rtn_value;
-       unsigned int lfsr_value;
-       int cpu;
+       u32 cpu, cpu_tmp;
+       u64 trace_entry;
+       u32 interrupt_mask;
+       u64 trace_buffer[2];
+       u64 last_trace_buffer;
+       u32 sample;
+       u32 trace_addr;
+       unsigned long sample_array_lock_flags;
+       int spu_num;
+       unsigned long flags;
  
-       oprofile_running = 0;
+       /* Make sure spu event interrupt handler and spu event swap
+        * don't access the counters simultaneously.
+        */
+       cpu = smp_processor_id();
+       spin_lock_irqsave(&cntr_lock, flags);
  
-#ifdef CONFIG_CPU_FREQ
-       cpufreq_unregister_notifier(&cpu_freq_notifier_block,
-                                   CPUFREQ_TRANSITION_NOTIFIER);
-#endif
+       cpu_tmp = cpu;
+       cbe_disable_pm(cpu);
  
-       for_each_online_cpu(cpu) {
-               if (cbe_get_hw_thread_id(cpu))
-                       continue;
+       interrupt_mask = cbe_get_and_clear_pm_interrupts(cpu);
  
-               subfunc = 3;    /*
-                                * 2 - activate SPU tracing,
-                                * 3 - deactivate
-                                */
-               lfsr_value = 0x8f100000;
+       sample = 0xABCDEF;
+       trace_entry = 0xfedcba;
+       last_trace_buffer = 0xdeadbeaf;
  
-               rtn_value = rtas_call(spu_rtas_token, 3, 1, NULL,
-                                     subfunc, cbe_cpu_to_node(cpu),
-                                     lfsr_value);
+       if ((oprofile_running == 1) && (interrupt_mask != 0)) {
+               /* disable writes to trace buff */
+               cbe_write_pm(cpu, pm_interval, 0);
  
-               if (unlikely(rtn_value != 0)) {
-                       printk(KERN_ERR
-                              "%s: rtas call ibm,cbe-spu-perftools failed, return = %d\n",
-                              __func__, rtn_value);
+               /* only have one perf cntr being used, cntr 0 */
+               if ((interrupt_mask & CBE_PM_CTR_OVERFLOW_INTR(0))
+                   && ctr[0].enabled)
+                       /* The SPU PC values will be read
+                        * from the trace buffer, reset counter
+                        */
+
+                       cbe_write_ctr(cpu, 0, reset_value[0]);
+
+               trace_addr = cbe_read_pm(cpu, trace_address);
+
+               while (!(trace_addr & CBE_PM_TRACE_BUF_EMPTY)) {
+                       /* There is data in the trace buffer to process
+                        * Read the buffer until you get to the last
+                        * entry.  This is the value we want.
+                        */
+
+                       cbe_read_trace_buffer(cpu, trace_buffer);
+                       trace_addr = cbe_read_pm(cpu, trace_address);
                 }
  
-               /* Deactivate the signals */
-               pm_rtas_reset_signals(cbe_cpu_to_node(cpu));
-       }
+               /* SPU Address 16 bit count format for 128 bit
+                * HW trace buffer is used for the SPU PC storage
+                *    HDR bits          0:15
+                *    SPU Addr 0 bits   16:31
+                *    SPU Addr 1 bits   32:47
+                *    unused bits       48:127
+                *
+                * HDR: bit4 = 1 SPU Address 0 valid
+                * HDR: bit5 = 1 SPU Address 1 valid
+                *  - unfortunately, the valid bits don't seem to work
+                *
+                * Note trace_buffer[0] holds bits 0:63 of the HW
+                * trace buffer, trace_buffer[1] holds bits 64:127
+                */
  
-       stop_spu_profiling();
-}
+               trace_entry = trace_buffer[0]
+                       & 0x00000000FFFF0000;
  
-static void cell_global_stop_ppu(void)
-{
-       int cpu;
+               /* only top 16 of the 18 bit SPU PC address
+                * is stored in trace buffer, hence shift right
+                * by 16 -2 bits */
+               sample = trace_entry >> 14;
+               last_trace_buffer = trace_buffer[0];
  
-       /*
-        * This routine will be called once for the system.
-        * There is one performance monitor per node, so we
-        * only need to perform this function once per node.
-        */
-       del_timer_sync(&timer_virt_cntr);
-       oprofile_running = 0;
-       smp_wmb();
+               spu_num = spu_evnt_phys_spu_indx
+                       + (cbe_cpu_to_node(cpu) * NUM_SPUS_PER_NODE);
  
-       for_each_online_cpu(cpu) {
-               if (cbe_get_hw_thread_id(cpu))
-                       continue;
+               /* make sure only one process at a time is calling
+                * spu_sync_buffer()
+                */
+               spin_lock_irqsave(&oprof_spu_smpl_arry_lck,
+                                 sample_array_lock_flags);
+               spu_sync_buffer(spu_num, &sample, 1);
+               spin_unlock_irqrestore(&oprof_spu_smpl_arry_lck,
+                                      sample_array_lock_flags);
  
-               cbe_sync_irq(cbe_cpu_to_node(cpu));
-               /* Stop the counters */
-               cbe_disable_pm(cpu);
+               smp_wmb();    /* insure spu event buffer updates are written
+                              * don't want events intermingled... */
  
-               /* Deactivate the signals */
-               pm_rtas_reset_signals(cbe_cpu_to_node(cpu));
+               /* The counters were frozen by the interrupt.
+                * Reenable the interrupt and restart the counters.
+                */
+               cbe_write_pm(cpu, pm_interval, NUM_INTERVAL_CYC);
+               cbe_enable_pm_interrupts(cpu, hdw_thread,
+                                        virt_cntr_inter_mask);
  
-               /* Deactivate interrupts */
-               cbe_disable_pm_interrupts(cpu);
-       }
-}
+               /* clear the trace buffer, re-enable writes to trace buff */
+               cbe_write_pm(cpu, trace_address, 0);
+               cbe_write_pm(cpu, pm_interval, NUM_INTERVAL_CYC);
  
-static void cell_global_stop(void)
-{
-       if (spu_cycle_reset)
-               cell_global_stop_spu();
-       else
-               cell_global_stop_ppu();
+               /* The writes to the various performance counters only writes
+                * to a latch.  The new values (interrupt setting bits, reset
+                * counter value etc.) are not copied to the actual registers
+                * until the performance monitor is enabled.  In order to get
+                * this to work as desired, the permormance monitor needs to
+                * be disabled while writing to the latches.  This is a
+                * HW design issue.
+                */
+               write_pm_cntrl(cpu);
+               cbe_enable_pm(cpu);
+       }
+       spin_unlock_irqrestore(&cntr_lock, flags);
  }
  
-static void cell_handle_interrupt(struct pt_regs *regs,
-                               struct op_counter_config *ctr)
+static void cell_handle_interrupt_ppu(struct pt_regs *regs,
+                                     struct op_counter_config *ctr)
  {
         u32 cpu;
         u64 pc;
@@ -1132,7 +1621,7 @@ static void cell_handle_interrupt(struct pt_regs *regs,
          * routine are not running at the same time. See the
          * cell_virtual_cntr() routine for additional comments.
          */
-       spin_lock_irqsave(&virt_cntr_lock, flags);
+       spin_lock_irqsave(&cntr_lock, flags);
  
         /*
          * Need to disable and reenable the performance counters
@@ -1185,7 +1674,16 @@ static void cell_handle_interrupt(struct pt_regs *regs,
                  */
                 cbe_enable_pm(cpu);
         }
-       spin_unlock_irqrestore(&virt_cntr_lock, flags);
+       spin_unlock_irqrestore(&cntr_lock, flags);
+}
+
+static void cell_handle_interrupt(struct pt_regs *regs,
+                                 struct op_counter_config *ctr)
+{
+       if (profiling_mode == PPU_PROFILING)
+               cell_handle_interrupt_ppu(regs, ctr);
+       else
+               cell_handle_interrupt_spu(regs, ctr);
  }
  
  /*
@@ -1195,7 +1693,8 @@ static void cell_handle_interrupt(struct pt_regs *regs,
   */
  static int cell_sync_start(void)
  {
-       if (spu_cycle_reset)
+       if ((profiling_mode == SPU_PROFILING_CYCLES) ||
+           (profiling_mode == SPU_PROFILING_EVENTS))
                 return spu_sync_start();
         else
                 return DO_GENERIC_SYNC;
@@ -1203,7 +1702,8 @@ static int cell_sync_start(void)
  
  static int cell_sync_stop(void)
  {
-       if (spu_cycle_reset)
+       if ((profiling_mode == SPU_PROFILING_CYCLES) ||
+           (profiling_mode == SPU_PROFILING_EVENTS))
                 return spu_sync_stop();
         else
                 return 1;
diff --git a/arch/s390/include/asm/chpid.h b/arch/s390/include/asm/chpid.h

index dfe3c7f3439a71f1cedb51ea0ca5ae71575ca329..fc71d8a6709b53f34306fc6a286ad7d69770b372 100644 (file)
--- a/arch/s390/include/asm/chpid.h
+++ b/arch/s390/include/asm/chpid.h
@@ -9,7 +9,7 @@
  #define _ASM_S390_CHPID_H _ASM_S390_CHPID_H
  
  #include <linux/string.h>
-#include <asm/types.h>
+#include <linux/types.h>
  
  #define __MAX_CHPID 255
  
diff --git a/arch/s390/include/asm/chsc.h b/arch/s390/include/asm/chsc.h

index d38d0cf62d4bdd1406f0b23f2fe5811b6586f352..807997f7414b15f8b5c186da14882a0378250b5b 100644 (file)
--- a/arch/s390/include/asm/chsc.h
+++ b/arch/s390/include/asm/chsc.h
@@ -8,6 +8,7 @@
  #ifndef _ASM_CHSC_H
  #define _ASM_CHSC_H
  
+#include <linux/types.h>
  #include <asm/chpid.h>
  #include <asm/schid.h>
  
diff --git a/arch/s390/include/asm/cmb.h b/arch/s390/include/asm/cmb.h

index 50196857d27a8e55dc4939333bb44fbc5d23fa0b..39ae03294794f23b852a233e724d8a1b98342c6e 100644 (file)
--- a/arch/s390/include/asm/cmb.h
+++ b/arch/s390/include/asm/cmb.h
@@ -1,5 +1,8 @@
  #ifndef S390_CMB_H
  #define S390_CMB_H
+
+#include <linux/types.h>
+
  /**
   * struct cmbdata - channel measurement block data for user space
   * @size: size of the stored data
diff --git a/arch/s390/include/asm/dasd.h b/arch/s390/include/asm/dasd.h

index 55b2b80cdf6e0926da9663a52292fafb0ca27762..e2db6f16d9c8bd6273ba0b26e56d9f2efe8c62e4 100644 (file)
--- a/arch/s390/include/asm/dasd.h
+++ b/arch/s390/include/asm/dasd.h
@@ -14,6 +14,7 @@
  
  #ifndef DASD_H
  #define DASD_H
+#include <linux/types.h>
  #include <linux/ioctl.h>
  
  #define DASD_IOCTL_LETTER 'D'
@@ -78,6 +79,7 @@ typedef struct dasd_information2_t {
  #define DASD_FEATURE_USEDIAG        0x02
  #define DASD_FEATURE_INITIAL_ONLINE  0x04
  #define DASD_FEATURE_ERPLOG         0x08
+#define DASD_FEATURE_FAILFAST       0x10
  
  #define DASD_PARTN_BITS 2
  
diff --git a/arch/s390/include/asm/kvm.h b/arch/s390/include/asm/kvm.h

index d74002f9579482f1bb13fc0a223b88802e6c1826..e1f54654e3ae5f9abdbc6b01668a49a7d4b6af26 100644 (file)
--- a/arch/s390/include/asm/kvm.h
+++ b/arch/s390/include/asm/kvm.h
@@ -13,7 +13,7 @@
   *    Author(s): Carsten Otte <cotte@de.ibm.com>
   *               Christian Borntraeger <borntraeger@de.ibm.com>
   */
-#include <asm/types.h>
+#include <linux/types.h>
  
  /* for KVM_GET_IRQCHIP and KVM_SET_IRQCHIP */
  struct kvm_pic_state {
diff --git a/arch/s390/include/asm/posix_types.h b/arch/s390/include/asm/posix_types.h

index 397d93fba3a77c762a8002805ce8ef5b175b5e8d..8cc113f9252352d3070ee4f03c0bb5a9aa833f7a 100644 (file)
--- a/arch/s390/include/asm/posix_types.h
+++ b/arch/s390/include/asm/posix_types.h
@@ -68,11 +68,7 @@ typedef unsigned short __kernel_old_dev_t;
  #endif /* __s390x__ */
  
  typedef struct {
-#if defined(__KERNEL__) || defined(__USE_ALL)
          int     val[2];
-#else                        /* !defined(__KERNEL__) && !defined(__USE_ALL)*/
-        int     __val[2];
-#endif                       /* !defined(__KERNEL__) && !defined(__USE_ALL)*/
  } __kernel_fsid_t;
  
  
diff --git a/arch/s390/include/asm/ptrace.h b/arch/s390/include/asm/ptrace.h

index 5396f9f122634bda4d0ce33bd78bd1747856e7d2..8920025c3c02ed3bb95919861bf4b16b823a731f 100644 (file)
--- a/arch/s390/include/asm/ptrace.h
+++ b/arch/s390/include/asm/ptrace.h
@@ -272,12 +272,15 @@ typedef struct
  #define PSW_ASC_SECONDARY      0x0000800000000000UL
  #define PSW_ASC_HOME           0x0000C00000000000UL
  
-extern long psw_user32_bits;
-
  #endif /* __s390x__ */
  
+#ifdef __KERNEL__
  extern long psw_kernel_bits;
  extern long psw_user_bits;
+#ifdef CONFIG_64BIT
+extern long psw_user32_bits;
+#endif
+#endif
  
  /* This macro merges a NEW PSW mask specified by the user into
     the currently active PSW mask CURRENT, modifying only those
diff --git a/arch/s390/include/asm/qeth.h b/arch/s390/include/asm/qeth.h

index 930d378ef75a80b79b54a0a2a25cf5e7af1cb1b5..06cbd1e8c9433e551323ef932e5d55a8965fc37b 100644 (file)
--- a/arch/s390/include/asm/qeth.h
+++ b/arch/s390/include/asm/qeth.h
@@ -10,6 +10,7 @@
   */
  #ifndef __ASM_S390_QETH_IOCTL_H__
  #define __ASM_S390_QETH_IOCTL_H__
+#include <linux/types.h>
  #include <linux/ioctl.h>
  
  #define SIOC_QETH_ARP_SET_NO_ENTRIES    (SIOCDEVPRIVATE)
diff --git a/arch/s390/include/asm/schid.h b/arch/s390/include/asm/schid.h

index 825503cf3dc2fa6f97a92b927debf3ed71add376..3e4d401b4e45978a9cad899495d11a05efa07437 100644 (file)
--- a/arch/s390/include/asm/schid.h
+++ b/arch/s390/include/asm/schid.h
@@ -1,6 +1,8 @@
  #ifndef ASM_SCHID_H
  #define ASM_SCHID_H
  
+#include <linux/types.h>
+
  struct subchannel_id {
         __u32 cssid : 8;
         __u32 : 4;
diff --git a/arch/s390/include/asm/swab.h b/arch/s390/include/asm/swab.h

index bd9321aa55a956d9b0fce55d6579b28c4a4c5eb7..eb18dc1f327bf906187358713a92e1c7066417e3 100644 (file)
--- a/arch/s390/include/asm/swab.h
+++ b/arch/s390/include/asm/swab.h
@@ -9,7 +9,7 @@
   *    Author(s): Martin Schwidefsky (schwidefsky@de.ibm.com)
   */
  
-#include <asm/types.h>
+#include <linux/types.h>
  
  #ifndef __s390x__
  # define __SWAB_64_THRU_32__
diff --git a/arch/s390/include/asm/types.h b/arch/s390/include/asm/types.h

index 41c547656130e75bb206312f3175d23bd2e868a7..3dc3fc228812a3ce22070d8b52705c211b23ea47 100644 (file)
--- a/arch/s390/include/asm/types.h
+++ b/arch/s390/include/asm/types.h
@@ -9,11 +9,7 @@
  #ifndef _S390_TYPES_H
  #define _S390_TYPES_H
  
-#ifndef __s390x__
-# include <asm-generic/int-ll64.h>
-#else
-# include <asm-generic/int-l64.h>
-#endif
+#include <asm-generic/int-ll64.h>
  
  #ifndef __ASSEMBLY__
  
diff --git a/arch/s390/kernel/entry.h b/arch/s390/kernel/entry.h

index 6b1896345edad2f0b2f49f26b17d4f82ad44d766..a65afc91e8aa86dab1f859797033da334ac226d0 100644 (file)
--- a/arch/s390/kernel/entry.h
+++ b/arch/s390/kernel/entry.h
@@ -54,7 +54,5 @@ long sys_sigreturn(void);
  long sys_rt_sigreturn(void);
  long sys32_sigreturn(void);
  long sys32_rt_sigreturn(void);
-long old_select(struct sel_arg_struct __user *arg);
-long sys_ptrace(long request, long pid, long addr, long data);
  
  #endif /* _ENTRY_H */
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c

index 9c0ccb532a4564fa5457a97e64bffbaf41ea4a79..2d337cbb9329c33e07f9dfccb2f51daa94269e30 100644 (file)
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -685,7 +685,8 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
         if (MACHINE_HAS_IEEE)
                 lowcore->extended_save_area_addr = (u32) save_area;
  #else
-       BUG_ON(vdso_alloc_per_cpu(smp_processor_id(), lowcore));
+       if (vdso_alloc_per_cpu(smp_processor_id(), lowcore))
+               BUG();
  #endif
         set_prefix((u32)(unsigned long) lowcore);
         local_mcck_enable();
diff --git a/arch/s390/kernel/sys_s390.c b/arch/s390/kernel/sys_s390.c

index 4fe952e557ac3bf683bb9535ff89b553035aad25..c34be4568b80fcc75250e3fe018994fa5683843e 100644 (file)
--- a/arch/s390/kernel/sys_s390.c
+++ b/arch/s390/kernel/sys_s390.c
@@ -103,25 +103,6 @@ out:
         return error;
  }
  
-#ifndef CONFIG_64BIT
-struct sel_arg_struct {
-       unsigned long n;
-       fd_set __user *inp, *outp, *exp;
-       struct timeval __user *tvp;
-};
-
-asmlinkage long old_select(struct sel_arg_struct __user *arg)
-{
-       struct sel_arg_struct a;
-
-       if (copy_from_user(&a, arg, sizeof(a)))
-               return -EFAULT;
-       /* sys_select() does the appropriate kernel locking */
-       return sys_select(a.n, a.inp, a.outp, a.exp, a.tvp);
-
-}
-#endif /* CONFIG_64BIT */
-
  /*
   * sys_ipc() is the de-multiplexer for the SysV IPC calls..
   *
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c

index 25a6a82f1c02309e8258553435112e44d9ec4b64..690e17819686534c9061b03e5408354bdb00f5d8 100644 (file)
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -322,7 +322,8 @@ static int __init vdso_init(void)
         vdso64_pagelist[vdso64_pages - 1] = virt_to_page(vdso_data);
         vdso64_pagelist[vdso64_pages] = NULL;
  #ifndef CONFIG_SMP
-       BUG_ON(vdso_alloc_per_cpu(0, S390_lowcore));
+       if (vdso_alloc_per_cpu(0, &S390_lowcore))
+               BUG();
  #endif
         vdso_init_cr5();
  #endif /* CONFIG_64BIT */
diff --git a/arch/s390/kernel/vdso32/gettimeofday.S b/arch/s390/kernel/vdso32/gettimeofday.S

index c32f29c3d70c7a62de324fe94b35d376bafae6f2..ad8acfc949fbdb5d0a7f59375bf567770780686e 100644 (file)
--- a/arch/s390/kernel/vdso32/gettimeofday.S
+++ b/arch/s390/kernel/vdso32/gettimeofday.S
@@ -9,10 +9,6 @@
   * it under the terms of the GNU General Public License (version 2 only)
   * as published by the Free Software Foundation.
   */
-#include <asm/vdso.h>
-#include <asm/asm-offsets.h>
-#include <asm/unistd.h>
-
  #include <asm/vdso.h>
  #include <asm/asm-offsets.h>
  #include <asm/unistd.h>
diff --git a/arch/s390/kvm/diag.c b/arch/s390/kvm/diag.c

index a0775e1f08df8a73d9d06127f0b675b976785a76..8300309698fa3abce3b9b84a5884aba432c372fe 100644 (file)
--- a/arch/s390/kvm/diag.c
+++ b/arch/s390/kvm/diag.c
@@ -47,7 +47,7 @@ static int __diag_ipl_functions(struct kvm_vcpu *vcpu)
         vcpu->run->s390_reset_flags |= KVM_S390_RESET_IPL;
         vcpu->run->s390_reset_flags |= KVM_S390_RESET_CPU_INIT;
         vcpu->run->exit_reason = KVM_EXIT_S390_RESET;
-       VCPU_EVENT(vcpu, 3, "requesting userspace resets %lx",
+       VCPU_EVENT(vcpu, 3, "requesting userspace resets %llx",
           vcpu->run->s390_reset_flags);
         return -EREMOTE;
  }
diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c

index 2960702b48246488087546669b1459524ead2958..f4fe28a2521a10aa7e5d8c99d58c5e83603b631e 100644 (file)
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -160,7 +160,7 @@ static void __do_deliver_interrupt(struct kvm_vcpu *vcpu,
                 break;
  
         case KVM_S390_INT_VIRTIO:
-               VCPU_EVENT(vcpu, 4, "interrupt: virtio parm:%x,parm64:%lx",
+               VCPU_EVENT(vcpu, 4, "interrupt: virtio parm:%x,parm64:%llx",
                            inti->ext.ext_params, inti->ext.ext_params2);
                 vcpu->stat.deliver_virtio_interrupt++;
                 rc = put_guest_u16(vcpu, __LC_EXT_INT_CODE, 0x2603);
@@ -360,7 +360,7 @@ int kvm_s390_handle_wait(struct kvm_vcpu *vcpu)
         vcpu->arch.ckc_timer.expires = jiffies + sltime;
  
         add_timer(&vcpu->arch.ckc_timer);
-       VCPU_EVENT(vcpu, 5, "enabled wait timer:%lx jiffies", sltime);
+       VCPU_EVENT(vcpu, 5, "enabled wait timer:%llx jiffies", sltime);
  no_timer:
         spin_lock_bh(&vcpu->arch.local_int.float_int->lock);
         spin_lock_bh(&vcpu->arch.local_int.lock);
@@ -491,7 +491,7 @@ int kvm_s390_inject_vm(struct kvm *kvm,
  
         switch (s390int->type) {
         case KVM_S390_INT_VIRTIO:
-               VM_EVENT(kvm, 5, "inject: virtio parm:%x,parm64:%lx",
+               VM_EVENT(kvm, 5, "inject: virtio parm:%x,parm64:%llx",
                          s390int->parm, s390int->parm64);
                 inti->type = s390int->type;
                 inti->ext.ext_params = s390int->parm;
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c

index cce40ff2913bbf851059c8ea0f8d604d4eb94304..3605df45dd419fb2082b193c8e7438bf5b815dfe 100644 (file)
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -118,7 +118,7 @@ static int handle_store_cpu_address(struct kvm_vcpu *vcpu)
                 goto out;
         }
  
-       VCPU_EVENT(vcpu, 5, "storing cpu address to %lx", useraddr);
+       VCPU_EVENT(vcpu, 5, "storing cpu address to %llx", useraddr);
  out:
         return 0;
  }
diff --git a/arch/sh/include/asm/mmu.h b/arch/sh/include/asm/mmu.h

index fdcb93bc6d11d3126093409580f9243e7d1b1e8a..6c43625bb1a5ecbee71f8249546e297178ab8980 100644 (file)
--- a/arch/sh/include/asm/mmu.h
+++ b/arch/sh/include/asm/mmu.h
@@ -9,7 +9,6 @@ typedef struct {
         mm_context_id_t         id;
         void                    *vdso;
  #else
-       struct vm_list_struct   *vmlist;
         unsigned long           end_brk;
  #endif
  #ifdef CONFIG_BINFMT_ELF_FDPIC
diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h

index 9fa9dcdf344baeb25ccfa79db6222b5aae470ba5..e02a359d2aa5c6bcbfb9c5465cfaa8f6bd79e007 100644 (file)
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -300,7 +300,7 @@ static inline int test_and_change_bit(int nr, volatile unsigned long *addr)
         return oldbit;
  }
  
-static inline int constant_test_bit(int nr, const volatile unsigned long *addr)
+static inline int constant_test_bit(unsigned int nr, const volatile unsigned long *addr)
  {
         return ((1UL << (nr % BITS_PER_LONG)) &
                 (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0;
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c

index 744aa7fc49d5e875403b6ffc64c0a2cae74f9c55..76b8cd953deed9f8a50d572cdc52b5edb68bc3b7 100644 (file)
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -201,6 +201,12 @@ struct chipset {
         void (*f)(int num, int slot, int func);
  };
  
+/*
+ * Only works for devices on the root bus. If you add any devices
+ * not on bus 0 readd another loop level in early_quirks(). But
+ * be careful because at least the Nvidia quirk here relies on
+ * only matching on bus 0.
+ */
  static struct chipset early_qrk[] __initdata = {
         { PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
           PCI_CLASS_BRIDGE_PCI, PCI_ANY_ID, QFLAG_APPLY_ONCE, nvidia_bugs },
@@ -267,17 +273,17 @@ static int __init check_dev_quirk(int num, int slot, int func)
  
  void __init early_quirks(void)
  {
-       int num, slot, func;
+       int slot, func;
  
         if (!early_pci_allowed())
                 return;
  
         /* Poor man's PCI discovery */
-       for (num = 0; num < 32; num++)
-               for (slot = 0; slot < 32; slot++)
-                       for (func = 0; func < 8; func++) {
-                               /* Only probe function 0 on single fn devices */
-                               if (check_dev_quirk(num, slot, func))
-                                       break;
-                       }
+       /* Only scan the root bus */
+       for (slot = 0; slot < 32; slot++)
+               for (func = 0; func < 8; func++) {
+                       /* Only probe function 0 on single fn devices */
+                       if (check_dev_quirk(0, slot, func))
+                               break;
+               }
  }
diff --git a/arch/x86/oprofile/op_model_amd.c b/arch/x86/oprofile/op_model_amd.c

index 98658f25f542b0be2a62116f1dcd61cb6d4e09a7..8fdf06e4edf93c4e93b37d71907bb274a0ebbfc3 100644 (file)
--- a/arch/x86/oprofile/op_model_amd.c
+++ b/arch/x86/oprofile/op_model_amd.c
@@ -2,7 +2,7 @@
   * @file op_model_amd.c
   * athlon / K7 / K8 / Family 10h model-specific MSR operations
   *
- * @remark Copyright 2002-2008 OProfile authors
+ * @remark Copyright 2002-2009 OProfile authors
   * @remark Read the file COPYING
   *
   * @author John Levon
@@ -10,7 +10,7 @@
   * @author Graydon Hoare
   * @author Robert Richter <robert.richter@amd.com>
   * @author Barry Kasindorf
-*/
+ */
  
  #include <linux/oprofile.h>
  #include <linux/device.h>
@@ -60,53 +60,10 @@ static unsigned long reset_value[NUM_COUNTERS];
  #define IBS_OP_LOW_VALID_BIT           (1ULL<<18)      /* bit 18 */
  #define IBS_OP_LOW_ENABLE              (1ULL<<17)      /* bit 17 */
  
-/* Codes used in cpu_buffer.c */
-/* This produces duplicate code, need to be fixed */
-#define IBS_FETCH_BEGIN 3
-#define IBS_OP_BEGIN    4
-
-/*
- * The function interface needs to be fixed, something like add
- * data. Should then be added to linux/oprofile.h.
- */
-extern void
-oprofile_add_ibs_sample(struct pt_regs * const regs,
-                       unsigned int * const ibs_sample, int ibs_code);
-
-struct ibs_fetch_sample {
-       /* MSRC001_1031 IBS Fetch Linear Address Register */
-       unsigned int ibs_fetch_lin_addr_low;
-       unsigned int ibs_fetch_lin_addr_high;
-       /* MSRC001_1030 IBS Fetch Control Register */
-       unsigned int ibs_fetch_ctl_low;
-       unsigned int ibs_fetch_ctl_high;
-       /* MSRC001_1032 IBS Fetch Physical Address Register */
-       unsigned int ibs_fetch_phys_addr_low;
-       unsigned int ibs_fetch_phys_addr_high;
-};
-
-struct ibs_op_sample {
-       /* MSRC001_1034 IBS Op Logical Address Register (IbsRIP) */
-       unsigned int ibs_op_rip_low;
-       unsigned int ibs_op_rip_high;
-       /* MSRC001_1035 IBS Op Data Register */
-       unsigned int ibs_op_data1_low;
-       unsigned int ibs_op_data1_high;
-       /* MSRC001_1036 IBS Op Data 2 Register */
-       unsigned int ibs_op_data2_low;
-       unsigned int ibs_op_data2_high;
-       /* MSRC001_1037 IBS Op Data 3 Register */
-       unsigned int ibs_op_data3_low;
-       unsigned int ibs_op_data3_high;
-       /* MSRC001_1038 IBS DC Linear Address Register (IbsDcLinAd) */
-       unsigned int ibs_dc_linear_low;
-       unsigned int ibs_dc_linear_high;
-       /* MSRC001_1039 IBS DC Physical Address Register (IbsDcPhysAd) */
-       unsigned int ibs_dc_phys_low;
-       unsigned int ibs_dc_phys_high;
-};
+#define IBS_FETCH_SIZE 6
+#define IBS_OP_SIZE    12
  
-static int ibs_allowed;        /* AMD Family10h and later */
+static int has_ibs;    /* AMD Family10h and later */
  
  struct op_ibs_config {
         unsigned long op_enabled;
@@ -197,31 +154,29 @@ static inline int
  op_amd_handle_ibs(struct pt_regs * const regs,
                   struct op_msrs const * const msrs)
  {
-       unsigned int low, high;
-       struct ibs_fetch_sample ibs_fetch;
-       struct ibs_op_sample ibs_op;
+       u32 low, high;
+       u64 msr;
+       struct op_entry entry;
  
-       if (!ibs_allowed)
+       if (!has_ibs)
                 return 1;
  
         if (ibs_config.fetch_enabled) {
                 rdmsr(MSR_AMD64_IBSFETCHCTL, low, high);
                 if (high & IBS_FETCH_HIGH_VALID_BIT) {
-                       ibs_fetch.ibs_fetch_ctl_high = high;
-                       ibs_fetch.ibs_fetch_ctl_low = low;
-                       rdmsr(MSR_AMD64_IBSFETCHLINAD, low, high);
-                       ibs_fetch.ibs_fetch_lin_addr_high = high;
-                       ibs_fetch.ibs_fetch_lin_addr_low = low;
-                       rdmsr(MSR_AMD64_IBSFETCHPHYSAD, low, high);
-                       ibs_fetch.ibs_fetch_phys_addr_high = high;
-                       ibs_fetch.ibs_fetch_phys_addr_low = low;
-
-                       oprofile_add_ibs_sample(regs,
-                                               (unsigned int *)&ibs_fetch,
-                                               IBS_FETCH_BEGIN);
+                       rdmsrl(MSR_AMD64_IBSFETCHLINAD, msr);
+                       oprofile_write_reserve(&entry, regs, msr,
+                                              IBS_FETCH_CODE, IBS_FETCH_SIZE);
+                       oprofile_add_data(&entry, (u32)msr);
+                       oprofile_add_data(&entry, (u32)(msr >> 32));
+                       oprofile_add_data(&entry, low);
+                       oprofile_add_data(&entry, high);
+                       rdmsrl(MSR_AMD64_IBSFETCHPHYSAD, msr);
+                       oprofile_add_data(&entry, (u32)msr);
+                       oprofile_add_data(&entry, (u32)(msr >> 32));
+                       oprofile_write_commit(&entry);
  
                         /* reenable the IRQ */
-                       rdmsr(MSR_AMD64_IBSFETCHCTL, low, high);
                         high &= ~IBS_FETCH_HIGH_VALID_BIT;
                         high |= IBS_FETCH_HIGH_ENABLE;
                         low &= IBS_FETCH_LOW_MAX_CNT_MASK;
@@ -232,30 +187,29 @@ op_amd_handle_ibs(struct pt_regs * const regs,
         if (ibs_config.op_enabled) {
                 rdmsr(MSR_AMD64_IBSOPCTL, low, high);
                 if (low & IBS_OP_LOW_VALID_BIT) {
-                       rdmsr(MSR_AMD64_IBSOPRIP, low, high);
-                       ibs_op.ibs_op_rip_low = low;
-                       ibs_op.ibs_op_rip_high = high;
-                       rdmsr(MSR_AMD64_IBSOPDATA, low, high);
-                       ibs_op.ibs_op_data1_low = low;
-                       ibs_op.ibs_op_data1_high = high;
-                       rdmsr(MSR_AMD64_IBSOPDATA2, low, high);
-                       ibs_op.ibs_op_data2_low = low;
-                       ibs_op.ibs_op_data2_high = high;
-                       rdmsr(MSR_AMD64_IBSOPDATA3, low, high);
-                       ibs_op.ibs_op_data3_low = low;
-                       ibs_op.ibs_op_data3_high = high;
-                       rdmsr(MSR_AMD64_IBSDCLINAD, low, high);
-                       ibs_op.ibs_dc_linear_low = low;
-                       ibs_op.ibs_dc_linear_high = high;
-                       rdmsr(MSR_AMD64_IBSDCPHYSAD, low, high);
-                       ibs_op.ibs_dc_phys_low = low;
-                       ibs_op.ibs_dc_phys_high = high;
+                       rdmsrl(MSR_AMD64_IBSOPRIP, msr);
+                       oprofile_write_reserve(&entry, regs, msr,
+                                              IBS_OP_CODE, IBS_OP_SIZE);
+                       oprofile_add_data(&entry, (u32)msr);
+                       oprofile_add_data(&entry, (u32)(msr >> 32));
+                       rdmsrl(MSR_AMD64_IBSOPDATA, msr);
+                       oprofile_add_data(&entry, (u32)msr);
+                       oprofile_add_data(&entry, (u32)(msr >> 32));
+                       rdmsrl(MSR_AMD64_IBSOPDATA2, msr);
+                       oprofile_add_data(&entry, (u32)msr);
+                       oprofile_add_data(&entry, (u32)(msr >> 32));
+                       rdmsrl(MSR_AMD64_IBSOPDATA3, msr);
+                       oprofile_add_data(&entry, (u32)msr);
+                       oprofile_add_data(&entry, (u32)(msr >> 32));
+                       rdmsrl(MSR_AMD64_IBSDCLINAD, msr);
+                       oprofile_add_data(&entry, (u32)msr);
+                       oprofile_add_data(&entry, (u32)(msr >> 32));
+                       rdmsrl(MSR_AMD64_IBSDCPHYSAD, msr);
+                       oprofile_add_data(&entry, (u32)msr);
+                       oprofile_add_data(&entry, (u32)(msr >> 32));
+                       oprofile_write_commit(&entry);
  
                         /* reenable the IRQ */
-                       oprofile_add_ibs_sample(regs,
-                                               (unsigned int *)&ibs_op,
-                                               IBS_OP_BEGIN);
-                       rdmsr(MSR_AMD64_IBSOPCTL, low, high);
                         high = 0;
                         low &= ~IBS_OP_LOW_VALID_BIT;
                         low |= IBS_OP_LOW_ENABLE;
@@ -305,14 +259,14 @@ static void op_amd_start(struct op_msrs const * const msrs)
         }
  
  #ifdef CONFIG_OPROFILE_IBS
-       if (ibs_allowed && ibs_config.fetch_enabled) {
+       if (has_ibs && ibs_config.fetch_enabled) {
                 low = (ibs_config.max_cnt_fetch >> 4) & 0xFFFF;
                 high = ((ibs_config.rand_en & 0x1) << 25) /* bit 57 */
                         + IBS_FETCH_HIGH_ENABLE;
                 wrmsr(MSR_AMD64_IBSFETCHCTL, low, high);
         }
  
-       if (ibs_allowed && ibs_config.op_enabled) {
+       if (has_ibs && ibs_config.op_enabled) {
                 low = ((ibs_config.max_cnt_op >> 4) & 0xFFFF)
                         + ((ibs_config.dispatched_ops & 0x1) << 19) /* bit 19 */
                         + IBS_OP_LOW_ENABLE;
@@ -341,14 +295,14 @@ static void op_amd_stop(struct op_msrs const * const msrs)
         }
  
  #ifdef CONFIG_OPROFILE_IBS
-       if (ibs_allowed && ibs_config.fetch_enabled) {
+       if (has_ibs && ibs_config.fetch_enabled) {
                 /* clear max count and enable */
                 low = 0;
                 high = 0;
                 wrmsr(MSR_AMD64_IBSFETCHCTL, low, high);
         }
  
-       if (ibs_allowed && ibs_config.op_enabled) {
+       if (has_ibs && ibs_config.op_enabled) {
                 /* clear max count and enable */
                 low = 0;
                 high = 0;
@@ -409,6 +363,7 @@ static int init_ibs_nmi(void)
                                        | IBSCTL_LVTOFFSETVAL);
                 pci_read_config_dword(cpu_cfg, IBSCTL, &value);
                 if (value != (ibs_eilvt_off | IBSCTL_LVTOFFSETVAL)) {
+                       pci_dev_put(cpu_cfg);
                         printk(KERN_DEBUG "Failed to setup IBS LVT offset, "
                                 "IBSCTL = 0x%08x", value);
                         return 1;
@@ -436,20 +391,20 @@ static int init_ibs_nmi(void)
  /* uninitialize the APIC for the IBS interrupts if needed */
  static void clear_ibs_nmi(void)
  {
-       if (ibs_allowed)
+       if (has_ibs)
                 on_each_cpu(apic_clear_ibs_nmi_per_cpu, NULL, 1);
  }
  
  /* initialize the APIC for the IBS interrupts if available */
  static void ibs_init(void)
  {
-       ibs_allowed = boot_cpu_has(X86_FEATURE_IBS);
+       has_ibs = boot_cpu_has(X86_FEATURE_IBS);
  
-       if (!ibs_allowed)
+       if (!has_ibs)
                 return;
  
         if (init_ibs_nmi()) {
-               ibs_allowed = 0;
+               has_ibs = 0;
                 return;
         }
  
@@ -458,7 +413,7 @@ static void ibs_init(void)
  
  static void ibs_exit(void)
  {
-       if (!ibs_allowed)
+       if (!has_ibs)
                 return;
  
         clear_ibs_nmi();
@@ -478,7 +433,7 @@ static int setup_ibs_files(struct super_block *sb, struct dentry *root)
         if (ret)
                 return ret;
  
-       if (!ibs_allowed)
+       if (!has_ibs)
                 return ret;
  
         /* model specific files */
diff --git a/drivers/char/hvc_iucv.c b/drivers/char/hvc_iucv.c

index 5ea7d7713fca42d0c5b091393669e20b46a9ca0f..a53496828b76949a3e0b50c82fa4372d8c6f21dd 100644 (file)
--- a/drivers/char/hvc_iucv.c
+++ b/drivers/char/hvc_iucv.c
@@ -1,26 +1,30 @@
  /*
- * hvc_iucv.c - z/VM IUCV back-end for the Hypervisor Console (HVC)
+ * hvc_iucv.c - z/VM IUCV hypervisor console (HVC) device driver
   *
- * This back-end for HVC provides terminal access via
+ * This HVC device driver provides terminal access using
   * z/VM IUCV communication paths.
   *
- * Copyright IBM Corp. 2008.
+ * Copyright IBM Corp. 2008
   *
   * Author(s):  Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
   */
  #define KMSG_COMPONENT         "hvc_iucv"
+#define pr_fmt(fmt)            KMSG_COMPONENT ": " fmt
  
  #include <linux/types.h>
  #include <asm/ebcdic.h>
+#include <linux/delay.h>
+#include <linux/init.h>
  #include <linux/mempool.h>
  #include <linux/module.h>
  #include <linux/tty.h>
+#include <linux/wait.h>
  #include <net/iucv/iucv.h>
  
  #include "hvc_console.h"
  
  
-/* HVC backend for z/VM IUCV */
+/* General device driver settings */
  #define HVC_IUCV_MAGIC         0xc9e4c3e5
  #define MAX_HVC_IUCV_LINES     HVC_ALLOC_TTY_ADAPTERS
  #define MEMPOOL_MIN_NR         (PAGE_SIZE / sizeof(struct iucv_tty_buffer)/4)
@@ -33,14 +37,14 @@
  #define MSG_TYPE_WINSIZE       0x08    /* Terminal window size update */
  #define MSG_TYPE_DATA          0x10    /* Terminal data */
  
-#define MSG_SIZE(s)            ((s) + offsetof(struct iucv_tty_msg, data))
  struct iucv_tty_msg {
         u8      version;                /* Message version */
         u8      type;                   /* Message type */
-#define MSG_MAX_DATALEN                (~(u16)0)
+#define MSG_MAX_DATALEN                ((u16)(~0))
         u16     datalen;                /* Payload length */
         u8      data[];                 /* Payload buffer */
  } __attribute__((packed));
+#define MSG_SIZE(s)            ((s) + offsetof(struct iucv_tty_msg, data))
  
  enum iucv_state_t {
         IUCV_DISCONN    = 0,
@@ -54,19 +58,26 @@ enum tty_state_t {
  };
  
  struct hvc_iucv_private {
-       struct hvc_struct       *hvc; /* HVC console struct reference */
+       struct hvc_struct       *hvc;           /* HVC struct reference */
         u8                      srv_name[8];    /* IUCV service name (ebcdic) */
+       unsigned char           is_console;     /* Linux console usage flag */
         enum iucv_state_t       iucv_state;     /* IUCV connection status */
         enum tty_state_t        tty_state;      /* TTY status */
         struct iucv_path        *path;          /* IUCV path pointer */
         spinlock_t              lock;           /* hvc_iucv_private lock */
+#define SNDBUF_SIZE            (PAGE_SIZE)     /* must be < MSG_MAX_DATALEN */
+       void                    *sndbuf;        /* send buffer            */
+       size_t                  sndbuf_len;     /* length of send buffer  */
+#define QUEUE_SNDBUF_DELAY     (HZ / 25)
+       struct delayed_work     sndbuf_work;    /* work: send iucv msg(s) */
+       wait_queue_head_t       sndbuf_waitq;   /* wait for send completion */
         struct list_head        tty_outqueue;   /* outgoing IUCV messages */
         struct list_head        tty_inqueue;    /* incoming IUCV messages */
  };
  
  struct iucv_tty_buffer {
         struct list_head        list;   /* list pointer */
-       struct iucv_message     msg;    /* store an incoming IUCV message */
+       struct iucv_message     msg;    /* store an IUCV message */
         size_t                  offset; /* data buffer offset */
         struct iucv_tty_msg     *mbuf;  /* buffer to store input/output data */
  };
@@ -78,11 +89,12 @@ static void hvc_iucv_msg_pending(struct iucv_path *, struct iucv_message *);
  static void hvc_iucv_msg_complete(struct iucv_path *, struct iucv_message *);
  
  
-/* Kernel module parameters */
-static unsigned long hvc_iucv_devices;
+/* Kernel module parameter: use one terminal device as default */
+static unsigned long hvc_iucv_devices = 1;
  
  /* Array of allocated hvc iucv tty lines... */
  static struct hvc_iucv_private *hvc_iucv_table[MAX_HVC_IUCV_LINES];
+#define IUCV_HVC_CON_IDX       (0)
  
  /* Kmem cache and mempool for iucv_tty_buffer elements */
  static struct kmem_cache *hvc_iucv_buffer_cache;
@@ -112,7 +124,7 @@ struct hvc_iucv_private *hvc_iucv_get_private(uint32_t num)
  }
  
  /**
- * alloc_tty_buffer() - Returns a new struct iucv_tty_buffer element.
+ * alloc_tty_buffer() - Return a new struct iucv_tty_buffer element.
   * @size:      Size of the internal buffer used to store data.
   * @flags:     Memory allocation flags passed to mempool.
   *
@@ -120,7 +132,6 @@ struct hvc_iucv_private *hvc_iucv_get_private(uint32_t num)
   * allocates an internal data buffer with the specified size @size.
   * Note: The total message size arises from the internal buffer size and the
   *      members of the iucv_tty_msg structure.
- *
   * The function returns NULL if memory allocation has failed.
   */
  static struct iucv_tty_buffer *alloc_tty_buffer(size_t size, gfp_t flags)
@@ -130,7 +141,7 @@ static struct iucv_tty_buffer *alloc_tty_buffer(size_t size, gfp_t flags)
         bufp = mempool_alloc(hvc_iucv_mempool, flags);
         if (!bufp)
                 return NULL;
-       memset(bufp, 0, sizeof(struct iucv_tty_buffer));
+       memset(bufp, 0, sizeof(*bufp));
  
         if (size > 0) {
                 bufp->msg.length = MSG_SIZE(size);
@@ -149,9 +160,6 @@ static struct iucv_tty_buffer *alloc_tty_buffer(size_t size, gfp_t flags)
  /**
   * destroy_tty_buffer() - destroy struct iucv_tty_buffer element.
   * @bufp:      Pointer to a struct iucv_tty_buffer element, SHALL NOT be NULL.
- *
- * The destroy_tty_buffer() function frees the internal data buffer and returns
- * the struct iucv_tty_buffer element back to the mempool for freeing.
   */
  static void destroy_tty_buffer(struct iucv_tty_buffer *bufp)
  {
@@ -161,11 +169,7 @@ static void destroy_tty_buffer(struct iucv_tty_buffer *bufp)
  
  /**
   * destroy_tty_buffer_list() - call destroy_tty_buffer() for each list element.
- * @list:      List head pointer to a list containing struct iucv_tty_buffer
- *             elements.
- *
- * Calls destroy_tty_buffer() for each struct iucv_tty_buffer element in the
- * list @list.
+ * @list:      List containing struct iucv_tty_buffer elements.
   */
  static void destroy_tty_buffer_list(struct list_head *list)
  {
@@ -178,24 +182,24 @@ static void destroy_tty_buffer_list(struct list_head *list)
  }
  
  /**
- * hvc_iucv_write() - Receive IUCV message write data to HVC console buffer.
- * @priv:              Pointer to hvc_iucv_private structure.
- * @buf:               HVC console buffer for writing received terminal data.
- * @count:             HVC console buffer size.
+ * hvc_iucv_write() - Receive IUCV message & write data to HVC buffer.
+ * @priv:              Pointer to struct hvc_iucv_private
+ * @buf:               HVC buffer for writing received terminal data.
+ * @count:             HVC buffer size.
   * @has_more_data:     Pointer to an int variable.
   *
   * The function picks up pending messages from the input queue and receives
   * the message data that is then written to the specified buffer @buf.
- * If the buffer size @count is less than the data message size, then the
+ * If the buffer size @count is less than the data message size, the
   * message is kept on the input queue and @has_more_data is set to 1.
- * If the message data has been entirely written, the message is removed from
+ * If all message data has been written, the message is removed from
   * the input queue.
   *
   * The function returns the number of bytes written to the terminal, zero if
   * there are no pending data messages available or if there is no established
   * IUCV path.
   * If the IUCV path has been severed, then -EPIPE is returned to cause a
- * hang up (that is issued by the HVC console layer).
+ * hang up (that is issued by the HVC layer).
   */
  static int hvc_iucv_write(struct hvc_iucv_private *priv,
                           char *buf, int count, int *has_more_data)
@@ -204,12 +208,12 @@ static int hvc_iucv_write(struct hvc_iucv_private *priv,
         int written;
         int rc;
  
-       /* Immediately return if there is no IUCV connection */
+       /* immediately return if there is no IUCV connection */
         if (priv->iucv_state == IUCV_DISCONN)
                 return 0;
  
-       /* If the IUCV path has been severed, return -EPIPE to inform the
-        * hvc console layer to hang up the tty device. */
+       /* if the IUCV path has been severed, return -EPIPE to inform the
+        * HVC layer to hang up the tty device. */
         if (priv->iucv_state == IUCV_SEVERED)
                 return -EPIPE;
  
@@ -217,7 +221,7 @@ static int hvc_iucv_write(struct hvc_iucv_private *priv,
         if (list_empty(&priv->tty_inqueue))
                 return 0;
  
-       /* receive a iucv message and flip data to the tty (ldisc) */
+       /* receive an iucv message and flip data to the tty (ldisc) */
         rb = list_first_entry(&priv->tty_inqueue, struct iucv_tty_buffer, list);
  
         written = 0;
@@ -260,7 +264,7 @@ static int hvc_iucv_write(struct hvc_iucv_private *priv,
         case MSG_TYPE_WINSIZE:
                 if (rb->mbuf->datalen != sizeof(struct winsize))
                         break;
-               hvc_resize(priv->hvc, *((struct winsize *)rb->mbuf->data));
+               hvc_resize(priv->hvc, *((struct winsize *) rb->mbuf->data));
                 break;
  
         case MSG_TYPE_ERROR:    /* ignored ... */
@@ -284,10 +288,9 @@ out_written:
   * @buf:       Pointer to a buffer to store data
   * @count:     Size of buffer available for writing
   *
- * The hvc_console thread calls this method to read characters from
- * the terminal backend. If an IUCV communication path has been established,
- * pending IUCV messages are received and data is copied into buffer @buf
- * up to @count bytes.
+ * The HVC thread calls this method to read characters from the back-end.
+ * If an IUCV communication path has been established, pending IUCV messages
+ * are received and data is copied into buffer @buf up to @count bytes.
   *
   * Locking:    The routine gets called under an irqsave() spinlock; and
   *             the routine locks the struct hvc_iucv_private->lock to call
@@ -318,66 +321,122 @@ static int hvc_iucv_get_chars(uint32_t vtermno, char *buf, int count)
  }
  
  /**
- * hvc_iucv_send() - Send an IUCV message containing terminal data.
+ * hvc_iucv_queue() - Buffer terminal data for sending.
   * @priv:      Pointer to struct hvc_iucv_private instance.
   * @buf:       Buffer containing data to send.
- * @size:      Size of buffer and amount of data to send.
+ * @count:     Size of buffer and amount of data to send.
+ *
+ * The function queues data for sending. To actually send the buffered data,
+ * a work queue function is scheduled (with QUEUE_SNDBUF_DELAY).
+ * The function returns the number of data bytes that has been buffered.
   *
- * If an IUCV communication path is established, the function copies the buffer
- * data to a newly allocated struct iucv_tty_buffer element, sends the data and
- * puts the element to the outqueue.
+ * If the device is not connected, data is ignored and the function returns
+ * @count.
+ * If the buffer is full, the function returns 0.
+ * If an existing IUCV communicaton path has been severed, -EPIPE is returned
+ * (that can be passed to HVC layer to cause a tty hangup).
+ */
+static int hvc_iucv_queue(struct hvc_iucv_private *priv, const char *buf,
+                         int count)
+{
+       size_t len;
+
+       if (priv->iucv_state == IUCV_DISCONN)
+               return count;                   /* ignore data */
+
+       if (priv->iucv_state == IUCV_SEVERED)
+               return -EPIPE;
+
+       len = min_t(size_t, count, SNDBUF_SIZE - priv->sndbuf_len);
+       if (!len)
+               return 0;
+
+       memcpy(priv->sndbuf + priv->sndbuf_len, buf, len);
+       priv->sndbuf_len += len;
+
+       if (priv->iucv_state == IUCV_CONNECTED)
+               schedule_delayed_work(&priv->sndbuf_work, QUEUE_SNDBUF_DELAY);
+
+       return len;
+}
+
+/**
+ * hvc_iucv_send() - Send an IUCV message containing terminal data.
+ * @priv:      Pointer to struct hvc_iucv_private instance.
   *
- * If there is no IUCV communication path established, the function returns 0.
- * If an existing IUCV communicaton path has been severed, the function returns
- * -EPIPE (can be passed to HVC layer to cause a tty hangup).
+ * If an IUCV communication path has been established, the buffered output data
+ * is sent via an IUCV message and the number of bytes sent is returned.
+ * Returns 0 if there is no established IUCV communication path or
+ * -EPIPE if an existing IUCV communicaton path has been severed.
   */
-static int hvc_iucv_send(struct hvc_iucv_private *priv, const char *buf,
-                        int count)
+static int hvc_iucv_send(struct hvc_iucv_private *priv)
  {
         struct iucv_tty_buffer *sb;
-       int rc;
-       u16 len;
+       int rc, len;
  
         if (priv->iucv_state == IUCV_SEVERED)
                 return -EPIPE;
  
         if (priv->iucv_state == IUCV_DISCONN)
-               return 0;
+               return -EIO;
  
-       len = min_t(u16, MSG_MAX_DATALEN, count);
+       if (!priv->sndbuf_len)
+               return 0;
  
         /* allocate internal buffer to store msg data and also compute total
          * message length */
-       sb = alloc_tty_buffer(len, GFP_ATOMIC);
+       sb = alloc_tty_buffer(priv->sndbuf_len, GFP_ATOMIC);
         if (!sb)
                 return -ENOMEM;
  
-       sb->mbuf->datalen = len;
-       memcpy(sb->mbuf->data, buf, len);
+       memcpy(sb->mbuf->data, priv->sndbuf, priv->sndbuf_len);
+       sb->mbuf->datalen = (u16) priv->sndbuf_len;
+       sb->msg.length = MSG_SIZE(sb->mbuf->datalen);
  
         list_add_tail(&sb->list, &priv->tty_outqueue);
  
         rc = __iucv_message_send(priv->path, &sb->msg, 0, 0,
                                  (void *) sb->mbuf, sb->msg.length);
         if (rc) {
+               /* drop the message here; however we might want to handle
+                * 0x03 (msg limit reached) by trying again... */
                 list_del(&sb->list);
                 destroy_tty_buffer(sb);
-               len = 0;
         }
+       len = priv->sndbuf_len;
+       priv->sndbuf_len = 0;
  
         return len;
  }
  
+/**
+ * hvc_iucv_sndbuf_work() - Send buffered data over IUCV
+ * @work:      Work structure.
+ *
+ * This work queue function sends buffered output data over IUCV and,
+ * if not all buffered data could be sent, reschedules itself.
+ */
+static void hvc_iucv_sndbuf_work(struct work_struct *work)
+{
+       struct hvc_iucv_private *priv;
+
+       priv = container_of(work, struct hvc_iucv_private, sndbuf_work.work);
+       if (!priv)
+               return;
+
+       spin_lock_bh(&priv->lock);
+       hvc_iucv_send(priv);
+       spin_unlock_bh(&priv->lock);
+}
+
  /**
   * hvc_iucv_put_chars() - HVC put_chars operation.
   * @vtermno:   HVC virtual terminal number.
   * @buf:       Pointer to an buffer to read data from
   * @count:     Size of buffer available for reading
   *
- * The hvc_console thread calls this method to write characters from
- * to the terminal backend.
- * The function calls hvc_iucv_send() under the lock of the
- * struct hvc_iucv_private instance that corresponds to the tty @vtermno.
+ * The HVC thread calls this method to write characters to the back-end.
+ * The function calls hvc_iucv_queue() to queue terminal data for sending.
   *
   * Locking:    The method gets called under an irqsave() spinlock; and
   *             locks struct hvc_iucv_private->lock.
@@ -385,7 +444,7 @@ static int hvc_iucv_send(struct hvc_iucv_private *priv, const char *buf,
  static int hvc_iucv_put_chars(uint32_t vtermno, const char *buf, int count)
  {
         struct hvc_iucv_private *priv = hvc_iucv_get_private(vtermno);
-       int sent;
+       int queued;
  
         if (count <= 0)
                 return 0;
@@ -394,10 +453,10 @@ static int hvc_iucv_put_chars(uint32_t vtermno, const char *buf, int count)
                 return -ENODEV;
  
         spin_lock(&priv->lock);
-       sent = hvc_iucv_send(priv, buf, count);
+       queued = hvc_iucv_queue(priv, buf, count);
         spin_unlock(&priv->lock);
  
-       return sent;
+       return queued;
  }
  
  /**
@@ -406,7 +465,7 @@ static int hvc_iucv_put_chars(uint32_t vtermno, const char *buf, int count)
   * @id:        Additional data (originally passed to hvc_alloc): the index of an struct
   *     hvc_iucv_private instance.
   *
- * The function sets the tty state to TTY_OPEN for the struct hvc_iucv_private
+ * The function sets the tty state to TTY_OPENED for the struct hvc_iucv_private
   * instance that is derived from @id. Always returns 0.
   *
   * Locking:    struct hvc_iucv_private->lock, spin_lock_bh
@@ -427,12 +486,8 @@ static int hvc_iucv_notifier_add(struct hvc_struct *hp, int id)
  }
  
  /**
- * hvc_iucv_cleanup() - Clean up function if the tty portion is finally closed.
+ * hvc_iucv_cleanup() - Clean up and reset a z/VM IUCV HVC instance.
   * @priv:      Pointer to the struct hvc_iucv_private instance.
- *
- * The functions severs the established IUCV communication path (if any), and
- * destroy struct iucv_tty_buffer elements from the in- and outqueue. Finally,
- * the functions resets the states to TTY_CLOSED and IUCV_DISCONN.
   */
  static void hvc_iucv_cleanup(struct hvc_iucv_private *priv)
  {
@@ -441,25 +496,62 @@ static void hvc_iucv_cleanup(struct hvc_iucv_private *priv)
  
         priv->tty_state = TTY_CLOSED;
         priv->iucv_state = IUCV_DISCONN;
+
+       priv->sndbuf_len = 0;
  }
  
  /**
- * hvc_iucv_notifier_hangup() - HVC notifier for tty hangups.
- * @hp: Pointer to the HVC device (struct hvc_struct)
- * @id: Additional data (originally passed to hvc_alloc): the index of an struct
- *     hvc_iucv_private instance.
+ * tty_outqueue_empty() - Test if the tty outq is empty
+ * @priv:      Pointer to struct hvc_iucv_private instance.
+ */
+static inline int tty_outqueue_empty(struct hvc_iucv_private *priv)
+{
+       int rc;
+
+       spin_lock_bh(&priv->lock);
+       rc = list_empty(&priv->tty_outqueue);
+       spin_unlock_bh(&priv->lock);
+
+       return rc;
+}
+
+/**
+ * flush_sndbuf_sync() - Flush send buffer and wait for completion
+ * @priv:      Pointer to struct hvc_iucv_private instance.
   *
- * This routine notifies the HVC backend that a tty hangup (carrier loss,
- * virtual or otherwise) has occured.
+ * The routine cancels a pending sndbuf work, calls hvc_iucv_send()
+ * to flush any buffered terminal output data and waits for completion.
+ */
+static void flush_sndbuf_sync(struct hvc_iucv_private *priv)
+{
+       int sync_wait;
+
+       cancel_delayed_work_sync(&priv->sndbuf_work);
+
+       spin_lock_bh(&priv->lock);
+       hvc_iucv_send(priv);            /* force sending buffered data */
+       sync_wait = !list_empty(&priv->tty_outqueue); /* anything queued ? */
+       spin_unlock_bh(&priv->lock);
+
+       if (sync_wait)
+               wait_event_timeout(priv->sndbuf_waitq,
+                                  tty_outqueue_empty(priv), HZ);
+}
+
+/**
+ * hvc_iucv_notifier_hangup() - HVC notifier for TTY hangups.
+ * @hp:                Pointer to the HVC device (struct hvc_struct)
+ * @id:                Additional data (originally passed to hvc_alloc):
+ *             the index of an struct hvc_iucv_private instance.
   *
- * The HVC backend for z/VM IUCV ignores virtual hangups (vhangup()), to keep
- * an existing IUCV communication path established.
+ * This routine notifies the HVC back-end that a tty hangup (carrier loss,
+ * virtual or otherwise) has occured.
+ * The z/VM IUCV HVC device driver ignores virtual hangups (vhangup())
+ * to keep an existing IUCV communication path established.
   * (Background: vhangup() is called from user space (by getty or login) to
   *             disable writing to the tty by other applications).
- *
- * If the tty has been opened (e.g. getty) and an established IUCV path has been
- * severed (we caused the tty hangup in that case), then the functions invokes
- * hvc_iucv_cleanup() to clean up.
+ * If the tty has been opened and an established IUCV path has been severed
+ * (we caused the tty hangup), the function calls hvc_iucv_cleanup().
   *
   * Locking:    struct hvc_iucv_private->lock
   */
@@ -471,12 +563,12 @@ static void hvc_iucv_notifier_hangup(struct hvc_struct *hp, int id)
         if (!priv)
                 return;
  
+       flush_sndbuf_sync(priv);
+
         spin_lock_bh(&priv->lock);
         /* NOTE: If the hangup was scheduled by ourself (from the iucv
-        *       path_servered callback [IUCV_SEVERED]), then we have to
-        *       finally clean up the tty backend structure and set state to
-        *       TTY_CLOSED.
-        *
+        *       path_servered callback [IUCV_SEVERED]), we have to clean up
+        *       our structure and to set state to TTY_CLOSED.
          *       If the tty was hung up otherwise (e.g. vhangup()), then we
          *       ignore this hangup and keep an established IUCV path open...
          *       (...the reason is that we are not able to connect back to the
@@ -494,10 +586,9 @@ static void hvc_iucv_notifier_hangup(struct hvc_struct *hp, int id)
   * @id:                Additional data (originally passed to hvc_alloc):
   *             the index of an struct hvc_iucv_private instance.
   *
- * This routine notifies the HVC backend that the last tty device file
- * descriptor has been closed.
- * The function calls hvc_iucv_cleanup() to clean up the struct hvc_iucv_private
- * instance.
+ * This routine notifies the HVC back-end that the last tty device fd has been
+ * closed.  The function calls hvc_iucv_cleanup() to clean up the struct
+ * hvc_iucv_private instance.
   *
   * Locking:    struct hvc_iucv_private->lock
   */
@@ -510,6 +601,8 @@ static void hvc_iucv_notifier_del(struct hvc_struct *hp, int id)
         if (!priv)
                 return;
  
+       flush_sndbuf_sync(priv);
+
         spin_lock_bh(&priv->lock);
         path = priv->path;              /* save reference to IUCV path */
         priv->path = NULL;
@@ -527,20 +620,18 @@ static void hvc_iucv_notifier_del(struct hvc_struct *hp, int id)
  /**
   * hvc_iucv_path_pending() - IUCV handler to process a connection request.
   * @path:      Pending path (struct iucv_path)
- * @ipvmid:    Originator z/VM system identifier
+ * @ipvmid:    z/VM system identifier of originator
   * @ipuser:    User specified data for this path
   *             (AF_IUCV: port/service name and originator port)
   *
- * The function uses the @ipuser data to check to determine if the pending
- * path belongs to a terminal managed by this HVC backend.
- * If the check is successful, then an additional check is done to ensure
- * that a terminal cannot be accessed multiple times (only one connection
- * to a terminal is allowed). In that particular case, the pending path is
- * severed. If it is the first connection, the pending path is accepted and
- * associated to the struct hvc_iucv_private. The iucv state is updated to
- * reflect that a communication path has been established.
+ * The function uses the @ipuser data to determine if the pending path belongs
+ * to a terminal managed by this device driver.
+ * If the path belongs to this driver, ensure that the terminal is not accessed
+ * multiple times (only one connection to a terminal is allowed).
+ * If the terminal is not yet connected, the pending path is accepted and is
+ * associated to the appropriate struct hvc_iucv_private instance.
   *
- * Returns 0 if the path belongs to a terminal managed by the this HVC backend;
+ * Returns 0 if @path belongs to a terminal managed by the this device driver;
   * otherwise returns -ENODEV in order to dispatch this path to other handlers.
   *
   * Locking:    struct hvc_iucv_private->lock
@@ -559,7 +650,6 @@ static      int hvc_iucv_path_pending(struct iucv_path *path,
                         priv = hvc_iucv_table[i];
                         break;
                 }
-
         if (!priv)
                 return -ENODEV;
  
@@ -588,6 +678,9 @@ static      int hvc_iucv_path_pending(struct iucv_path *path,
         priv->path = path;
         priv->iucv_state = IUCV_CONNECTED;
  
+       /* flush buffered output data... */
+       schedule_delayed_work(&priv->sndbuf_work, 5);
+
  out_path_handled:
         spin_unlock(&priv->lock);
         return 0;
@@ -603,8 +696,7 @@ out_path_handled:
   * sets the iucv state to IUCV_SEVERED for the associated struct
   * hvc_iucv_private instance. Later, the IUCV_SEVERED state triggers a tty
   * hangup (hvc_iucv_get_chars() / hvc_iucv_write()).
- *
- * If tty portion of the HVC is closed then clean up the outqueue in addition.
+ * If tty portion of the HVC is closed, clean up the outqueue.
   *
   * Locking:    struct hvc_iucv_private->lock
   */
@@ -615,15 +707,25 @@ static void hvc_iucv_path_severed(struct iucv_path *path, u8 ipuser[16])
         spin_lock(&priv->lock);
         priv->iucv_state = IUCV_SEVERED;
  
-       /* NOTE: If the tty has not yet been opened by a getty program
-        *       (e.g. to see console messages), then cleanup the
-        *       hvc_iucv_private structure to allow re-connects.
+       /* If the tty has not yet been opened, clean up the hvc_iucv_private
+        * structure to allow re-connects.
+        * This is also done for our console device because console hangups
+        * are handled specially and no notifier is called by HVC.
+        * The tty session is active (TTY_OPEN) and ready for re-connects...
          *
-        *       If the tty has been opened, the get_chars() callback returns
-        *       -EPIPE to signal the hvc console layer to hang up the tty. */
+        * If it has been opened, let get_chars() return -EPIPE to signal the
+        * HVC layer to hang up the tty.
+        * If so, we need to wake up the HVC thread to call get_chars()...
+        */
         priv->path = NULL;
         if (priv->tty_state == TTY_CLOSED)
                 hvc_iucv_cleanup(priv);
+       else
+               if (priv->is_console) {
+                       hvc_iucv_cleanup(priv);
+                       priv->tty_state = TTY_OPENED;
+               } else
+                       hvc_kick();
         spin_unlock(&priv->lock);
  
         /* finally sever path (outside of priv->lock due to lock ordering) */
@@ -636,9 +738,9 @@ static void hvc_iucv_path_severed(struct iucv_path *path, u8 ipuser[16])
   * @path:      Pending path (struct iucv_path)
   * @msg:       Pointer to the IUCV message
   *
- * The function stores an incoming message on the input queue for later
+ * The function puts an incoming message on the input queue for later
   * processing (by hvc_iucv_get_chars() / hvc_iucv_write()).
- * However, if the tty has not yet been opened, the message is rejected.
+ * If the tty has not yet been opened, the message is rejected.
   *
   * Locking:    struct hvc_iucv_private->lock
   */
@@ -648,6 +750,12 @@ static void hvc_iucv_msg_pending(struct iucv_path *path,
         struct hvc_iucv_private *priv = path->private;
         struct iucv_tty_buffer *rb;
  
+       /* reject messages that exceed max size of iucv_tty_msg->datalen */
+       if (msg->length > MSG_SIZE(MSG_MAX_DATALEN)) {
+               iucv_message_reject(path, msg);
+               return;
+       }
+
         spin_lock(&priv->lock);
  
         /* reject messages if tty has not yet been opened */
@@ -656,7 +764,7 @@ static void hvc_iucv_msg_pending(struct iucv_path *path,
                 goto unlock_return;
         }
  
-       /* allocate buffer an empty buffer element */
+       /* allocate tty buffer to save iucv msg only */
         rb = alloc_tty_buffer(0, GFP_ATOMIC);
         if (!rb) {
                 iucv_message_reject(path, msg);
@@ -666,7 +774,7 @@ static void hvc_iucv_msg_pending(struct iucv_path *path,
  
         list_add_tail(&rb->list, &priv->tty_inqueue);
  
-       hvc_kick();     /* wakup hvc console thread */
+       hvc_kick();     /* wake up hvc thread */
  
  unlock_return:
         spin_unlock(&priv->lock);
@@ -677,10 +785,10 @@ unlock_return:
   * @path:      Pending path (struct iucv_path)
   * @msg:       Pointer to the IUCV message
   *
- * The function is called upon completion of message delivery and the
- * message is removed from the outqueue. Additional delivery information
- * can be found in msg->audit: rejected messages (0x040000 (IPADRJCT)) and
- * purged messages (0x010000 (IPADPGNR)).
+ * The function is called upon completion of message delivery to remove the
+ * message from the outqueue. Additional delivery information can be found
+ * msg->audit: rejected messages (0x040000 (IPADRJCT)), and
+ *            purged messages   (0x010000 (IPADPGNR)).
   *
   * Locking:    struct hvc_iucv_private->lock
   */
@@ -697,6 +805,7 @@ static void hvc_iucv_msg_complete(struct iucv_path *path,
                         list_move(&ent->list, &list_remove);
                         break;
                 }
+       wake_up(&priv->sndbuf_waitq);
         spin_unlock(&priv->lock);
         destroy_tty_buffer_list(&list_remove);
  }
@@ -713,13 +822,14 @@ static struct hv_ops hvc_iucv_ops = {
  
  /**
   * hvc_iucv_alloc() - Allocates a new struct hvc_iucv_private instance
- * @id:        hvc_iucv_table index
+ * @id:                        hvc_iucv_table index
+ * @is_console:                Flag if the instance is used as Linux console
   *
- * This function allocates a new hvc_iucv_private struct and put the
- * instance into hvc_iucv_table at index @id.
+ * This function allocates a new hvc_iucv_private structure and stores
+ * the instance in hvc_iucv_table at index @id.
   * Returns 0 on success; otherwise non-zero.
   */
-static int __init hvc_iucv_alloc(int id)
+static int __init hvc_iucv_alloc(int id, unsigned int is_console)
  {
         struct hvc_iucv_private *priv;
         char name[9];
@@ -732,18 +842,33 @@ static int __init hvc_iucv_alloc(int id)
         spin_lock_init(&priv->lock);
         INIT_LIST_HEAD(&priv->tty_outqueue);
         INIT_LIST_HEAD(&priv->tty_inqueue);
+       INIT_DELAYED_WORK(&priv->sndbuf_work, hvc_iucv_sndbuf_work);
+       init_waitqueue_head(&priv->sndbuf_waitq);
+
+       priv->sndbuf = (void *) get_zeroed_page(GFP_KERNEL);
+       if (!priv->sndbuf) {
+               kfree(priv);
+               return -ENOMEM;
+       }
+
+       /* set console flag */
+       priv->is_console = is_console;
  
-       /* Finally allocate hvc */
-       priv->hvc = hvc_alloc(HVC_IUCV_MAGIC + id,
-                             HVC_IUCV_MAGIC + id, &hvc_iucv_ops, PAGE_SIZE);
+       /* finally allocate hvc */
+       priv->hvc = hvc_alloc(HVC_IUCV_MAGIC + id, /*             PAGE_SIZE */
+                             HVC_IUCV_MAGIC + id, &hvc_iucv_ops, 256);
         if (IS_ERR(priv->hvc)) {
                 rc = PTR_ERR(priv->hvc);
+               free_page((unsigned long) priv->sndbuf);
                 kfree(priv);
                 return rc;
         }
  
+       /* notify HVC thread instead of using polling */
+       priv->hvc->irq_requested = 1;
+
         /* setup iucv related information */
-       snprintf(name, 9, "ihvc%-4d", id);
+       snprintf(name, 9, "lnxhvc%-2d", id);
         memcpy(priv->srv_name, name, 8);
         ASCEBC(priv->srv_name, 8);
  
@@ -752,15 +877,16 @@ static int __init hvc_iucv_alloc(int id)
  }
  
  /**
- * hvc_iucv_init() - Initialization of HVC backend for z/VM IUCV
+ * hvc_iucv_init() - z/VM IUCV HVC device driver initialization
   */
  static int __init hvc_iucv_init(void)
  {
-       int rc, i;
+       int rc;
+       unsigned int i;
  
         if (!MACHINE_IS_VM) {
-               pr_warning("The z/VM IUCV Hypervisor console cannot be "
-                          "used without z/VM.\n");
+               pr_info("The z/VM IUCV HVC device driver cannot "
+                          "be used without z/VM\n");
                 return -ENODEV;
         }
  
@@ -774,26 +900,33 @@ static int __init hvc_iucv_init(void)
                                            sizeof(struct iucv_tty_buffer),
                                            0, 0, NULL);
         if (!hvc_iucv_buffer_cache) {
-               pr_err("Not enough memory for driver initialization "
-                       "(rs=%d).\n", 1);
+               pr_err("Allocating memory failed with reason code=%d\n", 1);
                 return -ENOMEM;
         }
  
         hvc_iucv_mempool = mempool_create_slab_pool(MEMPOOL_MIN_NR,
                                                     hvc_iucv_buffer_cache);
         if (!hvc_iucv_mempool) {
-               pr_err("Not enough memory for driver initialization "
-                       "(rs=%d).\n", 2);
+               pr_err("Allocating memory failed with reason code=%d\n", 2);
                 kmem_cache_destroy(hvc_iucv_buffer_cache);
                 return -ENOMEM;
         }
  
+       /* register the first terminal device as console
+        * (must be done before allocating hvc terminal devices) */
+       rc = hvc_instantiate(HVC_IUCV_MAGIC, IUCV_HVC_CON_IDX, &hvc_iucv_ops);
+       if (rc) {
+               pr_err("Registering HVC terminal device as "
+                      "Linux console failed\n");
+               goto out_error_memory;
+       }
+
         /* allocate hvc_iucv_private structs */
         for (i = 0; i < hvc_iucv_devices; i++) {
-               rc = hvc_iucv_alloc(i);
+               rc = hvc_iucv_alloc(i, (i == IUCV_HVC_CON_IDX) ? 1 : 0);
                 if (rc) {
-                       pr_err("Could not create new z/VM IUCV HVC backend "
-                               "rc=%d.\n", rc);
+                       pr_err("Creating a new HVC terminal device "
+                               "failed with error code=%d\n", rc);
                         goto out_error_hvc;
                 }
         }
@@ -801,7 +934,8 @@ static int __init hvc_iucv_init(void)
         /* register IUCV callback handler */
         rc = iucv_register(&hvc_iucv_handler, 0);
         if (rc) {
-               pr_err("Could not register iucv handler (rc=%d).\n", rc);
+               pr_err("Registering IUCV handlers failed with error code=%d\n",
+                       rc);
                 goto out_error_iucv;
         }
  
@@ -816,21 +950,12 @@ out_error_hvc:
                                 hvc_remove(hvc_iucv_table[i]->hvc);
                         kfree(hvc_iucv_table[i]);
                 }
+out_error_memory:
         mempool_destroy(hvc_iucv_mempool);
         kmem_cache_destroy(hvc_iucv_buffer_cache);
         return rc;
  }
  
-/**
- * hvc_iucv_console_init() - Early console initialization
- */
-static int __init hvc_iucv_console_init(void)
-{
-       if (!MACHINE_IS_VM || !hvc_iucv_devices)
-               return -ENODEV;
-       return hvc_instantiate(HVC_IUCV_MAGIC, 0, &hvc_iucv_ops);
-}
-
  /**
   * hvc_iucv_config() - Parsing of hvc_iucv=  kernel command line parameter
   * @val:       Parameter value (numeric)
@@ -841,10 +966,5 @@ static     int __init hvc_iucv_config(char *val)
  }
  
  
-module_init(hvc_iucv_init);
-console_initcall(hvc_iucv_console_init);
+device_initcall(hvc_iucv_init);
  __setup("hvc_iucv=", hvc_iucv_config);
-
-MODULE_LICENSE("GPL");
-MODULE_DESCRIPTION("HVC back-end for z/VM IUCV.");
-MODULE_AUTHOR("Hendrik Brueckner <brueckner@linux.vnet.ibm.com>");
diff --git a/drivers/leds/Kconfig b/drivers/leds/Kconfig

index e7fb7d2fcbfc4792abc91993337620a562a3323c..a4a1ae2146300b0f87ba8dc6526d46d5db16606a 100644 (file)
--- a/drivers/leds/Kconfig
+++ b/drivers/leds/Kconfig
@@ -63,6 +63,12 @@ config LEDS_WRAP
         help
           This option enables support for the PCEngines WRAP programmable LEDs.
  
+config LEDS_ALIX2
+       tristate "LED Support for ALIX.2 and ALIX.3 series"
+       depends on LEDS_CLASS && X86 && EXPERIMENTAL
+       help
+         This option enables support for the PCEngines ALIX.2 and ALIX.3 LEDs.
+
  config LEDS_H1940
         tristate "LED Support for iPAQ H1940 device"
         depends on LEDS_CLASS && ARCH_H1940
@@ -77,7 +83,7 @@ config LEDS_COBALT_QUBE
  
  config LEDS_COBALT_RAQ
         bool "LED Support for the Cobalt Raq series"
-       depends on LEDS_CLASS && MIPS_COBALT
+       depends on LEDS_CLASS=y && MIPS_COBALT
         select LEDS_TRIGGERS
         help
           This option enables support for the Cobalt Raq series LEDs.
@@ -158,6 +164,13 @@ config LEDS_PCA955X
           LED driver chips accessed via the I2C bus.  Supported
           devices include PCA9550, PCA9551, PCA9552, and PCA9553.
  
+config LEDS_WM8350
+       tristate "LED Support for WM8350 AudioPlus PMIC"
+       depends on LEDS_CLASS && MFD_WM8350
+       help
+         This option enables support for LEDs driven by the Wolfson
+         Microelectronics WM8350 AudioPlus PMIC.
+
  config LEDS_DA903X
         tristate "LED Support for DA9030/DA9034 PMIC"
         depends on LEDS_CLASS && PMIC_DA903X
diff --git a/drivers/leds/Makefile b/drivers/leds/Makefile

index e1967a29850e8b1c4ef446af8030ddfe1a547658..bc247cb02e8269857d71b4fca63de1d99801e22b 100644 (file)
--- a/drivers/leds/Makefile
+++ b/drivers/leds/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_LEDS_S3C24XX)            += leds-s3c24xx.o
  obj-$(CONFIG_LEDS_AMS_DELTA)           += leds-ams-delta.o
  obj-$(CONFIG_LEDS_NET48XX)             += leds-net48xx.o
  obj-$(CONFIG_LEDS_WRAP)                        += leds-wrap.o
+obj-$(CONFIG_LEDS_ALIX2)               += leds-alix2.o
  obj-$(CONFIG_LEDS_H1940)               += leds-h1940.o
  obj-$(CONFIG_LEDS_COBALT_QUBE)         += leds-cobalt-qube.o
  obj-$(CONFIG_LEDS_COBALT_RAQ)          += leds-cobalt-raq.o
@@ -23,6 +24,7 @@ obj-$(CONFIG_LEDS_FSG)                        += leds-fsg.o
  obj-$(CONFIG_LEDS_PCA955X)             += leds-pca955x.o
  obj-$(CONFIG_LEDS_DA903X)              += leds-da903x.o
  obj-$(CONFIG_LEDS_HP_DISK)             += leds-hp-disk.o
+obj-$(CONFIG_LEDS_WM8350)              += leds-wm8350.o
  
  # LED Triggers
  obj-$(CONFIG_LEDS_TRIGGER_TIMER)       += ledtrig-timer.o
diff --git a/drivers/leds/led-class.c b/drivers/leds/led-class.c

index 6c4a326176d7be57d724d994a7ff5e971f50dc93..52f82e3ea13aae5b04a603938fba9dbbd0654712 100644 (file)
--- a/drivers/leds/led-class.c
+++ b/drivers/leds/led-class.c
@@ -91,9 +91,29 @@ void led_classdev_resume(struct led_classdev *led_cdev)
  }
  EXPORT_SYMBOL_GPL(led_classdev_resume);
  
+static int led_suspend(struct device *dev, pm_message_t state)
+{
+       struct led_classdev *led_cdev = dev_get_drvdata(dev);
+
+       if (led_cdev->flags & LED_CORE_SUSPENDRESUME)
+               led_classdev_suspend(led_cdev);
+
+       return 0;
+}
+
+static int led_resume(struct device *dev)
+{
+       struct led_classdev *led_cdev = dev_get_drvdata(dev);
+
+       if (led_cdev->flags & LED_CORE_SUSPENDRESUME)
+               led_classdev_resume(led_cdev);
+
+       return 0;
+}
+
  /**
   * led_classdev_register - register a new object of led_classdev class.
- * @dev: The device to register.
+ * @parent: The device to register.
   * @led_cdev: the led_classdev structure for this device.
   */
  int led_classdev_register(struct device *parent, struct led_classdev *led_cdev)
@@ -174,6 +194,8 @@ static int __init leds_init(void)
         leds_class = class_create(THIS_MODULE, "leds");
         if (IS_ERR(leds_class))
                 return PTR_ERR(leds_class);
+       leds_class->suspend = led_suspend;
+       leds_class->resume = led_resume;
         return 0;
  }
  
diff --git a/drivers/leds/leds-alix2.c b/drivers/leds/leds-alix2.c

new file mode 100644 (file)

index 0000000..ddbd773
--- /dev/null
+++ b/drivers/leds/leds-alix2.c
@@ -0,0 +1,181 @@
+/*
+ * LEDs driver for PCEngines ALIX.2 and ALIX.3
+ *
+ * Copyright (C) 2008 Constantin Baranov <const@mimas.ru>
+ */
+
+#include <linux/err.h>
+#include <linux/io.h>
+#include <linux/kernel.h>
+#include <linux/leds.h>
+#include <linux/module.h>
+#include <linux/platform_device.h>
+#include <linux/string.h>
+
+static int force = 0;
+module_param(force, bool, 0444);
+MODULE_PARM_DESC(force, "Assume system has ALIX.2 style LEDs");
+
+struct alix_led {
+       struct led_classdev cdev;
+       unsigned short port;
+       unsigned int on_value;
+       unsigned int off_value;
+};
+
+static void alix_led_set(struct led_classdev *led_cdev,
+                        enum led_brightness brightness)
+{
+       struct alix_led *led_dev =
+               container_of(led_cdev, struct alix_led, cdev);
+
+       if (brightness)
+               outl(led_dev->on_value, led_dev->port);
+       else
+               outl(led_dev->off_value, led_dev->port);
+}
+
+static struct alix_led alix_leds[] = {
+       {
+               .cdev = {
+                       .name = "alix:1",
+                       .brightness_set = alix_led_set,
+               },
+               .port = 0x6100,
+               .on_value = 1 << 22,
+               .off_value = 1 << 6,
+       },
+       {
+               .cdev = {
+                       .name = "alix:2",
+                       .brightness_set = alix_led_set,
+               },
+               .port = 0x6180,
+               .on_value = 1 << 25,
+               .off_value = 1 << 9,
+       },
+       {
+               .cdev = {
+                       .name = "alix:3",
+                       .brightness_set = alix_led_set,
+               },
+               .port = 0x6180,
+               .on_value = 1 << 27,
+               .off_value = 1 << 11,
+       },
+};
+
+static int __init alix_led_probe(struct platform_device *pdev)
+{
+       int i;
+       int ret;
+
+       for (i = 0; i < ARRAY_SIZE(alix_leds); i++) {
+               alix_leds[i].cdev.flags |= LED_CORE_SUSPENDRESUME;
+               ret = led_classdev_register(&pdev->dev, &alix_leds[i].cdev);
+               if (ret < 0)
+                       goto fail;
+       }
+       return 0;
+
+fail:
+       while (--i >= 0)
+               led_classdev_unregister(&alix_leds[i].cdev);
+       return ret;
+}
+
+static int alix_led_remove(struct platform_device *pdev)
+{
+       int i;
+
+       for (i = 0; i < ARRAY_SIZE(alix_leds); i++)
+               led_classdev_unregister(&alix_leds[i].cdev);
+       return 0;
+}
+
+static struct platform_driver alix_led_driver = {
+       .remove = alix_led_remove,
+       .driver = {
+               .name = KBUILD_MODNAME,
+               .owner = THIS_MODULE,
+       },
+};
+
+static int __init alix_present(void)
+{
+       const unsigned long bios_phys = 0x000f0000;
+       const size_t bios_len = 0x00010000;
+       const char alix_sig[] = "PC Engines ALIX.";
+       const size_t alix_sig_len = sizeof(alix_sig) - 1;
+
+       const char *bios_virt;
+       const char *scan_end;
+       const char *p;
+       int ret = 0;
+
+       if (force) {
+               printk(KERN_NOTICE "%s: forced to skip BIOS test, "
+                      "assume system has ALIX.2 style LEDs\n",
+                      KBUILD_MODNAME);
+               ret = 1;
+               goto out;
+       }
+
+       bios_virt = phys_to_virt(bios_phys);
+       scan_end = bios_virt + bios_len - (alix_sig_len + 2);
+       for (p = bios_virt; p < scan_end; p++) {
+               const char *tail;
+
+               if (memcmp(p, alix_sig, alix_sig_len) != 0) {
+                       continue;
+               }
+
+               tail = p + alix_sig_len;
+               if ((tail[0] == '2' || tail[0] == '3') && tail[1] == '\0') {
+                       printk(KERN_INFO
+                              "%s: system is recognized as \"%s\"\n",
+                              KBUILD_MODNAME, p);
+                       ret = 1;
+                       break;
+               }
+       }
+
+out:
+       return ret;
+}
+
+static struct platform_device *pdev;
+
+static int __init alix_led_init(void)
+{
+       int ret;
+
+       if (!alix_present()) {
+               ret = -ENODEV;
+               goto out;
+       }
+
+       pdev = platform_device_register_simple(KBUILD_MODNAME, -1, NULL, 0);
+       if (!IS_ERR(pdev)) {
+               ret = platform_driver_probe(&alix_led_driver, alix_led_probe);
+               if (ret)
+                       platform_device_unregister(pdev);
+       } else
+               ret = PTR_ERR(pdev);
+
+out:
+       return ret;
+}
+
+static void __exit alix_led_exit(void)
+{
+       platform_device_unregister(pdev);
+       platform_driver_unregister(&alix_led_driver);
+}
+
+module_init(alix_led_init);
+module_exit(alix_led_exit);
+
+MODULE_AUTHOR("Constantin Baranov <const@mimas.ru>");
+MODULE_DESCRIPTION("PCEngines ALIX.2 and ALIX.3 LED driver");
+MODULE_LICENSE("GPL");
diff --git a/drivers/leds/leds-ams-delta.c b/drivers/leds/leds-ams-delta.c

index 1bd590bb3a6e8edce5aae8b672da8fe2a89bdb8a..446050759b4dfb93e18a7a3dcd5905cf2886785e 100644 (file)
--- a/drivers/leds/leds-ams-delta.c
+++ b/drivers/leds/leds-ams-delta.c
@@ -79,37 +79,12 @@ static struct ams_delta_led ams_delta_leds[] = {
         },
  };
  
-#ifdef CONFIG_PM
-static int ams_delta_led_suspend(struct platform_device *dev,
-               pm_message_t state)
-{
-       int i;
-
-       for (i = 0; i < ARRAY_SIZE(ams_delta_leds); i++)
-               led_classdev_suspend(&ams_delta_leds[i].cdev);
-
-       return 0;
-}
-
-static int ams_delta_led_resume(struct platform_device *dev)
-{
-       int i;
-
-       for (i = 0; i < ARRAY_SIZE(ams_delta_leds); i++)
-               led_classdev_resume(&ams_delta_leds[i].cdev);
-
-       return 0;
-}
-#else
-#define ams_delta_led_suspend NULL
-#define ams_delta_led_resume NULL
-#endif
-
  static int ams_delta_led_probe(struct platform_device *pdev)
  {
         int i, ret;
  
         for (i = 0; i < ARRAY_SIZE(ams_delta_leds); i++) {
+               ams_delta_leds[i].cdev.flags |= LED_CORE_SUSPENDRESUME;
                 ret = led_classdev_register(&pdev->dev,
                                 &ams_delta_leds[i].cdev);
                 if (ret < 0)
@@ -127,7 +102,7 @@ static int ams_delta_led_remove(struct platform_device *pdev)
  {
         int i;
  
-       for (i = 0; i < ARRAY_SIZE(ams_delta_leds); i--)
+       for (i = 0; i < ARRAY_SIZE(ams_delta_leds); i++)
                 led_classdev_unregister(&ams_delta_leds[i].cdev);
  
         return 0;
@@ -136,8 +111,6 @@ static int ams_delta_led_remove(struct platform_device *pdev)
  static struct platform_driver ams_delta_led_driver = {
         .probe          = ams_delta_led_probe,
         .remove         = ams_delta_led_remove,
-       .suspend        = ams_delta_led_suspend,
-       .resume         = ams_delta_led_resume,
         .driver         = {
                 .name = "ams-delta-led",
                 .owner = THIS_MODULE,
@@ -151,7 +124,7 @@ static int __init ams_delta_led_init(void)
  
  static void __exit ams_delta_led_exit(void)
  {
-       return platform_driver_unregister(&ams_delta_led_driver);
+       platform_driver_unregister(&ams_delta_led_driver);
  }
  
  module_init(ams_delta_led_init);
diff --git a/drivers/leds/leds-clevo-mail.c b/drivers/leds/leds-clevo-mail.c

index eb3415e88f43e217fb8faa87a00b9c082773ce0f..1813c84ea5fccb10293e499a3721dacf94c54290 100644 (file)
--- a/drivers/leds/leds-clevo-mail.c
+++ b/drivers/leds/leds-clevo-mail.c
@@ -142,6 +142,7 @@ static struct led_classdev clevo_mail_led = {
         .name                   = "clevo::mail",
         .brightness_set         = clevo_mail_led_set,
         .blink_set              = clevo_mail_led_blink,
+       .flags                  = LED_CORE_SUSPENDRESUME,
  };
  
  static int __init clevo_mail_led_probe(struct platform_device *pdev)
@@ -155,29 +156,9 @@ static int clevo_mail_led_remove(struct platform_device *pdev)
         return 0;
  }
  
-#ifdef CONFIG_PM
-static int clevo_mail_led_suspend(struct platform_device *dev,
-                                 pm_message_t state)
-{
-       led_classdev_suspend(&clevo_mail_led);
-       return 0;
-}
-
-static int clevo_mail_led_resume(struct platform_device *dev)
-{
-       led_classdev_resume(&clevo_mail_led);
-       return 0;
-}
-#else
-#define clevo_mail_led_suspend    NULL
-#define clevo_mail_led_resume     NULL
-#endif
-
  static struct platform_driver clevo_mail_led_driver = {
         .probe          = clevo_mail_led_probe,
         .remove         = clevo_mail_led_remove,
-       .suspend        = clevo_mail_led_suspend,
-       .resume         = clevo_mail_led_resume,
         .driver         = {
                 .name           = KBUILD_MODNAME,
                 .owner          = THIS_MODULE,
diff --git a/drivers/leds/leds-fsg.c b/drivers/leds/leds-fsg.c

index 34935155c1c00077b6e94201cb4c5e6a534f1fad..5f7c9c5c09b1306c35211ed024aed989c1592b61 100644 (file)
--- a/drivers/leds/leds-fsg.c
+++ b/drivers/leds/leds-fsg.c
@@ -99,64 +99,43 @@ static void fsg_led_ring_set(struct led_classdev *led_cdev,
  }
  
  
-
  static struct led_classdev fsg_wlan_led = {
         .name                   = "fsg:blue:wlan",
         .brightness_set         = fsg_led_wlan_set,
+       .flags                  = LED_CORE_SUSPENDRESUME,
  };
  
  static struct led_classdev fsg_wan_led = {
         .name                   = "fsg:blue:wan",
         .brightness_set         = fsg_led_wan_set,
+       .flags                  = LED_CORE_SUSPENDRESUME,
  };
  
  static struct led_classdev fsg_sata_led = {
         .name                   = "fsg:blue:sata",
         .brightness_set         = fsg_led_sata_set,
+       .flags                  = LED_CORE_SUSPENDRESUME,
  };
  
  static struct led_classdev fsg_usb_led = {
         .name                   = "fsg:blue:usb",
         .brightness_set         = fsg_led_usb_set,
+       .flags                  = LED_CORE_SUSPENDRESUME,
  };
  
  static struct led_classdev fsg_sync_led = {
         .name                   = "fsg:blue:sync",
         .brightness_set         = fsg_led_sync_set,
+       .flags                  = LED_CORE_SUSPENDRESUME,
  };
  
  static struct led_classdev fsg_ring_led = {
         .name                   = "fsg:blue:ring",
         .brightness_set         = fsg_led_ring_set,
+       .flags                  = LED_CORE_SUSPENDRESUME,
  };
  
  
-
-#ifdef CONFIG_PM
-static int fsg_led_suspend(struct platform_device *dev, pm_message_t state)
-{
-       led_classdev_suspend(&fsg_wlan_led);
-       led_classdev_suspend(&fsg_wan_led);
-       led_classdev_suspend(&fsg_sata_led);
-       led_classdev_suspend(&fsg_usb_led);
-       led_classdev_suspend(&fsg_sync_led);
-       led_classdev_suspend(&fsg_ring_led);
-       return 0;
-}
-
-static int fsg_led_resume(struct platform_device *dev)
-{
-       led_classdev_resume(&fsg_wlan_led);
-       led_classdev_resume(&fsg_wan_led);
-       led_classdev_resume(&fsg_sata_led);
-       led_classdev_resume(&fsg_usb_led);
-       led_classdev_resume(&fsg_sync_led);
-       led_classdev_resume(&fsg_ring_led);
-       return 0;
-}
-#endif
-
-
  static int fsg_led_probe(struct platform_device *pdev)
  {
         int ret;
@@ -232,10 +211,6 @@ static int fsg_led_remove(struct platform_device *pdev)
  static struct platform_driver fsg_led_driver = {
         .probe          = fsg_led_probe,
         .remove         = fsg_led_remove,
-#ifdef CONFIG_PM
-       .suspend        = fsg_led_suspend,
-       .resume         = fsg_led_resume,
-#endif
         .driver         = {
                 .name           = "fsg-led",
         },
diff --git a/drivers/leds/leds-gpio.c b/drivers/leds/leds-gpio.c

index b13bd2950e956347b98bac9ed76e18efcabed612..2e3df08b649b6c2a3a2dd83b62233bc3505a1459 100644 (file)
--- a/drivers/leds/leds-gpio.c
+++ b/drivers/leds/leds-gpio.c
@@ -105,6 +105,7 @@ static int gpio_led_probe(struct platform_device *pdev)
                 }
                 led_dat->cdev.brightness_set = gpio_led_set;
                 led_dat->cdev.brightness = LED_OFF;
+               led_dat->cdev.flags |= LED_CORE_SUSPENDRESUME;
  
                 gpio_direction_output(led_dat->gpio, led_dat->active_low);
  
@@ -154,44 +155,9 @@ static int __devexit gpio_led_remove(struct platform_device *pdev)
         return 0;
  }
  
-#ifdef CONFIG_PM
-static int gpio_led_suspend(struct platform_device *pdev, pm_message_t state)
-{
-       struct gpio_led_platform_data *pdata = pdev->dev.platform_data;
-       struct gpio_led_data *leds_data;
-       int i;
-
-       leds_data = platform_get_drvdata(pdev);
-
-       for (i = 0; i < pdata->num_leds; i++)
-               led_classdev_suspend(&leds_data[i].cdev);
-
-       return 0;
-}
-
-static int gpio_led_resume(struct platform_device *pdev)
-{
-       struct gpio_led_platform_data *pdata = pdev->dev.platform_data;
-       struct gpio_led_data *leds_data;
-       int i;
-
-       leds_data = platform_get_drvdata(pdev);
-
-       for (i = 0; i < pdata->num_leds; i++)
-               led_classdev_resume(&leds_data[i].cdev);
-
-       return 0;
-}
-#else
-#define gpio_led_suspend NULL
-#define gpio_led_resume NULL
-#endif
-
  static struct platform_driver gpio_led_driver = {
         .probe          = gpio_led_probe,
         .remove         = __devexit_p(gpio_led_remove),
-       .suspend        = gpio_led_suspend,
-       .resume         = gpio_led_resume,
         .driver         = {
                 .name   = "leds-gpio",
                 .owner  = THIS_MODULE,
diff --git a/drivers/leds/leds-hp-disk.c b/drivers/leds/leds-hp-disk.c

index 44fa757d82547ff82b705a29ee5d54bf8a3b179a..d786adc8c5e3616ecb8131f2b1d58d68d19a30c8 100644 (file)
--- a/drivers/leds/leds-hp-disk.c
+++ b/drivers/leds/leds-hp-disk.c
@@ -68,25 +68,9 @@ static struct led_classdev hpled_led = {
         .name                   = "hp:red:hddprotection",
         .default_trigger        = "heartbeat",
         .brightness_set         = hpled_set,
+       .flags                  = LED_CORE_SUSPENDRESUME,
  };
  
-#ifdef CONFIG_PM
-static int hpled_suspend(struct acpi_device *dev, pm_message_t state)
-{
-       led_classdev_suspend(&hpled_led);
-       return 0;
-}
-
-static int hpled_resume(struct acpi_device *dev)
-{
-       led_classdev_resume(&hpled_led);
-       return 0;
-}
-#else
-#define hpled_suspend NULL
-#define hpled_resume NULL
-#endif
-
  static int hpled_add(struct acpi_device *device)
  {
         int ret;
@@ -121,8 +105,6 @@ static struct acpi_driver leds_hp_driver = {
         .ops = {
                 .add     = hpled_add,
                 .remove  = hpled_remove,
-               .suspend = hpled_suspend,
-               .resume  = hpled_resume,
         }
  };
  
diff --git a/drivers/leds/leds-hp6xx.c b/drivers/leds/leds-hp6xx.c

index e8fb1baf8a5072e3d3454b3366e32d1c2c015140..e4ce1fd46338122b6e93b39f102a81be4ea621cb 100644 (file)
--- a/drivers/leds/leds-hp6xx.c
+++ b/drivers/leds/leds-hp6xx.c
@@ -45,30 +45,16 @@ static struct led_classdev hp6xx_red_led = {
         .name                   = "hp6xx:red",
         .default_trigger        = "hp6xx-charge",
         .brightness_set         = hp6xxled_red_set,
+       .flags                  = LED_CORE_SUSPENDRESUME,
  };
  
  static struct led_classdev hp6xx_green_led = {
         .name                   = "hp6xx:green",
         .default_trigger        = "ide-disk",
         .brightness_set         = hp6xxled_green_set,
+       .flags                  = LED_CORE_SUSPENDRESUME,
  };
  
-#ifdef CONFIG_PM
-static int hp6xxled_suspend(struct platform_device *dev, pm_message_t state)
-{
-       led_classdev_suspend(&hp6xx_red_led);
-       led_classdev_suspend(&hp6xx_green_led);
-       return 0;
-}
-
-static int hp6xxled_resume(struct platform_device *dev)
-{
-       led_classdev_resume(&hp6xx_red_led);
-       led_classdev_resume(&hp6xx_green_led);
-       return 0;
-}
-#endif
-
  static int hp6xxled_probe(struct platform_device *pdev)
  {
         int ret;
@@ -98,10 +84,6 @@ MODULE_ALIAS("platform:hp6xx-led");
  static struct platform_driver hp6xxled_driver = {
         .probe          = hp6xxled_probe,
         .remove         = hp6xxled_remove,
-#ifdef CONFIG_PM
-       .suspend        = hp6xxled_suspend,
-       .resume         = hp6xxled_resume,
-#endif
         .driver         = {
                 .name           = "hp6xx-led",
                 .owner          = THIS_MODULE,
diff --git a/drivers/leds/leds-net48xx.c b/drivers/leds/leds-net48xx.c

index 054360473c9490ea6e4251188dcbf4cad92fa310..93987a12da494594881dde8c375d94a246cd2254 100644 (file)
--- a/drivers/leds/leds-net48xx.c
+++ b/drivers/leds/leds-net48xx.c
@@ -33,26 +33,9 @@ static void net48xx_error_led_set(struct led_classdev *led_cdev,
  static struct led_classdev net48xx_error_led = {
         .name           = "net48xx::error",
         .brightness_set = net48xx_error_led_set,
+       .flags          = LED_CORE_SUSPENDRESUME,
  };
  
-#ifdef CONFIG_PM
-static int net48xx_led_suspend(struct platform_device *dev,
-               pm_message_t state)
-{
-       led_classdev_suspend(&net48xx_error_led);
-       return 0;
-}
-
-static int net48xx_led_resume(struct platform_device *dev)
-{
-       led_classdev_resume(&net48xx_error_led);
-       return 0;
-}
-#else
-#define net48xx_led_suspend NULL
-#define net48xx_led_resume NULL
-#endif
-
  static int net48xx_led_probe(struct platform_device *pdev)
  {
         return led_classdev_register(&pdev->dev, &net48xx_error_led);
@@ -67,8 +50,6 @@ static int net48xx_led_remove(struct platform_device *pdev)
  static struct platform_driver net48xx_led_driver = {
         .probe          = net48xx_led_probe,
         .remove         = net48xx_led_remove,
-       .suspend        = net48xx_led_suspend,
-       .resume         = net48xx_led_resume,
         .driver         = {
                 .name           = DRVNAME,
                 .owner          = THIS_MODULE,
diff --git a/drivers/leds/leds-pca9532.c b/drivers/leds/leds-pca9532.c

index 4064d4f6b33b41a99d093d3e05c65f281ed60588..76ec7498e2d5eee3e8d762ec71f2c8c7e54bafd7 100644 (file)
--- a/drivers/leds/leds-pca9532.c
+++ b/drivers/leds/leds-pca9532.c
@@ -16,6 +16,7 @@
  #include <linux/leds.h>
  #include <linux/input.h>
  #include <linux/mutex.h>
+#include <linux/workqueue.h>
  #include <linux/leds-pca9532.h>
  
  static const unsigned short normal_i2c[] = { /*0x60,*/ I2C_CLIENT_END};
@@ -34,6 +35,7 @@ struct pca9532_data {
         struct pca9532_led leds[16];
         struct mutex update_lock;
         struct input_dev    *idev;
+       struct work_struct work;
         u8 pwm[2];
         u8 psc[2];
  };
@@ -63,7 +65,7 @@ static struct i2c_driver pca9532_driver = {
   * as a compromise we average one pwm to the values requested by all
   * leds that are not ON/OFF.
   * */
-static int pca9532_setpwm(struct i2c_client *client, int pwm, int blink,
+static int pca9532_calcpwm(struct i2c_client *client, int pwm, int blink,
         enum led_brightness value)
  {
         int a = 0, b = 0, i = 0;
@@ -84,11 +86,17 @@ static int pca9532_setpwm(struct i2c_client *client, int pwm, int blink,
         b = b/a;
         if (b > 0xFF)
                 return -EINVAL;
-       mutex_lock(&data->update_lock);
         data->pwm[pwm] = b;
+       data->psc[pwm] = blink;
+       return 0;
+}
+
+static int pca9532_setpwm(struct i2c_client *client, int pwm)
+{
+       struct pca9532_data *data = i2c_get_clientdata(client);
+       mutex_lock(&data->update_lock);
         i2c_smbus_write_byte_data(client, PCA9532_REG_PWM(pwm),
                 data->pwm[pwm]);
-       data->psc[pwm] = blink;
         i2c_smbus_write_byte_data(client, PCA9532_REG_PSC(pwm),
                 data->psc[pwm]);
         mutex_unlock(&data->update_lock);
@@ -124,11 +132,11 @@ static void pca9532_set_brightness(struct led_classdev *led_cdev,
                 led->state = PCA9532_ON;
         else {
                 led->state = PCA9532_PWM0; /* Thecus: hardcode one pwm */
-               err = pca9532_setpwm(led->client, 0, 0, value);
+               err = pca9532_calcpwm(led->client, 0, 0, value);
                 if (err)
                         return; /* XXX: led api doesn't allow error code? */
         }
-       pca9532_setled(led);
+       schedule_work(&led->work);
  }
  
  static int pca9532_set_blink(struct led_classdev *led_cdev,
@@ -137,6 +145,7 @@ static int pca9532_set_blink(struct led_classdev *led_cdev,
         struct pca9532_led *led = ldev_to_led(led_cdev);
         struct i2c_client *client = led->client;
         int psc;
+       int err = 0;
  
         if (*delay_on == 0 && *delay_off == 0) {
         /* led subsystem ask us for a blink rate */
@@ -148,11 +157,15 @@ static int pca9532_set_blink(struct led_classdev *led_cdev,
  
         /* Thecus specific: only use PSC/PWM 0 */
         psc = (*delay_on * 152-1)/1000;
-       return pca9532_setpwm(client, 0, psc, led_cdev->brightness);
+       err = pca9532_calcpwm(client, 0, psc, led_cdev->brightness);
+       if (err)
+               return err;
+       schedule_work(&led->work);
+       return 0;
  }
  
-int pca9532_event(struct input_dev *dev, unsigned int type, unsigned int code,
-       int value)
+static int pca9532_event(struct input_dev *dev, unsigned int type,
+       unsigned int code, int value)
  {
         struct pca9532_data *data = input_get_drvdata(dev);
  
@@ -165,13 +178,28 @@ int pca9532_event(struct input_dev *dev, unsigned int type, unsigned int code,
         else
                 data->pwm[1] = 0;
  
-       dev_info(&dev->dev, "setting beep to %d \n", data->pwm[1]);
+       schedule_work(&data->work);
+
+       return 0;
+}
+
+static void pca9532_input_work(struct work_struct *work)
+{
+       struct pca9532_data *data;
+       data = container_of(work, struct pca9532_data, work);
         mutex_lock(&data->update_lock);
         i2c_smbus_write_byte_data(data->client, PCA9532_REG_PWM(1),
                 data->pwm[1]);
         mutex_unlock(&data->update_lock);
+}
  
-       return 0;
+static void pca9532_led_work(struct work_struct *work)
+{
+       struct pca9532_led *led;
+       led = container_of(work, struct pca9532_led, work);
+       if (led->state == PCA9532_PWM0)
+               pca9532_setpwm(led->client, 0);
+       pca9532_setled(led);
  }
  
  static int pca9532_configure(struct i2c_client *client,
@@ -204,8 +232,9 @@ static int pca9532_configure(struct i2c_client *client,
                         led->ldev.brightness = LED_OFF;
                         led->ldev.brightness_set = pca9532_set_brightness;
                         led->ldev.blink_set = pca9532_set_blink;
-                       if (led_classdev_register(&client->dev,
-                               &led->ldev) < 0)        {
+                       INIT_WORK(&led->work, pca9532_led_work);
+                       err = led_classdev_register(&client->dev, &led->ldev);
+                       if (err < 0) {
                                 dev_err(&client->dev,
                                         "couldn't register LED %s\n",
                                         led->name);
@@ -233,9 +262,11 @@ static int pca9532_configure(struct i2c_client *client,
                                                 BIT_MASK(SND_TONE);
                         data->idev->event = pca9532_event;
                         input_set_drvdata(data->idev, data);
+                       INIT_WORK(&data->work, pca9532_input_work);
                         err = input_register_device(data->idev);
                         if (err) {
                                 input_free_device(data->idev);
+                               cancel_work_sync(&data->work);
                                 data->idev = NULL;
                                 goto exit;
                         }
@@ -252,18 +283,19 @@ exit:
                                 break;
                         case PCA9532_TYPE_LED:
                                 led_classdev_unregister(&data->leds[i].ldev);
+                               cancel_work_sync(&data->leds[i].work);
                                 break;
                         case PCA9532_TYPE_N2100_BEEP:
                                 if (data->idev != NULL) {
                                         input_unregister_device(data->idev);
                                         input_free_device(data->idev);
+                                       cancel_work_sync(&data->work);
                                         data->idev = NULL;
                                 }
                                 break;
                         }
  
         return err;
-
  }
  
  static int pca9532_probe(struct i2c_client *client,
@@ -271,12 +303,16 @@ static int pca9532_probe(struct i2c_client *client,
  {
         struct pca9532_data *data = i2c_get_clientdata(client);
         struct pca9532_platform_data *pca9532_pdata = client->dev.platform_data;
+       int err;
+
+       if (!pca9532_pdata)
+               return -EIO;
  
         if (!i2c_check_functionality(client->adapter,
                 I2C_FUNC_SMBUS_BYTE_DATA))
                 return -EIO;
  
-       data = kzalloc(sizeof(struct pca9532_data), GFP_KERNEL);
+       data = kzalloc(sizeof(*data), GFP_KERNEL);
         if (!data)
                 return -ENOMEM;
  
@@ -285,12 +321,13 @@ static int pca9532_probe(struct i2c_client *client,
         data->client = client;
         mutex_init(&data->update_lock);
  
-       if (pca9532_pdata == NULL)
-               return -EIO;
-
-       pca9532_configure(client, data, pca9532_pdata);
-       return 0;
+       err = pca9532_configure(client, data, pca9532_pdata);
+       if (err) {
+               kfree(data);
+               i2c_set_clientdata(client, NULL);
+       }
  
+       return err;
  }
  
  static int pca9532_remove(struct i2c_client *client)
@@ -303,11 +340,13 @@ static int pca9532_remove(struct i2c_client *client)
                         break;
                 case PCA9532_TYPE_LED:
                         led_classdev_unregister(&data->leds[i].ldev);
+                       cancel_work_sync(&data->leds[i].work);
                         break;
                 case PCA9532_TYPE_N2100_BEEP:
                         if (data->idev != NULL) {
                                 input_unregister_device(data->idev);
                                 input_free_device(data->idev);
+                               cancel_work_sync(&data->work);
                                 data->idev = NULL;
                         }
                         break;
diff --git a/drivers/leds/leds-s3c24xx.c b/drivers/leds/leds-s3c24xx.c

index 25a07f2643ade687245f5f3e929207325e094da0..4d81131542ae68e71e203dae1f9e63194547ffef 100644 (file)
--- a/drivers/leds/leds-s3c24xx.c
+++ b/drivers/leds/leds-s3c24xx.c
@@ -82,6 +82,7 @@ static int s3c24xx_led_probe(struct platform_device *dev)
         led->cdev.brightness_set = s3c24xx_led_set;
         led->cdev.default_trigger = pdata->def_trigger;
         led->cdev.name = pdata->name;
+       led->cdev.flags |= LED_CORE_SUSPENDRESUME;
  
         led->pdata = pdata;
  
@@ -111,33 +112,9 @@ static int s3c24xx_led_probe(struct platform_device *dev)
         return ret;
  }
  
-
-#ifdef CONFIG_PM
-static int s3c24xx_led_suspend(struct platform_device *dev, pm_message_t state)
-{
-       struct s3c24xx_gpio_led *led = pdev_to_gpio(dev);
-
-       led_classdev_suspend(&led->cdev);
-       return 0;
-}
-
-static int s3c24xx_led_resume(struct platform_device *dev)
-{
-       struct s3c24xx_gpio_led *led = pdev_to_gpio(dev);
-
-       led_classdev_resume(&led->cdev);
-       return 0;
-}
-#else
-#define s3c24xx_led_suspend NULL
-#define s3c24xx_led_resume NULL
-#endif
-
  static struct platform_driver s3c24xx_led_driver = {
         .probe          = s3c24xx_led_probe,
         .remove         = s3c24xx_led_remove,
-       .suspend        = s3c24xx_led_suspend,
-       .resume         = s3c24xx_led_resume,
         .driver         = {
                 .name           = "s3c24xx_led",
                 .owner          = THIS_MODULE,
diff --git a/drivers/leds/leds-wm8350.c b/drivers/leds/leds-wm8350.c

new file mode 100644 (file)

index 0000000..38c6bcb
--- /dev/null
+++ b/drivers/leds/leds-wm8350.c
@@ -0,0 +1,311 @@
+/*
+ * LED driver for WM8350 driven LEDS.
+ *
+ * Copyright(C) 2007, 2008 Wolfson Microelectronics PLC.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/platform_device.h>
+#include <linux/leds.h>
+#include <linux/err.h>
+#include <linux/mfd/wm8350/pmic.h>
+#include <linux/regulator/consumer.h>
+
+/* Microamps */
+static const int isink_cur[] = {
+       4,
+       5,
+       6,
+       7,
+       8,
+       10,
+       11,
+       14,
+       16,
+       19,
+       23,
+       27,
+       32,
+       39,
+       46,
+       54,
+       65,
+       77,
+       92,
+       109,
+       130,
+       154,
+       183,
+       218,
+       259,
+       308,
+       367,
+       436,
+       518,
+       616,
+       733,
+       872,
+       1037,
+       1233,
+       1466,
+       1744,
+       2073,
+       2466,
+       2933,
+       3487,
+       4147,
+       4932,
+       5865,
+       6975,
+       8294,
+       9864,
+       11730,
+       13949,
+       16589,
+       19728,
+       23460,
+       27899,
+       33178,
+       39455,
+       46920,
+       55798,
+       66355,
+       78910,
+       93840,
+       111596,
+       132710,
+       157820,
+       187681,
+       223191
+};
+
+#define to_wm8350_led(led_cdev) \
+       container_of(led_cdev, struct wm8350_led, cdev)
+
+static void wm8350_led_enable(struct wm8350_led *led)
+{
+       int ret;
+
+       if (led->enabled)
+               return;
+
+       ret = regulator_enable(led->isink);
+       if (ret != 0) {
+               dev_err(led->cdev.dev, "Failed to enable ISINK: %d\n", ret);
+               return;
+       }
+
+       ret = regulator_enable(led->dcdc);
+       if (ret != 0) {
+               dev_err(led->cdev.dev, "Failed to enable DCDC: %d\n", ret);
+               regulator_disable(led->isink);
+               return;
+       }
+
+       led->enabled = 1;
+}
+
+static void wm8350_led_disable(struct wm8350_led *led)
+{
+       int ret;
+
+       if (!led->enabled)
+               return;
+
+       ret = regulator_disable(led->dcdc);
+       if (ret != 0) {
+               dev_err(led->cdev.dev, "Failed to disable DCDC: %d\n", ret);
+               return;
+       }
+
+       ret = regulator_disable(led->isink);
+       if (ret != 0) {
+               dev_err(led->cdev.dev, "Failed to disable ISINK: %d\n", ret);
+               regulator_enable(led->dcdc);
+               return;
+       }
+
+       led->enabled = 0;
+}
+
+static void led_work(struct work_struct *work)
+{
+       struct wm8350_led *led = container_of(work, struct wm8350_led, work);
+       int ret;
+       int uA;
+       unsigned long flags;
+
+       mutex_lock(&led->mutex);
+
+       spin_lock_irqsave(&led->value_lock, flags);
+
+       if (led->value == LED_OFF) {
+               spin_unlock_irqrestore(&led->value_lock, flags);
+               wm8350_led_disable(led);
+               goto out;
+       }
+
+       /* This scales linearly into the index of valid current
+        * settings which results in a linear scaling of perceived
+        * brightness due to the non-linear current settings provided
+        * by the hardware.
+        */
+       uA = (led->max_uA_index * led->value) / LED_FULL;
+       spin_unlock_irqrestore(&led->value_lock, flags);
+       BUG_ON(uA >= ARRAY_SIZE(isink_cur));
+
+       ret = regulator_set_current_limit(led->isink, isink_cur[uA],
+                                         isink_cur[uA]);
+       if (ret != 0)
+               dev_err(led->cdev.dev, "Failed to set %duA: %d\n",
+                       isink_cur[uA], ret);
+
+       wm8350_led_enable(led);
+
+out:
+       mutex_unlock(&led->mutex);
+}
+
+static void wm8350_led_set(struct led_classdev *led_cdev,
+                          enum led_brightness value)
+{
+       struct wm8350_led *led = to_wm8350_led(led_cdev);
+       unsigned long flags;
+
+       spin_lock_irqsave(&led->value_lock, flags);
+       led->value = value;
+       schedule_work(&led->work);
+       spin_unlock_irqrestore(&led->value_lock, flags);
+}
+
+static void wm8350_led_shutdown(struct platform_device *pdev)
+{
+       struct wm8350_led *led = platform_get_drvdata(pdev);
+
+       mutex_lock(&led->mutex);
+       led->value = LED_OFF;
+       wm8350_led_disable(led);
+       mutex_unlock(&led->mutex);
+}
+
+static int wm8350_led_probe(struct platform_device *pdev)
+{
+       struct regulator *isink, *dcdc;
+       struct wm8350_led *led;
+       struct wm8350_led_platform_data *pdata = pdev->dev.platform_data;
+       int ret, i;
+
+       if (pdata == NULL) {
+               dev_err(&pdev->dev, "no platform data\n");
+               return -ENODEV;
+       }
+
+       if (pdata->max_uA < isink_cur[0]) {
+               dev_err(&pdev->dev, "Invalid maximum current %duA\n",
+                       pdata->max_uA);
+               return -EINVAL;
+       }
+
+       isink = regulator_get(&pdev->dev, "led_isink");
+       if (IS_ERR(isink)) {
+               printk(KERN_ERR "%s: cant get ISINK\n", __func__);
+               return PTR_ERR(isink);
+       }
+
+       dcdc = regulator_get(&pdev->dev, "led_vcc");
+       if (IS_ERR(dcdc)) {
+               printk(KERN_ERR "%s: cant get DCDC\n", __func__);
+               ret = PTR_ERR(dcdc);
+               goto err_isink;
+       }
+
+       led = kzalloc(sizeof(*led), GFP_KERNEL);
+       if (led == NULL) {
+               ret = -ENOMEM;
+               goto err_dcdc;
+       }
+
+       led->cdev.brightness_set = wm8350_led_set;
+       led->cdev.default_trigger = pdata->default_trigger;
+       led->cdev.name = pdata->name;
+       led->cdev.flags |= LED_CORE_SUSPENDRESUME;
+       led->enabled = regulator_is_enabled(isink);
+       led->isink = isink;
+       led->dcdc = dcdc;
+
+       for (i = 0; i < ARRAY_SIZE(isink_cur) - 1; i++)
+               if (isink_cur[i] >= pdata->max_uA)
+                       break;
+       led->max_uA_index = i;
+       if (pdata->max_uA != isink_cur[i])
+               dev_warn(&pdev->dev,
+                        "Maximum current %duA is not directly supported,"
+                        " check platform data\n",
+                        pdata->max_uA);
+
+       spin_lock_init(&led->value_lock);
+       mutex_init(&led->mutex);
+       INIT_WORK(&led->work, led_work);
+       led->value = LED_OFF;
+       platform_set_drvdata(pdev, led);
+
+       ret = led_classdev_register(&pdev->dev, &led->cdev);
+       if (ret < 0)
+               goto err_led;
+
+       return 0;
+
+ err_led:
+       kfree(led);
+ err_dcdc:
+       regulator_put(dcdc);
+ err_isink:
+       regulator_put(isink);
+       return ret;
+}
+
+static int wm8350_led_remove(struct platform_device *pdev)
+{
+       struct wm8350_led *led = platform_get_drvdata(pdev);
+
+       led_classdev_unregister(&led->cdev);
+       flush_scheduled_work();
+       wm8350_led_disable(led);
+       regulator_put(led->dcdc);
+       regulator_put(led->isink);
+       kfree(led);
+       return 0;
+}
+
+static struct platform_driver wm8350_led_driver = {
+       .driver = {
+                  .name = "wm8350-led",
+                  .owner = THIS_MODULE,
+                  },
+       .probe = wm8350_led_probe,
+       .remove = wm8350_led_remove,
+       .shutdown = wm8350_led_shutdown,
+};
+
+static int __devinit wm8350_led_init(void)
+{
+       return platform_driver_register(&wm8350_led_driver);
+}
+module_init(wm8350_led_init);
+
+static void wm8350_led_exit(void)
+{
+       platform_driver_unregister(&wm8350_led_driver);
+}
+module_exit(wm8350_led_exit);
+
+MODULE_AUTHOR("Mark Brown");
+MODULE_DESCRIPTION("WM8350 LED driver");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("platform:wm8350-led");
diff --git a/drivers/leds/leds-wrap.c b/drivers/leds/leds-wrap.c

index 2f3aa87f2a1f60d86ab7c7d4ada902462c992ed2..2982c86ac4cffb9e93da0a252b232a573c02ac22 100644 (file)
--- a/drivers/leds/leds-wrap.c
+++ b/drivers/leds/leds-wrap.c
@@ -56,40 +56,21 @@ static struct led_classdev wrap_power_led = {
         .name                   = "wrap::power",
         .brightness_set         = wrap_power_led_set,
         .default_trigger        = "default-on",
+       .flags                  = LED_CORE_SUSPENDRESUME,
  };
  
  static struct led_classdev wrap_error_led = {
         .name           = "wrap::error",
         .brightness_set = wrap_error_led_set,
+       .flags                  = LED_CORE_SUSPENDRESUME,
  };
  
  static struct led_classdev wrap_extra_led = {
         .name           = "wrap::extra",
         .brightness_set = wrap_extra_led_set,
+       .flags                  = LED_CORE_SUSPENDRESUME,
  };
  
-#ifdef CONFIG_PM
-static int wrap_led_suspend(struct platform_device *dev,
-               pm_message_t state)
-{
-       led_classdev_suspend(&wrap_power_led);
-       led_classdev_suspend(&wrap_error_led);
-       led_classdev_suspend(&wrap_extra_led);
-       return 0;
-}
-
-static int wrap_led_resume(struct platform_device *dev)
-{
-       led_classdev_resume(&wrap_power_led);
-       led_classdev_resume(&wrap_error_led);
-       led_classdev_resume(&wrap_extra_led);
-       return 0;
-}
-#else
-#define wrap_led_suspend NULL
-#define wrap_led_resume NULL
-#endif
-
  static int wrap_led_probe(struct platform_device *pdev)
  {
         int ret;
@@ -127,8 +108,6 @@ static int wrap_led_remove(struct platform_device *pdev)
  static struct platform_driver wrap_led_driver = {
         .probe          = wrap_led_probe,
         .remove         = wrap_led_remove,
-       .suspend        = wrap_led_suspend,
-       .resume         = wrap_led_resume,
         .driver         = {
                 .name           = DRVNAME,
                 .owner          = THIS_MODULE,
diff --git a/drivers/leds/ledtrig-timer.c b/drivers/leds/ledtrig-timer.c

index db681962d7bb14742d53dea274ea3c49e424b6dd..3d6531396dda094e41eb7f00b80d20acbbd244cd 100644 (file)
--- a/drivers/leds/ledtrig-timer.c
+++ b/drivers/leds/ledtrig-timer.c
@@ -199,6 +199,7 @@ err_out:
  static void timer_trig_deactivate(struct led_classdev *led_cdev)
  {
         struct timer_trig_data *timer_data = led_cdev->trigger_data;
+       unsigned long on = 0, off = 0;
  
         if (timer_data) {
                 device_remove_file(led_cdev->dev, &dev_attr_delay_on);
@@ -206,6 +207,10 @@ static void timer_trig_deactivate(struct led_classdev *led_cdev)
                 del_timer_sync(&timer_data->timer);
                 kfree(timer_data);
         }
+
+       /* If there is hardware support for blinking, stop it */
+       if (led_cdev->blink_set)
+               led_cdev->blink_set(led_cdev, &on, &off);
  }
  
  static struct led_trigger timer_led_trigger = {
diff --git a/drivers/mfd/wm8350-core.c b/drivers/mfd/wm8350-core.c

index 3a273ccef3f2660b127ab426755650950377827c..f92595c8f165b7223d66958f648ba183dbb5519e 100644 (file)
--- a/drivers/mfd/wm8350-core.c
+++ b/drivers/mfd/wm8350-core.c
@@ -1453,6 +1453,9 @@ void wm8350_device_exit(struct wm8350 *wm8350)
  {
         int i;
  
+       for (i = 0; i < ARRAY_SIZE(wm8350->pmic.led); i++)
+               platform_device_unregister(wm8350->pmic.led[i].pdev);
+
         for (i = 0; i < ARRAY_SIZE(wm8350->pmic.pdev); i++)
                 platform_device_unregister(wm8350->pmic.pdev[i]);
  
diff --git a/drivers/oprofile/buffer_sync.c b/drivers/oprofile/buffer_sync.c

index 65e8294a9e29db35cfcff8a493682804e9dcffe3..9da5a4b811337562dbcf708dcad927756264381d 100644 (file)
--- a/drivers/oprofile/buffer_sync.c
+++ b/drivers/oprofile/buffer_sync.c
@@ -1,11 +1,12 @@
  /**
   * @file buffer_sync.c
   *
- * @remark Copyright 2002 OProfile authors
+ * @remark Copyright 2002-2009 OProfile authors
   * @remark Read the file COPYING
   *
   * @author John Levon <levon@movementarian.org>
   * @author Barry Kasindorf
+ * @author Robert Richter <robert.richter@amd.com>
   *
   * This is the core of the buffer management. Each
   * CPU buffer is processed and entered into the
@@ -315,88 +316,73 @@ static void add_trace_begin(void)
         add_event_entry(TRACE_BEGIN_CODE);
  }
  
-#ifdef CONFIG_OPROFILE_IBS
-
-#define IBS_FETCH_CODE_SIZE    2
-#define IBS_OP_CODE_SIZE       5
-
-/*
- * Add IBS fetch and op entries to event buffer
- */
-static void add_ibs_begin(int cpu, int code, struct mm_struct *mm)
+static void add_data(struct op_entry *entry, struct mm_struct *mm)
  {
-       unsigned long rip;
-       int i, count;
-       unsigned long ibs_cookie = 0;
+       unsigned long code, pc, val;
+       unsigned long cookie;
         off_t offset;
-       struct op_sample *sample;
-
-       sample = cpu_buffer_read_entry(cpu);
-       if (!sample)
-               goto Error;
-       rip = sample->eip;
  
-#ifdef __LP64__
-       rip += sample->event << 32;
-#endif
+       if (!op_cpu_buffer_get_data(entry, &code))
+               return;
+       if (!op_cpu_buffer_get_data(entry, &pc))
+               return;
+       if (!op_cpu_buffer_get_size(entry))
+               return;
  
         if (mm) {
-               ibs_cookie = lookup_dcookie(mm, rip, &offset);
+               cookie = lookup_dcookie(mm, pc, &offset);
  
-               if (ibs_cookie == NO_COOKIE)
-                       offset = rip;
-               if (ibs_cookie == INVALID_COOKIE) {
+               if (cookie == NO_COOKIE)
+                       offset = pc;
+               if (cookie == INVALID_COOKIE) {
                         atomic_inc(&oprofile_stats.sample_lost_no_mapping);
-                       offset = rip;
+                       offset = pc;
                 }
-               if (ibs_cookie != last_cookie) {
-                       add_cookie_switch(ibs_cookie);
-                       last_cookie = ibs_cookie;
+               if (cookie != last_cookie) {
+                       add_cookie_switch(cookie);
+                       last_cookie = cookie;
                 }
         } else
-               offset = rip;
+               offset = pc;
  
         add_event_entry(ESCAPE_CODE);
         add_event_entry(code);
         add_event_entry(offset);        /* Offset from Dcookie */
  
-       /* we send the Dcookie offset, but send the raw Linear Add also*/
-       add_event_entry(sample->eip);
-       add_event_entry(sample->event);
-
-       if (code == IBS_FETCH_CODE)
-               count = IBS_FETCH_CODE_SIZE;    /*IBS FETCH is 2 int64s*/
-       else
-               count = IBS_OP_CODE_SIZE;       /*IBS OP is 5 int64s*/
-
-       for (i = 0; i < count; i++) {
-               sample = cpu_buffer_read_entry(cpu);
-               if (!sample)
-                       goto Error;
-               add_event_entry(sample->eip);
-               add_event_entry(sample->event);
-       }
-
-       return;
-
-Error:
-       return;
+       while (op_cpu_buffer_get_data(entry, &val))
+               add_event_entry(val);
  }
  
-#endif
-
-static void add_sample_entry(unsigned long offset, unsigned long event)
+static inline void add_sample_entry(unsigned long offset, unsigned long event)
  {
         add_event_entry(offset);
         add_event_entry(event);
  }
  
  
-static int add_us_sample(struct mm_struct *mm, struct op_sample *s)
+/*
+ * Add a sample to the global event buffer. If possible the
+ * sample is converted into a persistent dentry/offset pair
+ * for later lookup from userspace. Return 0 on failure.
+ */
+static int
+add_sample(struct mm_struct *mm, struct op_sample *s, int in_kernel)
  {
         unsigned long cookie;
         off_t offset;
  
+       if (in_kernel) {
+               add_sample_entry(s->eip, s->event);
+               return 1;
+       }
+
+       /* add userspace sample */
+
+       if (!mm) {
+               atomic_inc(&oprofile_stats.sample_lost_no_mm);
+               return 0;
+       }
+
         cookie = lookup_dcookie(mm, s->eip, &offset);
  
         if (cookie == INVALID_COOKIE) {
@@ -415,25 +401,6 @@ static int add_us_sample(struct mm_struct *mm, struct op_sample *s)
  }
  
  
-/* Add a sample to the global event buffer. If possible the
- * sample is converted into a persistent dentry/offset pair
- * for later lookup from userspace.
- */
-static int
-add_sample(struct mm_struct *mm, struct op_sample *s, int in_kernel)
-{
-       if (in_kernel) {
-               add_sample_entry(s->eip, s->event);
-               return 1;
-       } else if (mm) {
-               return add_us_sample(mm, s);
-       } else {
-               atomic_inc(&oprofile_stats.sample_lost_no_mm);
-       }
-       return 0;
-}
-
-
  static void release_mm(struct mm_struct *mm)
  {
         if (!mm)
@@ -526,66 +493,69 @@ void sync_buffer(int cpu)
  {
         struct mm_struct *mm = NULL;
         struct mm_struct *oldmm;
+       unsigned long val;
         struct task_struct *new;
         unsigned long cookie = 0;
         int in_kernel = 1;
         sync_buffer_state state = sb_buffer_start;
         unsigned int i;
         unsigned long available;
+       unsigned long flags;
+       struct op_entry entry;
+       struct op_sample *sample;
  
         mutex_lock(&buffer_mutex);
  
         add_cpu_switch(cpu);
  
-       cpu_buffer_reset(cpu);
-       available = cpu_buffer_entries(cpu);
+       op_cpu_buffer_reset(cpu);
+       available = op_cpu_buffer_entries(cpu);
  
         for (i = 0; i < available; ++i) {
-               struct op_sample *s = cpu_buffer_read_entry(cpu);
-               if (!s)
+               sample = op_cpu_buffer_read_entry(&entry, cpu);
+               if (!sample)
                         break;
  
-               if (is_code(s->eip)) {
-                       switch (s->event) {
-                       case 0:
-                       case CPU_IS_KERNEL:
+               if (is_code(sample->eip)) {
+                       flags = sample->event;
+                       if (flags & TRACE_BEGIN) {
+                               state = sb_bt_start;
+                               add_trace_begin();
+                       }
+                       if (flags & KERNEL_CTX_SWITCH) {
                                 /* kernel/userspace switch */
-                               in_kernel = s->event;
+                               in_kernel = flags & IS_KERNEL;
                                 if (state == sb_buffer_start)
                                         state = sb_sample_start;
-                               add_kernel_ctx_switch(s->event);
-                               break;
-                       case CPU_TRACE_BEGIN:
-                               state = sb_bt_start;
-                               add_trace_begin();
-                               break;
-#ifdef CONFIG_OPROFILE_IBS
-                       case IBS_FETCH_BEGIN:
-                               state = sb_bt_start;
-                               add_ibs_begin(cpu, IBS_FETCH_CODE, mm);
-                               break;
-                       case IBS_OP_BEGIN:
-                               state = sb_bt_start;
-                               add_ibs_begin(cpu, IBS_OP_CODE, mm);
-                               break;
-#endif
-                       default:
+                               add_kernel_ctx_switch(flags & IS_KERNEL);
+                       }
+                       if (flags & USER_CTX_SWITCH
+                           && op_cpu_buffer_get_data(&entry, &val)) {
                                 /* userspace context switch */
+                               new = (struct task_struct *)val;
                                 oldmm = mm;
-                               new = (struct task_struct *)s->event;
                                 release_mm(oldmm);
                                 mm = take_tasks_mm(new);
                                 if (mm != oldmm)
                                         cookie = get_exec_dcookie(mm);
                                 add_user_ctx_switch(new, cookie);
-                               break;
-                       }
-               } else if (state >= sb_bt_start &&
-                          !add_sample(mm, s, in_kernel)) {
-                       if (state == sb_bt_start) {
-                               state = sb_bt_ignore;
-                               atomic_inc(&oprofile_stats.bt_lost_no_mapping);
                         }
+                       if (op_cpu_buffer_get_size(&entry))
+                               add_data(&entry, mm);
+                       continue;
+               }
+
+               if (state < sb_bt_start)
+                       /* ignore sample */
+                       continue;
+
+               if (add_sample(mm, sample, in_kernel))
+                       continue;
+
+               /* ignore backtraces if failed to add a sample */
+               if (state == sb_bt_start) {
+                       state = sb_bt_ignore;
+                       atomic_inc(&oprofile_stats.bt_lost_no_mapping);
                 }
         }
         release_mm(mm);
diff --git a/drivers/oprofile/cpu_buffer.c b/drivers/oprofile/cpu_buffer.c

index 61090969158facb8ae1843b6eba255202fdf8c4a..2e03b6d796d39aac794e6c8c52d5de979f82a7c7 100644 (file)
--- a/drivers/oprofile/cpu_buffer.c
+++ b/drivers/oprofile/cpu_buffer.c
@@ -1,11 +1,12 @@
  /**
   * @file cpu_buffer.c
   *
- * @remark Copyright 2002 OProfile authors
+ * @remark Copyright 2002-2009 OProfile authors
   * @remark Read the file COPYING
   *
   * @author John Levon <levon@movementarian.org>
   * @author Barry Kasindorf <barry.kasindorf@amd.com>
+ * @author Robert Richter <robert.richter@amd.com>
   *
   * Each CPU has a local buffer that stores PC value/event
   * pairs. We also log context switches when we notice them.
@@ -45,8 +46,8 @@
   * can be changed to a single buffer solution when the ring buffer
   * access is implemented as non-locking atomic code.
   */
-struct ring_buffer *op_ring_buffer_read;
-struct ring_buffer *op_ring_buffer_write;
+static struct ring_buffer *op_ring_buffer_read;
+static struct ring_buffer *op_ring_buffer_write;
  DEFINE_PER_CPU(struct oprofile_cpu_buffer, cpu_buffer);
  
  static void wq_sync_buffer(struct work_struct *work);
@@ -54,19 +55,9 @@ static void wq_sync_buffer(struct work_struct *work);
  #define DEFAULT_TIMER_EXPIRE (HZ / 10)
  static int work_enabled;
  
-void free_cpu_buffers(void)
-{
-       if (op_ring_buffer_read)
-               ring_buffer_free(op_ring_buffer_read);
-       op_ring_buffer_read = NULL;
-       if (op_ring_buffer_write)
-               ring_buffer_free(op_ring_buffer_write);
-       op_ring_buffer_write = NULL;
-}
-
  unsigned long oprofile_get_cpu_buffer_size(void)
  {
-       return fs_cpu_buffer_size;
+       return oprofile_cpu_buffer_size;
  }
  
  void oprofile_cpu_buffer_inc_smpl_lost(void)
@@ -77,11 +68,21 @@ void oprofile_cpu_buffer_inc_smpl_lost(void)
         cpu_buf->sample_lost_overflow++;
  }
  
+void free_cpu_buffers(void)
+{
+       if (op_ring_buffer_read)
+               ring_buffer_free(op_ring_buffer_read);
+       op_ring_buffer_read = NULL;
+       if (op_ring_buffer_write)
+               ring_buffer_free(op_ring_buffer_write);
+       op_ring_buffer_write = NULL;
+}
+
  int alloc_cpu_buffers(void)
  {
         int i;
  
-       unsigned long buffer_size = fs_cpu_buffer_size;
+       unsigned long buffer_size = oprofile_cpu_buffer_size;
  
         op_ring_buffer_read = ring_buffer_alloc(buffer_size, OP_BUFFER_FLAGS);
         if (!op_ring_buffer_read)
@@ -97,8 +98,6 @@ int alloc_cpu_buffers(void)
                 b->last_is_kernel = -1;
                 b->tracing = 0;
                 b->buffer_size = buffer_size;
-               b->tail_pos = 0;
-               b->head_pos = 0;
                 b->sample_received = 0;
                 b->sample_lost_overflow = 0;
                 b->backtrace_aborted = 0;
@@ -145,47 +144,156 @@ void end_cpu_work(void)
         flush_scheduled_work();
  }
  
-static inline int
-add_sample(struct oprofile_cpu_buffer *cpu_buf,
-          unsigned long pc, unsigned long event)
+/*
+ * This function prepares the cpu buffer to write a sample.
+ *
+ * Struct op_entry is used during operations on the ring buffer while
+ * struct op_sample contains the data that is stored in the ring
+ * buffer. Struct entry can be uninitialized. The function reserves a
+ * data array that is specified by size. Use
+ * op_cpu_buffer_write_commit() after preparing the sample. In case of
+ * errors a null pointer is returned, otherwise the pointer to the
+ * sample.
+ *
+ */
+struct op_sample
+*op_cpu_buffer_write_reserve(struct op_entry *entry, unsigned long size)
+{
+       entry->event = ring_buffer_lock_reserve
+               (op_ring_buffer_write, sizeof(struct op_sample) +
+                size * sizeof(entry->sample->data[0]), &entry->irq_flags);
+       if (entry->event)
+               entry->sample = ring_buffer_event_data(entry->event);
+       else
+               entry->sample = NULL;
+
+       if (!entry->sample)
+               return NULL;
+
+       entry->size = size;
+       entry->data = entry->sample->data;
+
+       return entry->sample;
+}
+
+int op_cpu_buffer_write_commit(struct op_entry *entry)
+{
+       return ring_buffer_unlock_commit(op_ring_buffer_write, entry->event,
+                                        entry->irq_flags);
+}
+
+struct op_sample *op_cpu_buffer_read_entry(struct op_entry *entry, int cpu)
+{
+       struct ring_buffer_event *e;
+       e = ring_buffer_consume(op_ring_buffer_read, cpu, NULL);
+       if (e)
+               goto event;
+       if (ring_buffer_swap_cpu(op_ring_buffer_read,
+                                op_ring_buffer_write,
+                                cpu))
+               return NULL;
+       e = ring_buffer_consume(op_ring_buffer_read, cpu, NULL);
+       if (e)
+               goto event;
+       return NULL;
+
+event:
+       entry->event = e;
+       entry->sample = ring_buffer_event_data(e);
+       entry->size = (ring_buffer_event_length(e) - sizeof(struct op_sample))
+               / sizeof(entry->sample->data[0]);
+       entry->data = entry->sample->data;
+       return entry->sample;
+}
+
+unsigned long op_cpu_buffer_entries(int cpu)
+{
+       return ring_buffer_entries_cpu(op_ring_buffer_read, cpu)
+               + ring_buffer_entries_cpu(op_ring_buffer_write, cpu);
+}
+
+static int
+op_add_code(struct oprofile_cpu_buffer *cpu_buf, unsigned long backtrace,
+           int is_kernel, struct task_struct *task)
  {
         struct op_entry entry;
-       int ret;
+       struct op_sample *sample;
+       unsigned long flags;
+       int size;
+
+       flags = 0;
+
+       if (backtrace)
+               flags |= TRACE_BEGIN;
+
+       /* notice a switch from user->kernel or vice versa */
+       is_kernel = !!is_kernel;
+       if (cpu_buf->last_is_kernel != is_kernel) {
+               cpu_buf->last_is_kernel = is_kernel;
+               flags |= KERNEL_CTX_SWITCH;
+               if (is_kernel)
+                       flags |= IS_KERNEL;
+       }
+
+       /* notice a task switch */
+       if (cpu_buf->last_task != task) {
+               cpu_buf->last_task = task;
+               flags |= USER_CTX_SWITCH;
+       }
+
+       if (!flags)
+               /* nothing to do */
+               return 0;
+
+       if (flags & USER_CTX_SWITCH)
+               size = 1;
+       else
+               size = 0;
+
+       sample = op_cpu_buffer_write_reserve(&entry, size);
+       if (!sample)
+               return -ENOMEM;
  
-       ret = cpu_buffer_write_entry(&entry);
-       if (ret)
-               return ret;
+       sample->eip = ESCAPE_CODE;
+       sample->event = flags;
  
-       entry.sample->eip = pc;
-       entry.sample->event = event;
+       if (size)
+               op_cpu_buffer_add_data(&entry, (unsigned long)task);
  
-       ret = cpu_buffer_write_commit(&entry);
-       if (ret)
-               return ret;
+       op_cpu_buffer_write_commit(&entry);
  
         return 0;
  }
  
  static inline int
-add_code(struct oprofile_cpu_buffer *buffer, unsigned long value)
+op_add_sample(struct oprofile_cpu_buffer *cpu_buf,
+             unsigned long pc, unsigned long event)
  {
-       return add_sample(buffer, ESCAPE_CODE, value);
+       struct op_entry entry;
+       struct op_sample *sample;
+
+       sample = op_cpu_buffer_write_reserve(&entry, 0);
+       if (!sample)
+               return -ENOMEM;
+
+       sample->eip = pc;
+       sample->event = event;
+
+       return op_cpu_buffer_write_commit(&entry);
  }
  
-/* This must be safe from any context. It's safe writing here
- * because of the head/tail separation of the writer and reader
- * of the CPU buffer.
+/*
+ * This must be safe from any context.
   *
   * is_kernel is needed because on some architectures you cannot
   * tell if you are in kernel or user space simply by looking at
   * pc. We tag this in the buffer by generating kernel enter/exit
   * events whenever is_kernel changes
   */
-static int log_sample(struct oprofile_cpu_buffer *cpu_buf, unsigned long pc,
-                     int is_kernel, unsigned long event)
+static int
+log_sample(struct oprofile_cpu_buffer *cpu_buf, unsigned long pc,
+          unsigned long backtrace, int is_kernel, unsigned long event)
  {
-       struct task_struct *task;
-
         cpu_buf->sample_received++;
  
         if (pc == ESCAPE_CODE) {
@@ -193,25 +301,10 @@ static int log_sample(struct oprofile_cpu_buffer *cpu_buf, unsigned long pc,
                 return 0;
         }
  
-       is_kernel = !!is_kernel;
-
-       task = current;
-
-       /* notice a switch from user->kernel or vice versa */
-       if (cpu_buf->last_is_kernel != is_kernel) {
-               cpu_buf->last_is_kernel = is_kernel;
-               if (add_code(cpu_buf, is_kernel))
-                       goto fail;
-       }
-
-       /* notice a task switch */
-       if (cpu_buf->last_task != task) {
-               cpu_buf->last_task = task;
-               if (add_code(cpu_buf, (unsigned long)task))
-                       goto fail;
-       }
+       if (op_add_code(cpu_buf, backtrace, is_kernel, current))
+               goto fail;
  
-       if (add_sample(cpu_buf, pc, event))
+       if (op_add_sample(cpu_buf, pc, event))
                 goto fail;
  
         return 1;
@@ -221,109 +314,102 @@ fail:
         return 0;
  }
  
-static int oprofile_begin_trace(struct oprofile_cpu_buffer *cpu_buf)
+static inline void oprofile_begin_trace(struct oprofile_cpu_buffer *cpu_buf)
  {
-       add_code(cpu_buf, CPU_TRACE_BEGIN);
         cpu_buf->tracing = 1;
-       return 1;
  }
  
-static void oprofile_end_trace(struct oprofile_cpu_buffer *cpu_buf)
+static inline void oprofile_end_trace(struct oprofile_cpu_buffer *cpu_buf)
  {
         cpu_buf->tracing = 0;
  }
  
-void oprofile_add_ext_sample(unsigned long pc, struct pt_regs * const regs,
-                               unsigned long event, int is_kernel)
+static inline void
+__oprofile_add_ext_sample(unsigned long pc, struct pt_regs * const regs,
+                         unsigned long event, int is_kernel)
  {
         struct oprofile_cpu_buffer *cpu_buf = &__get_cpu_var(cpu_buffer);
-
-       if (!backtrace_depth) {
-               log_sample(cpu_buf, pc, is_kernel, event);
-               return;
-       }
-
-       if (!oprofile_begin_trace(cpu_buf))
-               return;
+       unsigned long backtrace = oprofile_backtrace_depth;
  
         /*
          * if log_sample() fail we can't backtrace since we lost the
          * source of this event
          */
-       if (log_sample(cpu_buf, pc, is_kernel, event))
-               oprofile_ops.backtrace(regs, backtrace_depth);
+       if (!log_sample(cpu_buf, pc, backtrace, is_kernel, event))
+               /* failed */
+               return;
+
+       if (!backtrace)
+               return;
+
+       oprofile_begin_trace(cpu_buf);
+       oprofile_ops.backtrace(regs, backtrace);
         oprofile_end_trace(cpu_buf);
  }
  
+void oprofile_add_ext_sample(unsigned long pc, struct pt_regs * const regs,
+                            unsigned long event, int is_kernel)
+{
+       __oprofile_add_ext_sample(pc, regs, event, is_kernel);
+}
+
  void oprofile_add_sample(struct pt_regs * const regs, unsigned long event)
  {
         int is_kernel = !user_mode(regs);
         unsigned long pc = profile_pc(regs);
  
-       oprofile_add_ext_sample(pc, regs, event, is_kernel);
+       __oprofile_add_ext_sample(pc, regs, event, is_kernel);
  }
  
-#ifdef CONFIG_OPROFILE_IBS
-
-#define MAX_IBS_SAMPLE_SIZE 14
-
-void oprofile_add_ibs_sample(struct pt_regs * const regs,
-                            unsigned int * const ibs_sample, int ibs_code)
+/*
+ * Add samples with data to the ring buffer.
+ *
+ * Use oprofile_add_data(&entry, val) to add data and
+ * oprofile_write_commit(&entry) to commit the sample.
+ */
+void
+oprofile_write_reserve(struct op_entry *entry, struct pt_regs * const regs,
+                      unsigned long pc, int code, int size)
  {
+       struct op_sample *sample;
         int is_kernel = !user_mode(regs);
         struct oprofile_cpu_buffer *cpu_buf = &__get_cpu_var(cpu_buffer);
-       struct task_struct *task;
-       int fail = 0;
  
         cpu_buf->sample_received++;
  
-       /* notice a switch from user->kernel or vice versa */
-       if (cpu_buf->last_is_kernel != is_kernel) {
-               if (add_code(cpu_buf, is_kernel))
-                       goto fail;
-               cpu_buf->last_is_kernel = is_kernel;
-       }
-
-       /* notice a task switch */
-       if (!is_kernel) {
-               task = current;
-               if (cpu_buf->last_task != task) {
-                       if (add_code(cpu_buf, (unsigned long)task))
-                               goto fail;
-                       cpu_buf->last_task = task;
-               }
-       }
-
-       fail = fail || add_code(cpu_buf, ibs_code);
-       fail = fail || add_sample(cpu_buf, ibs_sample[0], ibs_sample[1]);
-       fail = fail || add_sample(cpu_buf, ibs_sample[2], ibs_sample[3]);
-       fail = fail || add_sample(cpu_buf, ibs_sample[4], ibs_sample[5]);
-
-       if (ibs_code == IBS_OP_BEGIN) {
-               fail = fail || add_sample(cpu_buf, ibs_sample[6], ibs_sample[7]);
-               fail = fail || add_sample(cpu_buf, ibs_sample[8], ibs_sample[9]);
-               fail = fail || add_sample(cpu_buf, ibs_sample[10], ibs_sample[11]);
-       }
+       /* no backtraces for samples with data */
+       if (op_add_code(cpu_buf, 0, is_kernel, current))
+               goto fail;
  
-       if (fail)
+       sample = op_cpu_buffer_write_reserve(entry, size + 2);
+       if (!sample)
                 goto fail;
+       sample->eip = ESCAPE_CODE;
+       sample->event = 0;              /* no flags */
  
-       if (backtrace_depth)
-               oprofile_ops.backtrace(regs, backtrace_depth);
+       op_cpu_buffer_add_data(entry, code);
+       op_cpu_buffer_add_data(entry, pc);
  
         return;
  
  fail:
         cpu_buf->sample_lost_overflow++;
-       return;
  }
  
-#endif
+int oprofile_add_data(struct op_entry *entry, unsigned long val)
+{
+       return op_cpu_buffer_add_data(entry, val);
+}
+
+int oprofile_write_commit(struct op_entry *entry)
+{
+       return op_cpu_buffer_write_commit(entry);
+}
  
  void oprofile_add_pc(unsigned long pc, int is_kernel, unsigned long event)
  {
         struct oprofile_cpu_buffer *cpu_buf = &__get_cpu_var(cpu_buffer);
-       log_sample(cpu_buf, pc, is_kernel, event);
+       log_sample(cpu_buf, pc, 0, is_kernel, event);
  }
  
  void oprofile_add_trace(unsigned long pc)
@@ -340,7 +426,7 @@ void oprofile_add_trace(unsigned long pc)
         if (pc == ESCAPE_CODE)
                 goto fail;
  
-       if (add_sample(cpu_buf, pc, 0))
+       if (op_add_sample(cpu_buf, pc, 0))
                 goto fail;
  
         return;
diff --git a/drivers/oprofile/cpu_buffer.h b/drivers/oprofile/cpu_buffer.h

index aacb0f0bc56682d37aca262f0d72ad135fb7cedc..63f81c44846ae28b5bf2020d8fe445e4e4377d7b 100644 (file)
--- a/drivers/oprofile/cpu_buffer.h
+++ b/drivers/oprofile/cpu_buffer.h
@@ -1,10 +1,11 @@
  /**
   * @file cpu_buffer.h
   *
- * @remark Copyright 2002 OProfile authors
+ * @remark Copyright 2002-2009 OProfile authors
   * @remark Read the file COPYING
   *
   * @author John Levon <levon@movementarian.org>
+ * @author Robert Richter <robert.richter@amd.com>
   */
  
  #ifndef OPROFILE_CPU_BUFFER_H
@@ -31,17 +32,12 @@ void end_cpu_work(void);
  struct op_sample {
         unsigned long eip;
         unsigned long event;
+       unsigned long data[0];
  };
  
-struct op_entry {
-       struct ring_buffer_event *event;
-       struct op_sample *sample;
-       unsigned long irq_flags;
-};
+struct op_entry;
  
  struct oprofile_cpu_buffer {
-       volatile unsigned long head_pos;
-       volatile unsigned long tail_pos;
         unsigned long buffer_size;
         struct task_struct *last_task;
         int last_is_kernel;
@@ -54,8 +50,6 @@ struct oprofile_cpu_buffer {
         struct delayed_work work;
  };
  
-extern struct ring_buffer *op_ring_buffer_read;
-extern struct ring_buffer *op_ring_buffer_write;
  DECLARE_PER_CPU(struct oprofile_cpu_buffer, cpu_buffer);
  
  /*
@@ -64,7 +58,7 @@ DECLARE_PER_CPU(struct oprofile_cpu_buffer, cpu_buffer);
   * reset these to invalid values; the next sample collected will
   * populate the buffer with proper values to initialize the buffer
   */
-static inline void cpu_buffer_reset(int cpu)
+static inline void op_cpu_buffer_reset(int cpu)
  {
         struct oprofile_cpu_buffer *cpu_buf = &per_cpu(cpu_buffer, cpu);
  
@@ -72,55 +66,48 @@ static inline void cpu_buffer_reset(int cpu)
         cpu_buf->last_task = NULL;
  }
  
-static inline int cpu_buffer_write_entry(struct op_entry *entry)
-{
-       entry->event = ring_buffer_lock_reserve(op_ring_buffer_write,
-                                               sizeof(struct op_sample),
-                                               &entry->irq_flags);
-       if (entry->event)
-               entry->sample = ring_buffer_event_data(entry->event);
-       else
-               entry->sample = NULL;
-
-       if (!entry->sample)
-               return -ENOMEM;
-
-       return 0;
-}
+struct op_sample
+*op_cpu_buffer_write_reserve(struct op_entry *entry, unsigned long size);
+int op_cpu_buffer_write_commit(struct op_entry *entry);
+struct op_sample *op_cpu_buffer_read_entry(struct op_entry *entry, int cpu);
+unsigned long op_cpu_buffer_entries(int cpu);
  
-static inline int cpu_buffer_write_commit(struct op_entry *entry)
+/* returns the remaining free size of data in the entry */
+static inline
+int op_cpu_buffer_add_data(struct op_entry *entry, unsigned long val)
  {
-       return ring_buffer_unlock_commit(op_ring_buffer_write, entry->event,
-                                        entry->irq_flags);
+       if (!entry->size)
+               return 0;
+       *entry->data = val;
+       entry->size--;
+       entry->data++;
+       return entry->size;
  }
  
-static inline struct op_sample *cpu_buffer_read_entry(int cpu)
+/* returns the size of data in the entry */
+static inline
+int op_cpu_buffer_get_size(struct op_entry *entry)
  {
-       struct ring_buffer_event *e;
-       e = ring_buffer_consume(op_ring_buffer_read, cpu, NULL);
-       if (e)
-               return ring_buffer_event_data(e);
-       if (ring_buffer_swap_cpu(op_ring_buffer_read,
-                                op_ring_buffer_write,
-                                cpu))
-               return NULL;
-       e = ring_buffer_consume(op_ring_buffer_read, cpu, NULL);
-       if (e)
-               return ring_buffer_event_data(e);
-       return NULL;
+       return entry->size;
  }
  
-/* "acquire" as many cpu buffer slots as we can */
-static inline unsigned long cpu_buffer_entries(int cpu)
+/* returns 0 if empty or the size of data including the current value */
+static inline
+int op_cpu_buffer_get_data(struct op_entry *entry, unsigned long *val)
  {
-       return ring_buffer_entries_cpu(op_ring_buffer_read, cpu)
-               + ring_buffer_entries_cpu(op_ring_buffer_write, cpu);
+       int size = entry->size;
+       if (!size)
+               return 0;
+       *val = *entry->data;
+       entry->size--;
+       entry->data++;
+       return size;
  }
  
-/* transient events for the CPU buffer -> event buffer */
-#define CPU_IS_KERNEL 1
-#define CPU_TRACE_BEGIN 2
-#define IBS_FETCH_BEGIN 3
-#define IBS_OP_BEGIN    4
+/* extra data flags */
+#define KERNEL_CTX_SWITCH      (1UL << 0)
+#define IS_KERNEL              (1UL << 1)
+#define TRACE_BEGIN            (1UL << 2)
+#define USER_CTX_SWITCH                (1UL << 3)
  
  #endif /* OPROFILE_CPU_BUFFER_H */
diff --git a/drivers/oprofile/event_buffer.c b/drivers/oprofile/event_buffer.c

index 191a3202cecc072b8effc0ea5f0913ebde2ddc57..2b7ae366ceb151f1c197fd116b251d55fdeb3b74 100644 (file)
--- a/drivers/oprofile/event_buffer.c
+++ b/drivers/oprofile/event_buffer.c
@@ -73,8 +73,8 @@ int alloc_event_buffer(void)
         unsigned long flags;
  
         spin_lock_irqsave(&oprofilefs_lock, flags);
-       buffer_size = fs_buffer_size;
-       buffer_watershed = fs_buffer_watershed;
+       buffer_size = oprofile_buffer_size;
+       buffer_watershed = oprofile_buffer_watershed;
         spin_unlock_irqrestore(&oprofilefs_lock, flags);
  
         if (buffer_watershed >= buffer_size)
diff --git a/drivers/oprofile/oprof.c b/drivers/oprofile/oprof.c

index cd375907f26fcd3cc452f021fa893e51870941f9..3cffce90f82a9693999bb16c5741b6c2e4e842c2 100644 (file)
--- a/drivers/oprofile/oprof.c
+++ b/drivers/oprofile/oprof.c
@@ -23,7 +23,7 @@
  struct oprofile_operations oprofile_ops;
  
  unsigned long oprofile_started;
-unsigned long backtrace_depth;
+unsigned long oprofile_backtrace_depth;
  static unsigned long is_setup;
  static DEFINE_MUTEX(start_mutex);
  
@@ -172,7 +172,7 @@ int oprofile_set_backtrace(unsigned long val)
                 goto out;
         }
  
-       backtrace_depth = val;
+       oprofile_backtrace_depth = val;
  
  out:
         mutex_unlock(&start_mutex);
diff --git a/drivers/oprofile/oprof.h b/drivers/oprofile/oprof.h

index 5df0c21a608ffbf48fa05b516fc9ff057509d148..c288d3c24b505ee5778eebe7f7b0f5fb3d82c23a 100644 (file)
--- a/drivers/oprofile/oprof.h
+++ b/drivers/oprofile/oprof.h
@@ -21,12 +21,12 @@ void oprofile_stop(void);
  
  struct oprofile_operations;
  
-extern unsigned long fs_buffer_size;
-extern unsigned long fs_cpu_buffer_size;
-extern unsigned long fs_buffer_watershed;
+extern unsigned long oprofile_buffer_size;
+extern unsigned long oprofile_cpu_buffer_size;
+extern unsigned long oprofile_buffer_watershed;
  extern struct oprofile_operations oprofile_ops;
  extern unsigned long oprofile_started;
-extern unsigned long backtrace_depth;
+extern unsigned long oprofile_backtrace_depth;
  
  struct super_block;
  struct dentry;
diff --git a/drivers/oprofile/oprofile_files.c b/drivers/oprofile/oprofile_files.c

index d8201998b0b7f8d93b5b7f1a654851374256dff9..5d36ffc30dd5cc82e5e8806aea88c4863a2cf219 100644 (file)
--- a/drivers/oprofile/oprofile_files.c
+++ b/drivers/oprofile/oprofile_files.c
@@ -14,17 +14,18 @@
  #include "oprofile_stats.h"
  #include "oprof.h"
  
-#define FS_BUFFER_SIZE_DEFAULT         131072
-#define FS_CPU_BUFFER_SIZE_DEFAULT     8192
-#define FS_BUFFER_WATERSHED_DEFAULT    32768   /* FIXME: tune */
+#define BUFFER_SIZE_DEFAULT            131072
+#define CPU_BUFFER_SIZE_DEFAULT                8192
+#define BUFFER_WATERSHED_DEFAULT       32768   /* FIXME: tune */
  
-unsigned long fs_buffer_size;
-unsigned long fs_cpu_buffer_size;
-unsigned long fs_buffer_watershed;
+unsigned long oprofile_buffer_size;
+unsigned long oprofile_cpu_buffer_size;
+unsigned long oprofile_buffer_watershed;
  
  static ssize_t depth_read(struct file *file, char __user *buf, size_t count, loff_t *offset)
  {
-       return oprofilefs_ulong_to_user(backtrace_depth, buf, count, offset);
+       return oprofilefs_ulong_to_user(oprofile_backtrace_depth, buf, count,
+                                       offset);
  }
  
  
@@ -125,16 +126,16 @@ static const struct file_operations dump_fops = {
  void oprofile_create_files(struct super_block *sb, struct dentry *root)
  {
         /* reinitialize default values */
-       fs_buffer_size =        FS_BUFFER_SIZE_DEFAULT;
-       fs_cpu_buffer_size =    FS_CPU_BUFFER_SIZE_DEFAULT;
-       fs_buffer_watershed =   FS_BUFFER_WATERSHED_DEFAULT;
+       oprofile_buffer_size =          BUFFER_SIZE_DEFAULT;
+       oprofile_cpu_buffer_size =      CPU_BUFFER_SIZE_DEFAULT;
+       oprofile_buffer_watershed =     BUFFER_WATERSHED_DEFAULT;
  
         oprofilefs_create_file(sb, root, "enable", &enable_fops);
         oprofilefs_create_file_perm(sb, root, "dump", &dump_fops, 0666);
         oprofilefs_create_file(sb, root, "buffer", &event_buffer_fops);
-       oprofilefs_create_ulong(sb, root, "buffer_size", &fs_buffer_size);
-       oprofilefs_create_ulong(sb, root, "buffer_watershed", &fs_buffer_watershed);
-       oprofilefs_create_ulong(sb, root, "cpu_buffer_size", &fs_cpu_buffer_size);
+       oprofilefs_create_ulong(sb, root, "buffer_size", &oprofile_buffer_size);
+       oprofilefs_create_ulong(sb, root, "buffer_watershed", &oprofile_buffer_watershed);
+       oprofilefs_create_ulong(sb, root, "cpu_buffer_size", &oprofile_cpu_buffer_size);
         oprofilefs_create_file(sb, root, "cpu_type", &cpu_type_fops);
         oprofilefs_create_file(sb, root, "backtrace_depth", &depth_fops);
         oprofilefs_create_file(sb, root, "pointer_size", &pointer_size_fops);
diff --git a/drivers/regulator/wm8350-regulator.c b/drivers/regulator/wm8350-regulator.c

index c68c496b2c499c30316e86bcb3f4237f79c9c4ff..7aa35248181bd4b021e5d638343e1bfd57459030 100644 (file)
--- a/drivers/regulator/wm8350-regulator.c
+++ b/drivers/regulator/wm8350-regulator.c
@@ -1412,6 +1412,97 @@ int wm8350_register_regulator(struct wm8350 *wm8350, int reg,
  }
  EXPORT_SYMBOL_GPL(wm8350_register_regulator);
  
+/**
+ * wm8350_register_led - Register a WM8350 LED output
+ *
+ * @param wm8350 The WM8350 device to configure.
+ * @param lednum LED device index to create.
+ * @param dcdc The DCDC to use for the LED.
+ * @param isink The ISINK to use for the LED.
+ * @param pdata Configuration for the LED.
+ *
+ * The WM8350 supports the use of an ISINK together with a DCDC to
+ * provide a power-efficient LED driver.  This function registers the
+ * regulators and instantiates the platform device for a LED.  The
+ * operating modes for the LED regulators must be configured using
+ * wm8350_isink_set_flash(), wm8350_dcdc25_set_mode() and
+ * wm8350_dcdc_set_slot() prior to calling this function.
+ */
+int wm8350_register_led(struct wm8350 *wm8350, int lednum, int dcdc, int isink,
+                       struct wm8350_led_platform_data *pdata)
+{
+       struct wm8350_led *led;
+       struct platform_device *pdev;
+       int ret;
+
+       if (lednum > ARRAY_SIZE(wm8350->pmic.led) || lednum < 0) {
+               dev_err(wm8350->dev, "Invalid LED index %d\n", lednum);
+               return -ENODEV;
+       }
+
+       led = &wm8350->pmic.led[lednum];
+
+       if (led->pdev) {
+               dev_err(wm8350->dev, "LED %d already allocated\n", lednum);
+               return -EINVAL;
+       }
+
+       pdev = platform_device_alloc("wm8350-led", lednum);
+       if (pdev == NULL) {
+               dev_err(wm8350->dev, "Failed to allocate LED %d\n", lednum);
+               return -ENOMEM;
+       }
+
+       led->isink_consumer.dev = &pdev->dev;
+       led->isink_consumer.supply = "led_isink";
+       led->isink_init.num_consumer_supplies = 1;
+       led->isink_init.consumer_supplies = &led->isink_consumer;
+       led->isink_init.constraints.min_uA = 0;
+       led->isink_init.constraints.max_uA = pdata->max_uA;
+       led->isink_init.constraints.valid_ops_mask = REGULATOR_CHANGE_CURRENT;
+       led->isink_init.constraints.valid_modes_mask = REGULATOR_MODE_NORMAL;
+       ret = wm8350_register_regulator(wm8350, isink, &led->isink_init);
+       if (ret != 0) {
+               platform_device_put(pdev);
+               return ret;
+       }
+
+       led->dcdc_consumer.dev = &pdev->dev;
+       led->dcdc_consumer.supply = "led_vcc";
+       led->dcdc_init.num_consumer_supplies = 1;
+       led->dcdc_init.consumer_supplies = &led->dcdc_consumer;
+       led->dcdc_init.constraints.valid_modes_mask = REGULATOR_MODE_NORMAL;
+       ret = wm8350_register_regulator(wm8350, dcdc, &led->dcdc_init);
+       if (ret != 0) {
+               platform_device_put(pdev);
+               return ret;
+       }
+
+       switch (isink) {
+       case WM8350_ISINK_A:
+               wm8350->pmic.isink_A_dcdc = dcdc;
+               break;
+       case WM8350_ISINK_B:
+               wm8350->pmic.isink_B_dcdc = dcdc;
+               break;
+       }
+
+       pdev->dev.platform_data = pdata;
+       pdev->dev.parent = wm8350->dev;
+       ret = platform_device_add(pdev);
+       if (ret != 0) {
+               dev_err(wm8350->dev, "Failed to register LED %d: %d\n",
+                       lednum, ret);
+               platform_device_put(pdev);
+               return ret;
+       }
+
+       led->pdev = pdev;
+
+       return 0;
+}
+EXPORT_SYMBOL_GPL(wm8350_register_led);
+
  static struct platform_driver wm8350_regulator_driver = {
         .probe = wm8350_regulator_probe,
         .remove = wm8350_regulator_remove,
diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c

index 570ae59c1d5eb5aa0841c1ee661134d93922ea4c..bd5914994142707556b36b95fcb8b3e7042899fd 100644 (file)
--- a/drivers/s390/block/dasd.c
+++ b/drivers/s390/block/dasd.c
@@ -336,6 +336,9 @@ static int
  dasd_state_ready_to_online(struct dasd_device * device)
  {
         int rc;
+       struct gendisk *disk;
+       struct disk_part_iter piter;
+       struct hd_struct *part;
  
         if (device->discipline->ready_to_online) {
                 rc = device->discipline->ready_to_online(device);
@@ -343,8 +346,14 @@ dasd_state_ready_to_online(struct dasd_device * device)
                         return rc;
         }
         device->state = DASD_STATE_ONLINE;
-       if (device->block)
+       if (device->block) {
                 dasd_schedule_block_bh(device->block);
+               disk = device->block->bdev->bd_disk;
+               disk_part_iter_init(&piter, disk, DISK_PITER_INCL_PART0);
+               while ((part = disk_part_iter_next(&piter)))
+                       kobject_uevent(&part_to_dev(part)->kobj, KOBJ_CHANGE);
+               disk_part_iter_exit(&piter);
+       }
         return 0;
  }
  
@@ -354,6 +363,9 @@ dasd_state_ready_to_online(struct dasd_device * device)
  static int dasd_state_online_to_ready(struct dasd_device *device)
  {
         int rc;
+       struct gendisk *disk;
+       struct disk_part_iter piter;
+       struct hd_struct *part;
  
         if (device->discipline->online_to_ready) {
                 rc = device->discipline->online_to_ready(device);
@@ -361,6 +373,13 @@ static int dasd_state_online_to_ready(struct dasd_device *device)
                         return rc;
         }
         device->state = DASD_STATE_READY;
+       if (device->block) {
+               disk = device->block->bdev->bd_disk;
+               disk_part_iter_init(&piter, disk, DISK_PITER_INCL_PART0);
+               while ((part = disk_part_iter_next(&piter)))
+                       kobject_uevent(&part_to_dev(part)->kobj, KOBJ_CHANGE);
+               disk_part_iter_exit(&piter);
+       }
         return 0;
  }
  
diff --git a/drivers/s390/block/dasd_devmap.c b/drivers/s390/block/dasd_devmap.c

index 2ef25731d197ed0f6a9ce2fdfdc1fc4358b9e460..300e28a531f80c938f880b68c246f921686c63bb 100644 (file)
--- a/drivers/s390/block/dasd_devmap.c
+++ b/drivers/s390/block/dasd_devmap.c
@@ -206,6 +206,8 @@ dasd_feature_list(char *str, char **endp)
                         features |= DASD_FEATURE_USEDIAG;
                 else if (len == 6 && !strncmp(str, "erplog", 6))
                         features |= DASD_FEATURE_ERPLOG;
+               else if (len == 8 && !strncmp(str, "failfast", 8))
+                       features |= DASD_FEATURE_FAILFAST;
                 else {
                         MESSAGE(KERN_WARNING,
                                 "unsupported feature: %*s, "
@@ -666,6 +668,51 @@ dasd_device_from_cdev(struct ccw_device *cdev)
   * SECTION: files in sysfs
   */
  
+/*
+ * failfast controls the behaviour, if no path is available
+ */
+static ssize_t dasd_ff_show(struct device *dev, struct device_attribute *attr,
+                           char *buf)
+{
+       struct dasd_devmap *devmap;
+       int ff_flag;
+
+       devmap = dasd_find_busid(dev->bus_id);
+       if (!IS_ERR(devmap))
+               ff_flag = (devmap->features & DASD_FEATURE_FAILFAST) != 0;
+       else
+               ff_flag = (DASD_FEATURE_DEFAULT & DASD_FEATURE_FAILFAST) != 0;
+       return snprintf(buf, PAGE_SIZE, ff_flag ? "1\n" : "0\n");
+}
+
+static ssize_t dasd_ff_store(struct device *dev, struct device_attribute *attr,
+             const char *buf, size_t count)
+{
+       struct dasd_devmap *devmap;
+       int val;
+       char *endp;
+
+       devmap = dasd_devmap_from_cdev(to_ccwdev(dev));
+       if (IS_ERR(devmap))
+               return PTR_ERR(devmap);
+
+       val = simple_strtoul(buf, &endp, 0);
+       if (((endp + 1) < (buf + count)) || (val > 1))
+               return -EINVAL;
+
+       spin_lock(&dasd_devmap_lock);
+       if (val)
+               devmap->features |= DASD_FEATURE_FAILFAST;
+       else
+               devmap->features &= ~DASD_FEATURE_FAILFAST;
+       if (devmap->device)
+               devmap->device->features = devmap->features;
+       spin_unlock(&dasd_devmap_lock);
+       return count;
+}
+
+static DEVICE_ATTR(failfast, 0644, dasd_ff_show, dasd_ff_store);
+
  /*
   * readonly controls the readonly status of a dasd
   */
@@ -1020,6 +1067,7 @@ static struct attribute * dasd_attrs[] = {
         &dev_attr_use_diag.attr,
         &dev_attr_eer_enabled.attr,
         &dev_attr_erplog.attr,
+       &dev_attr_failfast.attr,
         NULL,
  };
  
diff --git a/drivers/s390/block/dasd_diag.c b/drivers/s390/block/dasd_diag.c

index 7844461a995b7caf0b0b27ee32dfaf53f2d6cf69..ef2a5695205442a05c4ff817444066076df5d6e7 100644 (file)
--- a/drivers/s390/block/dasd_diag.c
+++ b/drivers/s390/block/dasd_diag.c
@@ -544,7 +544,8 @@ static struct dasd_ccw_req *dasd_diag_build_cp(struct dasd_device *memdev,
         }
         cqr->retries = DIAG_MAX_RETRIES;
         cqr->buildclk = get_clock();
-       if (blk_noretry_request(req))
+       if (blk_noretry_request(req) ||
+           block->base->features & DASD_FEATURE_FAILFAST)
                 set_bit(DASD_CQR_FLAGS_FAILFAST, &cqr->flags);
         cqr->startdev = memdev;
         cqr->memdev = memdev;
diff --git a/drivers/s390/block/dasd_eckd.c b/drivers/s390/block/dasd_eckd.c

index bd2c52e20762e3593cb531be9b77552db0a2089a..bdb87998f364f276067470e672a0db49bfc21e26 100644 (file)
--- a/drivers/s390/block/dasd_eckd.c
+++ b/drivers/s390/block/dasd_eckd.c
@@ -1700,7 +1700,8 @@ static struct dasd_ccw_req *dasd_eckd_build_cp(struct dasd_device *startdev,
                         recid++;
                 }
         }
-       if (blk_noretry_request(req))
+       if (blk_noretry_request(req) ||
+           block->base->features & DASD_FEATURE_FAILFAST)
                 set_bit(DASD_CQR_FLAGS_FAILFAST, &cqr->flags);
         cqr->startdev = startdev;
         cqr->memdev = startdev;
diff --git a/drivers/s390/block/dasd_fba.c b/drivers/s390/block/dasd_fba.c

index 7d442aeff3d1163c6f6b728cdd0b9d2a1f2017e5..f1d176021694886fc1b6c2ba1f30ee6318c322c8 100644 (file)
--- a/drivers/s390/block/dasd_fba.c
+++ b/drivers/s390/block/dasd_fba.c
@@ -355,7 +355,8 @@ static struct dasd_ccw_req *dasd_fba_build_cp(struct dasd_device * memdev,
                         recid++;
                 }
         }
-       if (blk_noretry_request(req))
+       if (blk_noretry_request(req) ||
+           block->base->features & DASD_FEATURE_FAILFAST)
                 set_bit(DASD_CQR_FLAGS_FAILFAST, &cqr->flags);
         cqr->startdev = memdev;
         cqr->memdev = memdev;
diff --git a/drivers/s390/char/Kconfig b/drivers/s390/char/Kconfig

index 643033890e341d28188a3769ea35dcc56cf66fe0..0769ced52dbd64c3b5583f7c447971fd2e8eeb4d 100644 (file)
--- a/drivers/s390/char/Kconfig
+++ b/drivers/s390/char/Kconfig
@@ -100,7 +100,7 @@ comment "S/390 tape interface support"
  
  config S390_TAPE_BLOCK
         bool "Support for tape block devices"
-       depends on S390_TAPE
+       depends on S390_TAPE && BLOCK
         help
           Select this option if you want to access your channel-attached tape
           devices using the block device interface.  This interface is similar
diff --git a/drivers/s390/cio/qdio_debug.c b/drivers/s390/cio/qdio_debug.c

index f8a3b6967f691aa504711cf6b49867c8879ace5c..da7afb04e71ff8003c0b6139d169590a0b1caf2e 100644 (file)
--- a/drivers/s390/cio/qdio_debug.c
+++ b/drivers/s390/cio/qdio_debug.c
@@ -169,6 +169,8 @@ static void setup_debugfs_entry(struct qdio_q *q, struct ccw_device *cdev)
                  q->nr);
         debugfs_queues[i] = debugfs_create_file(name, S_IFREG | S_IRUGO | S_IWUSR,
                                                 debugfs_root, q, &debugfs_fops);
+       if (IS_ERR(debugfs_queues[i]))
+               debugfs_queues[i] = NULL;
  }
  
  void qdio_setup_debug_entries(struct qdio_irq *irq_ptr, struct ccw_device *cdev)
diff --git a/drivers/video/backlight/Kconfig b/drivers/video/backlight/Kconfig

index 4a4dd9adc328cfc63f797cd2d785c14ed8082259..72facb9eb7db7661fbf8ef77cc4c66aca50b3184 100644 (file)
--- a/drivers/video/backlight/Kconfig
+++ b/drivers/video/backlight/Kconfig
@@ -52,11 +52,11 @@ config LCD_ILI9320
           then say y to include a power driver for it.
  
  config LCD_TDO24M
-       tristate "Toppoly TDO24M LCD Panels support"
+       tristate "Toppoly TDO24M  and TDO35S LCD Panels support"
         depends on LCD_CLASS_DEVICE && SPI_MASTER
         default n
         help
-         If you have a Toppoly TDO24M series LCD panel, say y here to
+         If you have a Toppoly TDO24M/TDO35S series LCD panel, say y here to
           include the support for it.
  
  config LCD_VGG2432A4
@@ -123,17 +123,14 @@ config BACKLIGHT_ATMEL_PWM
           To compile this driver as a module, choose M here: the module will be
           called atmel-pwm-bl.
  
-config BACKLIGHT_CORGI
-       tristate "Generic (aka Sharp Corgi) Backlight Driver (DEPRECATED)"
+config BACKLIGHT_GENERIC
+       tristate "Generic (aka Sharp Corgi) Backlight Driver"
         depends on BACKLIGHT_CLASS_DEVICE
-       default n
+       default y
         help
           Say y to enable the generic platform backlight driver previously
           known as the Corgi backlight driver. If you have a Sharp Zaurus
-         SL-C7xx, SL-Cxx00 or SL-6000x say y. Most users can say n.
-
-         Note: this driver is marked as deprecated, try enable SPI and
-         use the new corgi_lcd driver with integrated backlight control
+         SL-C7xx, SL-Cxx00 or SL-6000x say y.
  
  config BACKLIGHT_LOCOMO
         tristate "Sharp LOCOMO LCD/Backlight Driver"
diff --git a/drivers/video/backlight/Makefile b/drivers/video/backlight/Makefile

index 103427de670357f4f974fd598b37ee41cd5f6ef2..363b3cb2f01b263ffd2b14b50c8bbdd694b7656f 100644 (file)
--- a/drivers/video/backlight/Makefile
+++ b/drivers/video/backlight/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_LCD_TOSA)                   += tosa_lcd.o
  
  obj-$(CONFIG_BACKLIGHT_CLASS_DEVICE) += backlight.o
  obj-$(CONFIG_BACKLIGHT_ATMEL_PWM)    += atmel-pwm-bl.o
-obj-$(CONFIG_BACKLIGHT_CORGI)  += corgi_bl.o
+obj-$(CONFIG_BACKLIGHT_GENERIC)        += generic_bl.o
  obj-$(CONFIG_BACKLIGHT_HP680)  += hp680_bl.o
  obj-$(CONFIG_BACKLIGHT_LOCOMO) += locomolcd.o
  obj-$(CONFIG_BACKLIGHT_OMAP1)  += omap1_bl.o
diff --git a/drivers/video/backlight/backlight.c b/drivers/video/backlight/backlight.c

index 0664fc032235d37903e14f607e0e798214e89abf..157057c79ca3760fc80d6df28236dc0bbf106044 100644 (file)
--- a/drivers/video/backlight/backlight.c
+++ b/drivers/video/backlight/backlight.c
@@ -40,6 +40,10 @@ static int fb_notifier_callback(struct notifier_block *self,
                 if (!bd->ops->check_fb ||
                     bd->ops->check_fb(evdata->info)) {
                         bd->props.fb_blank = *(int *)evdata->data;
+                       if (bd->props.fb_blank == FB_BLANK_UNBLANK)
+                               bd->props.state &= ~BL_CORE_FBBLANK;
+                       else
+                               bd->props.state |= BL_CORE_FBBLANK;
                         backlight_update_status(bd);
                 }
         mutex_unlock(&bd->ops_lock);
@@ -80,20 +84,18 @@ static ssize_t backlight_show_power(struct device *dev,
  static ssize_t backlight_store_power(struct device *dev,
                 struct device_attribute *attr, const char *buf, size_t count)
  {
-       int rc = -ENXIO;
-       char *endp;
+       int rc;
         struct backlight_device *bd = to_backlight_device(dev);
-       int power = simple_strtoul(buf, &endp, 0);
-       size_t size = endp - buf;
+       unsigned long power;
  
-       if (*endp && isspace(*endp))
-               size++;
-       if (size != count)
-               return -EINVAL;
+       rc = strict_strtoul(buf, 0, &power);
+       if (rc)
+               return rc;
  
+       rc = -ENXIO;
         mutex_lock(&bd->ops_lock);
         if (bd->ops) {
-               pr_debug("backlight: set power to %d\n", power);
+               pr_debug("backlight: set power to %lu\n", power);
                 if (bd->props.power != power) {
                         bd->props.power = power;
                         backlight_update_status(bd);
@@ -116,28 +118,25 @@ static ssize_t backlight_show_brightness(struct device *dev,
  static ssize_t backlight_store_brightness(struct device *dev,
                 struct device_attribute *attr, const char *buf, size_t count)
  {
-       int rc = -ENXIO;
-       char *endp;
+       int rc;
         struct backlight_device *bd = to_backlight_device(dev);
-       int brightness = simple_strtoul(buf, &endp, 0);
-       size_t size = endp - buf;
+       unsigned long brightness;
+
+       rc = strict_strtoul(buf, 0, &brightness);
+       if (rc)
+               return rc;
  
-       if (*endp && isspace(*endp))
-               size++;
-       if (size != count)
-               return -EINVAL;
+       rc = -ENXIO;
  
         mutex_lock(&bd->ops_lock);
         if (bd->ops) {
                 if (brightness > bd->props.max_brightness)
                         rc = -EINVAL;
                 else {
-                       pr_debug("backlight: set brightness to %d\n",
+                       pr_debug("backlight: set brightness to %lu\n",
                                  brightness);
-                       if (bd->props.brightness != brightness) {
-                               bd->props.brightness = brightness;
-                               backlight_update_status(bd);
-                       }
+                       bd->props.brightness = brightness;
+                       backlight_update_status(bd);
                         rc = count;
                 }
         }
@@ -170,6 +169,34 @@ static ssize_t backlight_show_actual_brightness(struct device *dev,
  
  static struct class *backlight_class;
  
+static int backlight_suspend(struct device *dev, pm_message_t state)
+{
+       struct backlight_device *bd = to_backlight_device(dev);
+
+       if (bd->ops->options & BL_CORE_SUSPENDRESUME) {
+               mutex_lock(&bd->ops_lock);
+               bd->props.state |= BL_CORE_SUSPENDED;
+               backlight_update_status(bd);
+               mutex_unlock(&bd->ops_lock);
+       }
+
+       return 0;
+}
+
+static int backlight_resume(struct device *dev)
+{
+       struct backlight_device *bd = to_backlight_device(dev);
+
+       if (bd->ops->options & BL_CORE_SUSPENDRESUME) {
+               mutex_lock(&bd->ops_lock);
+               bd->props.state &= ~BL_CORE_SUSPENDED;
+               backlight_update_status(bd);
+               mutex_unlock(&bd->ops_lock);
+       }
+
+       return 0;
+}
+
  static void bl_device_release(struct device *dev)
  {
         struct backlight_device *bd = to_backlight_device(dev);
@@ -286,6 +313,8 @@ static int __init backlight_class_init(void)
         }
  
         backlight_class->dev_attrs = bl_device_attributes;
+       backlight_class->suspend = backlight_suspend;
+       backlight_class->resume = backlight_resume;
         return 0;
  }
  
diff --git a/drivers/video/backlight/corgi_bl.c b/drivers/video/backlight/corgi_bl.c

deleted file mode 100644 (file)

index 4d4d037..0000000
--- a/drivers/video/backlight/corgi_bl.c
+++ /dev/null
@@ -1,169 +0,0 @@
-/*
- *  Backlight Driver for Sharp Zaurus Handhelds (various models)
- *
- *  Copyright (c) 2004-2006 Richard Purdie
- *
- *  Based on Sharp's 2.4 Backlight Driver
- *
- *  This program is free software; you can redistribute it and/or modify
- *  it under the terms of the GNU General Public License version 2 as
- *  published by the Free Software Foundation.
- *
- */
-
-#include <linux/module.h>
-#include <linux/kernel.h>
-#include <linux/init.h>
-#include <linux/platform_device.h>
-#include <linux/mutex.h>
-#include <linux/fb.h>
-#include <linux/backlight.h>
-
-static int corgibl_intensity;
-static struct backlight_properties corgibl_data;
-static struct backlight_device *corgi_backlight_device;
-static struct generic_bl_info *bl_machinfo;
-
-static unsigned long corgibl_flags;
-#define CORGIBL_SUSPENDED     0x01
-#define CORGIBL_BATTLOW       0x02
-
-static int corgibl_send_intensity(struct backlight_device *bd)
-{
-       int intensity = bd->props.brightness;
-
-       if (bd->props.power != FB_BLANK_UNBLANK)
-               intensity = 0;
-       if (bd->props.fb_blank != FB_BLANK_UNBLANK)
-               intensity = 0;
-       if (corgibl_flags & CORGIBL_SUSPENDED)
-               intensity = 0;
-       if (corgibl_flags & CORGIBL_BATTLOW)
-               intensity &= bl_machinfo->limit_mask;
-
-       bl_machinfo->set_bl_intensity(intensity);
-
-       corgibl_intensity = intensity;
-
-       if (bl_machinfo->kick_battery)
-               bl_machinfo->kick_battery();
-
-       return 0;
-}
-
-#ifdef CONFIG_PM
-static int corgibl_suspend(struct platform_device *pdev, pm_message_t state)
-{
-       struct backlight_device *bd = platform_get_drvdata(pdev);
-
-       corgibl_flags |= CORGIBL_SUSPENDED;
-       backlight_update_status(bd);
-       return 0;
-}
-
-static int corgibl_resume(struct platform_device *pdev)
-{
-       struct backlight_device *bd = platform_get_drvdata(pdev);
-
-       corgibl_flags &= ~CORGIBL_SUSPENDED;
-       backlight_update_status(bd);
-       return 0;
-}
-#else
-#define corgibl_suspend        NULL
-#define corgibl_resume NULL
-#endif
-
-static int corgibl_get_intensity(struct backlight_device *bd)
-{
-       return corgibl_intensity;
-}
-
-/*
- * Called when the battery is low to limit the backlight intensity.
- * If limit==0 clear any limit, otherwise limit the intensity
- */
-void corgibl_limit_intensity(int limit)
-{
-       if (limit)
-               corgibl_flags |= CORGIBL_BATTLOW;
-       else
-               corgibl_flags &= ~CORGIBL_BATTLOW;
-       backlight_update_status(corgi_backlight_device);
-}
-EXPORT_SYMBOL(corgibl_limit_intensity);
-
-
-static struct backlight_ops corgibl_ops = {
-       .get_brightness = corgibl_get_intensity,
-       .update_status  = corgibl_send_intensity,
-};
-
-static int corgibl_probe(struct platform_device *pdev)
-{
-       struct generic_bl_info *machinfo = pdev->dev.platform_data;
-       const char *name = "generic-bl";
-
-       bl_machinfo = machinfo;
-       if (!machinfo->limit_mask)
-               machinfo->limit_mask = -1;
-
-       if (machinfo->name)
-               name = machinfo->name;
-
-       corgi_backlight_device = backlight_device_register (name,
-               &pdev->dev, NULL, &corgibl_ops);
-       if (IS_ERR (corgi_backlight_device))
-               return PTR_ERR (corgi_backlight_device);
-
-       platform_set_drvdata(pdev, corgi_backlight_device);
-
-       corgi_backlight_device->props.max_brightness = machinfo->max_intensity;
-       corgi_backlight_device->props.power = FB_BLANK_UNBLANK;
-       corgi_backlight_device->props.brightness = machinfo->default_intensity;
-       backlight_update_status(corgi_backlight_device);
-
-       printk("Corgi Backlight Driver Initialized.\n");
-       return 0;
-}
-
-static int corgibl_remove(struct platform_device *pdev)
-{
-       struct backlight_device *bd = platform_get_drvdata(pdev);
-
-       corgibl_data.power = 0;
-       corgibl_data.brightness = 0;
-       backlight_update_status(bd);
-
-       backlight_device_unregister(bd);
-
-       printk("Corgi Backlight Driver Unloaded\n");
-       return 0;
-}
-
-static struct platform_driver corgibl_driver = {
-       .probe          = corgibl_probe,
-       .remove         = corgibl_remove,
-       .suspend        = corgibl_suspend,
-       .resume         = corgibl_resume,
-       .driver         = {
-               .name   = "generic-bl",
-       },
-};
-
-static int __init corgibl_init(void)
-{
-       return platform_driver_register(&corgibl_driver);
-}
-
-static void __exit corgibl_exit(void)
-{
-       platform_driver_unregister(&corgibl_driver);
-}
-
-module_init(corgibl_init);
-module_exit(corgibl_exit);
-
-MODULE_AUTHOR("Richard Purdie <rpurdie@rpsys.net>");
-MODULE_DESCRIPTION("Corgi Backlight Driver");
-MODULE_LICENSE("GPL");
diff --git a/drivers/video/backlight/cr_bllcd.c b/drivers/video/backlight/cr_bllcd.c

index 26add8898605e0bc08789e6bdf4a7961886e575b..b9fe62b475c63a3fa953ea182958496cd4037d52 100644 (file)
--- a/drivers/video/backlight/cr_bllcd.c
+++ b/drivers/video/backlight/cr_bllcd.c
@@ -259,22 +259,18 @@ static int __init cr_backlight_init(void)
  {
         int ret = platform_driver_register(&cr_backlight_driver);
  
-       if (!ret) {
-               crp = platform_device_alloc("cr_backlight", -1);
-               if (!crp)
-                       return -ENOMEM;
+       if (ret)
+               return ret;
  
-               ret = platform_device_add(crp);
-
-               if (ret) {
-                       platform_device_put(crp);
-                       platform_driver_unregister(&cr_backlight_driver);
-               }
+       crp = platform_device_register_simple("cr_backlight", -1, NULL, 0);
+       if (IS_ERR(crp)) {
+               platform_driver_unregister(&cr_backlight_driver);
+               return PTR_ERR(crp);
         }
  
         printk("Carillo Ranch Backlight Driver Initialized.\n");
  
-       return ret;
+       return 0;
  }
  
  static void __exit cr_backlight_exit(void)
diff --git a/drivers/video/backlight/generic_bl.c b/drivers/video/backlight/generic_bl.c

new file mode 100644 (file)

index 0000000..6d27f62
--- /dev/null
+++ b/drivers/video/backlight/generic_bl.c
@@ -0,0 +1,147 @@
+/*
+ *  Generic Backlight Driver
+ *
+ *  Copyright (c) 2004-2008 Richard Purdie
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2 as
+ *  published by the Free Software Foundation.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/platform_device.h>
+#include <linux/mutex.h>
+#include <linux/fb.h>
+#include <linux/backlight.h>
+
+static int genericbl_intensity;
+static struct backlight_device *generic_backlight_device;
+static struct generic_bl_info *bl_machinfo;
+
+/* Flag to signal when the battery is low */
+#define GENERICBL_BATTLOW       BL_CORE_DRIVER1
+
+static int genericbl_send_intensity(struct backlight_device *bd)
+{
+       int intensity = bd->props.brightness;
+
+       if (bd->props.power != FB_BLANK_UNBLANK)
+               intensity = 0;
+       if (bd->props.state & BL_CORE_FBBLANK)
+               intensity = 0;
+       if (bd->props.state & BL_CORE_SUSPENDED)
+               intensity = 0;
+       if (bd->props.state & GENERICBL_BATTLOW)
+               intensity &= bl_machinfo->limit_mask;
+
+       bl_machinfo->set_bl_intensity(intensity);
+
+       genericbl_intensity = intensity;
+
+       if (bl_machinfo->kick_battery)
+               bl_machinfo->kick_battery();
+
+       return 0;
+}
+
+static int genericbl_get_intensity(struct backlight_device *bd)
+{
+       return genericbl_intensity;
+}
+
+/*
+ * Called when the battery is low to limit the backlight intensity.
+ * If limit==0 clear any limit, otherwise limit the intensity
+ */
+void corgibl_limit_intensity(int limit)
+{
+       struct backlight_device *bd = generic_backlight_device;
+
+       mutex_lock(&bd->ops_lock);
+       if (limit)
+               bd->props.state |= GENERICBL_BATTLOW;
+       else
+               bd->props.state &= ~GENERICBL_BATTLOW;
+       backlight_update_status(generic_backlight_device);
+       mutex_unlock(&bd->ops_lock);
+}
+EXPORT_SYMBOL(corgibl_limit_intensity);
+
+static struct backlight_ops genericbl_ops = {
+       .options = BL_CORE_SUSPENDRESUME,
+       .get_brightness = genericbl_get_intensity,
+       .update_status  = genericbl_send_intensity,
+};
+
+static int genericbl_probe(struct platform_device *pdev)
+{
+       struct generic_bl_info *machinfo = pdev->dev.platform_data;
+       const char *name = "generic-bl";
+       struct backlight_device *bd;
+
+       bl_machinfo = machinfo;
+       if (!machinfo->limit_mask)
+               machinfo->limit_mask = -1;
+
+       if (machinfo->name)
+               name = machinfo->name;
+
+       bd = backlight_device_register (name,
+               &pdev->dev, NULL, &genericbl_ops);
+       if (IS_ERR (bd))
+               return PTR_ERR (bd);
+
+       platform_set_drvdata(pdev, bd);
+
+       bd->props.max_brightness = machinfo->max_intensity;
+       bd->props.power = FB_BLANK_UNBLANK;
+       bd->props.brightness = machinfo->default_intensity;
+       backlight_update_status(bd);
+
+       generic_backlight_device = bd;
+
+       printk("Generic Backlight Driver Initialized.\n");
+       return 0;
+}
+
+static int genericbl_remove(struct platform_device *pdev)
+{
+       struct backlight_device *bd = platform_get_drvdata(pdev);
+
+       bd->props.power = 0;
+       bd->props.brightness = 0;
+       backlight_update_status(bd);
+
+       backlight_device_unregister(bd);
+
+       printk("Generic Backlight Driver Unloaded\n");
+       return 0;
+}
+
+static struct platform_driver genericbl_driver = {
+       .probe          = genericbl_probe,
+       .remove         = genericbl_remove,
+       .driver         = {
+               .name   = "generic-bl",
+       },
+};
+
+static int __init genericbl_init(void)
+{
+       return platform_driver_register(&genericbl_driver);
+}
+
+static void __exit genericbl_exit(void)
+{
+       platform_driver_unregister(&genericbl_driver);
+}
+
+module_init(genericbl_init);
+module_exit(genericbl_exit);
+
+MODULE_AUTHOR("Richard Purdie <rpurdie@rpsys.net>");
+MODULE_DESCRIPTION("Generic Backlight Driver");
+MODULE_LICENSE("GPL");
diff --git a/drivers/video/backlight/hp680_bl.c b/drivers/video/backlight/hp680_bl.c

index d4cfed0b26d5f8a83242c73769ca650805a521a5..5be55a20d8c77326510f48a6ad0ba600b5c36435 100644 (file)
--- a/drivers/video/backlight/hp680_bl.c
+++ b/drivers/video/backlight/hp680_bl.c
@@ -151,19 +151,15 @@ static int __init hp680bl_init(void)
         int ret;
  
         ret = platform_driver_register(&hp680bl_driver);
-       if (!ret) {
-               hp680bl_device = platform_device_alloc("hp680-bl", -1);
-               if (!hp680bl_device)
-                       return -ENOMEM;
-
-               ret = platform_device_add(hp680bl_device);
-
-               if (ret) {
-                       platform_device_put(hp680bl_device);
-                       platform_driver_unregister(&hp680bl_driver);
-               }
+       if (ret)
+               return ret;
+       hp680bl_device = platform_device_register_simple("hp680-bl", -1,
+                                                       NULL, 0);
+       if (IS_ERR(hp680bl_device)) {
+               platform_driver_unregister(&hp680bl_driver);
+               return PTR_ERR(hp680bl_device);
         }
-       return ret;
+       return 0;
  }
  
  static void __exit hp680bl_exit(void)
diff --git a/drivers/video/backlight/mbp_nvidia_bl.c b/drivers/video/backlight/mbp_nvidia_bl.c

index 06964af761c630295851142dac7690fc1bf11ab5..65864c500455677a1ef540579c0817f32fd62478 100644 (file)
--- a/drivers/video/backlight/mbp_nvidia_bl.c
+++ b/drivers/video/backlight/mbp_nvidia_bl.c
@@ -70,6 +70,7 @@ static int mbp_get_intensity(struct backlight_device *bd)
  }
  
  static struct backlight_ops mbp_ops = {
+       .options = BL_CORE_SUSPENDRESUME,
         .get_brightness = mbp_get_intensity,
         .update_status  = mbp_send_intensity,
  };
diff --git a/drivers/video/backlight/progear_bl.c b/drivers/video/backlight/progear_bl.c

index 15fb4d58b5bcce35f4e8c0a9a00bd0d76246370e..9edaf24fd82d4e49f0bf439162fa156088cbc25e 100644 (file)
--- a/drivers/video/backlight/progear_bl.c
+++ b/drivers/video/backlight/progear_bl.c
@@ -119,20 +119,16 @@ static int __init progearbl_init(void)
  {
         int ret = platform_driver_register(&progearbl_driver);
  
-       if (!ret) {
-               progearbl_device = platform_device_alloc("progear-bl", -1);
-               if (!progearbl_device)
-                       return -ENOMEM;
-
-               ret = platform_device_add(progearbl_device);
-
-               if (ret) {
-                       platform_device_put(progearbl_device);
-                       platform_driver_unregister(&progearbl_driver);
-               }
+       if (ret)
+               return ret;
+       progearbl_device = platform_device_register_simple("progear-bl", -1,
+                                                               NULL, 0);
+       if (IS_ERR(progearbl_device)) {
+               platform_driver_unregister(&progearbl_driver);
+               return PTR_ERR(progearbl_device);
         }
  
-       return ret;
+       return 0;
  }
  
  static void __exit progearbl_exit(void)
diff --git a/drivers/video/backlight/tdo24m.c b/drivers/video/backlight/tdo24m.c

index 8427669162ea494b7f9630bc4ea0842a1d0efbd1..1dae7f8f3c6b47c7f4861ff824fff0a4d4dcf8f3 100644 (file)
--- a/drivers/video/backlight/tdo24m.c
+++ b/drivers/video/backlight/tdo24m.c
@@ -14,6 +14,7 @@
  #include <linux/init.h>
  #include <linux/device.h>
  #include <linux/spi/spi.h>
+#include <linux/spi/tdo24m.h>
  #include <linux/fb.h>
  #include <linux/lcd.h>
  
@@ -31,6 +32,9 @@ struct tdo24m {
         struct spi_transfer     xfer;
         uint8_t                 *buf;
  
+       int (*adj_mode)(struct tdo24m *lcd, int mode);
+       int color_invert;
+
         int                     power;
         int                     mode;
  };
@@ -66,7 +70,7 @@ static uint32_t lcd_panel_off[] = {
         CMD_NULL,
  };
  
-static uint32_t lcd_vga_pass_through[] = {
+static uint32_t lcd_vga_pass_through_tdo24m[] = {
         CMD1(0xB0, 0x16),
         CMD1(0xBC, 0x80),
         CMD1(0xE1, 0x00),
@@ -75,7 +79,7 @@ static uint32_t lcd_vga_pass_through[] = {
         CMD_NULL,
  };
  
-static uint32_t lcd_qvga_pass_through[] = {
+static uint32_t lcd_qvga_pass_through_tdo24m[] = {
         CMD1(0xB0, 0x16),
         CMD1(0xBC, 0x81),
         CMD1(0xE1, 0x00),
@@ -84,7 +88,7 @@ static uint32_t lcd_qvga_pass_through[] = {
         CMD_NULL,
  };
  
-static uint32_t lcd_vga_transfer[] = {
+static uint32_t lcd_vga_transfer_tdo24m[] = {
         CMD1(0xcf, 0x02),       /* Blanking period control (1) */
         CMD2(0xd0, 0x08, 0x04), /* Blanking period control (2) */
         CMD1(0xd1, 0x01),       /* CKV timing control on/off */
@@ -110,6 +114,35 @@ static uint32_t lcd_qvga_transfer[] = {
         CMD_NULL,
  };
  
+static uint32_t lcd_vga_pass_through_tdo35s[] = {
+       CMD1(0xB0, 0x16),
+       CMD1(0xBC, 0x80),
+       CMD1(0xE1, 0x00),
+       CMD1(0x3B, 0x00),
+       CMD_NULL,
+};
+
+static uint32_t lcd_qvga_pass_through_tdo35s[] = {
+       CMD1(0xB0, 0x16),
+       CMD1(0xBC, 0x81),
+       CMD1(0xE1, 0x00),
+       CMD1(0x3B, 0x22),
+       CMD_NULL,
+};
+
+static uint32_t lcd_vga_transfer_tdo35s[] = {
+       CMD1(0xcf, 0x02),       /* Blanking period control (1) */
+       CMD2(0xd0, 0x08, 0x04), /* Blanking period control (2) */
+       CMD1(0xd1, 0x01),       /* CKV timing control on/off */
+       CMD2(0xd2, 0x00, 0x1e), /* CKV 1,2 timing control */
+       CMD2(0xd3, 0x14, 0x28), /* OEV timing control */
+       CMD2(0xd4, 0x28, 0x64), /* ASW timing control (1) */
+       CMD1(0xd5, 0x28),       /* ASW timing control (2) */
+       CMD0(0x21),             /* Invert for normally black display */
+       CMD0(0x29),             /* Display on */
+       CMD_NULL,
+};
+
  static uint32_t lcd_panel_config[] = {
         CMD2(0xb8, 0xff, 0xf9), /* Output control */
         CMD0(0x11),             /* sleep out */
@@ -148,6 +181,8 @@ static int tdo24m_writes(struct tdo24m *lcd, uint32_t *array)
         int nparams, err = 0;
  
         for (; *p != CMD_NULL; p++) {
+               if (!lcd->color_invert && *p == CMD0(0x21))
+                       continue;
  
                 nparams = (*p >> 30) & 0x3;
  
@@ -184,12 +219,33 @@ static int tdo24m_adj_mode(struct tdo24m *lcd, int mode)
  {
         switch (mode) {
         case MODE_VGA:
-               tdo24m_writes(lcd, lcd_vga_pass_through);
+               tdo24m_writes(lcd, lcd_vga_pass_through_tdo24m);
                 tdo24m_writes(lcd, lcd_panel_config);
-               tdo24m_writes(lcd, lcd_vga_transfer);
+               tdo24m_writes(lcd, lcd_vga_transfer_tdo24m);
                 break;
         case MODE_QVGA:
-               tdo24m_writes(lcd, lcd_qvga_pass_through);
+               tdo24m_writes(lcd, lcd_qvga_pass_through_tdo24m);
+               tdo24m_writes(lcd, lcd_panel_config);
+               tdo24m_writes(lcd, lcd_qvga_transfer);
+               break;
+       default:
+               return -EINVAL;
+       }
+
+       lcd->mode = mode;
+       return 0;
+}
+
+static int tdo35s_adj_mode(struct tdo24m *lcd, int mode)
+{
+       switch (mode) {
+       case MODE_VGA:
+               tdo24m_writes(lcd, lcd_vga_pass_through_tdo35s);
+               tdo24m_writes(lcd, lcd_panel_config);
+               tdo24m_writes(lcd, lcd_vga_transfer_tdo35s);
+               break;
+       case MODE_QVGA:
+               tdo24m_writes(lcd, lcd_qvga_pass_through_tdo35s);
                 tdo24m_writes(lcd, lcd_panel_config);
                 tdo24m_writes(lcd, lcd_qvga_transfer);
                 break;
@@ -213,7 +269,7 @@ static int tdo24m_power_on(struct tdo24m *lcd)
         if (err)
                 goto out;
  
-       err = tdo24m_adj_mode(lcd, lcd->mode);
+       err = lcd->adj_mode(lcd, lcd->mode);
  out:
         return err;
  }
@@ -262,7 +318,7 @@ static int tdo24m_set_mode(struct lcd_device *ld, struct fb_videomode *m)
         if (lcd->mode == mode)
                 return 0;
  
-       return tdo24m_adj_mode(lcd, mode);
+       return lcd->adj_mode(lcd, mode);
  }
  
  static struct lcd_ops tdo24m_ops = {
@@ -276,8 +332,16 @@ static int __devinit tdo24m_probe(struct spi_device *spi)
         struct tdo24m *lcd;
         struct spi_message *m;
         struct spi_transfer *x;
+       struct tdo24m_platform_data *pdata;
+       enum tdo24m_model model;
         int err;
  
+       pdata = spi->dev.platform_data;
+       if (pdata)
+               model = pdata->model;
+       else
+               model = TDO24M;
+
         spi->bits_per_word = 8;
         spi->mode = SPI_MODE_3;
         err = spi_setup(spi);
@@ -306,6 +370,20 @@ static int __devinit tdo24m_probe(struct spi_device *spi)
         x->tx_buf = &lcd->buf[0];
         spi_message_add_tail(x, m);
  
+       switch (model) {
+       case TDO24M:
+               lcd->color_invert = 1;
+               lcd->adj_mode = tdo24m_adj_mode;
+               break;
+       case TDO35S:
+               lcd->adj_mode = tdo35s_adj_mode;
+               lcd->color_invert = 0;
+               break;
+       default:
+               dev_err(&spi->dev, "Unsupported model");
+               goto out_free;
+       }
+
         lcd->lcd_dev = lcd_device_register("tdo24m", &spi->dev,
                                         lcd, &tdo24m_ops);
         if (IS_ERR(lcd->lcd_dev)) {
diff --git a/drivers/video/backlight/tosa_lcd.c b/drivers/video/backlight/tosa_lcd.c

index 57a26649f1a5ed6eeb4d6eae7517c79534090cad..b7fbc75a62fc7ec6c0dd6097af8a355d677724a8 100644 (file)
--- a/drivers/video/backlight/tosa_lcd.c
+++ b/drivers/video/backlight/tosa_lcd.c
@@ -39,6 +39,7 @@ struct tosa_lcd_data {
         struct i2c_client *i2c;
  
         int lcd_power;
+       bool is_vga;
  };
  
  static int tosa_tg_send(struct spi_device *spi, int adrs, uint8_t data)
@@ -81,8 +82,12 @@ static void tosa_lcd_tg_init(struct tosa_lcd_data *data)
  static void tosa_lcd_tg_on(struct tosa_lcd_data *data)
  {
         struct spi_device *spi = data->spi;
-       const int value = TG_REG0_COLOR | TG_REG0_UD | TG_REG0_LR;
-       tosa_tg_send(spi, TG_PNLCTL, value | TG_REG0_VQV); /* this depends on mode */
+       int value = TG_REG0_COLOR | TG_REG0_UD | TG_REG0_LR;
+
+       if (data->is_vga)
+               value |= TG_REG0_VQV;
+
+       tosa_tg_send(spi, TG_PNLCTL, value);
  
         /* TG LCD pannel power up */
         tosa_tg_send(spi, TG_PINICTL,0x4);
@@ -142,9 +147,25 @@ static int tosa_lcd_get_power(struct lcd_device *lcd)
         return data->lcd_power;
  }
  
+static int tosa_lcd_set_mode(struct lcd_device *lcd, struct fb_videomode *mode)
+{
+       struct tosa_lcd_data *data = lcd_get_data(lcd);
+
+       if (mode->xres == 320 || mode->yres == 320)
+               data->is_vga = false;
+       else
+               data->is_vga = true;
+
+       if (POWER_IS_ON(data->lcd_power))
+               tosa_lcd_tg_on(data);
+
+       return 0;
+}
+
  static struct lcd_ops tosa_lcd_ops = {
         .set_power = tosa_lcd_set_power,
         .get_power = tosa_lcd_get_power,
+       .set_mode = tosa_lcd_set_mode,
  };
  
  static int __devinit tosa_lcd_probe(struct spi_device *spi)
@@ -156,6 +177,8 @@ static int __devinit tosa_lcd_probe(struct spi_device *spi)
         if (!data)
                 return -ENOMEM;
  
+       data->is_vga = true; /* defaut to VGA mode */
+
         /*
          * bits_per_word cannot be configured in platform data
          */
diff --git a/drivers/video/backlight/vgg2432a4.c b/drivers/video/backlight/vgg2432a4.c

index 593c7687d54ae2f487fad7c5c47e45530c314043..8e653b8a6f17dcbdaa46f2942393e48c6d398a8b 100644 (file)
--- a/drivers/video/backlight/vgg2432a4.c
+++ b/drivers/video/backlight/vgg2432a4.c
@@ -137,7 +137,7 @@ static int vgg2432a4_lcd_init(struct ili9320 *lcd,
  
         ili9320_write(lcd, ILI9320_RGB_IF1, cfg->rgb_if1);
         ili9320_write(lcd, ILI9320_FRAMEMAKER, 0x0);
-       ili9320_write(lcd, ILI9320_RGB_IF2, ILI9320_RGBIF2_DPL);
+       ili9320_write(lcd, ILI9320_RGB_IF2, cfg->rgb_if2);
  
         ret = ili9320_write_regs(lcd, vgg_init1, ARRAY_SIZE(vgg_init1));
         if (ret != 0)
diff --git a/fs/Kconfig b/fs/Kconfig

index 32883589ee54d6a741774b5e96dc703ad3dc4411..51307b0fdf0f6f8c4c7022c91f9fa47618c82771 100644 (file)
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -269,6 +269,25 @@ config OCFS2_FS_POSIX_ACL
           Posix Access Control Lists (ACLs) support permissions for users and
           groups beyond the owner/group/world scheme.
  
+config BTRFS_FS
+       tristate "Btrfs filesystem (EXPERIMENTAL) Unstable disk format"
+       depends on EXPERIMENTAL
+       select LIBCRC32C
+       select ZLIB_INFLATE
+       select ZLIB_DEFLATE
+       help
+         Btrfs is a new filesystem with extents, writable snapshotting,
+         support for multiple devices and many more features.
+
+         Btrfs is highly experimental, and THE DISK FORMAT IS NOT YET
+         FINALIZED.  You should say N here unless you are interested in
+         testing Btrfs with non-critical data.
+
+         To compile this file system support as a module, choose M here. The
+         module will be called btrfs.
+
+         If unsure, say N.
+
  endif # BLOCK
  
  source "fs/notify/Kconfig"
@@ -913,6 +932,58 @@ config CRAMFS
  
           If unsure, say N.
  
+config SQUASHFS
+       tristate "SquashFS 4.0 - Squashed file system support"
+       depends on BLOCK
+       select ZLIB_INFLATE
+       help
+         Saying Y here includes support for SquashFS 4.0 (a Compressed
+         Read-Only File System).  Squashfs is a highly compressed read-only
+         filesystem for Linux.  It uses zlib compression to compress both
+         files, inodes and directories.  Inodes in the system are very small
+         and all blocks are packed to minimise data overhead. Block sizes
+         greater than 4K are supported up to a maximum of 1 Mbytes (default
+         block size 128K).  SquashFS 4.0 supports 64 bit filesystems and files
+         (larger than 4GB), full uid/gid information, hard links and
+         timestamps.  
+
+         Squashfs is intended for general read-only filesystem use, for
+         archival use (i.e. in cases where a .tar.gz file may be used), and in
+         embedded systems where low overhead is needed.  Further information
+         and tools are available from http://squashfs.sourceforge.net.
+
+         If you want to compile this as a module ( = code which can be
+         inserted in and removed from the running kernel whenever you want),
+         say M here and read <file:Documentation/modules.txt>.  The module
+         will be called squashfs.  Note that the root file system (the one
+         containing the directory /) cannot be compiled as a module.
+
+         If unsure, say N.
+
+config SQUASHFS_EMBEDDED
+
+       bool "Additional option for memory-constrained systems" 
+       depends on SQUASHFS
+       default n
+       help
+         Saying Y here allows you to specify cache size.
+
+         If unsure, say N.
+
+config SQUASHFS_FRAGMENT_CACHE_SIZE
+       int "Number of fragments cached" if SQUASHFS_EMBEDDED
+       depends on SQUASHFS
+       default "3"
+       help
+         By default SquashFS caches the last 3 fragments read from
+         the filesystem.  Increasing this amount may mean SquashFS
+         has to re-read fragments less often from disk, at the expense
+         of extra system memory.  Decreasing this amount will mean
+         SquashFS uses less memory at the expense of extra reads from disk.
+
+         Note there must be at least one cached fragment.  Anything
+         much more than three will probably not make much difference.
+
  config VXFS_FS
         tristate "FreeVxFS file system support (VERITAS VxFS(TM) compatible)"
         depends on BLOCK
diff --git a/fs/Makefile b/fs/Makefile

index c830611550d3d95726fe40a565d2ee9ff153158a..38bc735c67ad2ebfe79a4d64d35e6ad01117c8d6 100644 (file)
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -74,6 +74,7 @@ obj-$(CONFIG_JBD)             += jbd/
  obj-$(CONFIG_JBD2)             += jbd2/
  obj-$(CONFIG_EXT2_FS)          += ext2/
  obj-$(CONFIG_CRAMFS)           += cramfs/
+obj-$(CONFIG_SQUASHFS)         += squashfs/
  obj-y                          += ramfs/
  obj-$(CONFIG_HUGETLBFS)                += hugetlbfs/
  obj-$(CONFIG_CODA_FS)          += coda/
@@ -119,4 +120,5 @@ obj-$(CONFIG_HOSTFS)                += hostfs/
  obj-$(CONFIG_HPPFS)            += hppfs/
  obj-$(CONFIG_DEBUG_FS)         += debugfs/
  obj-$(CONFIG_OCFS2_FS)         += ocfs2/
+obj-$(CONFIG_BTRFS_FS)         += btrfs/
  obj-$(CONFIG_GFS2_FS)           += gfs2/
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c

index aa5b43205e3732a622e849d157b6227bb722ca81..f3e72c5c19f56ec0deee0614e1a625df511a58e0 100644 (file)
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -168,9 +168,6 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm,
         struct elf_fdpic_params exec_params, interp_params;
         struct elf_phdr *phdr;
         unsigned long stack_size, entryaddr;
-#ifndef CONFIG_MMU
-       unsigned long fullsize;
-#endif
  #ifdef ELF_FDPIC_PLAT_INIT
         unsigned long dynaddr;
  #endif
@@ -390,11 +387,6 @@ static int load_elf_fdpic_binary(struct linux_binprm *bprm,
                 goto error_kill;
         }
  
-       /* expand the stack mapping to use up the entire allocation granule */
-       fullsize = kobjsize((char *) current->mm->start_brk);
-       if (!IS_ERR_VALUE(do_mremap(current->mm->start_brk, stack_size,
-                                   fullsize, 0, 0)))
-               stack_size = fullsize;
         up_write(&current->mm->mmap_sem);
  
         current->mm->brk = current->mm->start_brk;
@@ -1567,11 +1559,9 @@ end_coredump:
  static int elf_fdpic_dump_segments(struct file *file, size_t *size,
                            unsigned long *limit, unsigned long mm_flags)
  {
-       struct vm_list_struct *vml;
-
-       for (vml = current->mm->context.vmlist; vml; vml = vml->next) {
-       struct vm_area_struct *vma = vml->vma;
+       struct vm_area_struct *vma;
  
+       for (vma = current->mm->mmap; vma; vma = vma->vm_next) {
                 if (!maydump(vma, mm_flags))
                         continue;
  
@@ -1617,9 +1607,6 @@ static int elf_fdpic_core_dump(long signr, struct pt_regs *regs,
         elf_fpxregset_t *xfpu = NULL;
  #endif
         int thread_status_size = 0;
-#ifndef CONFIG_MMU
-       struct vm_list_struct *vml;
-#endif
         elf_addr_t *auxv;
         unsigned long mm_flags;
  
@@ -1685,13 +1672,7 @@ static int elf_fdpic_core_dump(long signr, struct pt_regs *regs,
         fill_prstatus(prstatus, current, signr);
         elf_core_copy_regs(&prstatus->pr_reg, regs);
  
-#ifdef CONFIG_MMU
         segs = current->mm->map_count;
-#else
-       segs = 0;
-       for (vml = current->mm->context.vmlist; vml; vml = vml->next)
-           segs++;
-#endif
  #ifdef ELF_CORE_EXTRA_PHDRS
         segs += ELF_CORE_EXTRA_PHDRS;
  #endif
@@ -1766,20 +1747,10 @@ static int elf_fdpic_core_dump(long signr, struct pt_regs *regs,
         mm_flags = current->mm->flags;
  
         /* write program headers for segments dump */
-       for (
-#ifdef CONFIG_MMU
-               vma = current->mm->mmap; vma; vma = vma->vm_next
-#else
-                       vml = current->mm->context.vmlist; vml; vml = vml->next
-#endif
-            ) {
+       for (vma = current->mm->mmap; vma; vma = vma->vm_next) {
                 struct elf_phdr phdr;
                 size_t sz;
  
-#ifndef CONFIG_MMU
-               vma = vml->vma;
-#endif
-
                 sz = vma->vm_end - vma->vm_start;
  
                 phdr.p_type = PT_LOAD;
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c

index 7bbd5c6b37257ad5beba8180f9249d6f4eb1a6f3..5cebf0b37798422ff097260f657dc128ee2b0b58 100644 (file)
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -417,8 +417,8 @@ static int load_flat_file(struct linux_binprm * bprm,
         unsigned long textpos = 0, datapos = 0, result;
         unsigned long realdatastart = 0;
         unsigned long text_len, data_len, bss_len, stack_len, flags;
-       unsigned long len, reallen, memp = 0;
-       unsigned long extra, rlim;
+       unsigned long len, memp = 0;
+       unsigned long memp_size, extra, rlim;
         unsigned long *reloc = 0, *rp;
         struct inode *inode;
         int i, rev, relocs = 0;
@@ -543,17 +543,10 @@ static int load_flat_file(struct linux_binprm * bprm,
                 }
  
                 len = data_len + extra + MAX_SHARED_LIBS * sizeof(unsigned long);
+               len = PAGE_ALIGN(len);
                 down_write(&current->mm->mmap_sem);
                 realdatastart = do_mmap(0, 0, len,
                         PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, 0);
-               /* Remap to use all availabe slack region space */
-               if (realdatastart && (realdatastart < (unsigned long)-4096)) {
-                       reallen = kobjsize((void *)realdatastart);
-                       if (reallen > len) {
-                               realdatastart = do_mremap(realdatastart, len,
-                                       reallen, MREMAP_FIXED, realdatastart);
-                       }
-               }
                 up_write(&current->mm->mmap_sem);
  
                 if (realdatastart == 0 || realdatastart >= (unsigned long)-4096) {
@@ -591,21 +584,14 @@ static int load_flat_file(struct linux_binprm * bprm,
  
                 reloc = (unsigned long *) (datapos+(ntohl(hdr->reloc_start)-text_len));
                 memp = realdatastart;
-
+               memp_size = len;
         } else {
  
                 len = text_len + data_len + extra + MAX_SHARED_LIBS * sizeof(unsigned long);
+               len = PAGE_ALIGN(len);
                 down_write(&current->mm->mmap_sem);
                 textpos = do_mmap(0, 0, len,
                         PROT_READ | PROT_EXEC | PROT_WRITE, MAP_PRIVATE, 0);
-               /* Remap to use all availabe slack region space */
-               if (textpos && (textpos < (unsigned long) -4096)) {
-                       reallen = kobjsize((void *)textpos);
-                       if (reallen > len) {
-                               textpos = do_mremap(textpos, len, reallen,
-                                       MREMAP_FIXED, textpos);
-                       }
-               }
                 up_write(&current->mm->mmap_sem);
  
                 if (!textpos  || textpos >= (unsigned long) -4096) {
@@ -622,7 +608,7 @@ static int load_flat_file(struct linux_binprm * bprm,
                 reloc = (unsigned long *) (textpos + ntohl(hdr->reloc_start) +
                                 MAX_SHARED_LIBS * sizeof(unsigned long));
                 memp = textpos;
-
+               memp_size = len;
  #ifdef CONFIG_BINFMT_ZFLAT
                 /*
                  * load it all in and treat it like a RAM load from now on
@@ -680,10 +666,12 @@ static int load_flat_file(struct linux_binprm * bprm,
                  * set up the brk stuff, uses any slack left in data/bss/stack
                  * allocation.  We put the brk after the bss (between the bss
                  * and stack) like other platforms.
+                * Userspace code relies on the stack pointer starting out at
+                * an address right at the end of a page.
                  */
                 current->mm->start_brk = datapos + data_len + bss_len;
                 current->mm->brk = (current->mm->start_brk + 3) & ~3;
-               current->mm->context.end_brk = memp + kobjsize((void *) memp) - stack_len;
+               current->mm->context.end_brk = memp + memp_size - stack_len;
         }
  
         if (flags & FLAT_FLAG_KTRACE)
@@ -790,8 +778,8 @@ static int load_flat_file(struct linux_binprm * bprm,
  
         /* zero the BSS,  BRK and stack areas */
         memset((void*)(datapos + data_len), 0, bss_len + 
-                       (memp + kobjsize((void *) memp) - stack_len -   /* end brk */
-                       libinfo->lib_list[id].start_brk) +              /* start brk */
+                       (memp + memp_size - stack_len -         /* end brk */
+                       libinfo->lib_list[id].start_brk) +      /* start brk */
                         stack_len);
  
         return 0;
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile

new file mode 100644 (file)

index 0000000..d2cf5a5
--- /dev/null
+++ b/fs/btrfs/Makefile
@@ -0,0 +1,25 @@
+ifneq ($(KERNELRELEASE),)
+# kbuild part of makefile
+
+obj-$(CONFIG_BTRFS_FS) := btrfs.o
+btrfs-y := super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
+          file-item.o inode-item.o inode-map.o disk-io.o \
+          transaction.o inode.o file.o tree-defrag.o \
+          extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
+          extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
+          ref-cache.o export.o tree-log.o acl.o free-space-cache.o zlib.o \
+          compression.o
+else
+
+# Normal Makefile
+
+KERNELDIR := /lib/modules/`uname -r`/build
+all:
+       $(MAKE) -C $(KERNELDIR) M=`pwd` CONFIG_BTRFS_FS=m modules
+
+modules_install:
+       $(MAKE) -C $(KERNELDIR) M=`pwd` modules_install
+clean:
+       $(MAKE) -C $(KERNELDIR) M=`pwd` clean
+
+endif
diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c

new file mode 100644 (file)

index 0000000..1d53b62
--- /dev/null
+++ b/fs/btrfs/acl.c
@@ -0,0 +1,351 @@
+/*
+ * Copyright (C) 2007 Red Hat.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/fs.h>
+#include <linux/string.h>
+#include <linux/xattr.h>
+#include <linux/posix_acl_xattr.h>
+#include <linux/posix_acl.h>
+#include <linux/sched.h>
+
+#include "ctree.h"
+#include "btrfs_inode.h"
+#include "xattr.h"
+
+#ifdef CONFIG_FS_POSIX_ACL
+
+static void btrfs_update_cached_acl(struct inode *inode,
+                                   struct posix_acl **p_acl,
+                                   struct posix_acl *acl)
+{
+       spin_lock(&inode->i_lock);
+       if (*p_acl && *p_acl != BTRFS_ACL_NOT_CACHED)
+               posix_acl_release(*p_acl);
+       *p_acl = posix_acl_dup(acl);
+       spin_unlock(&inode->i_lock);
+}
+
+static struct posix_acl *btrfs_get_acl(struct inode *inode, int type)
+{
+       int size;
+       const char *name;
+       char *value = NULL;
+       struct posix_acl *acl = NULL, **p_acl;
+
+       switch (type) {
+       case ACL_TYPE_ACCESS:
+               name = POSIX_ACL_XATTR_ACCESS;
+               p_acl = &BTRFS_I(inode)->i_acl;
+               break;
+       case ACL_TYPE_DEFAULT:
+               name = POSIX_ACL_XATTR_DEFAULT;
+               p_acl = &BTRFS_I(inode)->i_default_acl;
+               break;
+       default:
+               return ERR_PTR(-EINVAL);
+       }
+
+       spin_lock(&inode->i_lock);
+       if (*p_acl != BTRFS_ACL_NOT_CACHED)
+               acl = posix_acl_dup(*p_acl);
+       spin_unlock(&inode->i_lock);
+
+       if (acl)
+               return acl;
+
+
+       size = __btrfs_getxattr(inode, name, "", 0);
+       if (size > 0) {
+               value = kzalloc(size, GFP_NOFS);
+               if (!value)
+                       return ERR_PTR(-ENOMEM);
+               size = __btrfs_getxattr(inode, name, value, size);
+               if (size > 0) {
+                       acl = posix_acl_from_xattr(value, size);
+                       btrfs_update_cached_acl(inode, p_acl, acl);
+               }
+               kfree(value);
+       } else if (size == -ENOENT) {
+               acl = NULL;
+               btrfs_update_cached_acl(inode, p_acl, acl);
+       }
+
+       return acl;
+}
+
+static int btrfs_xattr_get_acl(struct inode *inode, int type,
+                              void *value, size_t size)
+{
+       struct posix_acl *acl;
+       int ret = 0;
+
+       acl = btrfs_get_acl(inode, type);
+
+       if (IS_ERR(acl))
+               return PTR_ERR(acl);
+       if (acl == NULL)
+               return -ENODATA;
+       ret = posix_acl_to_xattr(acl, value, size);
+       posix_acl_release(acl);
+
+       return ret;
+}
+
+/*
+ * Needs to be called with fs_mutex held
+ */
+static int btrfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
+{
+       int ret, size = 0;
+       const char *name;
+       struct posix_acl **p_acl;
+       char *value = NULL;
+       mode_t mode;
+
+       if (acl) {
+               ret = posix_acl_valid(acl);
+               if (ret < 0)
+                       return ret;
+               ret = 0;
+       }
+
+       switch (type) {
+       case ACL_TYPE_ACCESS:
+               mode = inode->i_mode;
+               ret = posix_acl_equiv_mode(acl, &mode);
+               if (ret < 0)
+                       return ret;
+               ret = 0;
+               inode->i_mode = mode;
+               name = POSIX_ACL_XATTR_ACCESS;
+               p_acl = &BTRFS_I(inode)->i_acl;
+               break;
+       case ACL_TYPE_DEFAULT:
+               if (!S_ISDIR(inode->i_mode))
+                       return acl ? -EINVAL : 0;
+               name = POSIX_ACL_XATTR_DEFAULT;
+               p_acl = &BTRFS_I(inode)->i_default_acl;
+               break;
+       default:
+               return -EINVAL;
+       }
+
+       if (acl) {
+               size = posix_acl_xattr_size(acl->a_count);
+               value = kmalloc(size, GFP_NOFS);
+               if (!value) {
+                       ret = -ENOMEM;
+                       goto out;
+               }
+
+               ret = posix_acl_to_xattr(acl, value, size);
+               if (ret < 0)
+                       goto out;
+       }
+
+       ret = __btrfs_setxattr(inode, name, value, size, 0);
+
+out:
+       kfree(value);
+
+       if (!ret)
+               btrfs_update_cached_acl(inode, p_acl, acl);
+
+       return ret;
+}
+
+static int btrfs_xattr_set_acl(struct inode *inode, int type,
+                              const void *value, size_t size)
+{
+       int ret = 0;
+       struct posix_acl *acl = NULL;
+
+       if (value) {
+               acl = posix_acl_from_xattr(value, size);
+               if (acl == NULL) {
+                       value = NULL;
+                       size = 0;
+               } else if (IS_ERR(acl)) {
+                       return PTR_ERR(acl);
+               }
+       }
+
+       ret = btrfs_set_acl(inode, acl, type);
+
+       posix_acl_release(acl);
+
+       return ret;
+}
+
+
+static int btrfs_xattr_acl_access_get(struct inode *inode, const char *name,
+                                     void *value, size_t size)
+{
+       return btrfs_xattr_get_acl(inode, ACL_TYPE_ACCESS, value, size);
+}
+
+static int btrfs_xattr_acl_access_set(struct inode *inode, const char *name,
+                                     const void *value, size_t size, int flags)
+{
+       return btrfs_xattr_set_acl(inode, ACL_TYPE_ACCESS, value, size);
+}
+
+static int btrfs_xattr_acl_default_get(struct inode *inode, const char *name,
+                                      void *value, size_t size)
+{
+       return btrfs_xattr_get_acl(inode, ACL_TYPE_DEFAULT, value, size);
+}
+
+static int btrfs_xattr_acl_default_set(struct inode *inode, const char *name,
+                              const void *value, size_t size, int flags)
+{
+       return btrfs_xattr_set_acl(inode, ACL_TYPE_DEFAULT, value, size);
+}
+
+int btrfs_check_acl(struct inode *inode, int mask)
+{
+       struct posix_acl *acl;
+       int error = -EAGAIN;
+
+       acl = btrfs_get_acl(inode, ACL_TYPE_ACCESS);
+
+       if (IS_ERR(acl))
+               return PTR_ERR(acl);
+       if (acl) {
+               error = posix_acl_permission(inode, acl, mask);
+               posix_acl_release(acl);
+       }
+
+       return error;
+}
+
+/*
+ * btrfs_init_acl is already generally called under fs_mutex, so the locking
+ * stuff has been fixed to work with that.  If the locking stuff changes, we
+ * need to re-evaluate the acl locking stuff.
+ */
+int btrfs_init_acl(struct inode *inode, struct inode *dir)
+{
+       struct posix_acl *acl = NULL;
+       int ret = 0;
+
+       /* this happens with subvols */
+       if (!dir)
+               return 0;
+
+       if (!S_ISLNK(inode->i_mode)) {
+               if (IS_POSIXACL(dir)) {
+                       acl = btrfs_get_acl(dir, ACL_TYPE_DEFAULT);
+                       if (IS_ERR(acl))
+                               return PTR_ERR(acl);
+               }
+
+               if (!acl)
+                       inode->i_mode &= ~current->fs->umask;
+       }
+
+       if (IS_POSIXACL(dir) && acl) {
+               struct posix_acl *clone;
+               mode_t mode;
+
+               if (S_ISDIR(inode->i_mode)) {
+                       ret = btrfs_set_acl(inode, acl, ACL_TYPE_DEFAULT);
+                       if (ret)
+                               goto failed;
+               }
+               clone = posix_acl_clone(acl, GFP_NOFS);
+               ret = -ENOMEM;
+               if (!clone)
+                       goto failed;
+
+               mode = inode->i_mode;
+               ret = posix_acl_create_masq(clone, &mode);
+               if (ret >= 0) {
+                       inode->i_mode = mode;
+                       if (ret > 0) {
+                               /* we need an acl */
+                               ret = btrfs_set_acl(inode, clone,
+                                                   ACL_TYPE_ACCESS);
+                       }
+               }
+       }
+failed:
+       posix_acl_release(acl);
+
+       return ret;
+}
+
+int btrfs_acl_chmod(struct inode *inode)
+{
+       struct posix_acl *acl, *clone;
+       int ret = 0;
+
+       if (S_ISLNK(inode->i_mode))
+               return -EOPNOTSUPP;
+
+       if (!IS_POSIXACL(inode))
+               return 0;
+
+       acl = btrfs_get_acl(inode, ACL_TYPE_ACCESS);
+       if (IS_ERR(acl) || !acl)
+               return PTR_ERR(acl);
+
+       clone = posix_acl_clone(acl, GFP_KERNEL);
+       posix_acl_release(acl);
+       if (!clone)
+               return -ENOMEM;
+
+       ret = posix_acl_chmod_masq(clone, inode->i_mode);
+       if (!ret)
+               ret = btrfs_set_acl(inode, clone, ACL_TYPE_ACCESS);
+
+       posix_acl_release(clone);
+
+       return ret;
+}
+
+struct xattr_handler btrfs_xattr_acl_default_handler = {
+       .prefix = POSIX_ACL_XATTR_DEFAULT,
+       .get    = btrfs_xattr_acl_default_get,
+       .set    = btrfs_xattr_acl_default_set,
+};
+
+struct xattr_handler btrfs_xattr_acl_access_handler = {
+       .prefix = POSIX_ACL_XATTR_ACCESS,
+       .get    = btrfs_xattr_acl_access_get,
+       .set    = btrfs_xattr_acl_access_set,
+};
+
+#else /* CONFIG_FS_POSIX_ACL */
+
+int btrfs_acl_chmod(struct inode *inode)
+{
+       return 0;
+}
+
+int btrfs_init_acl(struct inode *inode, struct inode *dir)
+{
+       return 0;
+}
+
+int btrfs_check_acl(struct inode *inode, int mask)
+{
+       return 0;
+}
+
+#endif /* CONFIG_FS_POSIX_ACL */
diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c

new file mode 100644 (file)

index 0000000..8e2fec0
--- /dev/null
+++ b/fs/btrfs/async-thread.c
@@ -0,0 +1,419 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/version.h>
+#include <linux/kthread.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+# include <linux/freezer.h>
+#include "async-thread.h"
+
+#define WORK_QUEUED_BIT 0
+#define WORK_DONE_BIT 1
+#define WORK_ORDER_DONE_BIT 2
+
+/*
+ * container for the kthread task pointer and the list of pending work
+ * One of these is allocated per thread.
+ */
+struct btrfs_worker_thread {
+       /* pool we belong to */
+       struct btrfs_workers *workers;
+
+       /* list of struct btrfs_work that are waiting for service */
+       struct list_head pending;
+
+       /* list of worker threads from struct btrfs_workers */
+       struct list_head worker_list;
+
+       /* kthread */
+       struct task_struct *task;
+
+       /* number of things on the pending list */
+       atomic_t num_pending;
+
+       unsigned long sequence;
+
+       /* protects the pending list. */
+       spinlock_t lock;
+
+       /* set to non-zero when this thread is already awake and kicking */
+       int working;
+
+       /* are we currently idle */
+       int idle;
+};
+
+/*
+ * helper function to move a thread onto the idle list after it
+ * has finished some requests.
+ */
+static void check_idle_worker(struct btrfs_worker_thread *worker)
+{
+       if (!worker->idle && atomic_read(&worker->num_pending) <
+           worker->workers->idle_thresh / 2) {
+               unsigned long flags;
+               spin_lock_irqsave(&worker->workers->lock, flags);
+               worker->idle = 1;
+               list_move(&worker->worker_list, &worker->workers->idle_list);
+               spin_unlock_irqrestore(&worker->workers->lock, flags);
+       }
+}
+
+/*
+ * helper function to move a thread off the idle list after new
+ * pending work is added.
+ */
+static void check_busy_worker(struct btrfs_worker_thread *worker)
+{
+       if (worker->idle && atomic_read(&worker->num_pending) >=
+           worker->workers->idle_thresh) {
+               unsigned long flags;
+               spin_lock_irqsave(&worker->workers->lock, flags);
+               worker->idle = 0;
+               list_move_tail(&worker->worker_list,
+                              &worker->workers->worker_list);
+               spin_unlock_irqrestore(&worker->workers->lock, flags);
+       }
+}
+
+static noinline int run_ordered_completions(struct btrfs_workers *workers,
+                                           struct btrfs_work *work)
+{
+       unsigned long flags;
+
+       if (!workers->ordered)
+               return 0;
+
+       set_bit(WORK_DONE_BIT, &work->flags);
+
+       spin_lock_irqsave(&workers->lock, flags);
+
+       while (!list_empty(&workers->order_list)) {
+               work = list_entry(workers->order_list.next,
+                                 struct btrfs_work, order_list);
+
+               if (!test_bit(WORK_DONE_BIT, &work->flags))
+                       break;
+
+               /* we are going to call the ordered done function, but
+                * we leave the work item on the list as a barrier so
+                * that later work items that are done don't have their
+                * functions called before this one returns
+                */
+               if (test_and_set_bit(WORK_ORDER_DONE_BIT, &work->flags))
+                       break;
+
+               spin_unlock_irqrestore(&workers->lock, flags);
+
+               work->ordered_func(work);
+
+               /* now take the lock again and call the freeing code */
+               spin_lock_irqsave(&workers->lock, flags);
+               list_del(&work->order_list);
+               work->ordered_free(work);
+       }
+
+       spin_unlock_irqrestore(&workers->lock, flags);
+       return 0;
+}
+
+/*
+ * main loop for servicing work items
+ */
+static int worker_loop(void *arg)
+{
+       struct btrfs_worker_thread *worker = arg;
+       struct list_head *cur;
+       struct btrfs_work *work;
+       do {
+               spin_lock_irq(&worker->lock);
+               while (!list_empty(&worker->pending)) {
+                       cur = worker->pending.next;
+                       work = list_entry(cur, struct btrfs_work, list);
+                       list_del(&work->list);
+                       clear_bit(WORK_QUEUED_BIT, &work->flags);
+
+                       work->worker = worker;
+                       spin_unlock_irq(&worker->lock);
+
+                       work->func(work);
+
+                       atomic_dec(&worker->num_pending);
+                       /*
+                        * unless this is an ordered work queue,
+                        * 'work' was probably freed by func above.
+                        */
+                       run_ordered_completions(worker->workers, work);
+
+                       spin_lock_irq(&worker->lock);
+                       check_idle_worker(worker);
+
+               }
+               worker->working = 0;
+               if (freezing(current)) {
+                       refrigerator();
+               } else {
+                       set_current_state(TASK_INTERRUPTIBLE);
+                       spin_unlock_irq(&worker->lock);
+                       if (!kthread_should_stop())
+                               schedule();
+                       __set_current_state(TASK_RUNNING);
+               }
+       } while (!kthread_should_stop());
+       return 0;
+}
+
+/*
+ * this will wait for all the worker threads to shutdown
+ */
+int btrfs_stop_workers(struct btrfs_workers *workers)
+{
+       struct list_head *cur;
+       struct btrfs_worker_thread *worker;
+
+       list_splice_init(&workers->idle_list, &workers->worker_list);
+       while (!list_empty(&workers->worker_list)) {
+               cur = workers->worker_list.next;
+               worker = list_entry(cur, struct btrfs_worker_thread,
+                                   worker_list);
+               kthread_stop(worker->task);
+               list_del(&worker->worker_list);
+               kfree(worker);
+       }
+       return 0;
+}
+
+/*
+ * simple init on struct btrfs_workers
+ */
+void btrfs_init_workers(struct btrfs_workers *workers, char *name, int max)
+{
+       workers->num_workers = 0;
+       INIT_LIST_HEAD(&workers->worker_list);
+       INIT_LIST_HEAD(&workers->idle_list);
+       INIT_LIST_HEAD(&workers->order_list);
+       spin_lock_init(&workers->lock);
+       workers->max_workers = max;
+       workers->idle_thresh = 32;
+       workers->name = name;
+       workers->ordered = 0;
+}
+
+/*
+ * starts new worker threads.  This does not enforce the max worker
+ * count in case you need to temporarily go past it.
+ */
+int btrfs_start_workers(struct btrfs_workers *workers, int num_workers)
+{
+       struct btrfs_worker_thread *worker;
+       int ret = 0;
+       int i;
+
+       for (i = 0; i < num_workers; i++) {
+               worker = kzalloc(sizeof(*worker), GFP_NOFS);
+               if (!worker) {
+                       ret = -ENOMEM;
+                       goto fail;
+               }
+
+               INIT_LIST_HEAD(&worker->pending);
+               INIT_LIST_HEAD(&worker->worker_list);
+               spin_lock_init(&worker->lock);
+               atomic_set(&worker->num_pending, 0);
+               worker->task = kthread_run(worker_loop, worker,
+                                          "btrfs-%s-%d", workers->name,
+                                          workers->num_workers + i);
+               worker->workers = workers;
+               if (IS_ERR(worker->task)) {
+                       kfree(worker);
+                       ret = PTR_ERR(worker->task);
+                       goto fail;
+               }
+
+               spin_lock_irq(&workers->lock);
+               list_add_tail(&worker->worker_list, &workers->idle_list);
+               worker->idle = 1;
+               workers->num_workers++;
+               spin_unlock_irq(&workers->lock);
+       }
+       return 0;
+fail:
+       btrfs_stop_workers(workers);
+       return ret;
+}
+
+/*
+ * run through the list and find a worker thread that doesn't have a lot
+ * to do right now.  This can return null if we aren't yet at the thread
+ * count limit and all of the threads are busy.
+ */
+static struct btrfs_worker_thread *next_worker(struct btrfs_workers *workers)
+{
+       struct btrfs_worker_thread *worker;
+       struct list_head *next;
+       int enforce_min = workers->num_workers < workers->max_workers;
+
+       /*
+        * if we find an idle thread, don't move it to the end of the
+        * idle list.  This improves the chance that the next submission
+        * will reuse the same thread, and maybe catch it while it is still
+        * working
+        */
+       if (!list_empty(&workers->idle_list)) {
+               next = workers->idle_list.next;
+               worker = list_entry(next, struct btrfs_worker_thread,
+                                   worker_list);
+               return worker;
+       }
+       if (enforce_min || list_empty(&workers->worker_list))
+               return NULL;
+
+       /*
+        * if we pick a busy task, move the task to the end of the list.
+        * hopefully this will keep things somewhat evenly balanced.
+        * Do the move in batches based on the sequence number.  This groups
+        * requests submitted at roughly the same time onto the same worker.
+        */
+       next = workers->worker_list.next;
+       worker = list_entry(next, struct btrfs_worker_thread, worker_list);
+       atomic_inc(&worker->num_pending);
+       worker->sequence++;
+
+       if (worker->sequence % workers->idle_thresh == 0)
+               list_move_tail(next, &workers->worker_list);
+       return worker;
+}
+
+/*
+ * selects a worker thread to take the next job.  This will either find
+ * an idle worker, start a new worker up to the max count, or just return
+ * one of the existing busy workers.
+ */
+static struct btrfs_worker_thread *find_worker(struct btrfs_workers *workers)
+{
+       struct btrfs_worker_thread *worker;
+       unsigned long flags;
+
+again:
+       spin_lock_irqsave(&workers->lock, flags);
+       worker = next_worker(workers);
+       spin_unlock_irqrestore(&workers->lock, flags);
+
+       if (!worker) {
+               spin_lock_irqsave(&workers->lock, flags);
+               if (workers->num_workers >= workers->max_workers) {
+                       struct list_head *fallback = NULL;
+                       /*
+                        * we have failed to find any workers, just
+                        * return the force one
+                        */
+                       if (!list_empty(&workers->worker_list))
+                               fallback = workers->worker_list.next;
+                       if (!list_empty(&workers->idle_list))
+                               fallback = workers->idle_list.next;
+                       BUG_ON(!fallback);
+                       worker = list_entry(fallback,
+                                 struct btrfs_worker_thread, worker_list);
+                       spin_unlock_irqrestore(&workers->lock, flags);
+               } else {
+                       spin_unlock_irqrestore(&workers->lock, flags);
+                       /* we're below the limit, start another worker */
+                       btrfs_start_workers(workers, 1);
+                       goto again;
+               }
+       }
+       return worker;
+}
+
+/*
+ * btrfs_requeue_work just puts the work item back on the tail of the list
+ * it was taken from.  It is intended for use with long running work functions
+ * that make some progress and want to give the cpu up for others.
+ */
+int btrfs_requeue_work(struct btrfs_work *work)
+{
+       struct btrfs_worker_thread *worker = work->worker;
+       unsigned long flags;
+
+       if (test_and_set_bit(WORK_QUEUED_BIT, &work->flags))
+               goto out;
+
+       spin_lock_irqsave(&worker->lock, flags);
+       atomic_inc(&worker->num_pending);
+       list_add_tail(&work->list, &worker->pending);
+
+       /* by definition we're busy, take ourselves off the idle
+        * list
+        */
+       if (worker->idle) {
+               spin_lock_irqsave(&worker->workers->lock, flags);
+               worker->idle = 0;
+               list_move_tail(&worker->worker_list,
+                              &worker->workers->worker_list);
+               spin_unlock_irqrestore(&worker->workers->lock, flags);
+       }
+
+       spin_unlock_irqrestore(&worker->lock, flags);
+
+out:
+       return 0;
+}
+
+/*
+ * places a struct btrfs_work into the pending queue of one of the kthreads
+ */
+int btrfs_queue_worker(struct btrfs_workers *workers, struct btrfs_work *work)
+{
+       struct btrfs_worker_thread *worker;
+       unsigned long flags;
+       int wake = 0;
+
+       /* don't requeue something already on a list */
+       if (test_and_set_bit(WORK_QUEUED_BIT, &work->flags))
+               goto out;
+
+       worker = find_worker(workers);
+       if (workers->ordered) {
+               spin_lock_irqsave(&workers->lock, flags);
+               list_add_tail(&work->order_list, &workers->order_list);
+               spin_unlock_irqrestore(&workers->lock, flags);
+       } else {
+               INIT_LIST_HEAD(&work->order_list);
+       }
+
+       spin_lock_irqsave(&worker->lock, flags);
+       atomic_inc(&worker->num_pending);
+       check_busy_worker(worker);
+       list_add_tail(&work->list, &worker->pending);
+
+       /*
+        * avoid calling into wake_up_process if this thread has already
+        * been kicked
+        */
+       if (!worker->working)
+               wake = 1;
+       worker->working = 1;
+
+       spin_unlock_irqrestore(&worker->lock, flags);
+
+       if (wake)
+               wake_up_process(worker->task);
+out:
+       return 0;
+}
diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h

new file mode 100644 (file)

index 0000000..31be4ed
--- /dev/null
+++ b/fs/btrfs/async-thread.h
@@ -0,0 +1,101 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_ASYNC_THREAD_
+#define __BTRFS_ASYNC_THREAD_
+
+struct btrfs_worker_thread;
+
+/*
+ * This is similar to a workqueue, but it is meant to spread the operations
+ * across all available cpus instead of just the CPU that was used to
+ * queue the work.  There is also some batching introduced to try and
+ * cut down on context switches.
+ *
+ * By default threads are added on demand up to 2 * the number of cpus.
+ * Changing struct btrfs_workers->max_workers is one way to prevent
+ * demand creation of kthreads.
+ *
+ * the basic model of these worker threads is to embed a btrfs_work
+ * structure in your own data struct, and use container_of in a
+ * work function to get back to your data struct.
+ */
+struct btrfs_work {
+       /*
+        * func should be set to the function you want called
+        * your work struct is passed as the only arg
+        *
+        * ordered_func must be set for work sent to an ordered work queue,
+        * and it is called to complete a given work item in the same
+        * order they were sent to the queue.
+        */
+       void (*func)(struct btrfs_work *work);
+       void (*ordered_func)(struct btrfs_work *work);
+       void (*ordered_free)(struct btrfs_work *work);
+
+       /*
+        * flags should be set to zero.  It is used to make sure the
+        * struct is only inserted once into the list.
+        */
+       unsigned long flags;
+
+       /* don't touch these */
+       struct btrfs_worker_thread *worker;
+       struct list_head list;
+       struct list_head order_list;
+};
+
+struct btrfs_workers {
+       /* current number of running workers */
+       int num_workers;
+
+       /* max number of workers allowed.  changed by btrfs_start_workers */
+       int max_workers;
+
+       /* once a worker has this many requests or fewer, it is idle */
+       int idle_thresh;
+
+       /* force completions in the order they were queued */
+       int ordered;
+
+       /* list with all the work threads.  The workers on the idle thread
+        * may be actively servicing jobs, but they haven't yet hit the
+        * idle thresh limit above.
+        */
+       struct list_head worker_list;
+       struct list_head idle_list;
+
+       /*
+        * when operating in ordered mode, this maintains the list
+        * of work items waiting for completion
+        */
+       struct list_head order_list;
+
+       /* lock for finding the next worker thread to queue on */
+       spinlock_t lock;
+
+       /* extra name for this worker, used for current->name */
+       char *name;
+};
+
+int btrfs_queue_worker(struct btrfs_workers *workers, struct btrfs_work *work);
+int btrfs_start_workers(struct btrfs_workers *workers, int num_workers);
+int btrfs_stop_workers(struct btrfs_workers *workers);
+void btrfs_init_workers(struct btrfs_workers *workers, char *name, int max);
+int btrfs_requeue_work(struct btrfs_work *work);
+#endif
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h

new file mode 100644 (file)

index 0000000..a8c9693
--- /dev/null
+++ b/fs/btrfs/btrfs_inode.h
@@ -0,0 +1,131 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_I__
+#define __BTRFS_I__
+
+#include "extent_map.h"
+#include "extent_io.h"
+#include "ordered-data.h"
+
+/* in memory btrfs inode */
+struct btrfs_inode {
+       /* which subvolume this inode belongs to */
+       struct btrfs_root *root;
+
+       /* key used to find this inode on disk.  This is used by the code
+        * to read in roots of subvolumes
+        */
+       struct btrfs_key location;
+
+       /* the extent_tree has caches of all the extent mappings to disk */
+       struct extent_map_tree extent_tree;
+
+       /* the io_tree does range state (DIRTY, LOCKED etc) */
+       struct extent_io_tree io_tree;
+
+       /* special utility tree used to record which mirrors have already been
+        * tried when checksums fail for a given block
+        */
+       struct extent_io_tree io_failure_tree;
+
+       /* held while inesrting or deleting extents from files */
+       struct mutex extent_mutex;
+
+       /* held while logging the inode in tree-log.c */
+       struct mutex log_mutex;
+
+       /* used to order data wrt metadata */
+       struct btrfs_ordered_inode_tree ordered_tree;
+
+       /* standard acl pointers */
+       struct posix_acl *i_acl;
+       struct posix_acl *i_default_acl;
+
+       /* for keeping track of orphaned inodes */
+       struct list_head i_orphan;
+
+       /* list of all the delalloc inodes in the FS.  There are times we need
+        * to write all the delalloc pages to disk, and this list is used
+        * to walk them all.
+        */
+       struct list_head delalloc_inodes;
+
+       /* full 64 bit generation number, struct vfs_inode doesn't have a big
+        * enough field for this.
+        */
+       u64 generation;
+
+       /* sequence number for NFS changes */
+       u64 sequence;
+
+       /*
+        * transid of the trans_handle that last modified this inode
+        */
+       u64 last_trans;
+       /*
+        * transid that last logged this inode
+        */
+       u64 logged_trans;
+
+       /*
+        * trans that last made a change that should be fully fsync'd.  This
+        * gets reset to zero each time the inode is logged
+        */
+       u64 log_dirty_trans;
+
+       /* total number of bytes pending delalloc, used by stat to calc the
+        * real block usage of the file
+        */
+       u64 delalloc_bytes;
+
+       /*
+        * the size of the file stored in the metadata on disk.  data=ordered
+        * means the in-memory i_size might be larger than the size on disk
+        * because not all the blocks are written yet.
+        */
+       u64 disk_i_size;
+
+       /* flags field from the on disk inode */
+       u32 flags;
+
+       /*
+        * if this is a directory then index_cnt is the counter for the index
+        * number for new files that are created
+        */
+       u64 index_cnt;
+
+       /* the start of block group preferred for allocations. */
+       u64 block_group;
+
+       struct inode vfs_inode;
+};
+
+static inline struct btrfs_inode *BTRFS_I(struct inode *inode)
+{
+       return container_of(inode, struct btrfs_inode, vfs_inode);
+}
+
+static inline void btrfs_i_size_write(struct inode *inode, u64 size)
+{
+       inode->i_size = size;
+       BTRFS_I(inode)->disk_i_size = size;
+}
+
+
+#endif
diff --git a/fs/btrfs/compat.h b/fs/btrfs/compat.h

new file mode 100644 (file)

index 0000000..7c4503e
--- /dev/null
+++ b/fs/btrfs/compat.h
@@ -0,0 +1,7 @@
+#ifndef _COMPAT_H_
+#define _COMPAT_H_
+
+#define btrfs_drop_nlink(inode) drop_nlink(inode)
+#define btrfs_inc_nlink(inode) inc_nlink(inode)
+
+#endif /* _COMPAT_H_ */
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c

new file mode 100644 (file)

index 0000000..ee848d8
--- /dev/null
+++ b/fs/btrfs/compression.c
@@ -0,0 +1,709 @@
+/*
+ * Copyright (C) 2008 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/kernel.h>
+#include <linux/bio.h>
+#include <linux/buffer_head.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/highmem.h>
+#include <linux/time.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/smp_lock.h>
+#include <linux/backing-dev.h>
+#include <linux/mpage.h>
+#include <linux/swap.h>
+#include <linux/writeback.h>
+#include <linux/bit_spinlock.h>
+#include <linux/version.h>
+#include <linux/pagevec.h>
+#include "compat.h"
+#include "ctree.h"
+#include "disk-io.h"
+#include "transaction.h"
+#include "btrfs_inode.h"
+#include "volumes.h"
+#include "ordered-data.h"
+#include "compression.h"
+#include "extent_io.h"
+#include "extent_map.h"
+
+struct compressed_bio {
+       /* number of bios pending for this compressed extent */
+       atomic_t pending_bios;
+
+       /* the pages with the compressed data on them */
+       struct page **compressed_pages;
+
+       /* inode that owns this data */
+       struct inode *inode;
+
+       /* starting offset in the inode for our pages */
+       u64 start;
+
+       /* number of bytes in the inode we're working on */
+       unsigned long len;
+
+       /* number of bytes on disk */
+       unsigned long compressed_len;
+
+       /* number of compressed pages in the array */
+       unsigned long nr_pages;
+
+       /* IO errors */
+       int errors;
+       int mirror_num;
+
+       /* for reads, this is the bio we are copying the data into */
+       struct bio *orig_bio;
+
+       /*
+        * the start of a variable length array of checksums only
+        * used by reads
+        */
+       u32 sums;
+};
+
+static inline int compressed_bio_size(struct btrfs_root *root,
+                                     unsigned long disk_size)
+{
+       u16 csum_size = btrfs_super_csum_size(&root->fs_info->super_copy);
+       return sizeof(struct compressed_bio) +
+               ((disk_size + root->sectorsize - 1) / root->sectorsize) *
+               csum_size;
+}
+
+static struct bio *compressed_bio_alloc(struct block_device *bdev,
+                                       u64 first_byte, gfp_t gfp_flags)
+{
+       struct bio *bio;
+       int nr_vecs;
+
+       nr_vecs = bio_get_nr_vecs(bdev);
+       bio = bio_alloc(gfp_flags, nr_vecs);
+
+       if (bio == NULL && (current->flags & PF_MEMALLOC)) {
+               while (!bio && (nr_vecs /= 2))
+                       bio = bio_alloc(gfp_flags, nr_vecs);
+       }
+
+       if (bio) {
+               bio->bi_size = 0;
+               bio->bi_bdev = bdev;
+               bio->bi_sector = first_byte >> 9;
+       }
+       return bio;
+}
+
+static int check_compressed_csum(struct inode *inode,
+                                struct compressed_bio *cb,
+                                u64 disk_start)
+{
+       int ret;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct page *page;
+       unsigned long i;
+       char *kaddr;
+       u32 csum;
+       u32 *cb_sum = &cb->sums;
+
+       if (btrfs_test_flag(inode, NODATASUM))
+               return 0;
+
+       for (i = 0; i < cb->nr_pages; i++) {
+               page = cb->compressed_pages[i];
+               csum = ~(u32)0;
+
+               kaddr = kmap_atomic(page, KM_USER0);
+               csum = btrfs_csum_data(root, kaddr, csum, PAGE_CACHE_SIZE);
+               btrfs_csum_final(csum, (char *)&csum);
+               kunmap_atomic(kaddr, KM_USER0);
+
+               if (csum != *cb_sum) {
+                       printk(KERN_INFO "btrfs csum failed ino %lu "
+                              "extent %llu csum %u "
+                              "wanted %u mirror %d\n", inode->i_ino,
+                              (unsigned long long)disk_start,
+                              csum, *cb_sum, cb->mirror_num);
+                       ret = -EIO;
+                       goto fail;
+               }
+               cb_sum++;
+
+       }
+       ret = 0;
+fail:
+       return ret;
+}
+
+/* when we finish reading compressed pages from the disk, we
+ * decompress them and then run the bio end_io routines on the
+ * decompressed pages (in the inode address space).
+ *
+ * This allows the checksumming and other IO error handling routines
+ * to work normally
+ *
+ * The compressed pages are freed here, and it must be run
+ * in process context
+ */
+static void end_compressed_bio_read(struct bio *bio, int err)
+{
+       struct extent_io_tree *tree;
+       struct compressed_bio *cb = bio->bi_private;
+       struct inode *inode;
+       struct page *page;
+       unsigned long index;
+       int ret;
+
+       if (err)
+               cb->errors = 1;
+
+       /* if there are more bios still pending for this compressed
+        * extent, just exit
+        */
+       if (!atomic_dec_and_test(&cb->pending_bios))
+               goto out;
+
+       inode = cb->inode;
+       ret = check_compressed_csum(inode, cb, (u64)bio->bi_sector << 9);
+       if (ret)
+               goto csum_failed;
+
+       /* ok, we're the last bio for this extent, lets start
+        * the decompression.
+        */
+       tree = &BTRFS_I(inode)->io_tree;
+       ret = btrfs_zlib_decompress_biovec(cb->compressed_pages,
+                                       cb->start,
+                                       cb->orig_bio->bi_io_vec,
+                                       cb->orig_bio->bi_vcnt,
+                                       cb->compressed_len);
+csum_failed:
+       if (ret)
+               cb->errors = 1;
+
+       /* release the compressed pages */
+       index = 0;
+       for (index = 0; index < cb->nr_pages; index++) {
+               page = cb->compressed_pages[index];
+               page->mapping = NULL;
+               page_cache_release(page);
+       }
+
+       /* do io completion on the original bio */
+       if (cb->errors) {
+               bio_io_error(cb->orig_bio);
+       } else {
+               int bio_index = 0;
+               struct bio_vec *bvec = cb->orig_bio->bi_io_vec;
+
+               /*
+                * we have verified the checksum already, set page
+                * checked so the end_io handlers know about it
+                */
+               while (bio_index < cb->orig_bio->bi_vcnt) {
+                       SetPageChecked(bvec->bv_page);
+                       bvec++;
+                       bio_index++;
+               }
+               bio_endio(cb->orig_bio, 0);
+       }
+
+       /* finally free the cb struct */
+       kfree(cb->compressed_pages);
+       kfree(cb);
+out:
+       bio_put(bio);
+}
+
+/*
+ * Clear the writeback bits on all of the file
+ * pages for a compressed write
+ */
+static noinline int end_compressed_writeback(struct inode *inode, u64 start,
+                                            unsigned long ram_size)
+{
+       unsigned long index = start >> PAGE_CACHE_SHIFT;
+       unsigned long end_index = (start + ram_size - 1) >> PAGE_CACHE_SHIFT;
+       struct page *pages[16];
+       unsigned long nr_pages = end_index - index + 1;
+       int i;
+       int ret;
+
+       while (nr_pages > 0) {
+               ret = find_get_pages_contig(inode->i_mapping, index,
+                                    min_t(unsigned long,
+                                    nr_pages, ARRAY_SIZE(pages)), pages);
+               if (ret == 0) {
+                       nr_pages -= 1;
+                       index += 1;
+                       continue;
+               }
+               for (i = 0; i < ret; i++) {
+                       end_page_writeback(pages[i]);
+                       page_cache_release(pages[i]);
+               }
+               nr_pages -= ret;
+               index += ret;
+       }
+       /* the inode may be gone now */
+       return 0;
+}
+
+/*
+ * do the cleanup once all the compressed pages hit the disk.
+ * This will clear writeback on the file pages and free the compressed
+ * pages.
+ *
+ * This also calls the writeback end hooks for the file pages so that
+ * metadata and checksums can be updated in the file.
+ */
+static void end_compressed_bio_write(struct bio *bio, int err)
+{
+       struct extent_io_tree *tree;
+       struct compressed_bio *cb = bio->bi_private;
+       struct inode *inode;
+       struct page *page;
+       unsigned long index;
+
+       if (err)
+               cb->errors = 1;
+
+       /* if there are more bios still pending for this compressed
+        * extent, just exit
+        */
+       if (!atomic_dec_and_test(&cb->pending_bios))
+               goto out;
+
+       /* ok, we're the last bio for this extent, step one is to
+        * call back into the FS and do all the end_io operations
+        */
+       inode = cb->inode;
+       tree = &BTRFS_I(inode)->io_tree;
+       cb->compressed_pages[0]->mapping = cb->inode->i_mapping;
+       tree->ops->writepage_end_io_hook(cb->compressed_pages[0],
+                                        cb->start,
+                                        cb->start + cb->len - 1,
+                                        NULL, 1);
+       cb->compressed_pages[0]->mapping = NULL;
+
+       end_compressed_writeback(inode, cb->start, cb->len);
+       /* note, our inode could be gone now */
+
+       /*
+        * release the compressed pages, these came from alloc_page and
+        * are not attached to the inode at all
+        */
+       index = 0;
+       for (index = 0; index < cb->nr_pages; index++) {
+               page = cb->compressed_pages[index];
+               page->mapping = NULL;
+               page_cache_release(page);
+       }
+
+       /* finally free the cb struct */
+       kfree(cb->compressed_pages);
+       kfree(cb);
+out:
+       bio_put(bio);
+}
+
+/*
+ * worker function to build and submit bios for previously compressed pages.
+ * The corresponding pages in the inode should be marked for writeback
+ * and the compressed pages should have a reference on them for dropping
+ * when the IO is complete.
+ *
+ * This also checksums the file bytes and gets things ready for
+ * the end io hooks.
+ */
+int btrfs_submit_compressed_write(struct inode *inode, u64 start,
+                                unsigned long len, u64 disk_start,
+                                unsigned long compressed_len,
+                                struct page **compressed_pages,
+                                unsigned long nr_pages)
+{
+       struct bio *bio = NULL;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct compressed_bio *cb;
+       unsigned long bytes_left;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+       int page_index = 0;
+       struct page *page;
+       u64 first_byte = disk_start;
+       struct block_device *bdev;
+       int ret;
+
+       WARN_ON(start & ((u64)PAGE_CACHE_SIZE - 1));
+       cb = kmalloc(compressed_bio_size(root, compressed_len), GFP_NOFS);
+       atomic_set(&cb->pending_bios, 0);
+       cb->errors = 0;
+       cb->inode = inode;
+       cb->start = start;
+       cb->len = len;
+       cb->mirror_num = 0;
+       cb->compressed_pages = compressed_pages;
+       cb->compressed_len = compressed_len;
+       cb->orig_bio = NULL;
+       cb->nr_pages = nr_pages;
+
+       bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
+
+       bio = compressed_bio_alloc(bdev, first_byte, GFP_NOFS);
+       bio->bi_private = cb;
+       bio->bi_end_io = end_compressed_bio_write;
+       atomic_inc(&cb->pending_bios);
+
+       /* create and submit bios for the compressed pages */
+       bytes_left = compressed_len;
+       for (page_index = 0; page_index < cb->nr_pages; page_index++) {
+               page = compressed_pages[page_index];
+               page->mapping = inode->i_mapping;
+               if (bio->bi_size)
+                       ret = io_tree->ops->merge_bio_hook(page, 0,
+                                                          PAGE_CACHE_SIZE,
+                                                          bio, 0);
+               else
+                       ret = 0;
+
+               page->mapping = NULL;
+               if (ret || bio_add_page(bio, page, PAGE_CACHE_SIZE, 0) <
+                   PAGE_CACHE_SIZE) {
+                       bio_get(bio);
+
+                       /*
+                        * inc the count before we submit the bio so
+                        * we know the end IO handler won't happen before
+                        * we inc the count.  Otherwise, the cb might get
+                        * freed before we're done setting it up
+                        */
+                       atomic_inc(&cb->pending_bios);
+                       ret = btrfs_bio_wq_end_io(root->fs_info, bio, 0);
+                       BUG_ON(ret);
+
+                       ret = btrfs_csum_one_bio(root, inode, bio, start, 1);
+                       BUG_ON(ret);
+
+                       ret = btrfs_map_bio(root, WRITE, bio, 0, 1);
+                       BUG_ON(ret);
+
+                       bio_put(bio);
+
+                       bio = compressed_bio_alloc(bdev, first_byte, GFP_NOFS);
+                       bio->bi_private = cb;
+                       bio->bi_end_io = end_compressed_bio_write;
+                       bio_add_page(bio, page, PAGE_CACHE_SIZE, 0);
+               }
+               if (bytes_left < PAGE_CACHE_SIZE) {
+                       printk("bytes left %lu compress len %lu nr %lu\n",
+                              bytes_left, cb->compressed_len, cb->nr_pages);
+               }
+               bytes_left -= PAGE_CACHE_SIZE;
+               first_byte += PAGE_CACHE_SIZE;
+               cond_resched();
+       }
+       bio_get(bio);
+
+       ret = btrfs_bio_wq_end_io(root->fs_info, bio, 0);
+       BUG_ON(ret);
+
+       ret = btrfs_csum_one_bio(root, inode, bio, start, 1);
+       BUG_ON(ret);
+
+       ret = btrfs_map_bio(root, WRITE, bio, 0, 1);
+       BUG_ON(ret);
+
+       bio_put(bio);
+       return 0;
+}
+
+static noinline int add_ra_bio_pages(struct inode *inode,
+                                    u64 compressed_end,
+                                    struct compressed_bio *cb)
+{
+       unsigned long end_index;
+       unsigned long page_index;
+       u64 last_offset;
+       u64 isize = i_size_read(inode);
+       int ret;
+       struct page *page;
+       unsigned long nr_pages = 0;
+       struct extent_map *em;
+       struct address_space *mapping = inode->i_mapping;
+       struct pagevec pvec;
+       struct extent_map_tree *em_tree;
+       struct extent_io_tree *tree;
+       u64 end;
+       int misses = 0;
+
+       page = cb->orig_bio->bi_io_vec[cb->orig_bio->bi_vcnt - 1].bv_page;
+       last_offset = (page_offset(page) + PAGE_CACHE_SIZE);
+       em_tree = &BTRFS_I(inode)->extent_tree;
+       tree = &BTRFS_I(inode)->io_tree;
+
+       if (isize == 0)
+               return 0;
+
+       end_index = (i_size_read(inode) - 1) >> PAGE_CACHE_SHIFT;
+
+       pagevec_init(&pvec, 0);
+       while (last_offset < compressed_end) {
+               page_index = last_offset >> PAGE_CACHE_SHIFT;
+
+               if (page_index > end_index)
+                       break;
+
+               rcu_read_lock();
+               page = radix_tree_lookup(&mapping->page_tree, page_index);
+               rcu_read_unlock();
+               if (page) {
+                       misses++;
+                       if (misses > 4)
+                               break;
+                       goto next;
+               }
+
+               page = alloc_page(mapping_gfp_mask(mapping) | GFP_NOFS);
+               if (!page)
+                       break;
+
+               page->index = page_index;
+               /*
+                * what we want to do here is call add_to_page_cache_lru,
+                * but that isn't exported, so we reproduce it here
+                */
+               if (add_to_page_cache(page, mapping,
+                                     page->index, GFP_NOFS)) {
+                       page_cache_release(page);
+                       goto next;
+               }
+
+               /* open coding of lru_cache_add, also not exported */
+               page_cache_get(page);
+               if (!pagevec_add(&pvec, page))
+                       __pagevec_lru_add_file(&pvec);
+
+               end = last_offset + PAGE_CACHE_SIZE - 1;
+               /*
+                * at this point, we have a locked page in the page cache
+                * for these bytes in the file.  But, we have to make
+                * sure they map to this compressed extent on disk.
+                */
+               set_page_extent_mapped(page);
+               lock_extent(tree, last_offset, end, GFP_NOFS);
+               spin_lock(&em_tree->lock);
+               em = lookup_extent_mapping(em_tree, last_offset,
+                                          PAGE_CACHE_SIZE);
+               spin_unlock(&em_tree->lock);
+
+               if (!em || last_offset < em->start ||
+                   (last_offset + PAGE_CACHE_SIZE > extent_map_end(em)) ||
+                   (em->block_start >> 9) != cb->orig_bio->bi_sector) {
+                       free_extent_map(em);
+                       unlock_extent(tree, last_offset, end, GFP_NOFS);
+                       unlock_page(page);
+                       page_cache_release(page);
+                       break;
+               }
+               free_extent_map(em);
+
+               if (page->index == end_index) {
+                       char *userpage;
+                       size_t zero_offset = isize & (PAGE_CACHE_SIZE - 1);
+
+                       if (zero_offset) {
+                               int zeros;
+                               zeros = PAGE_CACHE_SIZE - zero_offset;
+                               userpage = kmap_atomic(page, KM_USER0);
+                               memset(userpage + zero_offset, 0, zeros);
+                               flush_dcache_page(page);
+                               kunmap_atomic(userpage, KM_USER0);
+                       }
+               }
+
+               ret = bio_add_page(cb->orig_bio, page,
+                                  PAGE_CACHE_SIZE, 0);
+
+               if (ret == PAGE_CACHE_SIZE) {
+                       nr_pages++;
+                       page_cache_release(page);
+               } else {
+                       unlock_extent(tree, last_offset, end, GFP_NOFS);
+                       unlock_page(page);
+                       page_cache_release(page);
+                       break;
+               }
+next:
+               last_offset += PAGE_CACHE_SIZE;
+       }
+       if (pagevec_count(&pvec))
+               __pagevec_lru_add_file(&pvec);
+       return 0;
+}
+
+/*
+ * for a compressed read, the bio we get passed has all the inode pages
+ * in it.  We don't actually do IO on those pages but allocate new ones
+ * to hold the compressed pages on disk.
+ *
+ * bio->bi_sector points to the compressed extent on disk
+ * bio->bi_io_vec points to all of the inode pages
+ * bio->bi_vcnt is a count of pages
+ *
+ * After the compressed pages are read, we copy the bytes into the
+ * bio we were passed and then call the bio end_io calls
+ */
+int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
+                                int mirror_num, unsigned long bio_flags)
+{
+       struct extent_io_tree *tree;
+       struct extent_map_tree *em_tree;
+       struct compressed_bio *cb;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       unsigned long uncompressed_len = bio->bi_vcnt * PAGE_CACHE_SIZE;
+       unsigned long compressed_len;
+       unsigned long nr_pages;
+       unsigned long page_index;
+       struct page *page;
+       struct block_device *bdev;
+       struct bio *comp_bio;
+       u64 cur_disk_byte = (u64)bio->bi_sector << 9;
+       u64 em_len;
+       u64 em_start;
+       struct extent_map *em;
+       int ret;
+       u32 *sums;
+
+       tree = &BTRFS_I(inode)->io_tree;
+       em_tree = &BTRFS_I(inode)->extent_tree;
+
+       /* we need the actual starting offset of this extent in the file */
+       spin_lock(&em_tree->lock);
+       em = lookup_extent_mapping(em_tree,
+                                  page_offset(bio->bi_io_vec->bv_page),
+                                  PAGE_CACHE_SIZE);
+       spin_unlock(&em_tree->lock);
+
+       compressed_len = em->block_len;
+       cb = kmalloc(compressed_bio_size(root, compressed_len), GFP_NOFS);
+       atomic_set(&cb->pending_bios, 0);
+       cb->errors = 0;
+       cb->inode = inode;
+       cb->mirror_num = mirror_num;
+       sums = &cb->sums;
+
+       cb->start = em->orig_start;
+       em_len = em->len;
+       em_start = em->start;
+
+       free_extent_map(em);
+       em = NULL;
+
+       cb->len = uncompressed_len;
+       cb->compressed_len = compressed_len;
+       cb->orig_bio = bio;
+
+       nr_pages = (compressed_len + PAGE_CACHE_SIZE - 1) /
+                                PAGE_CACHE_SIZE;
+       cb->compressed_pages = kmalloc(sizeof(struct page *) * nr_pages,
+                                      GFP_NOFS);
+       bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
+
+       for (page_index = 0; page_index < nr_pages; page_index++) {
+               cb->compressed_pages[page_index] = alloc_page(GFP_NOFS |
+                                                             __GFP_HIGHMEM);
+       }
+       cb->nr_pages = nr_pages;
+
+       add_ra_bio_pages(inode, em_start + em_len, cb);
+
+       /* include any pages we added in add_ra-bio_pages */
+       uncompressed_len = bio->bi_vcnt * PAGE_CACHE_SIZE;
+       cb->len = uncompressed_len;
+
+       comp_bio = compressed_bio_alloc(bdev, cur_disk_byte, GFP_NOFS);
+       comp_bio->bi_private = cb;
+       comp_bio->bi_end_io = end_compressed_bio_read;
+       atomic_inc(&cb->pending_bios);
+
+       for (page_index = 0; page_index < nr_pages; page_index++) {
+               page = cb->compressed_pages[page_index];
+               page->mapping = inode->i_mapping;
+               page->index = em_start >> PAGE_CACHE_SHIFT;
+
+               if (comp_bio->bi_size)
+                       ret = tree->ops->merge_bio_hook(page, 0,
+                                                       PAGE_CACHE_SIZE,
+                                                       comp_bio, 0);
+               else
+                       ret = 0;
+
+               page->mapping = NULL;
+               if (ret || bio_add_page(comp_bio, page, PAGE_CACHE_SIZE, 0) <
+                   PAGE_CACHE_SIZE) {
+                       bio_get(comp_bio);
+
+                       ret = btrfs_bio_wq_end_io(root->fs_info, comp_bio, 0);
+                       BUG_ON(ret);
+
+                       /*
+                        * inc the count before we submit the bio so
+                        * we know the end IO handler won't happen before
+                        * we inc the count.  Otherwise, the cb might get
+                        * freed before we're done setting it up
+                        */
+                       atomic_inc(&cb->pending_bios);
+
+                       if (!btrfs_test_flag(inode, NODATASUM)) {
+                               btrfs_lookup_bio_sums(root, inode, comp_bio,
+                                                     sums);
+                       }
+                       sums += (comp_bio->bi_size + root->sectorsize - 1) /
+                               root->sectorsize;
+
+                       ret = btrfs_map_bio(root, READ, comp_bio,
+                                           mirror_num, 0);
+                       BUG_ON(ret);
+
+                       bio_put(comp_bio);
+
+                       comp_bio = compressed_bio_alloc(bdev, cur_disk_byte,
+                                                       GFP_NOFS);
+                       comp_bio->bi_private = cb;
+                       comp_bio->bi_end_io = end_compressed_bio_read;
+
+                       bio_add_page(comp_bio, page, PAGE_CACHE_SIZE, 0);
+               }
+               cur_disk_byte += PAGE_CACHE_SIZE;
+       }
+       bio_get(comp_bio);
+
+       ret = btrfs_bio_wq_end_io(root->fs_info, comp_bio, 0);
+       BUG_ON(ret);
+
+       if (!btrfs_test_flag(inode, NODATASUM))
+               btrfs_lookup_bio_sums(root, inode, comp_bio, sums);
+
+       ret = btrfs_map_bio(root, READ, comp_bio, mirror_num, 0);
+       BUG_ON(ret);
+
+       bio_put(comp_bio);
+       return 0;
+}
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h

new file mode 100644 (file)

index 0000000..421f5b4
--- /dev/null
+++ b/fs/btrfs/compression.h
@@ -0,0 +1,47 @@
+/*
+ * Copyright (C) 2008 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_COMPRESSION_
+#define __BTRFS_COMPRESSION_
+
+int btrfs_zlib_decompress(unsigned char *data_in,
+                         struct page *dest_page,
+                         unsigned long start_byte,
+                         size_t srclen, size_t destlen);
+int btrfs_zlib_compress_pages(struct address_space *mapping,
+                             u64 start, unsigned long len,
+                             struct page **pages,
+                             unsigned long nr_dest_pages,
+                             unsigned long *out_pages,
+                             unsigned long *total_in,
+                             unsigned long *total_out,
+                             unsigned long max_out);
+int btrfs_zlib_decompress_biovec(struct page **pages_in,
+                             u64 disk_start,
+                             struct bio_vec *bvec,
+                             int vcnt,
+                             size_t srclen);
+void btrfs_zlib_exit(void);
+int btrfs_submit_compressed_write(struct inode *inode, u64 start,
+                                 unsigned long len, u64 disk_start,
+                                 unsigned long compressed_len,
+                                 struct page **compressed_pages,
+                                 unsigned long nr_pages);
+int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
+                                int mirror_num, unsigned long bio_flags);
+#endif
diff --git a/fs/btrfs/crc32c.h b/fs/btrfs/crc32c.h

new file mode 100644 (file)

index 0000000..6e1b3de
--- /dev/null
+++ b/fs/btrfs/crc32c.h
@@ -0,0 +1,29 @@
+/*
+ * Copyright (C) 2008 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_CRC32C__
+#define __BTRFS_CRC32C__
+#include <linux/crc32c.h>
+
+/*
+ * this file used to do more for selecting the HW version of crc32c,
+ * perhaps it will one day again soon.
+ */
+#define btrfs_crc32c(seed, data, length) crc32c(seed, data, length)
+#endif
+
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c

new file mode 100644 (file)

index 0000000..9e46c07
--- /dev/null
+++ b/fs/btrfs/ctree.c
@@ -0,0 +1,3953 @@
+/*
+ * Copyright (C) 2007,2008 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/sched.h>
+#include "ctree.h"
+#include "disk-io.h"
+#include "transaction.h"
+#include "print-tree.h"
+#include "locking.h"
+
+static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
+                     *root, struct btrfs_path *path, int level);
+static int split_leaf(struct btrfs_trans_handle *trans, struct btrfs_root
+                     *root, struct btrfs_key *ins_key,
+                     struct btrfs_path *path, int data_size, int extend);
+static int push_node_left(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root, struct extent_buffer *dst,
+                         struct extent_buffer *src, int empty);
+static int balance_node_right(struct btrfs_trans_handle *trans,
+                             struct btrfs_root *root,
+                             struct extent_buffer *dst_buf,
+                             struct extent_buffer *src_buf);
+static int del_ptr(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+                  struct btrfs_path *path, int level, int slot);
+
+inline void btrfs_init_path(struct btrfs_path *p)
+{
+       memset(p, 0, sizeof(*p));
+}
+
+struct btrfs_path *btrfs_alloc_path(void)
+{
+       struct btrfs_path *path;
+       path = kmem_cache_alloc(btrfs_path_cachep, GFP_NOFS);
+       if (path) {
+               btrfs_init_path(path);
+               path->reada = 1;
+       }
+       return path;
+}
+
+/* this also releases the path */
+void btrfs_free_path(struct btrfs_path *p)
+{
+       btrfs_release_path(NULL, p);
+       kmem_cache_free(btrfs_path_cachep, p);
+}
+
+/*
+ * path release drops references on the extent buffers in the path
+ * and it drops any locks held by this path
+ *
+ * It is safe to call this on paths that no locks or extent buffers held.
+ */
+noinline void btrfs_release_path(struct btrfs_root *root, struct btrfs_path *p)
+{
+       int i;
+
+       for (i = 0; i < BTRFS_MAX_LEVEL; i++) {
+               p->slots[i] = 0;
+               if (!p->nodes[i])
+                       continue;
+               if (p->locks[i]) {
+                       btrfs_tree_unlock(p->nodes[i]);
+                       p->locks[i] = 0;
+               }
+               free_extent_buffer(p->nodes[i]);
+               p->nodes[i] = NULL;
+       }
+}
+
+/*
+ * safely gets a reference on the root node of a tree.  A lock
+ * is not taken, so a concurrent writer may put a different node
+ * at the root of the tree.  See btrfs_lock_root_node for the
+ * looping required.
+ *
+ * The extent buffer returned by this has a reference taken, so
+ * it won't disappear.  It may stop being the root of the tree
+ * at any time because there are no locks held.
+ */
+struct extent_buffer *btrfs_root_node(struct btrfs_root *root)
+{
+       struct extent_buffer *eb;
+       spin_lock(&root->node_lock);
+       eb = root->node;
+       extent_buffer_get(eb);
+       spin_unlock(&root->node_lock);
+       return eb;
+}
+
+/* loop around taking references on and locking the root node of the
+ * tree until you end up with a lock on the root.  A locked buffer
+ * is returned, with a reference held.
+ */
+struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root)
+{
+       struct extent_buffer *eb;
+
+       while (1) {
+               eb = btrfs_root_node(root);
+               btrfs_tree_lock(eb);
+
+               spin_lock(&root->node_lock);
+               if (eb == root->node) {
+                       spin_unlock(&root->node_lock);
+                       break;
+               }
+               spin_unlock(&root->node_lock);
+
+               btrfs_tree_unlock(eb);
+               free_extent_buffer(eb);
+       }
+       return eb;
+}
+
+/* cowonly root (everything not a reference counted cow subvolume), just get
+ * put onto a simple dirty list.  transaction.c walks this to make sure they
+ * get properly updated on disk.
+ */
+static void add_root_to_dirty_list(struct btrfs_root *root)
+{
+       if (root->track_dirty && list_empty(&root->dirty_list)) {
+               list_add(&root->dirty_list,
+                        &root->fs_info->dirty_cowonly_roots);
+       }
+}
+
+/*
+ * used by snapshot creation to make a copy of a root for a tree with
+ * a given objectid.  The buffer with the new root node is returned in
+ * cow_ret, and this func returns zero on success or a negative error code.
+ */
+int btrfs_copy_root(struct btrfs_trans_handle *trans,
+                     struct btrfs_root *root,
+                     struct extent_buffer *buf,
+                     struct extent_buffer **cow_ret, u64 new_root_objectid)
+{
+       struct extent_buffer *cow;
+       u32 nritems;
+       int ret = 0;
+       int level;
+       struct btrfs_root *new_root;
+
+       new_root = kmalloc(sizeof(*new_root), GFP_NOFS);
+       if (!new_root)
+               return -ENOMEM;
+
+       memcpy(new_root, root, sizeof(*new_root));
+       new_root->root_key.objectid = new_root_objectid;
+
+       WARN_ON(root->ref_cows && trans->transid !=
+               root->fs_info->running_transaction->transid);
+       WARN_ON(root->ref_cows && trans->transid != root->last_trans);
+
+       level = btrfs_header_level(buf);
+       nritems = btrfs_header_nritems(buf);
+
+       cow = btrfs_alloc_free_block(trans, new_root, buf->len, 0,
+                                    new_root_objectid, trans->transid,
+                                    level, buf->start, 0);
+       if (IS_ERR(cow)) {
+               kfree(new_root);
+               return PTR_ERR(cow);
+       }
+
+       copy_extent_buffer(cow, buf, 0, 0, cow->len);
+       btrfs_set_header_bytenr(cow, cow->start);
+       btrfs_set_header_generation(cow, trans->transid);
+       btrfs_set_header_owner(cow, new_root_objectid);
+       btrfs_clear_header_flag(cow, BTRFS_HEADER_FLAG_WRITTEN);
+
+       write_extent_buffer(cow, root->fs_info->fsid,
+                           (unsigned long)btrfs_header_fsid(cow),
+                           BTRFS_FSID_SIZE);
+
+       WARN_ON(btrfs_header_generation(buf) > trans->transid);
+       ret = btrfs_inc_ref(trans, new_root, buf, cow, NULL);
+       kfree(new_root);
+
+       if (ret)
+               return ret;
+
+       btrfs_mark_buffer_dirty(cow);
+       *cow_ret = cow;
+       return 0;
+}
+
+/*
+ * does the dirty work in cow of a single block.  The parent block (if
+ * supplied) is updated to point to the new cow copy.  The new buffer is marked
+ * dirty and returned locked.  If you modify the block it needs to be marked
+ * dirty again.
+ *
+ * search_start -- an allocation hint for the new block
+ *
+ * empty_size -- a hint that you plan on doing more cow.  This is the size in
+ * bytes the allocator should try to find free next to the block it returns.
+ * This is just a hint and may be ignored by the allocator.
+ *
+ * prealloc_dest -- if you have already reserved a destination for the cow,
+ * this uses that block instead of allocating a new one.
+ * btrfs_alloc_reserved_extent is used to finish the allocation.
+ */
+static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root,
+                            struct extent_buffer *buf,
+                            struct extent_buffer *parent, int parent_slot,
+                            struct extent_buffer **cow_ret,
+                            u64 search_start, u64 empty_size,
+                            u64 prealloc_dest)
+{
+       u64 parent_start;
+       struct extent_buffer *cow;
+       u32 nritems;
+       int ret = 0;
+       int level;
+       int unlock_orig = 0;
+
+       if (*cow_ret == buf)
+               unlock_orig = 1;
+
+       WARN_ON(!btrfs_tree_locked(buf));
+
+       if (parent)
+               parent_start = parent->start;
+       else
+               parent_start = 0;
+
+       WARN_ON(root->ref_cows && trans->transid !=
+               root->fs_info->running_transaction->transid);
+       WARN_ON(root->ref_cows && trans->transid != root->last_trans);
+
+       level = btrfs_header_level(buf);
+       nritems = btrfs_header_nritems(buf);
+
+       if (prealloc_dest) {
+               struct btrfs_key ins;
+
+               ins.objectid = prealloc_dest;
+               ins.offset = buf->len;
+               ins.type = BTRFS_EXTENT_ITEM_KEY;
+
+               ret = btrfs_alloc_reserved_extent(trans, root, parent_start,
+                                                 root->root_key.objectid,
+                                                 trans->transid, level, &ins);
+               BUG_ON(ret);
+               cow = btrfs_init_new_buffer(trans, root, prealloc_dest,
+                                           buf->len);
+       } else {
+               cow = btrfs_alloc_free_block(trans, root, buf->len,
+                                            parent_start,
+                                            root->root_key.objectid,
+                                            trans->transid, level,
+                                            search_start, empty_size);
+       }
+       if (IS_ERR(cow))
+               return PTR_ERR(cow);
+
+       copy_extent_buffer(cow, buf, 0, 0, cow->len);
+       btrfs_set_header_bytenr(cow, cow->start);
+       btrfs_set_header_generation(cow, trans->transid);
+       btrfs_set_header_owner(cow, root->root_key.objectid);
+       btrfs_clear_header_flag(cow, BTRFS_HEADER_FLAG_WRITTEN);
+
+       write_extent_buffer(cow, root->fs_info->fsid,
+                           (unsigned long)btrfs_header_fsid(cow),
+                           BTRFS_FSID_SIZE);
+
+       WARN_ON(btrfs_header_generation(buf) > trans->transid);
+       if (btrfs_header_generation(buf) != trans->transid) {
+               u32 nr_extents;
+               ret = btrfs_inc_ref(trans, root, buf, cow, &nr_extents);
+               if (ret)
+                       return ret;
+
+               ret = btrfs_cache_ref(trans, root, buf, nr_extents);
+               WARN_ON(ret);
+       } else if (btrfs_header_owner(buf) == BTRFS_TREE_RELOC_OBJECTID) {
+               /*
+                * There are only two places that can drop reference to
+                * tree blocks owned by living reloc trees, one is here,
+                * the other place is btrfs_drop_subtree. In both places,
+                * we check reference count while tree block is locked.
+                * Furthermore, if reference count is one, it won't get
+                * increased by someone else.
+                */
+               u32 refs;
+               ret = btrfs_lookup_extent_ref(trans, root, buf->start,
+                                             buf->len, &refs);
+               BUG_ON(ret);
+               if (refs == 1) {
+                       ret = btrfs_update_ref(trans, root, buf, cow,
+                                              0, nritems);
+                       clean_tree_block(trans, root, buf);
+               } else {
+                       ret = btrfs_inc_ref(trans, root, buf, cow, NULL);
+               }
+               BUG_ON(ret);
+       } else {
+               ret = btrfs_update_ref(trans, root, buf, cow, 0, nritems);
+               if (ret)
+                       return ret;
+               clean_tree_block(trans, root, buf);
+       }
+
+       if (root->root_key.objectid == BTRFS_TREE_RELOC_OBJECTID) {
+               ret = btrfs_reloc_tree_cache_ref(trans, root, cow, buf->start);
+               WARN_ON(ret);
+       }
+
+       if (buf == root->node) {
+               WARN_ON(parent && parent != buf);
+
+               spin_lock(&root->node_lock);
+               root->node = cow;
+               extent_buffer_get(cow);
+               spin_unlock(&root->node_lock);
+
+               if (buf != root->commit_root) {
+                       btrfs_free_extent(trans, root, buf->start,
+                                         buf->len, buf->start,
+                                         root->root_key.objectid,
+                                         btrfs_header_generation(buf),
+                                         level, 1);
+               }
+               free_extent_buffer(buf);
+               add_root_to_dirty_list(root);
+       } else {
+               btrfs_set_node_blockptr(parent, parent_slot,
+                                       cow->start);
+               WARN_ON(trans->transid == 0);
+               btrfs_set_node_ptr_generation(parent, parent_slot,
+                                             trans->transid);
+               btrfs_mark_buffer_dirty(parent);
+               WARN_ON(btrfs_header_generation(parent) != trans->transid);
+               btrfs_free_extent(trans, root, buf->start, buf->len,
+                                 parent_start, btrfs_header_owner(parent),
+                                 btrfs_header_generation(parent), level, 1);
+       }
+       if (unlock_orig)
+               btrfs_tree_unlock(buf);
+       free_extent_buffer(buf);
+       btrfs_mark_buffer_dirty(cow);
+       *cow_ret = cow;
+       return 0;
+}
+
+/*
+ * cows a single block, see __btrfs_cow_block for the real work.
+ * This version of it has extra checks so that a block isn't cow'd more than
+ * once per transaction, as long as it hasn't been written yet
+ */
+noinline int btrfs_cow_block(struct btrfs_trans_handle *trans,
+                   struct btrfs_root *root, struct extent_buffer *buf,
+                   struct extent_buffer *parent, int parent_slot,
+                   struct extent_buffer **cow_ret, u64 prealloc_dest)
+{
+       u64 search_start;
+       int ret;
+
+       if (trans->transaction != root->fs_info->running_transaction) {
+               printk(KERN_CRIT "trans %llu running %llu\n",
+                      (unsigned long long)trans->transid,
+                      (unsigned long long)
+                      root->fs_info->running_transaction->transid);
+               WARN_ON(1);
+       }
+       if (trans->transid != root->fs_info->generation) {
+               printk(KERN_CRIT "trans %llu running %llu\n",
+                      (unsigned long long)trans->transid,
+                      (unsigned long long)root->fs_info->generation);
+               WARN_ON(1);
+       }
+
+       spin_lock(&root->fs_info->hash_lock);
+       if (btrfs_header_generation(buf) == trans->transid &&
+           btrfs_header_owner(buf) == root->root_key.objectid &&
+           !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
+               *cow_ret = buf;
+               spin_unlock(&root->fs_info->hash_lock);
+               WARN_ON(prealloc_dest);
+               return 0;
+       }
+       spin_unlock(&root->fs_info->hash_lock);
+       search_start = buf->start & ~((u64)(1024 * 1024 * 1024) - 1);
+       ret = __btrfs_cow_block(trans, root, buf, parent,
+                                parent_slot, cow_ret, search_start, 0,
+                                prealloc_dest);
+       return ret;
+}
+
+/*
+ * helper function for defrag to decide if two blocks pointed to by a
+ * node are actually close by
+ */
+static int close_blocks(u64 blocknr, u64 other, u32 blocksize)
+{
+       if (blocknr < other && other - (blocknr + blocksize) < 32768)
+               return 1;
+       if (blocknr > other && blocknr - (other + blocksize) < 32768)
+               return 1;
+       return 0;
+}
+
+/*
+ * compare two keys in a memcmp fashion
+ */
+static int comp_keys(struct btrfs_disk_key *disk, struct btrfs_key *k2)
+{
+       struct btrfs_key k1;
+
+       btrfs_disk_key_to_cpu(&k1, disk);
+
+       if (k1.objectid > k2->objectid)
+               return 1;
+       if (k1.objectid < k2->objectid)
+               return -1;
+       if (k1.type > k2->type)
+               return 1;
+       if (k1.type < k2->type)
+               return -1;
+       if (k1.offset > k2->offset)
+               return 1;
+       if (k1.offset < k2->offset)
+               return -1;
+       return 0;
+}
+
+/*
+ * same as comp_keys only with two btrfs_key's
+ */
+static int comp_cpu_keys(struct btrfs_key *k1, struct btrfs_key *k2)
+{
+       if (k1->objectid > k2->objectid)
+               return 1;
+       if (k1->objectid < k2->objectid)
+               return -1;
+       if (k1->type > k2->type)
+               return 1;
+       if (k1->type < k2->type)
+               return -1;
+       if (k1->offset > k2->offset)
+               return 1;
+       if (k1->offset < k2->offset)
+               return -1;
+       return 0;
+}
+
+/*
+ * this is used by the defrag code to go through all the
+ * leaves pointed to by a node and reallocate them so that
+ * disk order is close to key order
+ */
+int btrfs_realloc_node(struct btrfs_trans_handle *trans,
+                      struct btrfs_root *root, struct extent_buffer *parent,
+                      int start_slot, int cache_only, u64 *last_ret,
+                      struct btrfs_key *progress)
+{
+       struct extent_buffer *cur;
+       u64 blocknr;
+       u64 gen;
+       u64 search_start = *last_ret;
+       u64 last_block = 0;
+       u64 other;
+       u32 parent_nritems;
+       int end_slot;
+       int i;
+       int err = 0;
+       int parent_level;
+       int uptodate;
+       u32 blocksize;
+       int progress_passed = 0;
+       struct btrfs_disk_key disk_key;
+
+       parent_level = btrfs_header_level(parent);
+       if (cache_only && parent_level != 1)
+               return 0;
+
+       if (trans->transaction != root->fs_info->running_transaction)
+               WARN_ON(1);
+       if (trans->transid != root->fs_info->generation)
+               WARN_ON(1);
+
+       parent_nritems = btrfs_header_nritems(parent);
+       blocksize = btrfs_level_size(root, parent_level - 1);
+       end_slot = parent_nritems;
+
+       if (parent_nritems == 1)
+               return 0;
+
+       for (i = start_slot; i < end_slot; i++) {
+               int close = 1;
+
+               if (!parent->map_token) {
+                       map_extent_buffer(parent,
+                                       btrfs_node_key_ptr_offset(i),
+                                       sizeof(struct btrfs_key_ptr),
+                                       &parent->map_token, &parent->kaddr,
+                                       &parent->map_start, &parent->map_len,
+                                       KM_USER1);
+               }
+               btrfs_node_key(parent, &disk_key, i);
+               if (!progress_passed && comp_keys(&disk_key, progress) < 0)
+                       continue;
+
+               progress_passed = 1;
+               blocknr = btrfs_node_blockptr(parent, i);
+               gen = btrfs_node_ptr_generation(parent, i);
+               if (last_block == 0)
+                       last_block = blocknr;
+
+               if (i > 0) {
+                       other = btrfs_node_blockptr(parent, i - 1);
+                       close = close_blocks(blocknr, other, blocksize);
+               }
+               if (!close && i < end_slot - 2) {
+                       other = btrfs_node_blockptr(parent, i + 1);
+                       close = close_blocks(blocknr, other, blocksize);
+               }
+               if (close) {
+                       last_block = blocknr;
+                       continue;
+               }
+               if (parent->map_token) {
+                       unmap_extent_buffer(parent, parent->map_token,
+                                           KM_USER1);
+                       parent->map_token = NULL;
+               }
+
+               cur = btrfs_find_tree_block(root, blocknr, blocksize);
+               if (cur)
+                       uptodate = btrfs_buffer_uptodate(cur, gen);
+               else
+                       uptodate = 0;
+               if (!cur || !uptodate) {
+                       if (cache_only) {
+                               free_extent_buffer(cur);
+                               continue;
+                       }
+                       if (!cur) {
+                               cur = read_tree_block(root, blocknr,
+                                                        blocksize, gen);
+                       } else if (!uptodate) {
+                               btrfs_read_buffer(cur, gen);
+                       }
+               }
+               if (search_start == 0)
+                       search_start = last_block;
+
+               btrfs_tree_lock(cur);
+               err = __btrfs_cow_block(trans, root, cur, parent, i,
+                                       &cur, search_start,
+                                       min(16 * blocksize,
+                                           (end_slot - i) * blocksize), 0);
+               if (err) {
+                       btrfs_tree_unlock(cur);
+                       free_extent_buffer(cur);
+                       break;
+               }
+               search_start = cur->start;
+               last_block = cur->start;
+               *last_ret = search_start;
+               btrfs_tree_unlock(cur);
+               free_extent_buffer(cur);
+       }
+       if (parent->map_token) {
+               unmap_extent_buffer(parent, parent->map_token,
+                                   KM_USER1);
+               parent->map_token = NULL;
+       }
+       return err;
+}
+
+/*
+ * The leaf data grows from end-to-front in the node.
+ * this returns the address of the start of the last item,
+ * which is the stop of the leaf data stack
+ */
+static inline unsigned int leaf_data_end(struct btrfs_root *root,
+                                        struct extent_buffer *leaf)
+{
+       u32 nr = btrfs_header_nritems(leaf);
+       if (nr == 0)
+               return BTRFS_LEAF_DATA_SIZE(root);
+       return btrfs_item_offset_nr(leaf, nr - 1);
+}
+
+/*
+ * extra debugging checks to make sure all the items in a key are
+ * well formed and in the proper order
+ */
+static int check_node(struct btrfs_root *root, struct btrfs_path *path,
+                     int level)
+{
+       struct extent_buffer *parent = NULL;
+       struct extent_buffer *node = path->nodes[level];
+       struct btrfs_disk_key parent_key;
+       struct btrfs_disk_key node_key;
+       int parent_slot;
+       int slot;
+       struct btrfs_key cpukey;
+       u32 nritems = btrfs_header_nritems(node);
+
+       if (path->nodes[level + 1])
+               parent = path->nodes[level + 1];
+
+       slot = path->slots[level];
+       BUG_ON(nritems == 0);
+       if (parent) {
+               parent_slot = path->slots[level + 1];
+               btrfs_node_key(parent, &parent_key, parent_slot);
+               btrfs_node_key(node, &node_key, 0);
+               BUG_ON(memcmp(&parent_key, &node_key,
+                             sizeof(struct btrfs_disk_key)));
+               BUG_ON(btrfs_node_blockptr(parent, parent_slot) !=
+                      btrfs_header_bytenr(node));
+       }
+       BUG_ON(nritems > BTRFS_NODEPTRS_PER_BLOCK(root));
+       if (slot != 0) {
+               btrfs_node_key_to_cpu(node, &cpukey, slot - 1);
+               btrfs_node_key(node, &node_key, slot);
+               BUG_ON(comp_keys(&node_key, &cpukey) <= 0);
+       }
+       if (slot < nritems - 1) {
+               btrfs_node_key_to_cpu(node, &cpukey, slot + 1);
+               btrfs_node_key(node, &node_key, slot);
+               BUG_ON(comp_keys(&node_key, &cpukey) >= 0);
+       }
+       return 0;
+}
+
+/*
+ * extra checking to make sure all the items in a leaf are
+ * well formed and in the proper order
+ */
+static int check_leaf(struct btrfs_root *root, struct btrfs_path *path,
+                     int level)
+{
+       struct extent_buffer *leaf = path->nodes[level];
+       struct extent_buffer *parent = NULL;
+       int parent_slot;
+       struct btrfs_key cpukey;
+       struct btrfs_disk_key parent_key;
+       struct btrfs_disk_key leaf_key;
+       int slot = path->slots[0];
+
+       u32 nritems = btrfs_header_nritems(leaf);
+
+       if (path->nodes[level + 1])
+               parent = path->nodes[level + 1];
+
+       if (nritems == 0)
+               return 0;
+
+       if (parent) {
+               parent_slot = path->slots[level + 1];
+               btrfs_node_key(parent, &parent_key, parent_slot);
+               btrfs_item_key(leaf, &leaf_key, 0);
+
+               BUG_ON(memcmp(&parent_key, &leaf_key,
+                      sizeof(struct btrfs_disk_key)));
+               BUG_ON(btrfs_node_blockptr(parent, parent_slot) !=
+                      btrfs_header_bytenr(leaf));
+       }
+       if (slot != 0 && slot < nritems - 1) {
+               btrfs_item_key(leaf, &leaf_key, slot);
+               btrfs_item_key_to_cpu(leaf, &cpukey, slot - 1);
+               if (comp_keys(&leaf_key, &cpukey) <= 0) {
+                       btrfs_print_leaf(root, leaf);
+                       printk(KERN_CRIT "slot %d offset bad key\n", slot);
+                       BUG_ON(1);
+               }
+               if (btrfs_item_offset_nr(leaf, slot - 1) !=
+                      btrfs_item_end_nr(leaf, slot)) {
+                       btrfs_print_leaf(root, leaf);
+                       printk(KERN_CRIT "slot %d offset bad\n", slot);
+                       BUG_ON(1);
+               }
+       }
+       if (slot < nritems - 1) {
+               btrfs_item_key(leaf, &leaf_key, slot);
+               btrfs_item_key_to_cpu(leaf, &cpukey, slot + 1);
+               BUG_ON(comp_keys(&leaf_key, &cpukey) >= 0);
+               if (btrfs_item_offset_nr(leaf, slot) !=
+                       btrfs_item_end_nr(leaf, slot + 1)) {
+                       btrfs_print_leaf(root, leaf);
+                       printk(KERN_CRIT "slot %d offset bad\n", slot);
+                       BUG_ON(1);
+               }
+       }
+       BUG_ON(btrfs_item_offset_nr(leaf, 0) +
+              btrfs_item_size_nr(leaf, 0) != BTRFS_LEAF_DATA_SIZE(root));
+       return 0;
+}
+
+static noinline int check_block(struct btrfs_root *root,
+                               struct btrfs_path *path, int level)
+{
+       return 0;
+       if (level == 0)
+               return check_leaf(root, path, level);
+       return check_node(root, path, level);
+}
+
+/*
+ * search for key in the extent_buffer.  The items start at offset p,
+ * and they are item_size apart.  There are 'max' items in p.
+ *
+ * the slot in the array is returned via slot, and it points to
+ * the place where you would insert key if it is not found in
+ * the array.
+ *
+ * slot may point to max if the key is bigger than all of the keys
+ */
+static noinline int generic_bin_search(struct extent_buffer *eb,
+                                      unsigned long p,
+                                      int item_size, struct btrfs_key *key,
+                                      int max, int *slot)
+{
+       int low = 0;
+       int high = max;
+       int mid;
+       int ret;
+       struct btrfs_disk_key *tmp = NULL;
+       struct btrfs_disk_key unaligned;
+       unsigned long offset;
+       char *map_token = NULL;
+       char *kaddr = NULL;
+       unsigned long map_start = 0;
+       unsigned long map_len = 0;
+       int err;
+
+       while (low < high) {
+               mid = (low + high) / 2;
+               offset = p + mid * item_size;
+
+               if (!map_token || offset < map_start ||
+                   (offset + sizeof(struct btrfs_disk_key)) >
+                   map_start + map_len) {
+                       if (map_token) {
+                               unmap_extent_buffer(eb, map_token, KM_USER0);
+                               map_token = NULL;
+                       }
+
+                       err = map_private_extent_buffer(eb, offset,
+                                               sizeof(struct btrfs_disk_key),
+                                               &map_token, &kaddr,
+                                               &map_start, &map_len, KM_USER0);
+
+                       if (!err) {
+                               tmp = (struct btrfs_disk_key *)(kaddr + offset -
+                                                       map_start);
+                       } else {
+                               read_extent_buffer(eb, &unaligned,
+                                                  offset, sizeof(unaligned));
+                               tmp = &unaligned;
+                       }
+
+               } else {
+                       tmp = (struct btrfs_disk_key *)(kaddr + offset -
+                                                       map_start);
+               }
+               ret = comp_keys(tmp, key);
+
+               if (ret < 0)
+                       low = mid + 1;
+               else if (ret > 0)
+                       high = mid;
+               else {
+                       *slot = mid;
+                       if (map_token)
+                               unmap_extent_buffer(eb, map_token, KM_USER0);
+                       return 0;
+               }
+       }
+       *slot = low;
+       if (map_token)
+               unmap_extent_buffer(eb, map_token, KM_USER0);
+       return 1;
+}
+
+/*
+ * simple bin_search frontend that does the right thing for
+ * leaves vs nodes
+ */
+static int bin_search(struct extent_buffer *eb, struct btrfs_key *key,
+                     int level, int *slot)
+{
+       if (level == 0) {
+               return generic_bin_search(eb,
+                                         offsetof(struct btrfs_leaf, items),
+                                         sizeof(struct btrfs_item),
+                                         key, btrfs_header_nritems(eb),
+                                         slot);
+       } else {
+               return generic_bin_search(eb,
+                                         offsetof(struct btrfs_node, ptrs),
+                                         sizeof(struct btrfs_key_ptr),
+                                         key, btrfs_header_nritems(eb),
+                                         slot);
+       }
+       return -1;
+}
+
+/* given a node and slot number, this reads the blocks it points to.  The
+ * extent buffer is returned with a reference taken (but unlocked).
+ * NULL is returned on error.
+ */
+static noinline struct extent_buffer *read_node_slot(struct btrfs_root *root,
+                                  struct extent_buffer *parent, int slot)
+{
+       int level = btrfs_header_level(parent);
+       if (slot < 0)
+               return NULL;
+       if (slot >= btrfs_header_nritems(parent))
+               return NULL;
+
+       BUG_ON(level == 0);
+
+       return read_tree_block(root, btrfs_node_blockptr(parent, slot),
+                      btrfs_level_size(root, level - 1),
+                      btrfs_node_ptr_generation(parent, slot));
+}
+
+/*
+ * node level balancing, used to make sure nodes are in proper order for
+ * item deletion.  We balance from the top down, so we have to make sure
+ * that a deletion won't leave an node completely empty later on.
+ */
+static noinline int balance_level(struct btrfs_trans_handle *trans,
+                        struct btrfs_root *root,
+                        struct btrfs_path *path, int level)
+{
+       struct extent_buffer *right = NULL;
+       struct extent_buffer *mid;
+       struct extent_buffer *left = NULL;
+       struct extent_buffer *parent = NULL;
+       int ret = 0;
+       int wret;
+       int pslot;
+       int orig_slot = path->slots[level];
+       int err_on_enospc = 0;
+       u64 orig_ptr;
+
+       if (level == 0)
+               return 0;
+
+       mid = path->nodes[level];
+       WARN_ON(!path->locks[level]);
+       WARN_ON(btrfs_header_generation(mid) != trans->transid);
+
+       orig_ptr = btrfs_node_blockptr(mid, orig_slot);
+
+       if (level < BTRFS_MAX_LEVEL - 1)
+               parent = path->nodes[level + 1];
+       pslot = path->slots[level + 1];
+
+       /*
+        * deal with the case where there is only one pointer in the root
+        * by promoting the node below to a root
+        */
+       if (!parent) {
+               struct extent_buffer *child;
+
+               if (btrfs_header_nritems(mid) != 1)
+                       return 0;
+
+               /* promote the child to a root */
+               child = read_node_slot(root, mid, 0);
+               btrfs_tree_lock(child);
+               BUG_ON(!child);
+               ret = btrfs_cow_block(trans, root, child, mid, 0, &child, 0);
+               BUG_ON(ret);
+
+               spin_lock(&root->node_lock);
+               root->node = child;
+               spin_unlock(&root->node_lock);
+
+               ret = btrfs_update_extent_ref(trans, root, child->start,
+                                             mid->start, child->start,
+                                             root->root_key.objectid,
+                                             trans->transid, level - 1);
+               BUG_ON(ret);
+
+               add_root_to_dirty_list(root);
+               btrfs_tree_unlock(child);
+               path->locks[level] = 0;
+               path->nodes[level] = NULL;
+               clean_tree_block(trans, root, mid);
+               btrfs_tree_unlock(mid);
+               /* once for the path */
+               free_extent_buffer(mid);
+               ret = btrfs_free_extent(trans, root, mid->start, mid->len,
+                                       mid->start, root->root_key.objectid,
+                                       btrfs_header_generation(mid),
+                                       level, 1);
+               /* once for the root ptr */
+               free_extent_buffer(mid);
+               return ret;
+       }
+       if (btrfs_header_nritems(mid) >
+           BTRFS_NODEPTRS_PER_BLOCK(root) / 4)
+               return 0;
+
+       if (btrfs_header_nritems(mid) < 2)
+               err_on_enospc = 1;
+
+       left = read_node_slot(root, parent, pslot - 1);
+       if (left) {
+               btrfs_tree_lock(left);
+               wret = btrfs_cow_block(trans, root, left,
+                                      parent, pslot - 1, &left, 0);
+               if (wret) {
+                       ret = wret;
+                       goto enospc;
+               }
+       }
+       right = read_node_slot(root, parent, pslot + 1);
+       if (right) {
+               btrfs_tree_lock(right);
+               wret = btrfs_cow_block(trans, root, right,
+                                      parent, pslot + 1, &right, 0);
+               if (wret) {
+                       ret = wret;
+                       goto enospc;
+               }
+       }
+
+       /* first, try to make some room in the middle buffer */
+       if (left) {
+               orig_slot += btrfs_header_nritems(left);
+               wret = push_node_left(trans, root, left, mid, 1);
+               if (wret < 0)
+                       ret = wret;
+               if (btrfs_header_nritems(mid) < 2)
+                       err_on_enospc = 1;
+       }
+
+       /*
+        * then try to empty the right most buffer into the middle
+        */
+       if (right) {
+               wret = push_node_left(trans, root, mid, right, 1);
+               if (wret < 0 && wret != -ENOSPC)
+                       ret = wret;
+               if (btrfs_header_nritems(right) == 0) {
+                       u64 bytenr = right->start;
+                       u64 generation = btrfs_header_generation(parent);
+                       u32 blocksize = right->len;
+
+                       clean_tree_block(trans, root, right);
+                       btrfs_tree_unlock(right);
+                       free_extent_buffer(right);
+                       right = NULL;
+                       wret = del_ptr(trans, root, path, level + 1, pslot +
+                                      1);
+                       if (wret)
+                               ret = wret;
+                       wret = btrfs_free_extent(trans, root, bytenr,
+                                                blocksize, parent->start,
+                                                btrfs_header_owner(parent),
+                                                generation, level, 1);
+                       if (wret)
+                               ret = wret;
+               } else {
+                       struct btrfs_disk_key right_key;
+                       btrfs_node_key(right, &right_key, 0);
+                       btrfs_set_node_key(parent, &right_key, pslot + 1);
+                       btrfs_mark_buffer_dirty(parent);
+               }
+       }
+       if (btrfs_header_nritems(mid) == 1) {
+               /*
+                * we're not allowed to leave a node with one item in the
+                * tree during a delete.  A deletion from lower in the tree
+                * could try to delete the only pointer in this node.
+                * So, pull some keys from the left.
+                * There has to be a left pointer at this point because
+                * otherwise we would have pulled some pointers from the
+                * right
+                */
+               BUG_ON(!left);
+               wret = balance_node_right(trans, root, mid, left);
+               if (wret < 0) {
+                       ret = wret;
+                       goto enospc;
+               }
+               if (wret == 1) {
+                       wret = push_node_left(trans, root, left, mid, 1);
+                       if (wret < 0)
+                               ret = wret;
+               }
+               BUG_ON(wret == 1);
+       }
+       if (btrfs_header_nritems(mid) == 0) {
+               /* we've managed to empty the middle node, drop it */
+               u64 root_gen = btrfs_header_generation(parent);
+               u64 bytenr = mid->start;
+               u32 blocksize = mid->len;
+
+               clean_tree_block(trans, root, mid);
+               btrfs_tree_unlock(mid);
+               free_extent_buffer(mid);
+               mid = NULL;
+               wret = del_ptr(trans, root, path, level + 1, pslot);
+               if (wret)
+                       ret = wret;
+               wret = btrfs_free_extent(trans, root, bytenr, blocksize,
+                                        parent->start,
+                                        btrfs_header_owner(parent),
+                                        root_gen, level, 1);
+               if (wret)
+                       ret = wret;
+       } else {
+               /* update the parent key to reflect our changes */
+               struct btrfs_disk_key mid_key;
+               btrfs_node_key(mid, &mid_key, 0);
+               btrfs_set_node_key(parent, &mid_key, pslot);
+               btrfs_mark_buffer_dirty(parent);
+       }
+
+       /* update the path */
+       if (left) {
+               if (btrfs_header_nritems(left) > orig_slot) {
+                       extent_buffer_get(left);
+                       /* left was locked after cow */
+                       path->nodes[level] = left;
+                       path->slots[level + 1] -= 1;
+                       path->slots[level] = orig_slot;
+                       if (mid) {
+                               btrfs_tree_unlock(mid);
+                               free_extent_buffer(mid);
+                       }
+               } else {
+                       orig_slot -= btrfs_header_nritems(left);
+                       path->slots[level] = orig_slot;
+               }
+       }
+       /* double check we haven't messed things up */
+       check_block(root, path, level);
+       if (orig_ptr !=
+           btrfs_node_blockptr(path->nodes[level], path->slots[level]))
+               BUG();
+enospc:
+       if (right) {
+               btrfs_tree_unlock(right);
+               free_extent_buffer(right);
+       }
+       if (left) {
+               if (path->nodes[level] != left)
+                       btrfs_tree_unlock(left);
+               free_extent_buffer(left);
+       }
+       return ret;
+}
+
+/* Node balancing for insertion.  Here we only split or push nodes around
+ * when they are completely full.  This is also done top down, so we
+ * have to be pessimistic.
+ */
+static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
+                                         struct btrfs_root *root,
+                                         struct btrfs_path *path, int level)
+{
+       struct extent_buffer *right = NULL;
+       struct extent_buffer *mid;
+       struct extent_buffer *left = NULL;
+       struct extent_buffer *parent = NULL;
+       int ret = 0;
+       int wret;
+       int pslot;
+       int orig_slot = path->slots[level];
+       u64 orig_ptr;
+
+       if (level == 0)
+               return 1;
+
+       mid = path->nodes[level];
+       WARN_ON(btrfs_header_generation(mid) != trans->transid);
+       orig_ptr = btrfs_node_blockptr(mid, orig_slot);
+
+       if (level < BTRFS_MAX_LEVEL - 1)
+               parent = path->nodes[level + 1];
+       pslot = path->slots[level + 1];
+
+       if (!parent)
+               return 1;
+
+       left = read_node_slot(root, parent, pslot - 1);
+
+       /* first, try to make some room in the middle buffer */
+       if (left) {
+               u32 left_nr;
+
+               btrfs_tree_lock(left);
+               left_nr = btrfs_header_nritems(left);
+               if (left_nr >= BTRFS_NODEPTRS_PER_BLOCK(root) - 1) {
+                       wret = 1;
+               } else {
+                       ret = btrfs_cow_block(trans, root, left, parent,
+                                             pslot - 1, &left, 0);
+                       if (ret)
+                               wret = 1;
+                       else {
+                               wret = push_node_left(trans, root,
+                                                     left, mid, 0);
+                       }
+               }
+               if (wret < 0)
+                       ret = wret;
+               if (wret == 0) {
+                       struct btrfs_disk_key disk_key;
+                       orig_slot += left_nr;
+                       btrfs_node_key(mid, &disk_key, 0);
+                       btrfs_set_node_key(parent, &disk_key, pslot);
+                       btrfs_mark_buffer_dirty(parent);
+                       if (btrfs_header_nritems(left) > orig_slot) {
+                               path->nodes[level] = left;
+                               path->slots[level + 1] -= 1;
+                               path->slots[level] = orig_slot;
+                               btrfs_tree_unlock(mid);
+                               free_extent_buffer(mid);
+                       } else {
+                               orig_slot -=
+                                       btrfs_header_nritems(left);
+                               path->slots[level] = orig_slot;
+                               btrfs_tree_unlock(left);
+                               free_extent_buffer(left);
+                       }
+                       return 0;
+               }
+               btrfs_tree_unlock(left);
+               free_extent_buffer(left);
+       }
+       right = read_node_slot(root, parent, pslot + 1);
+
+       /*
+        * then try to empty the right most buffer into the middle
+        */
+       if (right) {
+               u32 right_nr;
+               btrfs_tree_lock(right);
+               right_nr = btrfs_header_nritems(right);
+               if (right_nr >= BTRFS_NODEPTRS_PER_BLOCK(root) - 1) {
+                       wret = 1;
+               } else {
+                       ret = btrfs_cow_block(trans, root, right,
+                                             parent, pslot + 1,
+                                             &right, 0);
+                       if (ret)
+                               wret = 1;
+                       else {
+                               wret = balance_node_right(trans, root,
+                                                         right, mid);
+                       }
+               }
+               if (wret < 0)
+                       ret = wret;
+               if (wret == 0) {
+                       struct btrfs_disk_key disk_key;
+
+                       btrfs_node_key(right, &disk_key, 0);
+                       btrfs_set_node_key(parent, &disk_key, pslot + 1);
+                       btrfs_mark_buffer_dirty(parent);
+
+                       if (btrfs_header_nritems(mid) <= orig_slot) {
+                               path->nodes[level] = right;
+                               path->slots[level + 1] += 1;
+                               path->slots[level] = orig_slot -
+                                       btrfs_header_nritems(mid);
+                               btrfs_tree_unlock(mid);
+                               free_extent_buffer(mid);
+                       } else {
+                               btrfs_tree_unlock(right);
+                               free_extent_buffer(right);
+                       }
+                       return 0;
+               }
+               btrfs_tree_unlock(right);
+               free_extent_buffer(right);
+       }
+       return 1;
+}
+
+/*
+ * readahead one full node of leaves, finding things that are close
+ * to the block in 'slot', and triggering ra on them.
+ */
+static noinline void reada_for_search(struct btrfs_root *root,
+                                     struct btrfs_path *path,
+                                     int level, int slot, u64 objectid)
+{
+       struct extent_buffer *node;
+       struct btrfs_disk_key disk_key;
+       u32 nritems;
+       u64 search;
+       u64 lowest_read;
+       u64 highest_read;
+       u64 nread = 0;
+       int direction = path->reada;
+       struct extent_buffer *eb;
+       u32 nr;
+       u32 blocksize;
+       u32 nscan = 0;
+
+       if (level != 1)
+               return;
+
+       if (!path->nodes[level])
+               return;
+
+       node = path->nodes[level];
+
+       search = btrfs_node_blockptr(node, slot);
+       blocksize = btrfs_level_size(root, level - 1);
+       eb = btrfs_find_tree_block(root, search, blocksize);
+       if (eb) {
+               free_extent_buffer(eb);
+               return;
+       }
+
+       highest_read = search;
+       lowest_read = search;
+
+       nritems = btrfs_header_nritems(node);
+       nr = slot;
+       while (1) {
+               if (direction < 0) {
+                       if (nr == 0)
+                               break;
+                       nr--;
+               } else if (direction > 0) {
+                       nr++;
+                       if (nr >= nritems)
+                               break;
+               }
+               if (path->reada < 0 && objectid) {
+                       btrfs_node_key(node, &disk_key, nr);
+                       if (btrfs_disk_key_objectid(&disk_key) != objectid)
+                               break;
+               }
+               search = btrfs_node_blockptr(node, nr);
+               if ((search >= lowest_read && search <= highest_read) ||
+                   (search < lowest_read && lowest_read - search <= 16384) ||
+                   (search > highest_read && search - highest_read <= 16384)) {
+                       readahead_tree_block(root, search, blocksize,
+                                    btrfs_node_ptr_generation(node, nr));
+                       nread += blocksize;
+               }
+               nscan++;
+               if (path->reada < 2 && (nread > (64 * 1024) || nscan > 32))
+                       break;
+
+               if (nread > (256 * 1024) || nscan > 128)
+                       break;
+
+               if (search < lowest_read)
+                       lowest_read = search;
+               if (search > highest_read)
+                       highest_read = search;
+       }
+}
+
+/*
+ * when we walk down the tree, it is usually safe to unlock the higher layers
+ * in the tree.  The exceptions are when our path goes through slot 0, because
+ * operations on the tree might require changing key pointers higher up in the
+ * tree.
+ *
+ * callers might also have set path->keep_locks, which tells this code to keep
+ * the lock if the path points to the last slot in the block.  This is part of
+ * walking through the tree, and selecting the next slot in the higher block.
+ *
+ * lowest_unlock sets the lowest level in the tree we're allowed to unlock.  so
+ * if lowest_unlock is 1, level 0 won't be unlocked
+ */
+static noinline void unlock_up(struct btrfs_path *path, int level,
+                              int lowest_unlock)
+{
+       int i;
+       int skip_level = level;
+       int no_skips = 0;
+       struct extent_buffer *t;
+
+       for (i = level; i < BTRFS_MAX_LEVEL; i++) {
+               if (!path->nodes[i])
+                       break;
+               if (!path->locks[i])
+                       break;
+               if (!no_skips && path->slots[i] == 0) {
+                       skip_level = i + 1;
+                       continue;
+               }
+               if (!no_skips && path->keep_locks) {
+                       u32 nritems;
+                       t = path->nodes[i];
+                       nritems = btrfs_header_nritems(t);
+                       if (nritems < 1 || path->slots[i] >= nritems - 1) {
+                               skip_level = i + 1;
+                               continue;
+                       }
+               }
+               if (skip_level < i && i >= lowest_unlock)
+                       no_skips = 1;
+
+               t = path->nodes[i];
+               if (i >= lowest_unlock && i > skip_level && path->locks[i]) {
+                       btrfs_tree_unlock(t);
+                       path->locks[i] = 0;
+               }
+       }
+}
+
+/*
+ * look for key in the tree.  path is filled in with nodes along the way
+ * if key is found, we return zero and you can find the item in the leaf
+ * level of the path (level 0)
+ *
+ * If the key isn't found, the path points to the slot where it should
+ * be inserted, and 1 is returned.  If there are other errors during the
+ * search a negative error number is returned.
+ *
+ * if ins_len > 0, nodes and leaves will be split as we walk down the
+ * tree.  if ins_len < 0, nodes will be merged as we walk down the tree (if
+ * possible)
+ */
+int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root
+                     *root, struct btrfs_key *key, struct btrfs_path *p, int
+                     ins_len, int cow)
+{
+       struct extent_buffer *b;
+       struct extent_buffer *tmp;
+       int slot;
+       int ret;
+       int level;
+       int should_reada = p->reada;
+       int lowest_unlock = 1;
+       int blocksize;
+       u8 lowest_level = 0;
+       u64 blocknr;
+       u64 gen;
+       struct btrfs_key prealloc_block;
+
+       lowest_level = p->lowest_level;
+       WARN_ON(lowest_level && ins_len > 0);
+       WARN_ON(p->nodes[0] != NULL);
+
+       if (ins_len < 0)
+               lowest_unlock = 2;
+
+       prealloc_block.objectid = 0;
+
+again:
+       if (p->skip_locking)
+               b = btrfs_root_node(root);
+       else
+               b = btrfs_lock_root_node(root);
+
+       while (b) {
+               level = btrfs_header_level(b);
+
+               /*
+                * setup the path here so we can release it under lock
+                * contention with the cow code
+                */
+               p->nodes[level] = b;
+               if (!p->skip_locking)
+                       p->locks[level] = 1;
+
+               if (cow) {
+                       int wret;
+
+                       /* is a cow on this block not required */
+                       spin_lock(&root->fs_info->hash_lock);
+                       if (btrfs_header_generation(b) == trans->transid &&
+                           btrfs_header_owner(b) == root->root_key.objectid &&
+                           !btrfs_header_flag(b, BTRFS_HEADER_FLAG_WRITTEN)) {
+                               spin_unlock(&root->fs_info->hash_lock);
+                               goto cow_done;
+                       }
+                       spin_unlock(&root->fs_info->hash_lock);
+
+                       /* ok, we have to cow, is our old prealloc the right
+                        * size?
+                        */
+                       if (prealloc_block.objectid &&
+                           prealloc_block.offset != b->len) {
+                               btrfs_free_reserved_extent(root,
+                                          prealloc_block.objectid,
+                                          prealloc_block.offset);
+                               prealloc_block.objectid = 0;
+                       }
+
+                       /*
+                        * for higher level blocks, try not to allocate blocks
+                        * with the block and the parent locks held.
+                        */
+                       if (level > 1 && !prealloc_block.objectid &&
+                           btrfs_path_lock_waiting(p, level)) {
+                               u32 size = b->len;
+                               u64 hint = b->start;
+
+                               btrfs_release_path(root, p);
+                               ret = btrfs_reserve_extent(trans, root,
+                                                          size, size, 0,
+                                                          hint, (u64)-1,
+                                                          &prealloc_block, 0);
+                               BUG_ON(ret);
+                               goto again;
+                       }
+
+                       wret = btrfs_cow_block(trans, root, b,
+                                              p->nodes[level + 1],
+                                              p->slots[level + 1],
+                                              &b, prealloc_block.objectid);
+                       prealloc_block.objectid = 0;
+                       if (wret) {
+                               free_extent_buffer(b);
+                               ret = wret;
+                               goto done;
+                       }
+               }
+cow_done:
+               BUG_ON(!cow && ins_len);
+               if (level != btrfs_header_level(b))
+                       WARN_ON(1);
+               level = btrfs_header_level(b);
+
+               p->nodes[level] = b;
+               if (!p->skip_locking)
+                       p->locks[level] = 1;
+
+               ret = check_block(root, p, level);
+               if (ret) {
+                       ret = -1;
+                       goto done;
+               }
+
+               ret = bin_search(b, key, level, &slot);
+               if (level != 0) {
+                       if (ret && slot > 0)
+                               slot -= 1;
+                       p->slots[level] = slot;
+                       if ((p->search_for_split || ins_len > 0) &&
+                           btrfs_header_nritems(b) >=
+                           BTRFS_NODEPTRS_PER_BLOCK(root) - 3) {
+                               int sret = split_node(trans, root, p, level);
+                               BUG_ON(sret > 0);
+                               if (sret) {
+                                       ret = sret;
+                                       goto done;
+                               }
+                               b = p->nodes[level];
+                               slot = p->slots[level];
+                       } else if (ins_len < 0) {
+                               int sret = balance_level(trans, root, p,
+                                                        level);
+                               if (sret) {
+                                       ret = sret;
+                                       goto done;
+                               }
+                               b = p->nodes[level];
+                               if (!b) {
+                                       btrfs_release_path(NULL, p);
+                                       goto again;
+                               }
+                               slot = p->slots[level];
+                               BUG_ON(btrfs_header_nritems(b) == 1);
+                       }
+                       unlock_up(p, level, lowest_unlock);
+
+                       /* this is only true while dropping a snapshot */
+                       if (level == lowest_level) {
+                               ret = 0;
+                               goto done;
+                       }
+
+                       blocknr = btrfs_node_blockptr(b, slot);
+                       gen = btrfs_node_ptr_generation(b, slot);
+                       blocksize = btrfs_level_size(root, level - 1);
+
+                       tmp = btrfs_find_tree_block(root, blocknr, blocksize);
+                       if (tmp && btrfs_buffer_uptodate(tmp, gen)) {
+                               b = tmp;
+                       } else {
+                               /*
+                                * reduce lock contention at high levels
+                                * of the btree by dropping locks before
+                                * we read.
+                                */
+                               if (level > 1) {
+                                       btrfs_release_path(NULL, p);
+                                       if (tmp)
+                                               free_extent_buffer(tmp);
+                                       if (should_reada)
+                                               reada_for_search(root, p,
+                                                                level, slot,
+                                                                key->objectid);
+
+                                       tmp = read_tree_block(root, blocknr,
+                                                        blocksize, gen);
+                                       if (tmp)
+                                               free_extent_buffer(tmp);
+                                       goto again;
+                               } else {
+                                       if (tmp)
+                                               free_extent_buffer(tmp);
+                                       if (should_reada)
+                                               reada_for_search(root, p,
+                                                                level, slot,
+                                                                key->objectid);
+                                       b = read_node_slot(root, b, slot);
+                               }
+                       }
+                       if (!p->skip_locking)
+                               btrfs_tree_lock(b);
+               } else {
+                       p->slots[level] = slot;
+                       if (ins_len > 0 &&
+                           btrfs_leaf_free_space(root, b) < ins_len) {
+                               int sret = split_leaf(trans, root, key,
+                                                     p, ins_len, ret == 0);
+                               BUG_ON(sret > 0);
+                               if (sret) {
+                                       ret = sret;
+                                       goto done;
+                               }
+                       }
+                       if (!p->search_for_split)
+                               unlock_up(p, level, lowest_unlock);
+                       goto done;
+               }
+       }
+       ret = 1;
+done:
+       if (prealloc_block.objectid) {
+               btrfs_free_reserved_extent(root,
+                          prealloc_block.objectid,
+                          prealloc_block.offset);
+       }
+
+       return ret;
+}
+
+int btrfs_merge_path(struct btrfs_trans_handle *trans,
+                    struct btrfs_root *root,
+                    struct btrfs_key *node_keys,
+                    u64 *nodes, int lowest_level)
+{
+       struct extent_buffer *eb;
+       struct extent_buffer *parent;
+       struct btrfs_key key;
+       u64 bytenr;
+       u64 generation;
+       u32 blocksize;
+       int level;
+       int slot;
+       int key_match;
+       int ret;
+
+       eb = btrfs_lock_root_node(root);
+       ret = btrfs_cow_block(trans, root, eb, NULL, 0, &eb, 0);
+       BUG_ON(ret);
+
+       parent = eb;
+       while (1) {
+               level = btrfs_header_level(parent);
+               if (level == 0 || level <= lowest_level)
+                       break;
+
+               ret = bin_search(parent, &node_keys[lowest_level], level,
+                                &slot);
+               if (ret && slot > 0)
+                       slot--;
+
+               bytenr = btrfs_node_blockptr(parent, slot);
+               if (nodes[level - 1] == bytenr)
+                       break;
+
+               blocksize = btrfs_level_size(root, level - 1);
+               generation = btrfs_node_ptr_generation(parent, slot);
+               btrfs_node_key_to_cpu(eb, &key, slot);
+               key_match = !memcmp(&key, &node_keys[level - 1], sizeof(key));
+
+               if (generation == trans->transid) {
+                       eb = read_tree_block(root, bytenr, blocksize,
+                                            generation);
+                       btrfs_tree_lock(eb);
+               }
+
+               /*
+                * if node keys match and node pointer hasn't been modified
+                * in the running transaction, we can merge the path. for
+                * blocks owened by reloc trees, the node pointer check is
+                * skipped, this is because these blocks are fully controlled
+                * by the space balance code, no one else can modify them.
+                */
+               if (!nodes[level - 1] || !key_match ||
+                   (generation == trans->transid &&
+                    btrfs_header_owner(eb) != BTRFS_TREE_RELOC_OBJECTID)) {
+                       if (level == 1 || level == lowest_level + 1) {
+                               if (generation == trans->transid) {
+                                       btrfs_tree_unlock(eb);
+                                       free_extent_buffer(eb);
+                               }
+                               break;
+                       }
+
+                       if (generation != trans->transid) {
+                               eb = read_tree_block(root, bytenr, blocksize,
+                                               generation);
+                               btrfs_tree_lock(eb);
+                       }
+
+                       ret = btrfs_cow_block(trans, root, eb, parent, slot,
+                                             &eb, 0);
+                       BUG_ON(ret);
+
+                       if (root->root_key.objectid ==
+                           BTRFS_TREE_RELOC_OBJECTID) {
+                               if (!nodes[level - 1]) {
+                                       nodes[level - 1] = eb->start;
+                                       memcpy(&node_keys[level - 1], &key,
+                                              sizeof(node_keys[0]));
+                               } else {
+                                       WARN_ON(1);
+                               }
+                       }
+
+                       btrfs_tree_unlock(parent);
+                       free_extent_buffer(parent);
+                       parent = eb;
+                       continue;
+               }
+
+               btrfs_set_node_blockptr(parent, slot, nodes[level - 1]);
+               btrfs_set_node_ptr_generation(parent, slot, trans->transid);
+               btrfs_mark_buffer_dirty(parent);
+
+               ret = btrfs_inc_extent_ref(trans, root,
+                                       nodes[level - 1],
+                                       blocksize, parent->start,
+                                       btrfs_header_owner(parent),
+                                       btrfs_header_generation(parent),
+                                       level - 1);
+               BUG_ON(ret);
+
+               /*
+                * If the block was created in the running transaction,
+                * it's possible this is the last reference to it, so we
+                * should drop the subtree.
+                */
+               if (generation == trans->transid) {
+                       ret = btrfs_drop_subtree(trans, root, eb, parent);
+                       BUG_ON(ret);
+                       btrfs_tree_unlock(eb);
+                       free_extent_buffer(eb);
+               } else {
+                       ret = btrfs_free_extent(trans, root, bytenr,
+                                       blocksize, parent->start,
+                                       btrfs_header_owner(parent),
+                                       btrfs_header_generation(parent),
+                                       level - 1, 1);
+                       BUG_ON(ret);
+               }
+               break;
+       }
+       btrfs_tree_unlock(parent);
+       free_extent_buffer(parent);
+       return 0;
+}
+
+/*
+ * adjust the pointers going up the tree, starting at level
+ * making sure the right key of each node is points to 'key'.
+ * This is used after shifting pointers to the left, so it stops
+ * fixing up pointers when a given leaf/node is not in slot 0 of the
+ * higher levels
+ *
+ * If this fails to write a tree block, it returns -1, but continues
+ * fixing up the blocks in ram so the tree is consistent.
+ */
+static int fixup_low_keys(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root, struct btrfs_path *path,
+                         struct btrfs_disk_key *key, int level)
+{
+       int i;
+       int ret = 0;
+       struct extent_buffer *t;
+
+       for (i = level; i < BTRFS_MAX_LEVEL; i++) {
+               int tslot = path->slots[i];
+               if (!path->nodes[i])
+                       break;
+               t = path->nodes[i];
+               btrfs_set_node_key(t, key, tslot);
+               btrfs_mark_buffer_dirty(path->nodes[i]);
+               if (tslot != 0)
+                       break;
+       }
+       return ret;
+}
+
+/*
+ * update item key.
+ *
+ * This function isn't completely safe. It's the caller's responsibility
+ * that the new key won't break the order
+ */
+int btrfs_set_item_key_safe(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root, struct btrfs_path *path,
+                           struct btrfs_key *new_key)
+{
+       struct btrfs_disk_key disk_key;
+       struct extent_buffer *eb;
+       int slot;
+
+       eb = path->nodes[0];
+       slot = path->slots[0];
+       if (slot > 0) {
+               btrfs_item_key(eb, &disk_key, slot - 1);
+               if (comp_keys(&disk_key, new_key) >= 0)
+                       return -1;
+       }
+       if (slot < btrfs_header_nritems(eb) - 1) {
+               btrfs_item_key(eb, &disk_key, slot + 1);
+               if (comp_keys(&disk_key, new_key) <= 0)
+                       return -1;
+       }
+
+       btrfs_cpu_key_to_disk(&disk_key, new_key);
+       btrfs_set_item_key(eb, &disk_key, slot);
+       btrfs_mark_buffer_dirty(eb);
+       if (slot == 0)
+               fixup_low_keys(trans, root, path, &disk_key, 1);
+       return 0;
+}
+
+/*
+ * try to push data from one node into the next node left in the
+ * tree.
+ *
+ * returns 0 if some ptrs were pushed left, < 0 if there was some horrible
+ * error, and > 0 if there was no room in the left hand block.
+ */
+static int push_node_left(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root, struct extent_buffer *dst,
+                         struct extent_buffer *src, int empty)
+{
+       int push_items = 0;
+       int src_nritems;
+       int dst_nritems;
+       int ret = 0;
+
+       src_nritems = btrfs_header_nritems(src);
+       dst_nritems = btrfs_header_nritems(dst);
+       push_items = BTRFS_NODEPTRS_PER_BLOCK(root) - dst_nritems;
+       WARN_ON(btrfs_header_generation(src) != trans->transid);
+       WARN_ON(btrfs_header_generation(dst) != trans->transid);
+
+       if (!empty && src_nritems <= 8)
+               return 1;
+
+       if (push_items <= 0)
+               return 1;
+
+       if (empty) {
+               push_items = min(src_nritems, push_items);
+               if (push_items < src_nritems) {
+                       /* leave at least 8 pointers in the node if
+                        * we aren't going to empty it
+                        */
+                       if (src_nritems - push_items < 8) {
+                               if (push_items <= 8)
+                                       return 1;
+                               push_items -= 8;
+                       }
+               }
+       } else
+               push_items = min(src_nritems - 8, push_items);
+
+       copy_extent_buffer(dst, src,
+                          btrfs_node_key_ptr_offset(dst_nritems),
+                          btrfs_node_key_ptr_offset(0),
+                          push_items * sizeof(struct btrfs_key_ptr));
+
+       if (push_items < src_nritems) {
+               memmove_extent_buffer(src, btrfs_node_key_ptr_offset(0),
+                                     btrfs_node_key_ptr_offset(push_items),
+                                     (src_nritems - push_items) *
+                                     sizeof(struct btrfs_key_ptr));
+       }
+       btrfs_set_header_nritems(src, src_nritems - push_items);
+       btrfs_set_header_nritems(dst, dst_nritems + push_items);
+       btrfs_mark_buffer_dirty(src);
+       btrfs_mark_buffer_dirty(dst);
+
+       ret = btrfs_update_ref(trans, root, src, dst, dst_nritems, push_items);
+       BUG_ON(ret);
+
+       return ret;
+}
+
+/*
+ * try to push data from one node into the next node right in the
+ * tree.
+ *
+ * returns 0 if some ptrs were pushed, < 0 if there was some horrible
+ * error, and > 0 if there was no room in the right hand block.
+ *
+ * this will  only push up to 1/2 the contents of the left node over
+ */
+static int balance_node_right(struct btrfs_trans_handle *trans,
+                             struct btrfs_root *root,
+                             struct extent_buffer *dst,
+                             struct extent_buffer *src)
+{
+       int push_items = 0;
+       int max_push;
+       int src_nritems;
+       int dst_nritems;
+       int ret = 0;
+
+       WARN_ON(btrfs_header_generation(src) != trans->transid);
+       WARN_ON(btrfs_header_generation(dst) != trans->transid);
+
+       src_nritems = btrfs_header_nritems(src);
+       dst_nritems = btrfs_header_nritems(dst);
+       push_items = BTRFS_NODEPTRS_PER_BLOCK(root) - dst_nritems;
+       if (push_items <= 0)
+               return 1;
+
+       if (src_nritems < 4)
+               return 1;
+
+       max_push = src_nritems / 2 + 1;
+       /* don't try to empty the node */
+       if (max_push >= src_nritems)
+               return 1;
+
+       if (max_push < push_items)
+               push_items = max_push;
+
+       memmove_extent_buffer(dst, btrfs_node_key_ptr_offset(push_items),
+                                     btrfs_node_key_ptr_offset(0),
+                                     (dst_nritems) *
+                                     sizeof(struct btrfs_key_ptr));
+
+       copy_extent_buffer(dst, src,
+                          btrfs_node_key_ptr_offset(0),
+                          btrfs_node_key_ptr_offset(src_nritems - push_items),
+                          push_items * sizeof(struct btrfs_key_ptr));
+
+       btrfs_set_header_nritems(src, src_nritems - push_items);
+       btrfs_set_header_nritems(dst, dst_nritems + push_items);
+
+       btrfs_mark_buffer_dirty(src);
+       btrfs_mark_buffer_dirty(dst);
+
+       ret = btrfs_update_ref(trans, root, src, dst, 0, push_items);
+       BUG_ON(ret);
+
+       return ret;
+}
+
+/*
+ * helper function to insert a new root level in the tree.
+ * A new node is allocated, and a single item is inserted to
+ * point to the existing root
+ *
+ * returns zero on success or < 0 on failure.
+ */
+static noinline int insert_new_root(struct btrfs_trans_handle *trans,
+                          struct btrfs_root *root,
+                          struct btrfs_path *path, int level)
+{
+       u64 lower_gen;
+       struct extent_buffer *lower;
+       struct extent_buffer *c;
+       struct extent_buffer *old;
+       struct btrfs_disk_key lower_key;
+       int ret;
+
+       BUG_ON(path->nodes[level]);
+       BUG_ON(path->nodes[level-1] != root->node);
+
+       lower = path->nodes[level-1];
+       if (level == 1)
+               btrfs_item_key(lower, &lower_key, 0);
+       else
+               btrfs_node_key(lower, &lower_key, 0);
+
+       c = btrfs_alloc_free_block(trans, root, root->nodesize, 0,
+                                  root->root_key.objectid, trans->transid,
+                                  level, root->node->start, 0);
+       if (IS_ERR(c))
+               return PTR_ERR(c);
+
+       memset_extent_buffer(c, 0, 0, root->nodesize);
+       btrfs_set_header_nritems(c, 1);
+       btrfs_set_header_level(c, level);
+       btrfs_set_header_bytenr(c, c->start);
+       btrfs_set_header_generation(c, trans->transid);
+       btrfs_set_header_owner(c, root->root_key.objectid);
+
+       write_extent_buffer(c, root->fs_info->fsid,
+                           (unsigned long)btrfs_header_fsid(c),
+                           BTRFS_FSID_SIZE);
+
+       write_extent_buffer(c, root->fs_info->chunk_tree_uuid,
+                           (unsigned long)btrfs_header_chunk_tree_uuid(c),
+                           BTRFS_UUID_SIZE);
+
+       btrfs_set_node_key(c, &lower_key, 0);
+       btrfs_set_node_blockptr(c, 0, lower->start);
+       lower_gen = btrfs_header_generation(lower);
+       WARN_ON(lower_gen != trans->transid);
+
+       btrfs_set_node_ptr_generation(c, 0, lower_gen);
+
+       btrfs_mark_buffer_dirty(c);
+
+       spin_lock(&root->node_lock);
+       old = root->node;
+       root->node = c;
+       spin_unlock(&root->node_lock);
+
+       ret = btrfs_update_extent_ref(trans, root, lower->start,
+                                     lower->start, c->start,
+                                     root->root_key.objectid,
+                                     trans->transid, level - 1);
+       BUG_ON(ret);
+
+       /* the super has an extra ref to root->node */
+       free_extent_buffer(old);
+
+       add_root_to_dirty_list(root);
+       extent_buffer_get(c);
+       path->nodes[level] = c;
+       path->locks[level] = 1;
+       path->slots[level] = 0;
+       return 0;
+}
+
+/*
+ * worker function to insert a single pointer in a node.
+ * the node should have enough room for the pointer already
+ *
+ * slot and level indicate where you want the key to go, and
+ * blocknr is the block the key points to.
+ *
+ * returns zero on success and < 0 on any error
+ */
+static int insert_ptr(struct btrfs_trans_handle *trans, struct btrfs_root
+                     *root, struct btrfs_path *path, struct btrfs_disk_key
+                     *key, u64 bytenr, int slot, int level)
+{
+       struct extent_buffer *lower;
+       int nritems;
+
+       BUG_ON(!path->nodes[level]);
+       lower = path->nodes[level];
+       nritems = btrfs_header_nritems(lower);
+       if (slot > nritems)
+               BUG();
+       if (nritems == BTRFS_NODEPTRS_PER_BLOCK(root))
+               BUG();
+       if (slot != nritems) {
+               memmove_extent_buffer(lower,
+                             btrfs_node_key_ptr_offset(slot + 1),
+                             btrfs_node_key_ptr_offset(slot),
+                             (nritems - slot) * sizeof(struct btrfs_key_ptr));
+       }
+       btrfs_set_node_key(lower, key, slot);
+       btrfs_set_node_blockptr(lower, slot, bytenr);
+       WARN_ON(trans->transid == 0);
+       btrfs_set_node_ptr_generation(lower, slot, trans->transid);
+       btrfs_set_header_nritems(lower, nritems + 1);
+       btrfs_mark_buffer_dirty(lower);
+       return 0;
+}
+
+/*
+ * split the node at the specified level in path in two.
+ * The path is corrected to point to the appropriate node after the split
+ *
+ * Before splitting this tries to make some room in the node by pushing
+ * left and right, if either one works, it returns right away.
+ *
+ * returns 0 on success and < 0 on failure
+ */
+static noinline int split_node(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *root,
+                              struct btrfs_path *path, int level)
+{
+       struct extent_buffer *c;
+       struct extent_buffer *split;
+       struct btrfs_disk_key disk_key;
+       int mid;
+       int ret;
+       int wret;
+       u32 c_nritems;
+
+       c = path->nodes[level];
+       WARN_ON(btrfs_header_generation(c) != trans->transid);
+       if (c == root->node) {
+               /* trying to split the root, lets make a new one */
+               ret = insert_new_root(trans, root, path, level + 1);
+               if (ret)
+                       return ret;
+       } else {
+               ret = push_nodes_for_insert(trans, root, path, level);
+               c = path->nodes[level];
+               if (!ret && btrfs_header_nritems(c) <
+                   BTRFS_NODEPTRS_PER_BLOCK(root) - 3)
+                       return 0;
+               if (ret < 0)
+                       return ret;
+       }
+
+       c_nritems = btrfs_header_nritems(c);
+
+       split = btrfs_alloc_free_block(trans, root, root->nodesize,
+                                       path->nodes[level + 1]->start,
+                                       root->root_key.objectid,
+                                       trans->transid, level, c->start, 0);
+       if (IS_ERR(split))
+               return PTR_ERR(split);
+
+       btrfs_set_header_flags(split, btrfs_header_flags(c));
+       btrfs_set_header_level(split, btrfs_header_level(c));
+       btrfs_set_header_bytenr(split, split->start);
+       btrfs_set_header_generation(split, trans->transid);
+       btrfs_set_header_owner(split, root->root_key.objectid);
+       btrfs_set_header_flags(split, 0);
+       write_extent_buffer(split, root->fs_info->fsid,
+                           (unsigned long)btrfs_header_fsid(split),
+                           BTRFS_FSID_SIZE);
+       write_extent_buffer(split, root->fs_info->chunk_tree_uuid,
+                           (unsigned long)btrfs_header_chunk_tree_uuid(split),
+                           BTRFS_UUID_SIZE);
+
+       mid = (c_nritems + 1) / 2;
+
+       copy_extent_buffer(split, c,
+                          btrfs_node_key_ptr_offset(0),
+                          btrfs_node_key_ptr_offset(mid),
+                          (c_nritems - mid) * sizeof(struct btrfs_key_ptr));
+       btrfs_set_header_nritems(split, c_nritems - mid);
+       btrfs_set_header_nritems(c, mid);
+       ret = 0;
+
+       btrfs_mark_buffer_dirty(c);
+       btrfs_mark_buffer_dirty(split);
+
+       btrfs_node_key(split, &disk_key, 0);
+       wret = insert_ptr(trans, root, path, &disk_key, split->start,
+                         path->slots[level + 1] + 1,
+                         level + 1);
+       if (wret)
+               ret = wret;
+
+       ret = btrfs_update_ref(trans, root, c, split, 0, c_nritems - mid);
+       BUG_ON(ret);
+
+       if (path->slots[level] >= mid) {
+               path->slots[level] -= mid;
+               btrfs_tree_unlock(c);
+               free_extent_buffer(c);
+               path->nodes[level] = split;
+               path->slots[level + 1] += 1;
+       } else {
+               btrfs_tree_unlock(split);
+               free_extent_buffer(split);
+       }
+       return ret;
+}
+
+/*
+ * how many bytes are required to store the items in a leaf.  start
+ * and nr indicate which items in the leaf to check.  This totals up the
+ * space used both by the item structs and the item data
+ */
+static int leaf_space_used(struct extent_buffer *l, int start, int nr)
+{
+       int data_len;
+       int nritems = btrfs_header_nritems(l);
+       int end = min(nritems, start + nr) - 1;
+
+       if (!nr)
+               return 0;
+       data_len = btrfs_item_end_nr(l, start);
+       data_len = data_len - btrfs_item_offset_nr(l, end);
+       data_len += sizeof(struct btrfs_item) * nr;
+       WARN_ON(data_len < 0);
+       return data_len;
+}
+
+/*
+ * The space between the end of the leaf items and
+ * the start of the leaf data.  IOW, how much room
+ * the leaf has left for both items and data
+ */
+noinline int btrfs_leaf_free_space(struct btrfs_root *root,
+                                  struct extent_buffer *leaf)
+{
+       int nritems = btrfs_header_nritems(leaf);
+       int ret;
+       ret = BTRFS_LEAF_DATA_SIZE(root) - leaf_space_used(leaf, 0, nritems);
+       if (ret < 0) {
+               printk(KERN_CRIT "leaf free space ret %d, leaf data size %lu, "
+                      "used %d nritems %d\n",
+                      ret, (unsigned long) BTRFS_LEAF_DATA_SIZE(root),
+                      leaf_space_used(leaf, 0, nritems), nritems);
+       }
+       return ret;
+}
+
+/*
+ * push some data in the path leaf to the right, trying to free up at
+ * least data_size bytes.  returns zero if the push worked, nonzero otherwise
+ *
+ * returns 1 if the push failed because the other node didn't have enough
+ * room, 0 if everything worked out and < 0 if there were major errors.
+ */
+static int push_leaf_right(struct btrfs_trans_handle *trans, struct btrfs_root
+                          *root, struct btrfs_path *path, int data_size,
+                          int empty)
+{
+       struct extent_buffer *left = path->nodes[0];
+       struct extent_buffer *right;
+       struct extent_buffer *upper;
+       struct btrfs_disk_key disk_key;
+       int slot;
+       u32 i;
+       int free_space;
+       int push_space = 0;
+       int push_items = 0;
+       struct btrfs_item *item;
+       u32 left_nritems;
+       u32 nr;
+       u32 right_nritems;
+       u32 data_end;
+       u32 this_item_size;
+       int ret;
+
+       slot = path->slots[1];
+       if (!path->nodes[1])
+               return 1;
+
+       upper = path->nodes[1];
+       if (slot >= btrfs_header_nritems(upper) - 1)
+               return 1;
+
+       WARN_ON(!btrfs_tree_locked(path->nodes[1]));
+
+       right = read_node_slot(root, upper, slot + 1);
+       btrfs_tree_lock(right);
+       free_space = btrfs_leaf_free_space(root, right);
+       if (free_space < data_size)
+               goto out_unlock;
+
+       /* cow and double check */
+       ret = btrfs_cow_block(trans, root, right, upper,
+                             slot + 1, &right, 0);
+       if (ret)
+               goto out_unlock;
+
+       free_space = btrfs_leaf_free_space(root, right);
+       if (free_space < data_size)
+               goto out_unlock;
+
+       left_nritems = btrfs_header_nritems(left);
+       if (left_nritems == 0)
+               goto out_unlock;
+
+       if (empty)
+               nr = 0;
+       else
+               nr = 1;
+
+       if (path->slots[0] >= left_nritems)
+               push_space += data_size;
+
+       i = left_nritems - 1;
+       while (i >= nr) {
+               item = btrfs_item_nr(left, i);
+
+               if (!empty && push_items > 0) {
+                       if (path->slots[0] > i)
+                               break;
+                       if (path->slots[0] == i) {
+                               int space = btrfs_leaf_free_space(root, left);
+                               if (space + push_space * 2 > free_space)
+                                       break;
+                       }
+               }
+
+               if (path->slots[0] == i)
+                       push_space += data_size;
+
+               if (!left->map_token) {
+                       map_extent_buffer(left, (unsigned long)item,
+                                       sizeof(struct btrfs_item),
+                                       &left->map_token, &left->kaddr,
+                                       &left->map_start, &left->map_len,
+                                       KM_USER1);
+               }
+
+               this_item_size = btrfs_item_size(left, item);
+               if (this_item_size + sizeof(*item) + push_space > free_space)
+                       break;
+
+               push_items++;
+               push_space += this_item_size + sizeof(*item);
+               if (i == 0)
+                       break;
+               i--;
+       }
+       if (left->map_token) {
+               unmap_extent_buffer(left, left->map_token, KM_USER1);
+               left->map_token = NULL;
+       }
+
+       if (push_items == 0)
+               goto out_unlock;
+
+       if (!empty && push_items == left_nritems)
+               WARN_ON(1);
+
+       /* push left to right */
+       right_nritems = btrfs_header_nritems(right);
+
+       push_space = btrfs_item_end_nr(left, left_nritems - push_items);
+       push_space -= leaf_data_end(root, left);
+
+       /* make room in the right data area */
+       data_end = leaf_data_end(root, right);
+       memmove_extent_buffer(right,
+                             btrfs_leaf_data(right) + data_end - push_space,
+                             btrfs_leaf_data(right) + data_end,
+                             BTRFS_LEAF_DATA_SIZE(root) - data_end);
+
+       /* copy from the left data area */
+       copy_extent_buffer(right, left, btrfs_leaf_data(right) +
+                    BTRFS_LEAF_DATA_SIZE(root) - push_space,
+                    btrfs_leaf_data(left) + leaf_data_end(root, left),
+                    push_space);
+
+       memmove_extent_buffer(right, btrfs_item_nr_offset(push_items),
+                             btrfs_item_nr_offset(0),
+                             right_nritems * sizeof(struct btrfs_item));
+
+       /* copy the items from left to right */
+       copy_extent_buffer(right, left, btrfs_item_nr_offset(0),
+                  btrfs_item_nr_offset(left_nritems - push_items),
+                  push_items * sizeof(struct btrfs_item));
+
+       /* update the item pointers */
+       right_nritems += push_items;
+       btrfs_set_header_nritems(right, right_nritems);
+       push_space = BTRFS_LEAF_DATA_SIZE(root);
+       for (i = 0; i < right_nritems; i++) {
+               item = btrfs_item_nr(right, i);
+               if (!right->map_token) {
+                       map_extent_buffer(right, (unsigned long)item,
+                                       sizeof(struct btrfs_item),
+                                       &right->map_token, &right->kaddr,
+                                       &right->map_start, &right->map_len,
+                                       KM_USER1);
+               }
+               push_space -= btrfs_item_size(right, item);
+               btrfs_set_item_offset(right, item, push_space);
+       }
+
+       if (right->map_token) {
+               unmap_extent_buffer(right, right->map_token, KM_USER1);
+               right->map_token = NULL;
+       }
+       left_nritems -= push_items;
+       btrfs_set_header_nritems(left, left_nritems);
+
+       if (left_nritems)
+               btrfs_mark_buffer_dirty(left);
+       btrfs_mark_buffer_dirty(right);
+
+       ret = btrfs_update_ref(trans, root, left, right, 0, push_items);
+       BUG_ON(ret);
+
+       btrfs_item_key(right, &disk_key, 0);
+       btrfs_set_node_key(upper, &disk_key, slot + 1);
+       btrfs_mark_buffer_dirty(upper);
+
+       /* then fixup the leaf pointer in the path */
+       if (path->slots[0] >= left_nritems) {
+               path->slots[0] -= left_nritems;
+               if (btrfs_header_nritems(path->nodes[0]) == 0)
+                       clean_tree_block(trans, root, path->nodes[0]);
+               btrfs_tree_unlock(path->nodes[0]);
+               free_extent_buffer(path->nodes[0]);
+               path->nodes[0] = right;
+               path->slots[1] += 1;
+       } else {
+               btrfs_tree_unlock(right);
+               free_extent_buffer(right);
+       }
+       return 0;
+
+out_unlock:
+       btrfs_tree_unlock(right);
+       free_extent_buffer(right);
+       return 1;
+}
+
+/*
+ * push some data in the path leaf to the left, trying to free up at
+ * least data_size bytes.  returns zero if the push worked, nonzero otherwise
+ */
+static int push_leaf_left(struct btrfs_trans_handle *trans, struct btrfs_root
+                         *root, struct btrfs_path *path, int data_size,
+                         int empty)
+{
+       struct btrfs_disk_key disk_key;
+       struct extent_buffer *right = path->nodes[0];
+       struct extent_buffer *left;
+       int slot;
+       int i;
+       int free_space;
+       int push_space = 0;
+       int push_items = 0;
+       struct btrfs_item *item;
+       u32 old_left_nritems;
+       u32 right_nritems;
+       u32 nr;
+       int ret = 0;
+       int wret;
+       u32 this_item_size;
+       u32 old_left_item_size;
+
+       slot = path->slots[1];
+       if (slot == 0)
+               return 1;
+       if (!path->nodes[1])
+               return 1;
+
+       right_nritems = btrfs_header_nritems(right);
+       if (right_nritems == 0)
+               return 1;
+
+       WARN_ON(!btrfs_tree_locked(path->nodes[1]));
+
+       left = read_node_slot(root, path->nodes[1], slot - 1);
+       btrfs_tree_lock(left);
+       free_space = btrfs_leaf_free_space(root, left);
+       if (free_space < data_size) {
+               ret = 1;
+               goto out;
+       }
+
+       /* cow and double check */
+       ret = btrfs_cow_block(trans, root, left,
+                             path->nodes[1], slot - 1, &left, 0);
+       if (ret) {
+               /* we hit -ENOSPC, but it isn't fatal here */
+               ret = 1;
+               goto out;
+       }
+
+       free_space = btrfs_leaf_free_space(root, left);
+       if (free_space < data_size) {
+               ret = 1;
+               goto out;
+       }
+
+       if (empty)
+               nr = right_nritems;
+       else
+               nr = right_nritems - 1;
+
+       for (i = 0; i < nr; i++) {
+               item = btrfs_item_nr(right, i);
+               if (!right->map_token) {
+                       map_extent_buffer(right, (unsigned long)item,
+                                       sizeof(struct btrfs_item),
+                                       &right->map_token, &right->kaddr,
+                                       &right->map_start, &right->map_len,
+                                       KM_USER1);
+               }
+
+               if (!empty && push_items > 0) {
+                       if (path->slots[0] < i)
+                               break;
+                       if (path->slots[0] == i) {
+                               int space = btrfs_leaf_free_space(root, right);
+                               if (space + push_space * 2 > free_space)
+                                       break;
+                       }
+               }
+
+               if (path->slots[0] == i)
+                       push_space += data_size;
+
+               this_item_size = btrfs_item_size(right, item);
+               if (this_item_size + sizeof(*item) + push_space > free_space)
+                       break;
+
+               push_items++;
+               push_space += this_item_size + sizeof(*item);
+       }
+
+       if (right->map_token) {
+               unmap_extent_buffer(right, right->map_token, KM_USER1);
+               right->map_token = NULL;
+       }
+
+       if (push_items == 0) {
+               ret = 1;
+               goto out;
+       }
+       if (!empty && push_items == btrfs_header_nritems(right))
+               WARN_ON(1);
+
+       /* push data from right to left */
+       copy_extent_buffer(left, right,
+                          btrfs_item_nr_offset(btrfs_header_nritems(left)),
+                          btrfs_item_nr_offset(0),
+                          push_items * sizeof(struct btrfs_item));
+
+       push_space = BTRFS_LEAF_DATA_SIZE(root) -
+                    btrfs_item_offset_nr(right, push_items - 1);
+
+       copy_extent_buffer(left, right, btrfs_leaf_data(left) +
+                    leaf_data_end(root, left) - push_space,
+                    btrfs_leaf_data(right) +
+                    btrfs_item_offset_nr(right, push_items - 1),
+                    push_space);
+       old_left_nritems = btrfs_header_nritems(left);
+       BUG_ON(old_left_nritems <= 0);
+
+       old_left_item_size = btrfs_item_offset_nr(left, old_left_nritems - 1);
+       for (i = old_left_nritems; i < old_left_nritems + push_items; i++) {
+               u32 ioff;
+
+               item = btrfs_item_nr(left, i);
+               if (!left->map_token) {
+                       map_extent_buffer(left, (unsigned long)item,
+                                       sizeof(struct btrfs_item),
+                                       &left->map_token, &left->kaddr,
+                                       &left->map_start, &left->map_len,
+                                       KM_USER1);
+               }
+
+               ioff = btrfs_item_offset(left, item);
+               btrfs_set_item_offset(left, item,
+                     ioff - (BTRFS_LEAF_DATA_SIZE(root) - old_left_item_size));
+       }
+       btrfs_set_header_nritems(left, old_left_nritems + push_items);
+       if (left->map_token) {
+               unmap_extent_buffer(left, left->map_token, KM_USER1);
+               left->map_token = NULL;
+       }
+
+       /* fixup right node */
+       if (push_items > right_nritems) {
+               printk(KERN_CRIT "push items %d nr %u\n", push_items,
+                      right_nritems);
+               WARN_ON(1);
+       }
+
+       if (push_items < right_nritems) {
+               push_space = btrfs_item_offset_nr(right, push_items - 1) -
+                                                 leaf_data_end(root, right);
+               memmove_extent_buffer(right, btrfs_leaf_data(right) +
+                                     BTRFS_LEAF_DATA_SIZE(root) - push_space,
+                                     btrfs_leaf_data(right) +
+                                     leaf_data_end(root, right), push_space);
+
+               memmove_extent_buffer(right, btrfs_item_nr_offset(0),
+                             btrfs_item_nr_offset(push_items),
+                            (btrfs_header_nritems(right) - push_items) *
+                            sizeof(struct btrfs_item));
+       }
+       right_nritems -= push_items;
+       btrfs_set_header_nritems(right, right_nritems);
+       push_space = BTRFS_LEAF_DATA_SIZE(root);
+       for (i = 0; i < right_nritems; i++) {
+               item = btrfs_item_nr(right, i);
+
+               if (!right->map_token) {
+                       map_extent_buffer(right, (unsigned long)item,
+                                       sizeof(struct btrfs_item),
+                                       &right->map_token, &right->kaddr,
+                                       &right->map_start, &right->map_len,
+                                       KM_USER1);
+               }
+
+               push_space = push_space - btrfs_item_size(right, item);
+               btrfs_set_item_offset(right, item, push_space);
+       }
+       if (right->map_token) {
+               unmap_extent_buffer(right, right->map_token, KM_USER1);
+               right->map_token = NULL;
+       }
+
+       btrfs_mark_buffer_dirty(left);
+       if (right_nritems)
+               btrfs_mark_buffer_dirty(right);
+
+       ret = btrfs_update_ref(trans, root, right, left,
+                              old_left_nritems, push_items);
+       BUG_ON(ret);
+
+       btrfs_item_key(right, &disk_key, 0);
+       wret = fixup_low_keys(trans, root, path, &disk_key, 1);
+       if (wret)
+               ret = wret;
+
+       /* then fixup the leaf pointer in the path */
+       if (path->slots[0] < push_items) {
+               path->slots[0] += old_left_nritems;
+               if (btrfs_header_nritems(path->nodes[0]) == 0)
+                       clean_tree_block(trans, root, path->nodes[0]);
+               btrfs_tree_unlock(path->nodes[0]);
+               free_extent_buffer(path->nodes[0]);
+               path->nodes[0] = left;
+               path->slots[1] -= 1;
+       } else {
+               btrfs_tree_unlock(left);
+               free_extent_buffer(left);
+               path->slots[0] -= push_items;
+       }
+       BUG_ON(path->slots[0] < 0);
+       return ret;
+out:
+       btrfs_tree_unlock(left);
+       free_extent_buffer(left);
+       return ret;
+}
+
+/*
+ * split the path's leaf in two, making sure there is at least data_size
+ * available for the resulting leaf level of the path.
+ *
+ * returns 0 if all went well and < 0 on failure.
+ */
+static noinline int split_leaf(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *root,
+                              struct btrfs_key *ins_key,
+                              struct btrfs_path *path, int data_size,
+                              int extend)
+{
+       struct extent_buffer *l;
+       u32 nritems;
+       int mid;
+       int slot;
+       struct extent_buffer *right;
+       int data_copy_size;
+       int rt_data_off;
+       int i;
+       int ret = 0;
+       int wret;
+       int double_split;
+       int num_doubles = 0;
+       struct btrfs_disk_key disk_key;
+
+       /* first try to make some room by pushing left and right */
+       if (data_size && ins_key->type != BTRFS_DIR_ITEM_KEY) {
+               wret = push_leaf_right(trans, root, path, data_size, 0);
+               if (wret < 0)
+                       return wret;
+               if (wret) {
+                       wret = push_leaf_left(trans, root, path, data_size, 0);
+                       if (wret < 0)
+                               return wret;
+               }
+               l = path->nodes[0];
+
+               /* did the pushes work? */
+               if (btrfs_leaf_free_space(root, l) >= data_size)
+                       return 0;
+       }
+
+       if (!path->nodes[1]) {
+               ret = insert_new_root(trans, root, path, 1);
+               if (ret)
+                       return ret;
+       }
+again:
+       double_split = 0;
+       l = path->nodes[0];
+       slot = path->slots[0];
+       nritems = btrfs_header_nritems(l);
+       mid = (nritems + 1) / 2;
+
+       right = btrfs_alloc_free_block(trans, root, root->leafsize,
+                                       path->nodes[1]->start,
+                                       root->root_key.objectid,
+                                       trans->transid, 0, l->start, 0);
+       if (IS_ERR(right)) {
+               BUG_ON(1);
+               return PTR_ERR(right);
+       }
+
+       memset_extent_buffer(right, 0, 0, sizeof(struct btrfs_header));
+       btrfs_set_header_bytenr(right, right->start);
+       btrfs_set_header_generation(right, trans->transid);
+       btrfs_set_header_owner(right, root->root_key.objectid);
+       btrfs_set_header_level(right, 0);
+       write_extent_buffer(right, root->fs_info->fsid,
+                           (unsigned long)btrfs_header_fsid(right),
+                           BTRFS_FSID_SIZE);
+
+       write_extent_buffer(right, root->fs_info->chunk_tree_uuid,
+                           (unsigned long)btrfs_header_chunk_tree_uuid(right),
+                           BTRFS_UUID_SIZE);
+       if (mid <= slot) {
+               if (nritems == 1 ||
+                   leaf_space_used(l, mid, nritems - mid) + data_size >
+                       BTRFS_LEAF_DATA_SIZE(root)) {
+                       if (slot >= nritems) {
+                               btrfs_cpu_key_to_disk(&disk_key, ins_key);
+                               btrfs_set_header_nritems(right, 0);
+                               wret = insert_ptr(trans, root, path,
+                                                 &disk_key, right->start,
+                                                 path->slots[1] + 1, 1);
+                               if (wret)
+                                       ret = wret;
+
+                               btrfs_tree_unlock(path->nodes[0]);
+                               free_extent_buffer(path->nodes[0]);
+                               path->nodes[0] = right;
+                               path->slots[0] = 0;
+                               path->slots[1] += 1;
+                               btrfs_mark_buffer_dirty(right);
+                               return ret;
+                       }
+                       mid = slot;
+                       if (mid != nritems &&
+                           leaf_space_used(l, mid, nritems - mid) +
+                           data_size > BTRFS_LEAF_DATA_SIZE(root)) {
+                               double_split = 1;
+                       }
+               }
+       } else {
+               if (leaf_space_used(l, 0, mid) + data_size >
+                       BTRFS_LEAF_DATA_SIZE(root)) {
+                       if (!extend && data_size && slot == 0) {
+                               btrfs_cpu_key_to_disk(&disk_key, ins_key);
+                               btrfs_set_header_nritems(right, 0);
+                               wret = insert_ptr(trans, root, path,
+                                                 &disk_key,
+                                                 right->start,
+                                                 path->slots[1], 1);
+                               if (wret)
+                                       ret = wret;
+                               btrfs_tree_unlock(path->nodes[0]);
+                               free_extent_buffer(path->nodes[0]);
+                               path->nodes[0] = right;
+                               path->slots[0] = 0;
+                               if (path->slots[1] == 0) {
+                                       wret = fixup_low_keys(trans, root,
+                                                     path, &disk_key, 1);
+                                       if (wret)
+                                               ret = wret;
+                               }
+                               btrfs_mark_buffer_dirty(right);
+                               return ret;
+                       } else if ((extend || !data_size) && slot == 0) {
+                               mid = 1;
+                       } else {
+                               mid = slot;
+                               if (mid != nritems &&
+                                   leaf_space_used(l, mid, nritems - mid) +
+                                   data_size > BTRFS_LEAF_DATA_SIZE(root)) {
+                                       double_split = 1;
+                               }
+                       }
+               }
+       }
+       nritems = nritems - mid;
+       btrfs_set_header_nritems(right, nritems);
+       data_copy_size = btrfs_item_end_nr(l, mid) - leaf_data_end(root, l);
+
+       copy_extent_buffer(right, l, btrfs_item_nr_offset(0),
+                          btrfs_item_nr_offset(mid),
+                          nritems * sizeof(struct btrfs_item));
+
+       copy_extent_buffer(right, l,
+                    btrfs_leaf_data(right) + BTRFS_LEAF_DATA_SIZE(root) -
+                    data_copy_size, btrfs_leaf_data(l) +
+                    leaf_data_end(root, l), data_copy_size);
+
+       rt_data_off = BTRFS_LEAF_DATA_SIZE(root) -
+                     btrfs_item_end_nr(l, mid);
+
+       for (i = 0; i < nritems; i++) {
+               struct btrfs_item *item = btrfs_item_nr(right, i);
+               u32 ioff;
+
+               if (!right->map_token) {
+                       map_extent_buffer(right, (unsigned long)item,
+                                       sizeof(struct btrfs_item),
+                                       &right->map_token, &right->kaddr,
+                                       &right->map_start, &right->map_len,
+                                       KM_USER1);
+               }
+
+               ioff = btrfs_item_offset(right, item);
+               btrfs_set_item_offset(right, item, ioff + rt_data_off);
+       }
+
+       if (right->map_token) {
+               unmap_extent_buffer(right, right->map_token, KM_USER1);
+               right->map_token = NULL;
+       }
+
+       btrfs_set_header_nritems(l, mid);
+       ret = 0;
+       btrfs_item_key(right, &disk_key, 0);
+       wret = insert_ptr(trans, root, path, &disk_key, right->start,
+                         path->slots[1] + 1, 1);
+       if (wret)
+               ret = wret;
+
+       btrfs_mark_buffer_dirty(right);
+       btrfs_mark_buffer_dirty(l);
+       BUG_ON(path->slots[0] != slot);
+
+       ret = btrfs_update_ref(trans, root, l, right, 0, nritems);
+       BUG_ON(ret);
+
+       if (mid <= slot) {
+               btrfs_tree_unlock(path->nodes[0]);
+               free_extent_buffer(path->nodes[0]);
+               path->nodes[0] = right;
+               path->slots[0] -= mid;
+               path->slots[1] += 1;
+       } else {
+               btrfs_tree_unlock(right);
+               free_extent_buffer(right);
+       }
+
+       BUG_ON(path->slots[0] < 0);
+
+       if (double_split) {
+               BUG_ON(num_doubles != 0);
+               num_doubles++;
+               goto again;
+       }
+       return ret;
+}
+
+/*
+ * This function splits a single item into two items,
+ * giving 'new_key' to the new item and splitting the
+ * old one at split_offset (from the start of the item).
+ *
+ * The path may be released by this operation.  After
+ * the split, the path is pointing to the old item.  The
+ * new item is going to be in the same node as the old one.
+ *
+ * Note, the item being split must be smaller enough to live alone on
+ * a tree block with room for one extra struct btrfs_item
+ *
+ * This allows us to split the item in place, keeping a lock on the
+ * leaf the entire time.
+ */
+int btrfs_split_item(struct btrfs_trans_handle *trans,
+                    struct btrfs_root *root,
+                    struct btrfs_path *path,
+                    struct btrfs_key *new_key,
+                    unsigned long split_offset)
+{
+       u32 item_size;
+       struct extent_buffer *leaf;
+       struct btrfs_key orig_key;
+       struct btrfs_item *item;
+       struct btrfs_item *new_item;
+       int ret = 0;
+       int slot;
+       u32 nritems;
+       u32 orig_offset;
+       struct btrfs_disk_key disk_key;
+       char *buf;
+
+       leaf = path->nodes[0];
+       btrfs_item_key_to_cpu(leaf, &orig_key, path->slots[0]);
+       if (btrfs_leaf_free_space(root, leaf) >= sizeof(struct btrfs_item))
+               goto split;
+
+       item_size = btrfs_item_size_nr(leaf, path->slots[0]);
+       btrfs_release_path(root, path);
+
+       path->search_for_split = 1;
+       path->keep_locks = 1;
+
+       ret = btrfs_search_slot(trans, root, &orig_key, path, 0, 1);
+       path->search_for_split = 0;
+
+       /* if our item isn't there or got smaller, return now */
+       if (ret != 0 || item_size != btrfs_item_size_nr(path->nodes[0],
+                                                       path->slots[0])) {
+               path->keep_locks = 0;
+               return -EAGAIN;
+       }
+
+       ret = split_leaf(trans, root, &orig_key, path,
+                        sizeof(struct btrfs_item), 1);
+       path->keep_locks = 0;
+       BUG_ON(ret);
+
+       leaf = path->nodes[0];
+       BUG_ON(btrfs_leaf_free_space(root, leaf) < sizeof(struct btrfs_item));
+
+split:
+       item = btrfs_item_nr(leaf, path->slots[0]);
+       orig_offset = btrfs_item_offset(leaf, item);
+       item_size = btrfs_item_size(leaf, item);
+
+
+       buf = kmalloc(item_size, GFP_NOFS);
+       read_extent_buffer(leaf, buf, btrfs_item_ptr_offset(leaf,
+                           path->slots[0]), item_size);
+       slot = path->slots[0] + 1;
+       leaf = path->nodes[0];
+
+       nritems = btrfs_header_nritems(leaf);
+
+       if (slot != nritems) {
+               /* shift the items */
+               memmove_extent_buffer(leaf, btrfs_item_nr_offset(slot + 1),
+                             btrfs_item_nr_offset(slot),
+                             (nritems - slot) * sizeof(struct btrfs_item));
+
+       }
+
+       btrfs_cpu_key_to_disk(&disk_key, new_key);
+       btrfs_set_item_key(leaf, &disk_key, slot);
+
+       new_item = btrfs_item_nr(leaf, slot);
+
+       btrfs_set_item_offset(leaf, new_item, orig_offset);
+       btrfs_set_item_size(leaf, new_item, item_size - split_offset);
+
+       btrfs_set_item_offset(leaf, item,
+                             orig_offset + item_size - split_offset);
+       btrfs_set_item_size(leaf, item, split_offset);
+
+       btrfs_set_header_nritems(leaf, nritems + 1);
+
+       /* write the data for the start of the original item */
+       write_extent_buffer(leaf, buf,
+                           btrfs_item_ptr_offset(leaf, path->slots[0]),
+                           split_offset);
+
+       /* write the data for the new item */
+       write_extent_buffer(leaf, buf + split_offset,
+                           btrfs_item_ptr_offset(leaf, slot),
+                           item_size - split_offset);
+       btrfs_mark_buffer_dirty(leaf);
+
+       ret = 0;
+       if (btrfs_leaf_free_space(root, leaf) < 0) {
+               btrfs_print_leaf(root, leaf);
+               BUG();
+       }
+       kfree(buf);
+       return ret;
+}
+
+/*
+ * make the item pointed to by the path smaller.  new_size indicates
+ * how small to make it, and from_end tells us if we just chop bytes
+ * off the end of the item or if we shift the item to chop bytes off
+ * the front.
+ */
+int btrfs_truncate_item(struct btrfs_trans_handle *trans,
+                       struct btrfs_root *root,
+                       struct btrfs_path *path,
+                       u32 new_size, int from_end)
+{
+       int ret = 0;
+       int slot;
+       int slot_orig;
+       struct extent_buffer *leaf;
+       struct btrfs_item *item;
+       u32 nritems;
+       unsigned int data_end;
+       unsigned int old_data_start;
+       unsigned int old_size;
+       unsigned int size_diff;
+       int i;
+
+       slot_orig = path->slots[0];
+       leaf = path->nodes[0];
+       slot = path->slots[0];
+
+       old_size = btrfs_item_size_nr(leaf, slot);
+       if (old_size == new_size)
+               return 0;
+
+       nritems = btrfs_header_nritems(leaf);
+       data_end = leaf_data_end(root, leaf);
+
+       old_data_start = btrfs_item_offset_nr(leaf, slot);
+
+       size_diff = old_size - new_size;
+
+       BUG_ON(slot < 0);
+       BUG_ON(slot >= nritems);
+
+       /*
+        * item0..itemN ... dataN.offset..dataN.size .. data0.size
+        */
+       /* first correct the data pointers */
+       for (i = slot; i < nritems; i++) {
+               u32 ioff;
+               item = btrfs_item_nr(leaf, i);
+
+               if (!leaf->map_token) {
+                       map_extent_buffer(leaf, (unsigned long)item,
+                                       sizeof(struct btrfs_item),
+                                       &leaf->map_token, &leaf->kaddr,
+                                       &leaf->map_start, &leaf->map_len,
+                                       KM_USER1);
+               }
+
+               ioff = btrfs_item_offset(leaf, item);
+               btrfs_set_item_offset(leaf, item, ioff + size_diff);
+       }
+
+       if (leaf->map_token) {
+               unmap_extent_buffer(leaf, leaf->map_token, KM_USER1);
+               leaf->map_token = NULL;
+       }
+
+       /* shift the data */
+       if (from_end) {
+               memmove_extent_buffer(leaf, btrfs_leaf_data(leaf) +
+                             data_end + size_diff, btrfs_leaf_data(leaf) +
+                             data_end, old_data_start + new_size - data_end);
+       } else {
+               struct btrfs_disk_key disk_key;
+               u64 offset;
+
+               btrfs_item_key(leaf, &disk_key, slot);
+
+               if (btrfs_disk_key_type(&disk_key) == BTRFS_EXTENT_DATA_KEY) {
+                       unsigned long ptr;
+                       struct btrfs_file_extent_item *fi;
+
+                       fi = btrfs_item_ptr(leaf, slot,
+                                           struct btrfs_file_extent_item);
+                       fi = (struct btrfs_file_extent_item *)(
+                            (unsigned long)fi - size_diff);
+
+                       if (btrfs_file_extent_type(leaf, fi) ==
+                           BTRFS_FILE_EXTENT_INLINE) {
+                               ptr = btrfs_item_ptr_offset(leaf, slot);
+                               memmove_extent_buffer(leaf, ptr,
+                                     (unsigned long)fi,
+                                     offsetof(struct btrfs_file_extent_item,
+                                                disk_bytenr));
+                       }
+               }
+
+               memmove_extent_buffer(leaf, btrfs_leaf_data(leaf) +
+                             data_end + size_diff, btrfs_leaf_data(leaf) +
+                             data_end, old_data_start - data_end);
+
+               offset = btrfs_disk_key_offset(&disk_key);
+               btrfs_set_disk_key_offset(&disk_key, offset + size_diff);
+               btrfs_set_item_key(leaf, &disk_key, slot);
+               if (slot == 0)
+                       fixup_low_keys(trans, root, path, &disk_key, 1);
+       }
+
+       item = btrfs_item_nr(leaf, slot);
+       btrfs_set_item_size(leaf, item, new_size);
+       btrfs_mark_buffer_dirty(leaf);
+
+       ret = 0;
+       if (btrfs_leaf_free_space(root, leaf) < 0) {
+               btrfs_print_leaf(root, leaf);
+               BUG();
+       }
+       return ret;
+}
+
+/*
+ * make the item pointed to by the path bigger, data_size is the new size.
+ */
+int btrfs_extend_item(struct btrfs_trans_handle *trans,
+                     struct btrfs_root *root, struct btrfs_path *path,
+                     u32 data_size)
+{
+       int ret = 0;
+       int slot;
+       int slot_orig;
+       struct extent_buffer *leaf;
+       struct btrfs_item *item;
+       u32 nritems;
+       unsigned int data_end;
+       unsigned int old_data;
+       unsigned int old_size;
+       int i;
+
+       slot_orig = path->slots[0];
+       leaf = path->nodes[0];
+
+       nritems = btrfs_header_nritems(leaf);
+       data_end = leaf_data_end(root, leaf);
+
+       if (btrfs_leaf_free_space(root, leaf) < data_size) {
+               btrfs_print_leaf(root, leaf);
+               BUG();
+       }
+       slot = path->slots[0];
+       old_data = btrfs_item_end_nr(leaf, slot);
+
+       BUG_ON(slot < 0);
+       if (slot >= nritems) {
+               btrfs_print_leaf(root, leaf);
+               printk(KERN_CRIT "slot %d too large, nritems %d\n",
+                      slot, nritems);
+               BUG_ON(1);
+       }
+
+       /*
+        * item0..itemN ... dataN.offset..dataN.size .. data0.size
+        */
+       /* first correct the data pointers */
+       for (i = slot; i < nritems; i++) {
+               u32 ioff;
+               item = btrfs_item_nr(leaf, i);
+
+               if (!leaf->map_token) {
+                       map_extent_buffer(leaf, (unsigned long)item,
+                                       sizeof(struct btrfs_item),
+                                       &leaf->map_token, &leaf->kaddr,
+                                       &leaf->map_start, &leaf->map_len,
+                                       KM_USER1);
+               }
+               ioff = btrfs_item_offset(leaf, item);
+               btrfs_set_item_offset(leaf, item, ioff - data_size);
+       }
+
+       if (leaf->map_token) {
+               unmap_extent_buffer(leaf, leaf->map_token, KM_USER1);
+               leaf->map_token = NULL;
+       }
+
+       /* shift the data */
+       memmove_extent_buffer(leaf, btrfs_leaf_data(leaf) +
+                     data_end - data_size, btrfs_leaf_data(leaf) +
+                     data_end, old_data - data_end);
+
+       data_end = old_data;
+       old_size = btrfs_item_size_nr(leaf, slot);
+       item = btrfs_item_nr(leaf, slot);
+       btrfs_set_item_size(leaf, item, old_size + data_size);
+       btrfs_mark_buffer_dirty(leaf);
+
+       ret = 0;
+       if (btrfs_leaf_free_space(root, leaf) < 0) {
+               btrfs_print_leaf(root, leaf);
+               BUG();
+       }
+       return ret;
+}
+
+/*
+ * Given a key and some data, insert items into the tree.
+ * This does all the path init required, making room in the tree if needed.
+ * Returns the number of keys that were inserted.
+ */
+int btrfs_insert_some_items(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root,
+                           struct btrfs_path *path,
+                           struct btrfs_key *cpu_key, u32 *data_size,
+                           int nr)
+{
+       struct extent_buffer *leaf;
+       struct btrfs_item *item;
+       int ret = 0;
+       int slot;
+       int i;
+       u32 nritems;
+       u32 total_data = 0;
+       u32 total_size = 0;
+       unsigned int data_end;
+       struct btrfs_disk_key disk_key;
+       struct btrfs_key found_key;
+
+       for (i = 0; i < nr; i++) {
+               if (total_size + data_size[i] + sizeof(struct btrfs_item) >
+                   BTRFS_LEAF_DATA_SIZE(root)) {
+                       break;
+                       nr = i;
+               }
+               total_data += data_size[i];
+               total_size += data_size[i] + sizeof(struct btrfs_item);
+       }
+       BUG_ON(nr == 0);
+
+       ret = btrfs_search_slot(trans, root, cpu_key, path, total_size, 1);
+       if (ret == 0)
+               return -EEXIST;
+       if (ret < 0)
+               goto out;
+
+       leaf = path->nodes[0];
+
+       nritems = btrfs_header_nritems(leaf);
+       data_end = leaf_data_end(root, leaf);
+
+       if (btrfs_leaf_free_space(root, leaf) < total_size) {
+               for (i = nr; i >= 0; i--) {
+                       total_data -= data_size[i];
+                       total_size -= data_size[i] + sizeof(struct btrfs_item);
+                       if (total_size < btrfs_leaf_free_space(root, leaf))
+                               break;
+               }
+               nr = i;
+       }
+
+       slot = path->slots[0];
+       BUG_ON(slot < 0);
+
+       if (slot != nritems) {
+               unsigned int old_data = btrfs_item_end_nr(leaf, slot);
+
+               item = btrfs_item_nr(leaf, slot);
+               btrfs_item_key_to_cpu(leaf, &found_key, slot);
+
+               /* figure out how many keys we can insert in here */
+               total_data = data_size[0];
+               for (i = 1; i < nr; i++) {
+                       if (comp_cpu_keys(&found_key, cpu_key + i) <= 0)
+                               break;
+                       total_data += data_size[i];
+               }
+               nr = i;
+
+               if (old_data < data_end) {
+                       btrfs_print_leaf(root, leaf);
+                       printk(KERN_CRIT "slot %d old_data %d data_end %d\n",
+                              slot, old_data, data_end);
+                       BUG_ON(1);
+               }
+               /*
+                * item0..itemN ... dataN.offset..dataN.size .. data0.size
+                */
+               /* first correct the data pointers */
+               WARN_ON(leaf->map_token);
+               for (i = slot; i < nritems; i++) {
+                       u32 ioff;
+
+                       item = btrfs_item_nr(leaf, i);
+                       if (!leaf->map_token) {
+                               map_extent_buffer(leaf, (unsigned long)item,
+                                       sizeof(struct btrfs_item),
+                                       &leaf->map_token, &leaf->kaddr,
+                                       &leaf->map_start, &leaf->map_len,
+                                       KM_USER1);
+                       }
+
+                       ioff = btrfs_item_offset(leaf, item);
+                       btrfs_set_item_offset(leaf, item, ioff - total_data);
+               }
+               if (leaf->map_token) {
+                       unmap_extent_buffer(leaf, leaf->map_token, KM_USER1);
+                       leaf->map_token = NULL;
+               }
+
+               /* shift the items */
+               memmove_extent_buffer(leaf, btrfs_item_nr_offset(slot + nr),
+                             btrfs_item_nr_offset(slot),
+                             (nritems - slot) * sizeof(struct btrfs_item));
+
+               /* shift the data */
+               memmove_extent_buffer(leaf, btrfs_leaf_data(leaf) +
+                             data_end - total_data, btrfs_leaf_data(leaf) +
+                             data_end, old_data - data_end);
+               data_end = old_data;
+       } else {
+               /*
+                * this sucks but it has to be done, if we are inserting at
+                * the end of the leaf only insert 1 of the items, since we
+                * have no way of knowing whats on the next leaf and we'd have
+                * to drop our current locks to figure it out
+                */
+               nr = 1;
+       }
+
+       /* setup the item for the new data */
+       for (i = 0; i < nr; i++) {
+               btrfs_cpu_key_to_disk(&disk_key, cpu_key + i);
+               btrfs_set_item_key(leaf, &disk_key, slot + i);
+               item = btrfs_item_nr(leaf, slot + i);
+               btrfs_set_item_offset(leaf, item, data_end - data_size[i]);
+               data_end -= data_size[i];
+               btrfs_set_item_size(leaf, item, data_size[i]);
+       }
+       btrfs_set_header_nritems(leaf, nritems + nr);
+       btrfs_mark_buffer_dirty(leaf);
+
+       ret = 0;
+       if (slot == 0) {
+               btrfs_cpu_key_to_disk(&disk_key, cpu_key);
+               ret = fixup_low_keys(trans, root, path, &disk_key, 1);
+       }
+
+       if (btrfs_leaf_free_space(root, leaf) < 0) {
+               btrfs_print_leaf(root, leaf);
+               BUG();
+       }
+out:
+       if (!ret)
+               ret = nr;
+       return ret;
+}
+
+/*
+ * Given a key and some data, insert items into the tree.
+ * This does all the path init required, making room in the tree if needed.
+ */
+int btrfs_insert_empty_items(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root,
+                           struct btrfs_path *path,
+                           struct btrfs_key *cpu_key, u32 *data_size,
+                           int nr)
+{
+       struct extent_buffer *leaf;
+       struct btrfs_item *item;
+       int ret = 0;
+       int slot;
+       int slot_orig;
+       int i;
+       u32 nritems;
+       u32 total_size = 0;
+       u32 total_data = 0;
+       unsigned int data_end;
+       struct btrfs_disk_key disk_key;
+
+       for (i = 0; i < nr; i++)
+               total_data += data_size[i];
+
+       total_size = total_data + (nr * sizeof(struct btrfs_item));
+       ret = btrfs_search_slot(trans, root, cpu_key, path, total_size, 1);
+       if (ret == 0)
+               return -EEXIST;
+       if (ret < 0)
+               goto out;
+
+       slot_orig = path->slots[0];
+       leaf = path->nodes[0];
+
+       nritems = btrfs_header_nritems(leaf);
+       data_end = leaf_data_end(root, leaf);
+
+       if (btrfs_leaf_free_space(root, leaf) < total_size) {
+               btrfs_print_leaf(root, leaf);
+               printk(KERN_CRIT "not enough freespace need %u have %d\n",
+                      total_size, btrfs_leaf_free_space(root, leaf));
+               BUG();
+       }
+
+       slot = path->slots[0];
+       BUG_ON(slot < 0);
+
+       if (slot != nritems) {
+               unsigned int old_data = btrfs_item_end_nr(leaf, slot);
+
+               if (old_data < data_end) {
+                       btrfs_print_leaf(root, leaf);
+                       printk(KERN_CRIT "slot %d old_data %d data_end %d\n",
+                              slot, old_data, data_end);
+                       BUG_ON(1);
+               }
+               /*
+                * item0..itemN ... dataN.offset..dataN.size .. data0.size
+                */
+               /* first correct the data pointers */
+               WARN_ON(leaf->map_token);
+               for (i = slot; i < nritems; i++) {
+                       u32 ioff;
+
+                       item = btrfs_item_nr(leaf, i);
+                       if (!leaf->map_token) {
+                               map_extent_buffer(leaf, (unsigned long)item,
+                                       sizeof(struct btrfs_item),
+                                       &leaf->map_token, &leaf->kaddr,
+                                       &leaf->map_start, &leaf->map_len,
+                                       KM_USER1);
+                       }
+
+                       ioff = btrfs_item_offset(leaf, item);
+                       btrfs_set_item_offset(leaf, item, ioff - total_data);
+               }
+               if (leaf->map_token) {
+                       unmap_extent_buffer(leaf, leaf->map_token, KM_USER1);
+                       leaf->map_token = NULL;
+               }
+
+               /* shift the items */
+               memmove_extent_buffer(leaf, btrfs_item_nr_offset(slot + nr),
+                             btrfs_item_nr_offset(slot),
+                             (nritems - slot) * sizeof(struct btrfs_item));
+
+               /* shift the data */
+               memmove_extent_buffer(leaf, btrfs_leaf_data(leaf) +
+                             data_end - total_data, btrfs_leaf_data(leaf) +
+                             data_end, old_data - data_end);
+               data_end = old_data;
+       }
+
+       /* setup the item for the new data */
+       for (i = 0; i < nr; i++) {
+               btrfs_cpu_key_to_disk(&disk_key, cpu_key + i);
+               btrfs_set_item_key(leaf, &disk_key, slot + i);
+               item = btrfs_item_nr(leaf, slot + i);
+               btrfs_set_item_offset(leaf, item, data_end - data_size[i]);
+               data_end -= data_size[i];
+               btrfs_set_item_size(leaf, item, data_size[i]);
+       }
+       btrfs_set_header_nritems(leaf, nritems + nr);
+       btrfs_mark_buffer_dirty(leaf);
+
+       ret = 0;
+       if (slot == 0) {
+               btrfs_cpu_key_to_disk(&disk_key, cpu_key);
+               ret = fixup_low_keys(trans, root, path, &disk_key, 1);
+       }
+
+       if (btrfs_leaf_free_space(root, leaf) < 0) {
+               btrfs_print_leaf(root, leaf);
+               BUG();
+       }
+out:
+       return ret;
+}
+
+/*
+ * Given a key and some data, insert an item into the tree.
+ * This does all the path init required, making room in the tree if needed.
+ */
+int btrfs_insert_item(struct btrfs_trans_handle *trans, struct btrfs_root
+                     *root, struct btrfs_key *cpu_key, void *data, u32
+                     data_size)
+{
+       int ret = 0;
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       unsigned long ptr;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+       ret = btrfs_insert_empty_item(trans, root, path, cpu_key, data_size);
+       if (!ret) {
+               leaf = path->nodes[0];
+               ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
+               write_extent_buffer(leaf, data, ptr, data_size);
+               btrfs_mark_buffer_dirty(leaf);
+       }
+       btrfs_free_path(path);
+       return ret;
+}
+
+/*
+ * delete the pointer from a given node.
+ *
+ * the tree should have been previously balanced so the deletion does not
+ * empty a node.
+ */
+static int del_ptr(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+                  struct btrfs_path *path, int level, int slot)
+{
+       struct extent_buffer *parent = path->nodes[level];
+       u32 nritems;
+       int ret = 0;
+       int wret;
+
+       nritems = btrfs_header_nritems(parent);
+       if (slot != nritems - 1) {
+               memmove_extent_buffer(parent,
+                             btrfs_node_key_ptr_offset(slot),
+                             btrfs_node_key_ptr_offset(slot + 1),
+                             sizeof(struct btrfs_key_ptr) *
+                             (nritems - slot - 1));
+       }
+       nritems--;
+       btrfs_set_header_nritems(parent, nritems);
+       if (nritems == 0 && parent == root->node) {
+               BUG_ON(btrfs_header_level(root->node) != 1);
+               /* just turn the root into a leaf and break */
+               btrfs_set_header_level(root->node, 0);
+       } else if (slot == 0) {
+               struct btrfs_disk_key disk_key;
+
+               btrfs_node_key(parent, &disk_key, 0);
+               wret = fixup_low_keys(trans, root, path, &disk_key, level + 1);
+               if (wret)
+                       ret = wret;
+       }
+       btrfs_mark_buffer_dirty(parent);
+       return ret;
+}
+
+/*
+ * a helper function to delete the leaf pointed to by path->slots[1] and
+ * path->nodes[1].  bytenr is the node block pointer, but since the callers
+ * already know it, it is faster to have them pass it down than to
+ * read it out of the node again.
+ *
+ * This deletes the pointer in path->nodes[1] and frees the leaf
+ * block extent.  zero is returned if it all worked out, < 0 otherwise.
+ *
+ * The path must have already been setup for deleting the leaf, including
+ * all the proper balancing.  path->nodes[1] must be locked.
+ */
+noinline int btrfs_del_leaf(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root,
+                           struct btrfs_path *path, u64 bytenr)
+{
+       int ret;
+       u64 root_gen = btrfs_header_generation(path->nodes[1]);
+
+       ret = del_ptr(trans, root, path, 1, path->slots[1]);
+       if (ret)
+               return ret;
+
+       ret = btrfs_free_extent(trans, root, bytenr,
+                               btrfs_level_size(root, 0),
+                               path->nodes[1]->start,
+                               btrfs_header_owner(path->nodes[1]),
+                               root_gen, 0, 1);
+       return ret;
+}
+/*
+ * delete the item at the leaf level in path.  If that empties
+ * the leaf, remove it from the tree
+ */
+int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+                   struct btrfs_path *path, int slot, int nr)
+{
+       struct extent_buffer *leaf;
+       struct btrfs_item *item;
+       int last_off;
+       int dsize = 0;
+       int ret = 0;
+       int wret;
+       int i;
+       u32 nritems;
+
+       leaf = path->nodes[0];
+       last_off = btrfs_item_offset_nr(leaf, slot + nr - 1);
+
+       for (i = 0; i < nr; i++)
+               dsize += btrfs_item_size_nr(leaf, slot + i);
+
+       nritems = btrfs_header_nritems(leaf);
+
+       if (slot + nr != nritems) {
+               int data_end = leaf_data_end(root, leaf);
+
+               memmove_extent_buffer(leaf, btrfs_leaf_data(leaf) +
+                             data_end + dsize,
+                             btrfs_leaf_data(leaf) + data_end,
+                             last_off - data_end);
+
+               for (i = slot + nr; i < nritems; i++) {
+                       u32 ioff;
+
+                       item = btrfs_item_nr(leaf, i);
+                       if (!leaf->map_token) {
+                               map_extent_buffer(leaf, (unsigned long)item,
+                                       sizeof(struct btrfs_item),
+                                       &leaf->map_token, &leaf->kaddr,
+                                       &leaf->map_start, &leaf->map_len,
+                                       KM_USER1);
+                       }
+                       ioff = btrfs_item_offset(leaf, item);
+                       btrfs_set_item_offset(leaf, item, ioff + dsize);
+               }
+
+               if (leaf->map_token) {
+                       unmap_extent_buffer(leaf, leaf->map_token, KM_USER1);
+                       leaf->map_token = NULL;
+               }
+
+               memmove_extent_buffer(leaf, btrfs_item_nr_offset(slot),
+                             btrfs_item_nr_offset(slot + nr),
+                             sizeof(struct btrfs_item) *
+                             (nritems - slot - nr));
+       }
+       btrfs_set_header_nritems(leaf, nritems - nr);
+       nritems -= nr;
+
+       /* delete the leaf if we've emptied it */
+       if (nritems == 0) {
+               if (leaf == root->node) {
+                       btrfs_set_header_level(leaf, 0);
+               } else {
+                       ret = btrfs_del_leaf(trans, root, path, leaf->start);
+                       BUG_ON(ret);
+               }
+       } else {
+               int used = leaf_space_used(leaf, 0, nritems);
+               if (slot == 0) {
+                       struct btrfs_disk_key disk_key;
+
+                       btrfs_item_key(leaf, &disk_key, 0);
+                       wret = fixup_low_keys(trans, root, path,
+                                             &disk_key, 1);
+                       if (wret)
+                               ret = wret;
+               }
+
+               /* delete the leaf if it is mostly empty */
+               if (used < BTRFS_LEAF_DATA_SIZE(root) / 4) {
+                       /* push_leaf_left fixes the path.
+                        * make sure the path still points to our leaf
+                        * for possible call to del_ptr below
+                        */
+                       slot = path->slots[1];
+                       extent_buffer_get(leaf);
+
+                       wret = push_leaf_left(trans, root, path, 1, 1);
+                       if (wret < 0 && wret != -ENOSPC)
+                               ret = wret;
+
+                       if (path->nodes[0] == leaf &&
+                           btrfs_header_nritems(leaf)) {
+                               wret = push_leaf_right(trans, root, path, 1, 1);
+                               if (wret < 0 && wret != -ENOSPC)
+                                       ret = wret;
+                       }
+
+                       if (btrfs_header_nritems(leaf) == 0) {
+                               path->slots[1] = slot;
+                               ret = btrfs_del_leaf(trans, root, path,
+                                                    leaf->start);
+                               BUG_ON(ret);
+                               free_extent_buffer(leaf);
+                       } else {
+                               /* if we're still in the path, make sure
+                                * we're dirty.  Otherwise, one of the
+                                * push_leaf functions must have already
+                                * dirtied this buffer
+                                */
+                               if (path->nodes[0] == leaf)
+                                       btrfs_mark_buffer_dirty(leaf);
+                               free_extent_buffer(leaf);
+                       }
+               } else {
+                       btrfs_mark_buffer_dirty(leaf);
+               }
+       }
+       return ret;
+}
+
+/*
+ * search the tree again to find a leaf with lesser keys
+ * returns 0 if it found something or 1 if there are no lesser leaves.
+ * returns < 0 on io errors.
+ *
+ * This may release the path, and so you may lose any locks held at the
+ * time you call it.
+ */
+int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path)
+{
+       struct btrfs_key key;
+       struct btrfs_disk_key found_key;
+       int ret;
+
+       btrfs_item_key_to_cpu(path->nodes[0], &key, 0);
+
+       if (key.offset > 0)
+               key.offset--;
+       else if (key.type > 0)
+               key.type--;
+       else if (key.objectid > 0)
+               key.objectid--;
+       else
+               return 1;
+
+       btrfs_release_path(root, path);
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       if (ret < 0)
+               return ret;
+       btrfs_item_key(path->nodes[0], &found_key, 0);
+       ret = comp_keys(&found_key, &key);
+       if (ret < 0)
+               return 0;
+       return 1;
+}
+
+/*
+ * A helper function to walk down the tree starting at min_key, and looking
+ * for nodes or leaves that are either in cache or have a minimum
+ * transaction id.  This is used by the btree defrag code, and tree logging
+ *
+ * This does not cow, but it does stuff the starting key it finds back
+ * into min_key, so you can call btrfs_search_slot with cow=1 on the
+ * key and get a writable path.
+ *
+ * This does lock as it descends, and path->keep_locks should be set
+ * to 1 by the caller.
+ *
+ * This honors path->lowest_level to prevent descent past a given level
+ * of the tree.
+ *
+ * min_trans indicates the oldest transaction that you are interested
+ * in walking through.  Any nodes or leaves older than min_trans are
+ * skipped over (without reading them).
+ *
+ * returns zero if something useful was found, < 0 on error and 1 if there
+ * was nothing in the tree that matched the search criteria.
+ */
+int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
+                        struct btrfs_key *max_key,
+                        struct btrfs_path *path, int cache_only,
+                        u64 min_trans)
+{
+       struct extent_buffer *cur;
+       struct btrfs_key found_key;
+       int slot;
+       int sret;
+       u32 nritems;
+       int level;
+       int ret = 1;
+
+       WARN_ON(!path->keep_locks);
+again:
+       cur = btrfs_lock_root_node(root);
+       level = btrfs_header_level(cur);
+       WARN_ON(path->nodes[level]);
+       path->nodes[level] = cur;
+       path->locks[level] = 1;
+
+       if (btrfs_header_generation(cur) < min_trans) {
+               ret = 1;
+               goto out;
+       }
+       while (1) {
+               nritems = btrfs_header_nritems(cur);
+               level = btrfs_header_level(cur);
+               sret = bin_search(cur, min_key, level, &slot);
+
+               /* at the lowest level, we're done, setup the path and exit */
+               if (level == path->lowest_level) {
+                       if (slot >= nritems)
+                               goto find_next_key;
+                       ret = 0;
+                       path->slots[level] = slot;
+                       btrfs_item_key_to_cpu(cur, &found_key, slot);
+                       goto out;
+               }
+               if (sret && slot > 0)
+                       slot--;
+               /*
+                * check this node pointer against the cache_only and
+                * min_trans parameters.  If it isn't in cache or is too
+                * old, skip to the next one.
+                */
+               while (slot < nritems) {
+                       u64 blockptr;
+                       u64 gen;
+                       struct extent_buffer *tmp;
+                       struct btrfs_disk_key disk_key;
+
+                       blockptr = btrfs_node_blockptr(cur, slot);
+                       gen = btrfs_node_ptr_generation(cur, slot);
+                       if (gen < min_trans) {
+                               slot++;
+                               continue;
+                       }
+                       if (!cache_only)
+                               break;
+
+                       if (max_key) {
+                               btrfs_node_key(cur, &disk_key, slot);
+                               if (comp_keys(&disk_key, max_key) >= 0) {
+                                       ret = 1;
+                                       goto out;
+                               }
+                       }
+
+                       tmp = btrfs_find_tree_block(root, blockptr,
+                                           btrfs_level_size(root, level - 1));
+
+                       if (tmp && btrfs_buffer_uptodate(tmp, gen)) {
+                               free_extent_buffer(tmp);
+                               break;
+                       }
+                       if (tmp)
+                               free_extent_buffer(tmp);
+                       slot++;
+               }
+find_next_key:
+               /*
+                * we didn't find a candidate key in this node, walk forward
+                * and find another one
+                */
+               if (slot >= nritems) {
+                       path->slots[level] = slot;
+                       sret = btrfs_find_next_key(root, path, min_key, level,
+                                                 cache_only, min_trans);
+                       if (sret == 0) {
+                               btrfs_release_path(root, path);
+                               goto again;
+                       } else {
+                               goto out;
+                       }
+               }
+               /* save our key for returning back */
+               btrfs_node_key_to_cpu(cur, &found_key, slot);
+               path->slots[level] = slot;
+               if (level == path->lowest_level) {
+                       ret = 0;
+                       unlock_up(path, level, 1);
+                       goto out;
+               }
+               cur = read_node_slot(root, cur, slot);
+
+               btrfs_tree_lock(cur);
+               path->locks[level - 1] = 1;
+               path->nodes[level - 1] = cur;
+               unlock_up(path, level, 1);
+       }
+out:
+       if (ret == 0)
+               memcpy(min_key, &found_key, sizeof(found_key));
+       return ret;
+}
+
+/*
+ * this is similar to btrfs_next_leaf, but does not try to preserve
+ * and fixup the path.  It looks for and returns the next key in the
+ * tree based on the current path and the cache_only and min_trans
+ * parameters.
+ *
+ * 0 is returned if another key is found, < 0 if there are any errors
+ * and 1 is returned if there are no higher keys in the tree
+ *
+ * path->keep_locks should be set to 1 on the search made before
+ * calling this function.
+ */
+int btrfs_find_next_key(struct btrfs_root *root, struct btrfs_path *path,
+                       struct btrfs_key *key, int lowest_level,
+                       int cache_only, u64 min_trans)
+{
+       int level = lowest_level;
+       int slot;
+       struct extent_buffer *c;
+
+       WARN_ON(!path->keep_locks);
+       while (level < BTRFS_MAX_LEVEL) {
+               if (!path->nodes[level])
+                       return 1;
+
+               slot = path->slots[level] + 1;
+               c = path->nodes[level];
+next:
+               if (slot >= btrfs_header_nritems(c)) {
+                       level++;
+                       if (level == BTRFS_MAX_LEVEL)
+                               return 1;
+                       continue;
+               }
+               if (level == 0)
+                       btrfs_item_key_to_cpu(c, key, slot);
+               else {
+                       u64 blockptr = btrfs_node_blockptr(c, slot);
+                       u64 gen = btrfs_node_ptr_generation(c, slot);
+
+                       if (cache_only) {
+                               struct extent_buffer *cur;
+                               cur = btrfs_find_tree_block(root, blockptr,
+                                           btrfs_level_size(root, level - 1));
+                               if (!cur || !btrfs_buffer_uptodate(cur, gen)) {
+                                       slot++;
+                                       if (cur)
+                                               free_extent_buffer(cur);
+                                       goto next;
+                               }
+                               free_extent_buffer(cur);
+                       }
+                       if (gen < min_trans) {
+                               slot++;
+                               goto next;
+                       }
+                       btrfs_node_key_to_cpu(c, key, slot);
+               }
+               return 0;
+       }
+       return 1;
+}
+
+/*
+ * search the tree again to find a leaf with greater keys
+ * returns 0 if it found something or 1 if there are no greater leaves.
+ * returns < 0 on io errors.
+ */
+int btrfs_next_leaf(struct btrfs_root *root, struct btrfs_path *path)
+{
+       int slot;
+       int level = 1;
+       struct extent_buffer *c;
+       struct extent_buffer *next = NULL;
+       struct btrfs_key key;
+       u32 nritems;
+       int ret;
+
+       nritems = btrfs_header_nritems(path->nodes[0]);
+       if (nritems == 0)
+               return 1;
+
+       btrfs_item_key_to_cpu(path->nodes[0], &key, nritems - 1);
+
+       btrfs_release_path(root, path);
+       path->keep_locks = 1;
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       path->keep_locks = 0;
+
+       if (ret < 0)
+               return ret;
+
+       nritems = btrfs_header_nritems(path->nodes[0]);
+       /*
+        * by releasing the path above we dropped all our locks.  A balance
+        * could have added more items next to the key that used to be
+        * at the very end of the block.  So, check again here and
+        * advance the path if there are now more items available.
+        */
+       if (nritems > 0 && path->slots[0] < nritems - 1) {
+               path->slots[0]++;
+               goto done;
+       }
+
+       while (level < BTRFS_MAX_LEVEL) {
+               if (!path->nodes[level])
+                       return 1;
+
+               slot = path->slots[level] + 1;
+               c = path->nodes[level];
+               if (slot >= btrfs_header_nritems(c)) {
+                       level++;
+                       if (level == BTRFS_MAX_LEVEL)
+                               return 1;
+                       continue;
+               }
+
+               if (next) {
+                       btrfs_tree_unlock(next);
+                       free_extent_buffer(next);
+               }
+
+               if (level == 1 && (path->locks[1] || path->skip_locking) &&
+                   path->reada)
+                       reada_for_search(root, path, level, slot, 0);
+
+               next = read_node_slot(root, c, slot);
+               if (!path->skip_locking) {
+                       WARN_ON(!btrfs_tree_locked(c));
+                       btrfs_tree_lock(next);
+               }
+               break;
+       }
+       path->slots[level] = slot;
+       while (1) {
+               level--;
+               c = path->nodes[level];
+               if (path->locks[level])
+                       btrfs_tree_unlock(c);
+               free_extent_buffer(c);
+               path->nodes[level] = next;
+               path->slots[level] = 0;
+               if (!path->skip_locking)
+                       path->locks[level] = 1;
+               if (!level)
+                       break;
+               if (level == 1 && path->locks[1] && path->reada)
+                       reada_for_search(root, path, level, slot, 0);
+               next = read_node_slot(root, next, 0);
+               if (!path->skip_locking) {
+                       WARN_ON(!btrfs_tree_locked(path->nodes[level]));
+                       btrfs_tree_lock(next);
+               }
+       }
+done:
+       unlock_up(path, 0, 1);
+       return 0;
+}
+
+/*
+ * this uses btrfs_prev_leaf to walk backwards in the tree, and keeps
+ * searching until it gets past min_objectid or finds an item of 'type'
+ *
+ * returns 0 if something is found, 1 if nothing was found and < 0 on error
+ */
+int btrfs_previous_item(struct btrfs_root *root,
+                       struct btrfs_path *path, u64 min_objectid,
+                       int type)
+{
+       struct btrfs_key found_key;
+       struct extent_buffer *leaf;
+       u32 nritems;
+       int ret;
+
+       while (1) {
+               if (path->slots[0] == 0) {
+                       ret = btrfs_prev_leaf(root, path);
+                       if (ret != 0)
+                               return ret;
+               } else {
+                       path->slots[0]--;
+               }
+               leaf = path->nodes[0];
+               nritems = btrfs_header_nritems(leaf);
+               if (nritems == 0)
+                       return 1;
+               if (path->slots[0] == nritems)
+                       path->slots[0]--;
+
+               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+               if (found_key.type == type)
+                       return 0;
+               if (found_key.objectid < min_objectid)
+                       break;
+               if (found_key.objectid == min_objectid &&
+                   found_key.type < type)
+                       break;
+       }
+       return 1;
+}
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h

new file mode 100644 (file)

index 0000000..eee060f
--- /dev/null
+++ b/fs/btrfs/ctree.h
@@ -0,0 +1,2129 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_CTREE__
+#define __BTRFS_CTREE__
+
+#include <linux/version.h>
+#include <linux/mm.h>
+#include <linux/highmem.h>
+#include <linux/fs.h>
+#include <linux/completion.h>
+#include <linux/backing-dev.h>
+#include <linux/wait.h>
+#include <asm/kmap_types.h>
+#include "extent_io.h"
+#include "extent_map.h"
+#include "async-thread.h"
+
+struct btrfs_trans_handle;
+struct btrfs_transaction;
+extern struct kmem_cache *btrfs_trans_handle_cachep;
+extern struct kmem_cache *btrfs_transaction_cachep;
+extern struct kmem_cache *btrfs_bit_radix_cachep;
+extern struct kmem_cache *btrfs_path_cachep;
+struct btrfs_ordered_sum;
+
+#define BTRFS_MAGIC "_BHRfS_M"
+
+#define BTRFS_ACL_NOT_CACHED    ((void *)-1)
+
+#ifdef CONFIG_LOCKDEP
+# define BTRFS_MAX_LEVEL 7
+#else
+# define BTRFS_MAX_LEVEL 8
+#endif
+
+/* holds pointers to all of the tree roots */
+#define BTRFS_ROOT_TREE_OBJECTID 1ULL
+
+/* stores information about which extents are in use, and reference counts */
+#define BTRFS_EXTENT_TREE_OBJECTID 2ULL
+
+/*
+ * chunk tree stores translations from logical -> physical block numbering
+ * the super block points to the chunk tree
+ */
+#define BTRFS_CHUNK_TREE_OBJECTID 3ULL
+
+/*
+ * stores information about which areas of a given device are in use.
+ * one per device.  The tree of tree roots points to the device tree
+ */
+#define BTRFS_DEV_TREE_OBJECTID 4ULL
+
+/* one per subvolume, storing files and directories */
+#define BTRFS_FS_TREE_OBJECTID 5ULL
+
+/* directory objectid inside the root tree */
+#define BTRFS_ROOT_TREE_DIR_OBJECTID 6ULL
+
+/* holds checksums of all the data extents */
+#define BTRFS_CSUM_TREE_OBJECTID 7ULL
+
+/* orhpan objectid for tracking unlinked/truncated files */
+#define BTRFS_ORPHAN_OBJECTID -5ULL
+
+/* does write ahead logging to speed up fsyncs */
+#define BTRFS_TREE_LOG_OBJECTID -6ULL
+#define BTRFS_TREE_LOG_FIXUP_OBJECTID -7ULL
+
+/* for space balancing */
+#define BTRFS_TREE_RELOC_OBJECTID -8ULL
+#define BTRFS_DATA_RELOC_TREE_OBJECTID -9ULL
+
+/*
+ * extent checksums all have this objectid
+ * this allows them to share the logging tree
+ * for fsyncs
+ */
+#define BTRFS_EXTENT_CSUM_OBJECTID -10ULL
+
+/* dummy objectid represents multiple objectids */
+#define BTRFS_MULTIPLE_OBJECTIDS -255ULL
+
+/*
+ * All files have objectids in this range.
+ */
+#define BTRFS_FIRST_FREE_OBJECTID 256ULL
+#define BTRFS_LAST_FREE_OBJECTID -256ULL
+#define BTRFS_FIRST_CHUNK_TREE_OBJECTID 256ULL
+
+
+/*
+ * the device items go into the chunk tree.  The key is in the form
+ * [ 1 BTRFS_DEV_ITEM_KEY device_id ]
+ */
+#define BTRFS_DEV_ITEMS_OBJECTID 1ULL
+
+/*
+ * we can actually store much bigger names, but lets not confuse the rest
+ * of linux
+ */
+#define BTRFS_NAME_LEN 255
+
+/* 32 bytes in various csum fields */
+#define BTRFS_CSUM_SIZE 32
+
+/* csum types */
+#define BTRFS_CSUM_TYPE_CRC32  0
+
+static int btrfs_csum_sizes[] = { 4, 0 };
+
+/* four bytes for CRC32 */
+#define BTRFS_EMPTY_DIR_SIZE 0
+
+#define BTRFS_FT_UNKNOWN       0
+#define BTRFS_FT_REG_FILE      1
+#define BTRFS_FT_DIR           2
+#define BTRFS_FT_CHRDEV                3
+#define BTRFS_FT_BLKDEV                4
+#define BTRFS_FT_FIFO          5
+#define BTRFS_FT_SOCK          6
+#define BTRFS_FT_SYMLINK       7
+#define BTRFS_FT_XATTR         8
+#define BTRFS_FT_MAX           9
+
+/*
+ * the key defines the order in the tree, and so it also defines (optimal)
+ * block layout.  objectid corresonds to the inode number.  The flags
+ * tells us things about the object, and is a kind of stream selector.
+ * so for a given inode, keys with flags of 1 might refer to the inode
+ * data, flags of 2 may point to file data in the btree and flags == 3
+ * may point to extents.
+ *
+ * offset is the starting byte offset for this key in the stream.
+ *
+ * btrfs_disk_key is in disk byte order.  struct btrfs_key is always
+ * in cpu native order.  Otherwise they are identical and their sizes
+ * should be the same (ie both packed)
+ */
+struct btrfs_disk_key {
+       __le64 objectid;
+       u8 type;
+       __le64 offset;
+} __attribute__ ((__packed__));
+
+struct btrfs_key {
+       u64 objectid;
+       u8 type;
+       u64 offset;
+} __attribute__ ((__packed__));
+
+struct btrfs_mapping_tree {
+       struct extent_map_tree map_tree;
+};
+
+#define BTRFS_UUID_SIZE 16
+struct btrfs_dev_item {
+       /* the internal btrfs device id */
+       __le64 devid;
+
+       /* size of the device */
+       __le64 total_bytes;
+
+       /* bytes used */
+       __le64 bytes_used;
+
+       /* optimal io alignment for this device */
+       __le32 io_align;
+
+       /* optimal io width for this device */
+       __le32 io_width;
+
+       /* minimal io size for this device */
+       __le32 sector_size;
+
+       /* type and info about this device */
+       __le64 type;
+
+       /* expected generation for this device */
+       __le64 generation;
+
+       /*
+        * starting byte of this partition on the device,
+        * to allowr for stripe alignment in the future
+        */
+       __le64 start_offset;
+
+       /* grouping information for allocation decisions */
+       __le32 dev_group;
+
+       /* seek speed 0-100 where 100 is fastest */
+       u8 seek_speed;
+
+       /* bandwidth 0-100 where 100 is fastest */
+       u8 bandwidth;
+
+       /* btrfs generated uuid for this device */
+       u8 uuid[BTRFS_UUID_SIZE];
+
+       /* uuid of FS who owns this device */
+       u8 fsid[BTRFS_UUID_SIZE];
+} __attribute__ ((__packed__));
+
+struct btrfs_stripe {
+       __le64 devid;
+       __le64 offset;
+       u8 dev_uuid[BTRFS_UUID_SIZE];
+} __attribute__ ((__packed__));
+
+struct btrfs_chunk {
+       /* size of this chunk in bytes */
+       __le64 length;
+
+       /* objectid of the root referencing this chunk */
+       __le64 owner;
+
+       __le64 stripe_len;
+       __le64 type;
+
+       /* optimal io alignment for this chunk */
+       __le32 io_align;
+
+       /* optimal io width for this chunk */
+       __le32 io_width;
+
+       /* minimal io size for this chunk */
+       __le32 sector_size;
+
+       /* 2^16 stripes is quite a lot, a second limit is the size of a single
+        * item in the btree
+        */
+       __le16 num_stripes;
+
+       /* sub stripes only matter for raid10 */
+       __le16 sub_stripes;
+       struct btrfs_stripe stripe;
+       /* additional stripes go here */
+} __attribute__ ((__packed__));
+
+static inline unsigned long btrfs_chunk_item_size(int num_stripes)
+{
+       BUG_ON(num_stripes == 0);
+       return sizeof(struct btrfs_chunk) +
+               sizeof(struct btrfs_stripe) * (num_stripes - 1);
+}
+
+#define BTRFS_FSID_SIZE 16
+#define BTRFS_HEADER_FLAG_WRITTEN (1 << 0)
+
+/*
+ * every tree block (leaf or node) starts with this header.
+ */
+struct btrfs_header {
+       /* these first four must match the super block */
+       u8 csum[BTRFS_CSUM_SIZE];
+       u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
+       __le64 bytenr; /* which block this node is supposed to live in */
+       __le64 flags;
+
+       /* allowed to be different from the super from here on down */
+       u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
+       __le64 generation;
+       __le64 owner;
+       __le32 nritems;
+       u8 level;
+} __attribute__ ((__packed__));
+
+#define BTRFS_NODEPTRS_PER_BLOCK(r) (((r)->nodesize - \
+                                     sizeof(struct btrfs_header)) / \
+                                    sizeof(struct btrfs_key_ptr))
+#define __BTRFS_LEAF_DATA_SIZE(bs) ((bs) - sizeof(struct btrfs_header))
+#define BTRFS_LEAF_DATA_SIZE(r) (__BTRFS_LEAF_DATA_SIZE(r->leafsize))
+#define BTRFS_MAX_INLINE_DATA_SIZE(r) (BTRFS_LEAF_DATA_SIZE(r) - \
+                                       sizeof(struct btrfs_item) - \
+                                       sizeof(struct btrfs_file_extent_item))
+
+#define BTRFS_SUPER_FLAG_SEEDING (1ULL << 32)
+
+/*
+ * this is a very generous portion of the super block, giving us
+ * room to translate 14 chunks with 3 stripes each.
+ */
+#define BTRFS_SYSTEM_CHUNK_ARRAY_SIZE 2048
+#define BTRFS_LABEL_SIZE 256
+
+/*
+ * the super block basically lists the main trees of the FS
+ * it currently lacks any block count etc etc
+ */
+struct btrfs_super_block {
+       u8 csum[BTRFS_CSUM_SIZE];
+       /* the first 4 fields must match struct btrfs_header */
+       u8 fsid[BTRFS_FSID_SIZE];    /* FS specific uuid */
+       __le64 bytenr; /* this block number */
+       __le64 flags;
+
+       /* allowed to be different from the btrfs_header from here own down */
+       __le64 magic;
+       __le64 generation;
+       __le64 root;
+       __le64 chunk_root;
+       __le64 log_root;
+
+       /* this will help find the new super based on the log root */
+       __le64 log_root_transid;
+       __le64 total_bytes;
+       __le64 bytes_used;
+       __le64 root_dir_objectid;
+       __le64 num_devices;
+       __le32 sectorsize;
+       __le32 nodesize;
+       __le32 leafsize;
+       __le32 stripesize;
+       __le32 sys_chunk_array_size;
+       __le64 chunk_root_generation;
+       __le64 compat_flags;
+       __le64 compat_ro_flags;
+       __le64 incompat_flags;
+       __le16 csum_type;
+       u8 root_level;
+       u8 chunk_root_level;
+       u8 log_root_level;
+       struct btrfs_dev_item dev_item;
+
+       char label[BTRFS_LABEL_SIZE];
+
+       /* future expansion */
+       __le64 reserved[32];
+       u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
+} __attribute__ ((__packed__));
+
+/*
+ * Compat flags that we support.  If any incompat flags are set other than the
+ * ones specified below then we will fail to mount
+ */
+#define BTRFS_FEATURE_COMPAT_SUPP      0x0
+#define BTRFS_FEATURE_COMPAT_RO_SUPP   0x0
+#define BTRFS_FEATURE_INCOMPAT_SUPP    0x0
+
+/*
+ * A leaf is full of items. offset and size tell us where to find
+ * the item in the leaf (relative to the start of the data area)
+ */
+struct btrfs_item {
+       struct btrfs_disk_key key;
+       __le32 offset;
+       __le32 size;
+} __attribute__ ((__packed__));
+
+/*
+ * leaves have an item area and a data area:
+ * [item0, item1....itemN] [free space] [dataN...data1, data0]
+ *
+ * The data is separate from the items to get the keys closer together
+ * during searches.
+ */
+struct btrfs_leaf {
+       struct btrfs_header header;
+       struct btrfs_item items[];
+} __attribute__ ((__packed__));
+
+/*
+ * all non-leaf blocks are nodes, they hold only keys and pointers to
+ * other blocks
+ */
+struct btrfs_key_ptr {
+       struct btrfs_disk_key key;
+       __le64 blockptr;
+       __le64 generation;
+} __attribute__ ((__packed__));
+
+struct btrfs_node {
+       struct btrfs_header header;
+       struct btrfs_key_ptr ptrs[];
+} __attribute__ ((__packed__));
+
+/*
+ * btrfs_paths remember the path taken from the root down to the leaf.
+ * level 0 is always the leaf, and nodes[1...BTRFS_MAX_LEVEL] will point
+ * to any other levels that are present.
+ *
+ * The slots array records the index of the item or block pointer
+ * used while walking the tree.
+ */
+struct btrfs_path {
+       struct extent_buffer *nodes[BTRFS_MAX_LEVEL];
+       int slots[BTRFS_MAX_LEVEL];
+       /* if there is real range locking, this locks field will change */
+       int locks[BTRFS_MAX_LEVEL];
+       int reada;
+       /* keep some upper locks as we walk down */
+       int keep_locks;
+       int skip_locking;
+       int lowest_level;
+
+       /*
+        * set by btrfs_split_item, tells search_slot to keep all locks
+        * and to force calls to keep space in the nodes
+        */
+       int search_for_split;
+};
+
+/*
+ * items in the extent btree are used to record the objectid of the
+ * owner of the block and the number of references
+ */
+struct btrfs_extent_item {
+       __le32 refs;
+} __attribute__ ((__packed__));
+
+struct btrfs_extent_ref {
+       __le64 root;
+       __le64 generation;
+       __le64 objectid;
+       __le32 num_refs;
+} __attribute__ ((__packed__));
+
+/* dev extents record free space on individual devices.  The owner
+ * field points back to the chunk allocation mapping tree that allocated
+ * the extent.  The chunk tree uuid field is a way to double check the owner
+ */
+struct btrfs_dev_extent {
+       __le64 chunk_tree;
+       __le64 chunk_objectid;
+       __le64 chunk_offset;
+       __le64 length;
+       u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
+} __attribute__ ((__packed__));
+
+struct btrfs_inode_ref {
+       __le64 index;
+       __le16 name_len;
+       /* name goes here */
+} __attribute__ ((__packed__));
+
+struct btrfs_timespec {
+       __le64 sec;
+       __le32 nsec;
+} __attribute__ ((__packed__));
+
+typedef enum {
+       BTRFS_COMPRESS_NONE = 0,
+       BTRFS_COMPRESS_ZLIB = 1,
+       BTRFS_COMPRESS_LAST = 2,
+} btrfs_compression_type;
+
+/* we don't understand any encryption methods right now */
+typedef enum {
+       BTRFS_ENCRYPTION_NONE = 0,
+       BTRFS_ENCRYPTION_LAST = 1,
+} btrfs_encryption_type;
+
+struct btrfs_inode_item {
+       /* nfs style generation number */
+       __le64 generation;
+       /* transid that last touched this inode */
+       __le64 transid;
+       __le64 size;
+       __le64 nbytes;
+       __le64 block_group;
+       __le32 nlink;
+       __le32 uid;
+       __le32 gid;
+       __le32 mode;
+       __le64 rdev;
+       __le64 flags;
+
+       /* modification sequence number for NFS */
+       __le64 sequence;
+
+       /*
+        * a little future expansion, for more than this we can
+        * just grow the inode item and version it
+        */
+       __le64 reserved[4];
+       struct btrfs_timespec atime;
+       struct btrfs_timespec ctime;
+       struct btrfs_timespec mtime;
+       struct btrfs_timespec otime;
+} __attribute__ ((__packed__));
+
+struct btrfs_dir_log_item {
+       __le64 end;
+} __attribute__ ((__packed__));
+
+struct btrfs_dir_item {
+       struct btrfs_disk_key location;
+       __le64 transid;
+       __le16 data_len;
+       __le16 name_len;
+       u8 type;
+} __attribute__ ((__packed__));
+
+struct btrfs_root_item {
+       struct btrfs_inode_item inode;
+       __le64 generation;
+       __le64 root_dirid;
+       __le64 bytenr;
+       __le64 byte_limit;
+       __le64 bytes_used;
+       __le64 last_snapshot;
+       __le64 flags;
+       __le32 refs;
+       struct btrfs_disk_key drop_progress;
+       u8 drop_level;
+       u8 level;
+} __attribute__ ((__packed__));
+
+/*
+ * this is used for both forward and backward root refs
+ */
+struct btrfs_root_ref {
+       __le64 dirid;
+       __le64 sequence;
+       __le16 name_len;
+} __attribute__ ((__packed__));
+
+#define BTRFS_FILE_EXTENT_INLINE 0
+#define BTRFS_FILE_EXTENT_REG 1
+#define BTRFS_FILE_EXTENT_PREALLOC 2
+
+struct btrfs_file_extent_item {
+       /*
+        * transaction id that created this extent
+        */
+       __le64 generation;
+       /*
+        * max number of bytes to hold this extent in ram
+        * when we split a compressed extent we can't know how big
+        * each of the resulting pieces will be.  So, this is
+        * an upper limit on the size of the extent in ram instead of
+        * an exact limit.
+        */
+       __le64 ram_bytes;
+
+       /*
+        * 32 bits for the various ways we might encode the data,
+        * including compression and encryption.  If any of these
+        * are set to something a given disk format doesn't understand
+        * it is treated like an incompat flag for reading and writing,
+        * but not for stat.
+        */
+       u8 compression;
+       u8 encryption;
+       __le16 other_encoding; /* spare for later use */
+
+       /* are we inline data or a real extent? */
+       u8 type;
+
+       /*
+        * disk space consumed by the extent, checksum blocks are included
+        * in these numbers
+        */
+       __le64 disk_bytenr;
+       __le64 disk_num_bytes;
+       /*
+        * the logical offset in file blocks (no csums)
+        * this extent record is for.  This allows a file extent to point
+        * into the middle of an existing extent on disk, sharing it
+        * between two snapshots (useful if some bytes in the middle of the
+        * extent have changed
+        */
+       __le64 offset;
+       /*
+        * the logical number of file blocks (no csums included).  This
+        * always reflects the size uncompressed and without encoding.
+        */
+       __le64 num_bytes;
+
+} __attribute__ ((__packed__));
+
+struct btrfs_csum_item {
+       u8 csum;
+} __attribute__ ((__packed__));
+
+/* different types of block groups (and chunks) */
+#define BTRFS_BLOCK_GROUP_DATA     (1 << 0)
+#define BTRFS_BLOCK_GROUP_SYSTEM   (1 << 1)
+#define BTRFS_BLOCK_GROUP_METADATA (1 << 2)
+#define BTRFS_BLOCK_GROUP_RAID0    (1 << 3)
+#define BTRFS_BLOCK_GROUP_RAID1    (1 << 4)
+#define BTRFS_BLOCK_GROUP_DUP     (1 << 5)
+#define BTRFS_BLOCK_GROUP_RAID10   (1 << 6)
+
+struct btrfs_block_group_item {
+       __le64 used;
+       __le64 chunk_objectid;
+       __le64 flags;
+} __attribute__ ((__packed__));
+
+struct btrfs_space_info {
+       u64 flags;
+       u64 total_bytes;
+       u64 bytes_used;
+       u64 bytes_pinned;
+       u64 bytes_reserved;
+       u64 bytes_readonly;
+       int full;
+       int force_alloc;
+       struct list_head list;
+
+       /* for block groups in our same type */
+       struct list_head block_groups;
+       spinlock_t lock;
+       struct rw_semaphore groups_sem;
+};
+
+struct btrfs_free_space {
+       struct rb_node bytes_index;
+       struct rb_node offset_index;
+       u64 offset;
+       u64 bytes;
+};
+
+struct btrfs_block_group_cache {
+       struct btrfs_key key;
+       struct btrfs_block_group_item item;
+       spinlock_t lock;
+       struct mutex alloc_mutex;
+       struct mutex cache_mutex;
+       u64 pinned;
+       u64 reserved;
+       u64 flags;
+       int cached;
+       int ro;
+       int dirty;
+
+       struct btrfs_space_info *space_info;
+
+       /* free space cache stuff */
+       struct rb_root free_space_bytes;
+       struct rb_root free_space_offset;
+
+       /* block group cache stuff */
+       struct rb_node cache_node;
+
+       /* for block groups in the same raid type */
+       struct list_head list;
+
+       /* usage count */
+       atomic_t count;
+};
+
+struct btrfs_leaf_ref_tree {
+       struct rb_root root;
+       struct list_head list;
+       spinlock_t lock;
+};
+
+struct btrfs_device;
+struct btrfs_fs_devices;
+struct btrfs_fs_info {
+       u8 fsid[BTRFS_FSID_SIZE];
+       u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
+       struct btrfs_root *extent_root;
+       struct btrfs_root *tree_root;
+       struct btrfs_root *chunk_root;
+       struct btrfs_root *dev_root;
+       struct btrfs_root *fs_root;
+       struct btrfs_root *csum_root;
+
+       /* the log root tree is a directory of all the other log roots */
+       struct btrfs_root *log_root_tree;
+       struct radix_tree_root fs_roots_radix;
+
+       /* block group cache stuff */
+       spinlock_t block_group_cache_lock;
+       struct rb_root block_group_cache_tree;
+
+       struct extent_io_tree pinned_extents;
+       struct extent_io_tree pending_del;
+       struct extent_io_tree extent_ins;
+
+       /* logical->physical extent mapping */
+       struct btrfs_mapping_tree mapping_tree;
+
+       u64 generation;
+       u64 last_trans_committed;
+       u64 last_trans_new_blockgroup;
+       u64 open_ioctl_trans;
+       unsigned long mount_opt;
+       u64 max_extent;
+       u64 max_inline;
+       u64 alloc_start;
+       struct btrfs_transaction *running_transaction;
+       wait_queue_head_t transaction_throttle;
+       wait_queue_head_t transaction_wait;
+
+       wait_queue_head_t async_submit_wait;
+       wait_queue_head_t tree_log_wait;
+
+       struct btrfs_super_block super_copy;
+       struct btrfs_super_block super_for_commit;
+       struct block_device *__bdev;
+       struct super_block *sb;
+       struct inode *btree_inode;
+       struct backing_dev_info bdi;
+       spinlock_t hash_lock;
+       struct mutex trans_mutex;
+       struct mutex tree_log_mutex;
+       struct mutex transaction_kthread_mutex;
+       struct mutex cleaner_mutex;
+       struct mutex extent_ins_mutex;
+       struct mutex pinned_mutex;
+       struct mutex chunk_mutex;
+       struct mutex drop_mutex;
+       struct mutex volume_mutex;
+       struct mutex tree_reloc_mutex;
+       struct list_head trans_list;
+       struct list_head hashers;
+       struct list_head dead_roots;
+
+       atomic_t nr_async_submits;
+       atomic_t async_submit_draining;
+       atomic_t nr_async_bios;
+       atomic_t async_delalloc_pages;
+       atomic_t tree_log_writers;
+       atomic_t tree_log_commit;
+       unsigned long tree_log_batch;
+       u64 tree_log_transid;
+
+       /*
+        * this is used by the balancing code to wait for all the pending
+        * ordered extents
+        */
+       spinlock_t ordered_extent_lock;
+       struct list_head ordered_extents;
+       struct list_head delalloc_inodes;
+
+       /*
+        * there is a pool of worker threads for checksumming during writes
+        * and a pool for checksumming after reads.  This is because readers
+        * can run with FS locks held, and the writers may be waiting for
+        * those locks.  We don't want ordering in the pending list to cause
+        * deadlocks, and so the two are serviced separately.
+        *
+        * A third pool does submit_bio to avoid deadlocking with the other
+        * two
+        */
+       struct btrfs_workers workers;
+       struct btrfs_workers delalloc_workers;
+       struct btrfs_workers endio_workers;
+       struct btrfs_workers endio_meta_workers;
+       struct btrfs_workers endio_meta_write_workers;
+       struct btrfs_workers endio_write_workers;
+       struct btrfs_workers submit_workers;
+       /*
+        * fixup workers take dirty pages that didn't properly go through
+        * the cow mechanism and make them safe to write.  It happens
+        * for the sys_munmap function call path
+        */
+       struct btrfs_workers fixup_workers;
+       struct task_struct *transaction_kthread;
+       struct task_struct *cleaner_kthread;
+       int thread_pool_size;
+
+       /* tree relocation relocated fields */
+       struct list_head dead_reloc_roots;
+       struct btrfs_leaf_ref_tree reloc_ref_tree;
+       struct btrfs_leaf_ref_tree shared_ref_tree;
+
+       struct kobject super_kobj;
+       struct completion kobj_unregister;
+       int do_barriers;
+       int closing;
+       int log_root_recovering;
+       atomic_t throttles;
+       atomic_t throttle_gen;
+
+       u64 total_pinned;
+       struct list_head dirty_cowonly_roots;
+
+       struct btrfs_fs_devices *fs_devices;
+       struct list_head space_info;
+       spinlock_t delalloc_lock;
+       spinlock_t new_trans_lock;
+       u64 delalloc_bytes;
+       u64 last_alloc;
+       u64 last_data_alloc;
+
+       spinlock_t ref_cache_lock;
+       u64 total_ref_cache_size;
+
+       u64 avail_data_alloc_bits;
+       u64 avail_metadata_alloc_bits;
+       u64 avail_system_alloc_bits;
+       u64 data_alloc_profile;
+       u64 metadata_alloc_profile;
+       u64 system_alloc_profile;
+
+       void *bdev_holder;
+};
+
+/*
+ * in ram representation of the tree.  extent_root is used for all allocations
+ * and for the extent tree extent_root root.
+ */
+struct btrfs_dirty_root;
+struct btrfs_root {
+       struct extent_buffer *node;
+
+       /* the node lock is held while changing the node pointer */
+       spinlock_t node_lock;
+
+       struct extent_buffer *commit_root;
+       struct btrfs_leaf_ref_tree *ref_tree;
+       struct btrfs_leaf_ref_tree ref_tree_struct;
+       struct btrfs_dirty_root *dirty_root;
+       struct btrfs_root *log_root;
+       struct btrfs_root *reloc_root;
+
+       struct btrfs_root_item root_item;
+       struct btrfs_key root_key;
+       struct btrfs_fs_info *fs_info;
+       struct extent_io_tree dirty_log_pages;
+
+       struct kobject root_kobj;
+       struct completion kobj_unregister;
+       struct mutex objectid_mutex;
+       struct mutex log_mutex;
+
+       u64 objectid;
+       u64 last_trans;
+
+       /* data allocations are done in sectorsize units */
+       u32 sectorsize;
+
+       /* node allocations are done in nodesize units */
+       u32 nodesize;
+
+       /* leaf allocations are done in leafsize units */
+       u32 leafsize;
+
+       u32 stripesize;
+
+       u32 type;
+       u64 highest_inode;
+       u64 last_inode_alloc;
+       int ref_cows;
+       int track_dirty;
+       u64 defrag_trans_start;
+       struct btrfs_key defrag_progress;
+       struct btrfs_key defrag_max;
+       int defrag_running;
+       int defrag_level;
+       char *name;
+       int in_sysfs;
+
+       /* the dirty list is only used by non-reference counted roots */
+       struct list_head dirty_list;
+
+       spinlock_t list_lock;
+       struct list_head dead_list;
+       struct list_head orphan_list;
+
+       /*
+        * right now this just gets used so that a root has its own devid
+        * for stat.  It may be used for more later
+        */
+       struct super_block anon_super;
+};
+
+/*
+
+ * inode items have the data typically returned from stat and store other
+ * info about object characteristics.  There is one for every file and dir in
+ * the FS
+ */
+#define BTRFS_INODE_ITEM_KEY           1
+#define BTRFS_INODE_REF_KEY            12
+#define BTRFS_XATTR_ITEM_KEY           24
+#define BTRFS_ORPHAN_ITEM_KEY          48
+/* reserve 2-15 close to the inode for later flexibility */
+
+/*
+ * dir items are the name -> inode pointers in a directory.  There is one
+ * for every name in a directory.
+ */
+#define BTRFS_DIR_LOG_ITEM_KEY  60
+#define BTRFS_DIR_LOG_INDEX_KEY 72
+#define BTRFS_DIR_ITEM_KEY     84
+#define BTRFS_DIR_INDEX_KEY    96
+/*
+ * extent data is for file data
+ */
+#define BTRFS_EXTENT_DATA_KEY  108
+
+/*
+ * extent csums are stored in a separate tree and hold csums for
+ * an entire extent on disk.
+ */
+#define BTRFS_EXTENT_CSUM_KEY  128
+
+/*
+ * root items point to tree roots.  There are typically in the root
+ * tree used by the super block to find all the other trees
+ */
+#define BTRFS_ROOT_ITEM_KEY    132
+
+/*
+ * root backrefs tie subvols and snapshots to the directory entries that
+ * reference them
+ */
+#define BTRFS_ROOT_BACKREF_KEY 144
+
+/*
+ * root refs make a fast index for listing all of the snapshots and
+ * subvolumes referenced by a given root.  They point directly to the
+ * directory item in the root that references the subvol
+ */
+#define BTRFS_ROOT_REF_KEY     156
+
+/*
+ * extent items are in the extent map tree.  These record which blocks
+ * are used, and how many references there are to each block
+ */
+#define BTRFS_EXTENT_ITEM_KEY  168
+#define BTRFS_EXTENT_REF_KEY   180
+
+/*
+ * block groups give us hints into the extent allocation trees.  Which
+ * blocks are free etc etc
+ */
+#define BTRFS_BLOCK_GROUP_ITEM_KEY 192
+
+#define BTRFS_DEV_EXTENT_KEY   204
+#define BTRFS_DEV_ITEM_KEY     216
+#define BTRFS_CHUNK_ITEM_KEY   228
+
+/*
+ * string items are for debugging.  They just store a short string of
+ * data in the FS
+ */
+#define BTRFS_STRING_ITEM_KEY  253
+
+#define BTRFS_MOUNT_NODATASUM          (1 << 0)
+#define BTRFS_MOUNT_NODATACOW          (1 << 1)
+#define BTRFS_MOUNT_NOBARRIER          (1 << 2)
+#define BTRFS_MOUNT_SSD                        (1 << 3)
+#define BTRFS_MOUNT_DEGRADED           (1 << 4)
+#define BTRFS_MOUNT_COMPRESS           (1 << 5)
+
+#define btrfs_clear_opt(o, opt)                ((o) &= ~BTRFS_MOUNT_##opt)
+#define btrfs_set_opt(o, opt)          ((o) |= BTRFS_MOUNT_##opt)
+#define btrfs_test_opt(root, opt)      ((root)->fs_info->mount_opt & \
+                                        BTRFS_MOUNT_##opt)
+/*
+ * Inode flags
+ */
+#define BTRFS_INODE_NODATASUM          (1 << 0)
+#define BTRFS_INODE_NODATACOW          (1 << 1)
+#define BTRFS_INODE_READONLY           (1 << 2)
+#define BTRFS_INODE_NOCOMPRESS         (1 << 3)
+#define BTRFS_INODE_PREALLOC           (1 << 4)
+#define btrfs_clear_flag(inode, flag)  (BTRFS_I(inode)->flags &= \
+                                        ~BTRFS_INODE_##flag)
+#define btrfs_set_flag(inode, flag)    (BTRFS_I(inode)->flags |= \
+                                        BTRFS_INODE_##flag)
+#define btrfs_test_flag(inode, flag)   (BTRFS_I(inode)->flags & \
+                                        BTRFS_INODE_##flag)
+/* some macros to generate set/get funcs for the struct fields.  This
+ * assumes there is a lefoo_to_cpu for every type, so lets make a simple
+ * one for u8:
+ */
+#define le8_to_cpu(v) (v)
+#define cpu_to_le8(v) (v)
+#define __le8 u8
+
+#define read_eb_member(eb, ptr, type, member, result) (                        \
+       read_extent_buffer(eb, (char *)(result),                        \
+                          ((unsigned long)(ptr)) +                     \
+                           offsetof(type, member),                     \
+                          sizeof(((type *)0)->member)))
+
+#define write_eb_member(eb, ptr, type, member, result) (               \
+       write_extent_buffer(eb, (char *)(result),                       \
+                          ((unsigned long)(ptr)) +                     \
+                           offsetof(type, member),                     \
+                          sizeof(((type *)0)->member)))
+
+#ifndef BTRFS_SETGET_FUNCS
+#define BTRFS_SETGET_FUNCS(name, type, member, bits)                   \
+u##bits btrfs_##name(struct extent_buffer *eb, type *s);               \
+void btrfs_set_##name(struct extent_buffer *eb, type *s, u##bits val);
+#endif
+
+#define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)            \
+static inline u##bits btrfs_##name(struct extent_buffer *eb)           \
+{                                                                      \
+       type *p = kmap_atomic(eb->first_page, KM_USER0);                \
+       u##bits res = le##bits##_to_cpu(p->member);                     \
+       kunmap_atomic(p, KM_USER0);                                     \
+       return res;                                                     \
+}                                                                      \
+static inline void btrfs_set_##name(struct extent_buffer *eb,          \
+                                   u##bits val)                        \
+{                                                                      \
+       type *p = kmap_atomic(eb->first_page, KM_USER0);                \
+       p->member = cpu_to_le##bits(val);                               \
+       kunmap_atomic(p, KM_USER0);                                     \
+}
+
+#define BTRFS_SETGET_STACK_FUNCS(name, type, member, bits)             \
+static inline u##bits btrfs_##name(type *s)                            \
+{                                                                      \
+       return le##bits##_to_cpu(s->member);                            \
+}                                                                      \
+static inline void btrfs_set_##name(type *s, u##bits val)              \
+{                                                                      \
+       s->member = cpu_to_le##bits(val);                               \
+}
+
+BTRFS_SETGET_FUNCS(device_type, struct btrfs_dev_item, type, 64);
+BTRFS_SETGET_FUNCS(device_total_bytes, struct btrfs_dev_item, total_bytes, 64);
+BTRFS_SETGET_FUNCS(device_bytes_used, struct btrfs_dev_item, bytes_used, 64);
+BTRFS_SETGET_FUNCS(device_io_align, struct btrfs_dev_item, io_align, 32);
+BTRFS_SETGET_FUNCS(device_io_width, struct btrfs_dev_item, io_width, 32);
+BTRFS_SETGET_FUNCS(device_start_offset, struct btrfs_dev_item,
+                  start_offset, 64);
+BTRFS_SETGET_FUNCS(device_sector_size, struct btrfs_dev_item, sector_size, 32);
+BTRFS_SETGET_FUNCS(device_id, struct btrfs_dev_item, devid, 64);
+BTRFS_SETGET_FUNCS(device_group, struct btrfs_dev_item, dev_group, 32);
+BTRFS_SETGET_FUNCS(device_seek_speed, struct btrfs_dev_item, seek_speed, 8);
+BTRFS_SETGET_FUNCS(device_bandwidth, struct btrfs_dev_item, bandwidth, 8);
+BTRFS_SETGET_FUNCS(device_generation, struct btrfs_dev_item, generation, 64);
+
+BTRFS_SETGET_STACK_FUNCS(stack_device_type, struct btrfs_dev_item, type, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_total_bytes, struct btrfs_dev_item,
+                        total_bytes, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_bytes_used, struct btrfs_dev_item,
+                        bytes_used, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_io_align, struct btrfs_dev_item,
+                        io_align, 32);
+BTRFS_SETGET_STACK_FUNCS(stack_device_io_width, struct btrfs_dev_item,
+                        io_width, 32);
+BTRFS_SETGET_STACK_FUNCS(stack_device_sector_size, struct btrfs_dev_item,
+                        sector_size, 32);
+BTRFS_SETGET_STACK_FUNCS(stack_device_id, struct btrfs_dev_item, devid, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_device_group, struct btrfs_dev_item,
+                        dev_group, 32);
+BTRFS_SETGET_STACK_FUNCS(stack_device_seek_speed, struct btrfs_dev_item,
+                        seek_speed, 8);
+BTRFS_SETGET_STACK_FUNCS(stack_device_bandwidth, struct btrfs_dev_item,
+                        bandwidth, 8);
+BTRFS_SETGET_STACK_FUNCS(stack_device_generation, struct btrfs_dev_item,
+                        generation, 64);
+
+static inline char *btrfs_device_uuid(struct btrfs_dev_item *d)
+{
+       return (char *)d + offsetof(struct btrfs_dev_item, uuid);
+}
+
+static inline char *btrfs_device_fsid(struct btrfs_dev_item *d)
+{
+       return (char *)d + offsetof(struct btrfs_dev_item, fsid);
+}
+
+BTRFS_SETGET_FUNCS(chunk_length, struct btrfs_chunk, length, 64);
+BTRFS_SETGET_FUNCS(chunk_owner, struct btrfs_chunk, owner, 64);
+BTRFS_SETGET_FUNCS(chunk_stripe_len, struct btrfs_chunk, stripe_len, 64);
+BTRFS_SETGET_FUNCS(chunk_io_align, struct btrfs_chunk, io_align, 32);
+BTRFS_SETGET_FUNCS(chunk_io_width, struct btrfs_chunk, io_width, 32);
+BTRFS_SETGET_FUNCS(chunk_sector_size, struct btrfs_chunk, sector_size, 32);
+BTRFS_SETGET_FUNCS(chunk_type, struct btrfs_chunk, type, 64);
+BTRFS_SETGET_FUNCS(chunk_num_stripes, struct btrfs_chunk, num_stripes, 16);
+BTRFS_SETGET_FUNCS(chunk_sub_stripes, struct btrfs_chunk, sub_stripes, 16);
+BTRFS_SETGET_FUNCS(stripe_devid, struct btrfs_stripe, devid, 64);
+BTRFS_SETGET_FUNCS(stripe_offset, struct btrfs_stripe, offset, 64);
+
+static inline char *btrfs_stripe_dev_uuid(struct btrfs_stripe *s)
+{
+       return (char *)s + offsetof(struct btrfs_stripe, dev_uuid);
+}
+
+BTRFS_SETGET_STACK_FUNCS(stack_chunk_length, struct btrfs_chunk, length, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_chunk_owner, struct btrfs_chunk, owner, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_chunk_stripe_len, struct btrfs_chunk,
+                        stripe_len, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_chunk_io_align, struct btrfs_chunk,
+                        io_align, 32);
+BTRFS_SETGET_STACK_FUNCS(stack_chunk_io_width, struct btrfs_chunk,
+                        io_width, 32);
+BTRFS_SETGET_STACK_FUNCS(stack_chunk_sector_size, struct btrfs_chunk,
+                        sector_size, 32);
+BTRFS_SETGET_STACK_FUNCS(stack_chunk_type, struct btrfs_chunk, type, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_chunk_num_stripes, struct btrfs_chunk,
+                        num_stripes, 16);
+BTRFS_SETGET_STACK_FUNCS(stack_chunk_sub_stripes, struct btrfs_chunk,
+                        sub_stripes, 16);
+BTRFS_SETGET_STACK_FUNCS(stack_stripe_devid, struct btrfs_stripe, devid, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_stripe_offset, struct btrfs_stripe, offset, 64);
+
+static inline struct btrfs_stripe *btrfs_stripe_nr(struct btrfs_chunk *c,
+                                                  int nr)
+{
+       unsigned long offset = (unsigned long)c;
+       offset += offsetof(struct btrfs_chunk, stripe);
+       offset += nr * sizeof(struct btrfs_stripe);
+       return (struct btrfs_stripe *)offset;
+}
+
+static inline char *btrfs_stripe_dev_uuid_nr(struct btrfs_chunk *c, int nr)
+{
+       return btrfs_stripe_dev_uuid(btrfs_stripe_nr(c, nr));
+}
+
+static inline u64 btrfs_stripe_offset_nr(struct extent_buffer *eb,
+                                        struct btrfs_chunk *c, int nr)
+{
+       return btrfs_stripe_offset(eb, btrfs_stripe_nr(c, nr));
+}
+
+static inline void btrfs_set_stripe_offset_nr(struct extent_buffer *eb,
+                                            struct btrfs_chunk *c, int nr,
+                                            u64 val)
+{
+       btrfs_set_stripe_offset(eb, btrfs_stripe_nr(c, nr), val);
+}
+
+static inline u64 btrfs_stripe_devid_nr(struct extent_buffer *eb,
+                                        struct btrfs_chunk *c, int nr)
+{
+       return btrfs_stripe_devid(eb, btrfs_stripe_nr(c, nr));
+}
+
+static inline void btrfs_set_stripe_devid_nr(struct extent_buffer *eb,
+                                            struct btrfs_chunk *c, int nr,
+                                            u64 val)
+{
+       btrfs_set_stripe_devid(eb, btrfs_stripe_nr(c, nr), val);
+}
+
+/* struct btrfs_block_group_item */
+BTRFS_SETGET_STACK_FUNCS(block_group_used, struct btrfs_block_group_item,
+                        used, 64);
+BTRFS_SETGET_FUNCS(disk_block_group_used, struct btrfs_block_group_item,
+                        used, 64);
+BTRFS_SETGET_STACK_FUNCS(block_group_chunk_objectid,
+                       struct btrfs_block_group_item, chunk_objectid, 64);
+
+BTRFS_SETGET_FUNCS(disk_block_group_chunk_objectid,
+                  struct btrfs_block_group_item, chunk_objectid, 64);
+BTRFS_SETGET_FUNCS(disk_block_group_flags,
+                  struct btrfs_block_group_item, flags, 64);
+BTRFS_SETGET_STACK_FUNCS(block_group_flags,
+                       struct btrfs_block_group_item, flags, 64);
+
+/* struct btrfs_inode_ref */
+BTRFS_SETGET_FUNCS(inode_ref_name_len, struct btrfs_inode_ref, name_len, 16);
+BTRFS_SETGET_FUNCS(inode_ref_index, struct btrfs_inode_ref, index, 64);
+
+/* struct btrfs_inode_item */
+BTRFS_SETGET_FUNCS(inode_generation, struct btrfs_inode_item, generation, 64);
+BTRFS_SETGET_FUNCS(inode_sequence, struct btrfs_inode_item, sequence, 64);
+BTRFS_SETGET_FUNCS(inode_transid, struct btrfs_inode_item, transid, 64);
+BTRFS_SETGET_FUNCS(inode_size, struct btrfs_inode_item, size, 64);
+BTRFS_SETGET_FUNCS(inode_nbytes, struct btrfs_inode_item, nbytes, 64);
+BTRFS_SETGET_FUNCS(inode_block_group, struct btrfs_inode_item, block_group, 64);
+BTRFS_SETGET_FUNCS(inode_nlink, struct btrfs_inode_item, nlink, 32);
+BTRFS_SETGET_FUNCS(inode_uid, struct btrfs_inode_item, uid, 32);
+BTRFS_SETGET_FUNCS(inode_gid, struct btrfs_inode_item, gid, 32);
+BTRFS_SETGET_FUNCS(inode_mode, struct btrfs_inode_item, mode, 32);
+BTRFS_SETGET_FUNCS(inode_rdev, struct btrfs_inode_item, rdev, 64);
+BTRFS_SETGET_FUNCS(inode_flags, struct btrfs_inode_item, flags, 64);
+
+static inline struct btrfs_timespec *
+btrfs_inode_atime(struct btrfs_inode_item *inode_item)
+{
+       unsigned long ptr = (unsigned long)inode_item;
+       ptr += offsetof(struct btrfs_inode_item, atime);
+       return (struct btrfs_timespec *)ptr;
+}
+
+static inline struct btrfs_timespec *
+btrfs_inode_mtime(struct btrfs_inode_item *inode_item)
+{
+       unsigned long ptr = (unsigned long)inode_item;
+       ptr += offsetof(struct btrfs_inode_item, mtime);
+       return (struct btrfs_timespec *)ptr;
+}
+
+static inline struct btrfs_timespec *
+btrfs_inode_ctime(struct btrfs_inode_item *inode_item)
+{
+       unsigned long ptr = (unsigned long)inode_item;
+       ptr += offsetof(struct btrfs_inode_item, ctime);
+       return (struct btrfs_timespec *)ptr;
+}
+
+static inline struct btrfs_timespec *
+btrfs_inode_otime(struct btrfs_inode_item *inode_item)
+{
+       unsigned long ptr = (unsigned long)inode_item;
+       ptr += offsetof(struct btrfs_inode_item, otime);
+       return (struct btrfs_timespec *)ptr;
+}
+
+BTRFS_SETGET_FUNCS(timespec_sec, struct btrfs_timespec, sec, 64);
+BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
+
+/* struct btrfs_dev_extent */
+BTRFS_SETGET_FUNCS(dev_extent_chunk_tree, struct btrfs_dev_extent,
+                  chunk_tree, 64);
+BTRFS_SETGET_FUNCS(dev_extent_chunk_objectid, struct btrfs_dev_extent,
+                  chunk_objectid, 64);
+BTRFS_SETGET_FUNCS(dev_extent_chunk_offset, struct btrfs_dev_extent,
+                  chunk_offset, 64);
+BTRFS_SETGET_FUNCS(dev_extent_length, struct btrfs_dev_extent, length, 64);
+
+static inline u8 *btrfs_dev_extent_chunk_tree_uuid(struct btrfs_dev_extent *dev)
+{
+       unsigned long ptr = offsetof(struct btrfs_dev_extent, chunk_tree_uuid);
+       return (u8 *)((unsigned long)dev + ptr);
+}
+
+/* struct btrfs_extent_ref */
+BTRFS_SETGET_FUNCS(ref_root, struct btrfs_extent_ref, root, 64);
+BTRFS_SETGET_FUNCS(ref_generation, struct btrfs_extent_ref, generation, 64);
+BTRFS_SETGET_FUNCS(ref_objectid, struct btrfs_extent_ref, objectid, 64);
+BTRFS_SETGET_FUNCS(ref_num_refs, struct btrfs_extent_ref, num_refs, 32);
+
+BTRFS_SETGET_STACK_FUNCS(stack_ref_root, struct btrfs_extent_ref, root, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_ref_generation, struct btrfs_extent_ref,
+                        generation, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_ref_objectid, struct btrfs_extent_ref,
+                        objectid, 64);
+BTRFS_SETGET_STACK_FUNCS(stack_ref_num_refs, struct btrfs_extent_ref,
+                        num_refs, 32);
+
+/* struct btrfs_extent_item */
+BTRFS_SETGET_FUNCS(extent_refs, struct btrfs_extent_item, refs, 32);
+BTRFS_SETGET_STACK_FUNCS(stack_extent_refs, struct btrfs_extent_item,
+                        refs, 32);
+
+/* struct btrfs_node */
+BTRFS_SETGET_FUNCS(key_blockptr, struct btrfs_key_ptr, blockptr, 64);
+BTRFS_SETGET_FUNCS(key_generation, struct btrfs_key_ptr, generation, 64);
+
+static inline u64 btrfs_node_blockptr(struct extent_buffer *eb, int nr)
+{
+       unsigned long ptr;
+       ptr = offsetof(struct btrfs_node, ptrs) +
+               sizeof(struct btrfs_key_ptr) * nr;
+       return btrfs_key_blockptr(eb, (struct btrfs_key_ptr *)ptr);
+}
+
+static inline void btrfs_set_node_blockptr(struct extent_buffer *eb,
+                                          int nr, u64 val)
+{
+       unsigned long ptr;
+       ptr = offsetof(struct btrfs_node, ptrs) +
+               sizeof(struct btrfs_key_ptr) * nr;
+       btrfs_set_key_blockptr(eb, (struct btrfs_key_ptr *)ptr, val);
+}
+
+static inline u64 btrfs_node_ptr_generation(struct extent_buffer *eb, int nr)
+{
+       unsigned long ptr;
+       ptr = offsetof(struct btrfs_node, ptrs) +
+               sizeof(struct btrfs_key_ptr) * nr;
+       return btrfs_key_generation(eb, (struct btrfs_key_ptr *)ptr);
+}
+
+static inline void btrfs_set_node_ptr_generation(struct extent_buffer *eb,
+                                                int nr, u64 val)
+{
+       unsigned long ptr;
+       ptr = offsetof(struct btrfs_node, ptrs) +
+               sizeof(struct btrfs_key_ptr) * nr;
+       btrfs_set_key_generation(eb, (struct btrfs_key_ptr *)ptr, val);
+}
+
+static inline unsigned long btrfs_node_key_ptr_offset(int nr)
+{
+       return offsetof(struct btrfs_node, ptrs) +
+               sizeof(struct btrfs_key_ptr) * nr;
+}
+
+void btrfs_node_key(struct extent_buffer *eb,
+                   struct btrfs_disk_key *disk_key, int nr);
+
+static inline void btrfs_set_node_key(struct extent_buffer *eb,
+                                     struct btrfs_disk_key *disk_key, int nr)
+{
+       unsigned long ptr;
+       ptr = btrfs_node_key_ptr_offset(nr);
+       write_eb_member(eb, (struct btrfs_key_ptr *)ptr,
+                      struct btrfs_key_ptr, key, disk_key);
+}
+
+/* struct btrfs_item */
+BTRFS_SETGET_FUNCS(item_offset, struct btrfs_item, offset, 32);
+BTRFS_SETGET_FUNCS(item_size, struct btrfs_item, size, 32);
+
+static inline unsigned long btrfs_item_nr_offset(int nr)
+{
+       return offsetof(struct btrfs_leaf, items) +
+               sizeof(struct btrfs_item) * nr;
+}
+
+static inline struct btrfs_item *btrfs_item_nr(struct extent_buffer *eb,
+                                              int nr)
+{
+       return (struct btrfs_item *)btrfs_item_nr_offset(nr);
+}
+
+static inline u32 btrfs_item_end(struct extent_buffer *eb,
+                                struct btrfs_item *item)
+{
+       return btrfs_item_offset(eb, item) + btrfs_item_size(eb, item);
+}
+
+static inline u32 btrfs_item_end_nr(struct extent_buffer *eb, int nr)
+{
+       return btrfs_item_end(eb, btrfs_item_nr(eb, nr));
+}
+
+static inline u32 btrfs_item_offset_nr(struct extent_buffer *eb, int nr)
+{
+       return btrfs_item_offset(eb, btrfs_item_nr(eb, nr));
+}
+
+static inline u32 btrfs_item_size_nr(struct extent_buffer *eb, int nr)
+{
+       return btrfs_item_size(eb, btrfs_item_nr(eb, nr));
+}
+
+static inline void btrfs_item_key(struct extent_buffer *eb,
+                          struct btrfs_disk_key *disk_key, int nr)
+{
+       struct btrfs_item *item = btrfs_item_nr(eb, nr);
+       read_eb_member(eb, item, struct btrfs_item, key, disk_key);
+}
+
+static inline void btrfs_set_item_key(struct extent_buffer *eb,
+                              struct btrfs_disk_key *disk_key, int nr)
+{
+       struct btrfs_item *item = btrfs_item_nr(eb, nr);
+       write_eb_member(eb, item, struct btrfs_item, key, disk_key);
+}
+
+BTRFS_SETGET_FUNCS(dir_log_end, struct btrfs_dir_log_item, end, 64);
+
+/*
+ * struct btrfs_root_ref
+ */
+BTRFS_SETGET_FUNCS(root_ref_dirid, struct btrfs_root_ref, dirid, 64);
+BTRFS_SETGET_FUNCS(root_ref_sequence, struct btrfs_root_ref, sequence, 64);
+BTRFS_SETGET_FUNCS(root_ref_name_len, struct btrfs_root_ref, name_len, 16);
+
+/* struct btrfs_dir_item */
+BTRFS_SETGET_FUNCS(dir_data_len, struct btrfs_dir_item, data_len, 16);
+BTRFS_SETGET_FUNCS(dir_type, struct btrfs_dir_item, type, 8);
+BTRFS_SETGET_FUNCS(dir_name_len, struct btrfs_dir_item, name_len, 16);
+BTRFS_SETGET_FUNCS(dir_transid, struct btrfs_dir_item, transid, 64);
+
+static inline void btrfs_dir_item_key(struct extent_buffer *eb,
+                                     struct btrfs_dir_item *item,
+                                     struct btrfs_disk_key *key)
+{
+       read_eb_member(eb, item, struct btrfs_dir_item, location, key);
+}
+
+static inline void btrfs_set_dir_item_key(struct extent_buffer *eb,
+                                         struct btrfs_dir_item *item,
+                                         struct btrfs_disk_key *key)
+{
+       write_eb_member(eb, item, struct btrfs_dir_item, location, key);
+}
+
+/* struct btrfs_disk_key */
+BTRFS_SETGET_STACK_FUNCS(disk_key_objectid, struct btrfs_disk_key,
+                        objectid, 64);
+BTRFS_SETGET_STACK_FUNCS(disk_key_offset, struct btrfs_disk_key, offset, 64);
+BTRFS_SETGET_STACK_FUNCS(disk_key_type, struct btrfs_disk_key, type, 8);
+
+static inline void btrfs_disk_key_to_cpu(struct btrfs_key *cpu,
+                                        struct btrfs_disk_key *disk)
+{
+       cpu->offset = le64_to_cpu(disk->offset);
+       cpu->type = disk->type;
+       cpu->objectid = le64_to_cpu(disk->objectid);
+}
+
+static inline void btrfs_cpu_key_to_disk(struct btrfs_disk_key *disk,
+                                        struct btrfs_key *cpu)
+{
+       disk->offset = cpu_to_le64(cpu->offset);
+       disk->type = cpu->type;
+       disk->objectid = cpu_to_le64(cpu->objectid);
+}
+
+static inline void btrfs_node_key_to_cpu(struct extent_buffer *eb,
+                                 struct btrfs_key *key, int nr)
+{
+       struct btrfs_disk_key disk_key;
+       btrfs_node_key(eb, &disk_key, nr);
+       btrfs_disk_key_to_cpu(key, &disk_key);
+}
+
+static inline void btrfs_item_key_to_cpu(struct extent_buffer *eb,
+                                 struct btrfs_key *key, int nr)
+{
+       struct btrfs_disk_key disk_key;
+       btrfs_item_key(eb, &disk_key, nr);
+       btrfs_disk_key_to_cpu(key, &disk_key);
+}
+
+static inline void btrfs_dir_item_key_to_cpu(struct extent_buffer *eb,
+                                     struct btrfs_dir_item *item,
+                                     struct btrfs_key *key)
+{
+       struct btrfs_disk_key disk_key;
+       btrfs_dir_item_key(eb, item, &disk_key);
+       btrfs_disk_key_to_cpu(key, &disk_key);
+}
+
+
+static inline u8 btrfs_key_type(struct btrfs_key *key)
+{
+       return key->type;
+}
+
+static inline void btrfs_set_key_type(struct btrfs_key *key, u8 val)
+{
+       key->type = val;
+}
+
+/* struct btrfs_header */
+BTRFS_SETGET_HEADER_FUNCS(header_bytenr, struct btrfs_header, bytenr, 64);
+BTRFS_SETGET_HEADER_FUNCS(header_generation, struct btrfs_header,
+                         generation, 64);
+BTRFS_SETGET_HEADER_FUNCS(header_owner, struct btrfs_header, owner, 64);
+BTRFS_SETGET_HEADER_FUNCS(header_nritems, struct btrfs_header, nritems, 32);
+BTRFS_SETGET_HEADER_FUNCS(header_flags, struct btrfs_header, flags, 64);
+BTRFS_SETGET_HEADER_FUNCS(header_level, struct btrfs_header, level, 8);
+
+static inline int btrfs_header_flag(struct extent_buffer *eb, u64 flag)
+{
+       return (btrfs_header_flags(eb) & flag) == flag;
+}
+
+static inline int btrfs_set_header_flag(struct extent_buffer *eb, u64 flag)
+{
+       u64 flags = btrfs_header_flags(eb);
+       btrfs_set_header_flags(eb, flags | flag);
+       return (flags & flag) == flag;
+}
+
+static inline int btrfs_clear_header_flag(struct extent_buffer *eb, u64 flag)
+{
+       u64 flags = btrfs_header_flags(eb);
+       btrfs_set_header_flags(eb, flags & ~flag);
+       return (flags & flag) == flag;
+}
+
+static inline u8 *btrfs_header_fsid(struct extent_buffer *eb)
+{
+       unsigned long ptr = offsetof(struct btrfs_header, fsid);
+       return (u8 *)ptr;
+}
+
+static inline u8 *btrfs_header_chunk_tree_uuid(struct extent_buffer *eb)
+{
+       unsigned long ptr = offsetof(struct btrfs_header, chunk_tree_uuid);
+       return (u8 *)ptr;
+}
+
+static inline u8 *btrfs_super_fsid(struct extent_buffer *eb)
+{
+       unsigned long ptr = offsetof(struct btrfs_super_block, fsid);
+       return (u8 *)ptr;
+}
+
+static inline u8 *btrfs_header_csum(struct extent_buffer *eb)
+{
+       unsigned long ptr = offsetof(struct btrfs_header, csum);
+       return (u8 *)ptr;
+}
+
+static inline struct btrfs_node *btrfs_buffer_node(struct extent_buffer *eb)
+{
+       return NULL;
+}
+
+static inline struct btrfs_leaf *btrfs_buffer_leaf(struct extent_buffer *eb)
+{
+       return NULL;
+}
+
+static inline struct btrfs_header *btrfs_buffer_header(struct extent_buffer *eb)
+{
+       return NULL;
+}
+
+static inline int btrfs_is_leaf(struct extent_buffer *eb)
+{
+       return btrfs_header_level(eb) == 0;
+}
+
+/* struct btrfs_root_item */
+BTRFS_SETGET_FUNCS(disk_root_generation, struct btrfs_root_item,
+                  generation, 64);
+BTRFS_SETGET_FUNCS(disk_root_refs, struct btrfs_root_item, refs, 32);
+BTRFS_SETGET_FUNCS(disk_root_bytenr, struct btrfs_root_item, bytenr, 64);
+BTRFS_SETGET_FUNCS(disk_root_level, struct btrfs_root_item, level, 8);
+
+BTRFS_SETGET_STACK_FUNCS(root_generation, struct btrfs_root_item,
+                        generation, 64);
+BTRFS_SETGET_STACK_FUNCS(root_bytenr, struct btrfs_root_item, bytenr, 64);
+BTRFS_SETGET_STACK_FUNCS(root_level, struct btrfs_root_item, level, 8);
+BTRFS_SETGET_STACK_FUNCS(root_dirid, struct btrfs_root_item, root_dirid, 64);
+BTRFS_SETGET_STACK_FUNCS(root_refs, struct btrfs_root_item, refs, 32);
+BTRFS_SETGET_STACK_FUNCS(root_flags, struct btrfs_root_item, flags, 64);
+BTRFS_SETGET_STACK_FUNCS(root_used, struct btrfs_root_item, bytes_used, 64);
+BTRFS_SETGET_STACK_FUNCS(root_limit, struct btrfs_root_item, byte_limit, 64);
+BTRFS_SETGET_STACK_FUNCS(root_last_snapshot, struct btrfs_root_item,
+                        last_snapshot, 64);
+
+/* struct btrfs_super_block */
+
+BTRFS_SETGET_STACK_FUNCS(super_bytenr, struct btrfs_super_block, bytenr, 64);
+BTRFS_SETGET_STACK_FUNCS(super_flags, struct btrfs_super_block, flags, 64);
+BTRFS_SETGET_STACK_FUNCS(super_generation, struct btrfs_super_block,
+                        generation, 64);
+BTRFS_SETGET_STACK_FUNCS(super_root, struct btrfs_super_block, root, 64);
+BTRFS_SETGET_STACK_FUNCS(super_sys_array_size,
+                        struct btrfs_super_block, sys_chunk_array_size, 32);
+BTRFS_SETGET_STACK_FUNCS(super_chunk_root_generation,
+                        struct btrfs_super_block, chunk_root_generation, 64);
+BTRFS_SETGET_STACK_FUNCS(super_root_level, struct btrfs_super_block,
+                        root_level, 8);
+BTRFS_SETGET_STACK_FUNCS(super_chunk_root, struct btrfs_super_block,
+                        chunk_root, 64);
+BTRFS_SETGET_STACK_FUNCS(super_chunk_root_level, struct btrfs_super_block,
+                        chunk_root_level, 8);
+BTRFS_SETGET_STACK_FUNCS(super_log_root, struct btrfs_super_block,
+                        log_root, 64);
+BTRFS_SETGET_STACK_FUNCS(super_log_root_transid, struct btrfs_super_block,
+                        log_root_transid, 64);
+BTRFS_SETGET_STACK_FUNCS(super_log_root_level, struct btrfs_super_block,
+                        log_root_level, 8);
+BTRFS_SETGET_STACK_FUNCS(super_total_bytes, struct btrfs_super_block,
+                        total_bytes, 64);
+BTRFS_SETGET_STACK_FUNCS(super_bytes_used, struct btrfs_super_block,
+                        bytes_used, 64);
+BTRFS_SETGET_STACK_FUNCS(super_sectorsize, struct btrfs_super_block,
+                        sectorsize, 32);
+BTRFS_SETGET_STACK_FUNCS(super_nodesize, struct btrfs_super_block,
+                        nodesize, 32);
+BTRFS_SETGET_STACK_FUNCS(super_leafsize, struct btrfs_super_block,
+                        leafsize, 32);
+BTRFS_SETGET_STACK_FUNCS(super_stripesize, struct btrfs_super_block,
+                        stripesize, 32);
+BTRFS_SETGET_STACK_FUNCS(super_root_dir, struct btrfs_super_block,
+                        root_dir_objectid, 64);
+BTRFS_SETGET_STACK_FUNCS(super_num_devices, struct btrfs_super_block,
+                        num_devices, 64);
+BTRFS_SETGET_STACK_FUNCS(super_compat_flags, struct btrfs_super_block,
+                        compat_flags, 64);
+BTRFS_SETGET_STACK_FUNCS(super_compat_ro_flags, struct btrfs_super_block,
+                        compat_flags, 64);
+BTRFS_SETGET_STACK_FUNCS(super_incompat_flags, struct btrfs_super_block,
+                        incompat_flags, 64);
+BTRFS_SETGET_STACK_FUNCS(super_csum_type, struct btrfs_super_block,
+                        csum_type, 16);
+
+static inline int btrfs_super_csum_size(struct btrfs_super_block *s)
+{
+       int t = btrfs_super_csum_type(s);
+       BUG_ON(t >= ARRAY_SIZE(btrfs_csum_sizes));
+       return btrfs_csum_sizes[t];
+}
+
+static inline unsigned long btrfs_leaf_data(struct extent_buffer *l)
+{
+       return offsetof(struct btrfs_leaf, items);
+}
+
+/* struct btrfs_file_extent_item */
+BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8);
+
+static inline unsigned long
+btrfs_file_extent_inline_start(struct btrfs_file_extent_item *e)
+{
+       unsigned long offset = (unsigned long)e;
+       offset += offsetof(struct btrfs_file_extent_item, disk_bytenr);
+       return offset;
+}
+
+static inline u32 btrfs_file_extent_calc_inline_size(u32 datasize)
+{
+       return offsetof(struct btrfs_file_extent_item, disk_bytenr) + datasize;
+}
+
+BTRFS_SETGET_FUNCS(file_extent_disk_bytenr, struct btrfs_file_extent_item,
+                  disk_bytenr, 64);
+BTRFS_SETGET_FUNCS(file_extent_generation, struct btrfs_file_extent_item,
+                  generation, 64);
+BTRFS_SETGET_FUNCS(file_extent_disk_num_bytes, struct btrfs_file_extent_item,
+                  disk_num_bytes, 64);
+BTRFS_SETGET_FUNCS(file_extent_offset, struct btrfs_file_extent_item,
+                 offset, 64);
+BTRFS_SETGET_FUNCS(file_extent_num_bytes, struct btrfs_file_extent_item,
+                  num_bytes, 64);
+BTRFS_SETGET_FUNCS(file_extent_ram_bytes, struct btrfs_file_extent_item,
+                  ram_bytes, 64);
+BTRFS_SETGET_FUNCS(file_extent_compression, struct btrfs_file_extent_item,
+                  compression, 8);
+BTRFS_SETGET_FUNCS(file_extent_encryption, struct btrfs_file_extent_item,
+                  encryption, 8);
+BTRFS_SETGET_FUNCS(file_extent_other_encoding, struct btrfs_file_extent_item,
+                  other_encoding, 16);
+
+/* this returns the number of file bytes represented by the inline item.
+ * If an item is compressed, this is the uncompressed size
+ */
+static inline u32 btrfs_file_extent_inline_len(struct extent_buffer *eb,
+                                              struct btrfs_file_extent_item *e)
+{
+       return btrfs_file_extent_ram_bytes(eb, e);
+}
+
+/*
+ * this returns the number of bytes used by the item on disk, minus the
+ * size of any extent headers.  If a file is compressed on disk, this is
+ * the compressed size
+ */
+static inline u32 btrfs_file_extent_inline_item_len(struct extent_buffer *eb,
+                                                   struct btrfs_item *e)
+{
+       unsigned long offset;
+       offset = offsetof(struct btrfs_file_extent_item, disk_bytenr);
+       return btrfs_item_size(eb, e) - offset;
+}
+
+static inline struct btrfs_root *btrfs_sb(struct super_block *sb)
+{
+       return sb->s_fs_info;
+}
+
+static inline int btrfs_set_root_name(struct btrfs_root *root,
+                                     const char *name, int len)
+{
+       /* if we already have a name just free it */
+       kfree(root->name);
+
+       root->name = kmalloc(len+1, GFP_KERNEL);
+       if (!root->name)
+               return -ENOMEM;
+
+       memcpy(root->name, name, len);
+       root->name[len] = '\0';
+
+       return 0;
+}
+
+static inline u32 btrfs_level_size(struct btrfs_root *root, int level)
+{
+       if (level == 0)
+               return root->leafsize;
+       return root->nodesize;
+}
+
+/* helper function to cast into the data area of the leaf. */
+#define btrfs_item_ptr(leaf, slot, type) \
+       ((type *)(btrfs_leaf_data(leaf) + \
+       btrfs_item_offset_nr(leaf, slot)))
+
+#define btrfs_item_ptr_offset(leaf, slot) \
+       ((unsigned long)(btrfs_leaf_data(leaf) + \
+       btrfs_item_offset_nr(leaf, slot)))
+
+static inline struct dentry *fdentry(struct file *file)
+{
+       return file->f_path.dentry;
+}
+
+/* extent-tree.c */
+int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len);
+int btrfs_lookup_extent_ref(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root, u64 bytenr,
+                           u64 num_bytes, u32 *refs);
+int btrfs_update_pinned_extents(struct btrfs_root *root,
+                               u64 bytenr, u64 num, int pin);
+int btrfs_drop_leaf_ref(struct btrfs_trans_handle *trans,
+                       struct btrfs_root *root, struct extent_buffer *leaf);
+int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root, u64 objectid, u64 bytenr);
+int btrfs_extent_post_op(struct btrfs_trans_handle *trans,
+                        struct btrfs_root *root);
+int btrfs_copy_pinned(struct btrfs_root *root, struct extent_io_tree *copy);
+struct btrfs_block_group_cache *btrfs_lookup_block_group(
+                                                struct btrfs_fs_info *info,
+                                                u64 bytenr);
+u64 btrfs_find_block_group(struct btrfs_root *root,
+                          u64 search_start, u64 search_hint, int owner);
+struct extent_buffer *btrfs_alloc_free_block(struct btrfs_trans_handle *trans,
+                                            struct btrfs_root *root,
+                                            u32 blocksize, u64 parent,
+                                            u64 root_objectid,
+                                            u64 ref_generation,
+                                            int level,
+                                            u64 hint,
+                                            u64 empty_size);
+struct extent_buffer *btrfs_init_new_buffer(struct btrfs_trans_handle *trans,
+                                           struct btrfs_root *root,
+                                           u64 bytenr, u32 blocksize);
+int btrfs_alloc_extent(struct btrfs_trans_handle *trans,
+                      struct btrfs_root *root,
+                      u64 num_bytes, u64 parent, u64 min_bytes,
+                      u64 root_objectid, u64 ref_generation,
+                      u64 owner, u64 empty_size, u64 hint_byte,
+                      u64 search_end, struct btrfs_key *ins, u64 data);
+int btrfs_alloc_reserved_extent(struct btrfs_trans_handle *trans,
+                               struct btrfs_root *root, u64 parent,
+                               u64 root_objectid, u64 ref_generation,
+                               u64 owner, struct btrfs_key *ins);
+int btrfs_alloc_logged_extent(struct btrfs_trans_handle *trans,
+                               struct btrfs_root *root, u64 parent,
+                               u64 root_objectid, u64 ref_generation,
+                               u64 owner, struct btrfs_key *ins);
+int btrfs_reserve_extent(struct btrfs_trans_handle *trans,
+                                 struct btrfs_root *root,
+                                 u64 num_bytes, u64 min_alloc_size,
+                                 u64 empty_size, u64 hint_byte,
+                                 u64 search_end, struct btrfs_key *ins,
+                                 u64 data);
+int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+                 struct extent_buffer *orig_buf, struct extent_buffer *buf,
+                 u32 *nr_extents);
+int btrfs_cache_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+                   struct extent_buffer *buf, u32 nr_extents);
+int btrfs_update_ref(struct btrfs_trans_handle *trans,
+                    struct btrfs_root *root, struct extent_buffer *orig_buf,
+                    struct extent_buffer *buf, int start_slot, int nr);
+int btrfs_free_extent(struct btrfs_trans_handle *trans,
+                     struct btrfs_root *root,
+                     u64 bytenr, u64 num_bytes, u64 parent,
+                     u64 root_objectid, u64 ref_generation,
+                     u64 owner_objectid, int pin);
+int btrfs_free_reserved_extent(struct btrfs_root *root, u64 start, u64 len);
+int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *root,
+                              struct extent_io_tree *unpin);
+int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
+                        struct btrfs_root *root,
+                        u64 bytenr, u64 num_bytes, u64 parent,
+                        u64 root_objectid, u64 ref_generation,
+                        u64 owner_objectid);
+int btrfs_update_extent_ref(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root, u64 bytenr,
+                           u64 orig_parent, u64 parent,
+                           u64 root_objectid, u64 ref_generation,
+                           u64 owner_objectid);
+int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
+                                   struct btrfs_root *root);
+int btrfs_extent_readonly(struct btrfs_root *root, u64 bytenr);
+int btrfs_free_block_groups(struct btrfs_fs_info *info);
+int btrfs_read_block_groups(struct btrfs_root *root);
+int btrfs_make_block_group(struct btrfs_trans_handle *trans,
+                          struct btrfs_root *root, u64 bytes_used,
+                          u64 type, u64 chunk_objectid, u64 chunk_offset,
+                          u64 size);
+int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root, u64 group_start);
+int btrfs_relocate_block_group(struct btrfs_root *root, u64 group_start);
+int btrfs_free_reloc_root(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root);
+int btrfs_drop_dead_reloc_roots(struct btrfs_root *root);
+int btrfs_reloc_tree_cache_ref(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *root,
+                              struct extent_buffer *buf, u64 orig_start);
+int btrfs_add_dead_reloc_root(struct btrfs_root *root);
+int btrfs_cleanup_reloc_trees(struct btrfs_root *root);
+int btrfs_reloc_clone_csums(struct inode *inode, u64 file_pos, u64 len);
+u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags);
+/* ctree.c */
+int btrfs_previous_item(struct btrfs_root *root,
+                       struct btrfs_path *path, u64 min_objectid,
+                       int type);
+int btrfs_merge_path(struct btrfs_trans_handle *trans,
+                    struct btrfs_root *root,
+                    struct btrfs_key *node_keys,
+                    u64 *nodes, int lowest_level);
+int btrfs_set_item_key_safe(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root, struct btrfs_path *path,
+                           struct btrfs_key *new_key);
+struct extent_buffer *btrfs_root_node(struct btrfs_root *root);
+struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root);
+int btrfs_find_next_key(struct btrfs_root *root, struct btrfs_path *path,
+                       struct btrfs_key *key, int lowest_level,
+                       int cache_only, u64 min_trans);
+int btrfs_search_forward(struct btrfs_root *root, struct btrfs_key *min_key,
+                        struct btrfs_key *max_key,
+                        struct btrfs_path *path, int cache_only,
+                        u64 min_trans);
+int btrfs_cow_block(struct btrfs_trans_handle *trans,
+                   struct btrfs_root *root, struct extent_buffer *buf,
+                   struct extent_buffer *parent, int parent_slot,
+                   struct extent_buffer **cow_ret, u64 prealloc_dest);
+int btrfs_copy_root(struct btrfs_trans_handle *trans,
+                     struct btrfs_root *root,
+                     struct extent_buffer *buf,
+                     struct extent_buffer **cow_ret, u64 new_root_objectid);
+int btrfs_extend_item(struct btrfs_trans_handle *trans, struct btrfs_root
+                     *root, struct btrfs_path *path, u32 data_size);
+int btrfs_truncate_item(struct btrfs_trans_handle *trans,
+                       struct btrfs_root *root,
+                       struct btrfs_path *path,
+                       u32 new_size, int from_end);
+int btrfs_split_item(struct btrfs_trans_handle *trans,
+                    struct btrfs_root *root,
+                    struct btrfs_path *path,
+                    struct btrfs_key *new_key,
+                    unsigned long split_offset);
+int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root
+                     *root, struct btrfs_key *key, struct btrfs_path *p, int
+                     ins_len, int cow);
+int btrfs_realloc_node(struct btrfs_trans_handle *trans,
+                      struct btrfs_root *root, struct extent_buffer *parent,
+                      int start_slot, int cache_only, u64 *last_ret,
+                      struct btrfs_key *progress);
+void btrfs_release_path(struct btrfs_root *root, struct btrfs_path *p);
+struct btrfs_path *btrfs_alloc_path(void);
+void btrfs_free_path(struct btrfs_path *p);
+void btrfs_init_path(struct btrfs_path *p);
+int btrfs_del_items(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+                  struct btrfs_path *path, int slot, int nr);
+int btrfs_del_leaf(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root,
+                           struct btrfs_path *path, u64 bytenr);
+static inline int btrfs_del_item(struct btrfs_trans_handle *trans,
+                                struct btrfs_root *root,
+                                struct btrfs_path *path)
+{
+       return btrfs_del_items(trans, root, path, path->slots[0], 1);
+}
+
+int btrfs_insert_item(struct btrfs_trans_handle *trans, struct btrfs_root
+                     *root, struct btrfs_key *key, void *data, u32 data_size);
+int btrfs_insert_some_items(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root,
+                           struct btrfs_path *path,
+                           struct btrfs_key *cpu_key, u32 *data_size,
+                           int nr);
+int btrfs_insert_empty_items(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root,
+                            struct btrfs_path *path,
+                            struct btrfs_key *cpu_key, u32 *data_size, int nr);
+
+static inline int btrfs_insert_empty_item(struct btrfs_trans_handle *trans,
+                                         struct btrfs_root *root,
+                                         struct btrfs_path *path,
+                                         struct btrfs_key *key,
+                                         u32 data_size)
+{
+       return btrfs_insert_empty_items(trans, root, path, key, &data_size, 1);
+}
+
+int btrfs_next_leaf(struct btrfs_root *root, struct btrfs_path *path);
+int btrfs_prev_leaf(struct btrfs_root *root, struct btrfs_path *path);
+int btrfs_leaf_free_space(struct btrfs_root *root, struct extent_buffer *leaf);
+int btrfs_drop_snapshot(struct btrfs_trans_handle *trans, struct btrfs_root
+                       *root);
+int btrfs_drop_subtree(struct btrfs_trans_handle *trans,
+                       struct btrfs_root *root,
+                       struct extent_buffer *node,
+                       struct extent_buffer *parent);
+/* root-item.c */
+int btrfs_find_root_ref(struct btrfs_root *tree_root,
+                  struct btrfs_path *path,
+                  u64 root_id, u64 ref_id);
+int btrfs_add_root_ref(struct btrfs_trans_handle *trans,
+                      struct btrfs_root *tree_root,
+                      u64 root_id, u8 type, u64 ref_id,
+                      u64 dirid, u64 sequence,
+                      const char *name, int name_len);
+int btrfs_del_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+                  struct btrfs_key *key);
+int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root
+                     *root, struct btrfs_key *key, struct btrfs_root_item
+                     *item);
+int btrfs_update_root(struct btrfs_trans_handle *trans, struct btrfs_root
+                     *root, struct btrfs_key *key, struct btrfs_root_item
+                     *item);
+int btrfs_find_last_root(struct btrfs_root *root, u64 objectid, struct
+                        btrfs_root_item *item, struct btrfs_key *key);
+int btrfs_search_root(struct btrfs_root *root, u64 search_start,
+                     u64 *found_objectid);
+int btrfs_find_dead_roots(struct btrfs_root *root, u64 objectid,
+                         struct btrfs_root *latest_root);
+/* dir-item.c */
+int btrfs_insert_dir_item(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root, const char *name,
+                         int name_len, u64 dir,
+                         struct btrfs_key *location, u8 type, u64 index);
+struct btrfs_dir_item *btrfs_lookup_dir_item(struct btrfs_trans_handle *trans,
+                                            struct btrfs_root *root,
+                                            struct btrfs_path *path, u64 dir,
+                                            const char *name, int name_len,
+                                            int mod);
+struct btrfs_dir_item *
+btrfs_lookup_dir_index_item(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root,
+                           struct btrfs_path *path, u64 dir,
+                           u64 objectid, const char *name, int name_len,
+                           int mod);
+struct btrfs_dir_item *btrfs_match_dir_item_name(struct btrfs_root *root,
+                             struct btrfs_path *path,
+                             const char *name, int name_len);
+int btrfs_delete_one_dir_name(struct btrfs_trans_handle *trans,
+                             struct btrfs_root *root,
+                             struct btrfs_path *path,
+                             struct btrfs_dir_item *di);
+int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root, const char *name,
+                           u16 name_len, const void *data, u16 data_len,
+                           u64 dir);
+struct btrfs_dir_item *btrfs_lookup_xattr(struct btrfs_trans_handle *trans,
+                                         struct btrfs_root *root,
+                                         struct btrfs_path *path, u64 dir,
+                                         const char *name, u16 name_len,
+                                         int mod);
+
+/* orphan.c */
+int btrfs_insert_orphan_item(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root, u64 offset);
+int btrfs_del_orphan_item(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root, u64 offset);
+
+/* inode-map.c */
+int btrfs_find_free_objectid(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *fs_root,
+                            u64 dirid, u64 *objectid);
+int btrfs_find_highest_inode(struct btrfs_root *fs_root, u64 *objectid);
+
+/* inode-item.c */
+int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
+                          struct btrfs_root *root,
+                          const char *name, int name_len,
+                          u64 inode_objectid, u64 ref_objectid, u64 index);
+int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
+                          struct btrfs_root *root,
+                          const char *name, int name_len,
+                          u64 inode_objectid, u64 ref_objectid, u64 *index);
+int btrfs_insert_empty_inode(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root,
+                            struct btrfs_path *path, u64 objectid);
+int btrfs_lookup_inode(struct btrfs_trans_handle *trans, struct btrfs_root
+                      *root, struct btrfs_path *path,
+                      struct btrfs_key *location, int mod);
+
+/* file-item.c */
+int btrfs_del_csums(struct btrfs_trans_handle *trans,
+                   struct btrfs_root *root, u64 bytenr, u64 len);
+int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
+                         struct bio *bio, u32 *dst);
+int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root,
+                            u64 objectid, u64 pos,
+                            u64 disk_offset, u64 disk_num_bytes,
+                            u64 num_bytes, u64 offset, u64 ram_bytes,
+                            u8 compression, u8 encryption, u16 other_encoding);
+int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root,
+                            struct btrfs_path *path, u64 objectid,
+                            u64 bytenr, int mod);
+int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
+                          struct btrfs_root *root,
+                          struct btrfs_ordered_sum *sums);
+int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
+                      struct bio *bio, u64 file_start, int contig);
+int btrfs_csum_file_bytes(struct btrfs_root *root, struct inode *inode,
+                         u64 start, unsigned long len);
+struct btrfs_csum_item *btrfs_lookup_csum(struct btrfs_trans_handle *trans,
+                                         struct btrfs_root *root,
+                                         struct btrfs_path *path,
+                                         u64 bytenr, int cow);
+int btrfs_csum_truncate(struct btrfs_trans_handle *trans,
+                       struct btrfs_root *root, struct btrfs_path *path,
+                       u64 isize);
+int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start,
+                            u64 end, struct list_head *list);
+/* inode.c */
+
+/* RHEL and EL kernels have a patch that renames PG_checked to FsMisc */
+#if defined(ClearPageFsMisc) && !defined(ClearPageChecked)
+#define ClearPageChecked ClearPageFsMisc
+#define SetPageChecked SetPageFsMisc
+#define PageChecked PageFsMisc
+#endif
+
+struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry);
+int btrfs_set_inode_index(struct inode *dir, u64 *index);
+int btrfs_unlink_inode(struct btrfs_trans_handle *trans,
+                      struct btrfs_root *root,
+                      struct inode *dir, struct inode *inode,
+                      const char *name, int name_len);
+int btrfs_add_link(struct btrfs_trans_handle *trans,
+                  struct inode *parent_inode, struct inode *inode,
+                  const char *name, int name_len, int add_backref, u64 index);
+int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *root,
+                              struct inode *inode, u64 new_size,
+                              u32 min_type);
+
+int btrfs_start_delalloc_inodes(struct btrfs_root *root);
+int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end);
+int btrfs_writepages(struct address_space *mapping,
+                    struct writeback_control *wbc);
+int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *new_root, struct dentry *dentry,
+                            u64 new_dirid, u64 alloc_hint);
+int btrfs_merge_bio_hook(struct page *page, unsigned long offset,
+                        size_t size, struct bio *bio, unsigned long bio_flags);
+
+unsigned long btrfs_force_ra(struct address_space *mapping,
+                             struct file_ra_state *ra, struct file *file,
+                             pgoff_t offset, pgoff_t last_index);
+int btrfs_check_free_space(struct btrfs_root *root, u64 num_required,
+                          int for_del);
+int btrfs_page_mkwrite(struct vm_area_struct *vma, struct page *page);
+int btrfs_readpage(struct file *file, struct page *page);
+void btrfs_delete_inode(struct inode *inode);
+void btrfs_put_inode(struct inode *inode);
+void btrfs_read_locked_inode(struct inode *inode);
+int btrfs_write_inode(struct inode *inode, int wait);
+void btrfs_dirty_inode(struct inode *inode);
+struct inode *btrfs_alloc_inode(struct super_block *sb);
+void btrfs_destroy_inode(struct inode *inode);
+int btrfs_init_cachep(void);
+void btrfs_destroy_cachep(void);
+long btrfs_ioctl_trans_end(struct file *file);
+struct inode *btrfs_ilookup(struct super_block *s, u64 objectid,
+                           struct btrfs_root *root, int wait);
+struct inode *btrfs_iget_locked(struct super_block *s, u64 objectid,
+                               struct btrfs_root *root);
+struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
+                        struct btrfs_root *root, int *is_new);
+int btrfs_commit_write(struct file *file, struct page *page,
+                      unsigned from, unsigned to);
+struct extent_map *btrfs_get_extent(struct inode *inode, struct page *page,
+                                   size_t page_offset, u64 start, u64 end,
+                                   int create);
+int btrfs_update_inode(struct btrfs_trans_handle *trans,
+                             struct btrfs_root *root,
+                             struct inode *inode);
+int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode);
+int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode);
+void btrfs_orphan_cleanup(struct btrfs_root *root);
+int btrfs_cont_expand(struct inode *inode, loff_t size);
+
+/* ioctl.c */
+long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
+
+/* file.c */
+int btrfs_sync_file(struct file *file, struct dentry *dentry, int datasync);
+int btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end,
+                           int skip_pinned);
+int btrfs_check_file(struct btrfs_root *root, struct inode *inode);
+extern struct file_operations btrfs_file_operations;
+int btrfs_drop_extents(struct btrfs_trans_handle *trans,
+                      struct btrfs_root *root, struct inode *inode,
+                      u64 start, u64 end, u64 inline_limit, u64 *hint_block);
+int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
+                             struct btrfs_root *root,
+                             struct inode *inode, u64 start, u64 end);
+int btrfs_release_file(struct inode *inode, struct file *file);
+
+/* tree-defrag.c */
+int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
+                       struct btrfs_root *root, int cache_only);
+
+/* sysfs.c */
+int btrfs_init_sysfs(void);
+void btrfs_exit_sysfs(void);
+int btrfs_sysfs_add_super(struct btrfs_fs_info *fs);
+int btrfs_sysfs_add_root(struct btrfs_root *root);
+void btrfs_sysfs_del_root(struct btrfs_root *root);
+void btrfs_sysfs_del_super(struct btrfs_fs_info *root);
+
+/* xattr.c */
+ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size);
+
+/* super.c */
+u64 btrfs_parse_size(char *str);
+int btrfs_parse_options(struct btrfs_root *root, char *options);
+int btrfs_sync_fs(struct super_block *sb, int wait);
+
+/* acl.c */
+int btrfs_check_acl(struct inode *inode, int mask);
+int btrfs_init_acl(struct inode *inode, struct inode *dir);
+int btrfs_acl_chmod(struct inode *inode);
+
+/* free-space-cache.c */
+int btrfs_add_free_space(struct btrfs_block_group_cache *block_group,
+                        u64 bytenr, u64 size);
+int btrfs_add_free_space_lock(struct btrfs_block_group_cache *block_group,
+                             u64 offset, u64 bytes);
+int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
+                           u64 bytenr, u64 size);
+int btrfs_remove_free_space_lock(struct btrfs_block_group_cache *block_group,
+                                u64 offset, u64 bytes);
+void btrfs_remove_free_space_cache(struct btrfs_block_group_cache
+                                  *block_group);
+struct btrfs_free_space *btrfs_find_free_space(struct btrfs_block_group_cache
+                                              *block_group, u64 offset,
+                                              u64 bytes);
+void btrfs_dump_free_space(struct btrfs_block_group_cache *block_group,
+                          u64 bytes);
+u64 btrfs_block_group_free_space(struct btrfs_block_group_cache *block_group);
+#endif
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c

new file mode 100644 (file)

index 0000000..926a0b2
--- /dev/null
+++ b/fs/btrfs/dir-item.c
@@ -0,0 +1,386 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "ctree.h"
+#include "disk-io.h"
+#include "hash.h"
+#include "transaction.h"
+
+/*
+ * insert a name into a directory, doing overflow properly if there is a hash
+ * collision.  data_size indicates how big the item inserted should be.  On
+ * success a struct btrfs_dir_item pointer is returned, otherwise it is
+ * an ERR_PTR.
+ *
+ * The name is not copied into the dir item, you have to do that yourself.
+ */
+static struct btrfs_dir_item *insert_with_overflow(struct btrfs_trans_handle
+                                                  *trans,
+                                                  struct btrfs_root *root,
+                                                  struct btrfs_path *path,
+                                                  struct btrfs_key *cpu_key,
+                                                  u32 data_size,
+                                                  const char *name,
+                                                  int name_len)
+{
+       int ret;
+       char *ptr;
+       struct btrfs_item *item;
+       struct extent_buffer *leaf;
+
+       ret = btrfs_insert_empty_item(trans, root, path, cpu_key, data_size);
+       if (ret == -EEXIST) {
+               struct btrfs_dir_item *di;
+               di = btrfs_match_dir_item_name(root, path, name, name_len);
+               if (di)
+                       return ERR_PTR(-EEXIST);
+               ret = btrfs_extend_item(trans, root, path, data_size);
+               WARN_ON(ret > 0);
+       }
+       if (ret < 0)
+               return ERR_PTR(ret);
+       WARN_ON(ret > 0);
+       leaf = path->nodes[0];
+       item = btrfs_item_nr(leaf, path->slots[0]);
+       ptr = btrfs_item_ptr(leaf, path->slots[0], char);
+       BUG_ON(data_size > btrfs_item_size(leaf, item));
+       ptr += btrfs_item_size(leaf, item) - data_size;
+       return (struct btrfs_dir_item *)ptr;
+}
+
+/*
+ * xattrs work a lot like directories, this inserts an xattr item
+ * into the tree
+ */
+int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root, const char *name,
+                           u16 name_len, const void *data, u16 data_len,
+                           u64 dir)
+{
+       int ret = 0;
+       struct btrfs_path *path;
+       struct btrfs_dir_item *dir_item;
+       unsigned long name_ptr, data_ptr;
+       struct btrfs_key key, location;
+       struct btrfs_disk_key disk_key;
+       struct extent_buffer *leaf;
+       u32 data_size;
+
+       key.objectid = dir;
+       btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY);
+       key.offset = btrfs_name_hash(name, name_len);
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+       if (name_len + data_len + sizeof(struct btrfs_dir_item) >
+           BTRFS_LEAF_DATA_SIZE(root) - sizeof(struct btrfs_item))
+               return -ENOSPC;
+
+       data_size = sizeof(*dir_item) + name_len + data_len;
+       dir_item = insert_with_overflow(trans, root, path, &key, data_size,
+                                       name, name_len);
+       /*
+        * FIXME: at some point we should handle xattr's that are larger than
+        * what we can fit in our leaf.  We set location to NULL b/c we arent
+        * pointing at anything else, that will change if we store the xattr
+        * data in a separate inode.
+        */
+       BUG_ON(IS_ERR(dir_item));
+       memset(&location, 0, sizeof(location));
+
+       leaf = path->nodes[0];
+       btrfs_cpu_key_to_disk(&disk_key, &location);
+       btrfs_set_dir_item_key(leaf, dir_item, &disk_key);
+       btrfs_set_dir_type(leaf, dir_item, BTRFS_FT_XATTR);
+       btrfs_set_dir_name_len(leaf, dir_item, name_len);
+       btrfs_set_dir_transid(leaf, dir_item, trans->transid);
+       btrfs_set_dir_data_len(leaf, dir_item, data_len);
+       name_ptr = (unsigned long)(dir_item + 1);
+       data_ptr = (unsigned long)((char *)name_ptr + name_len);
+
+       write_extent_buffer(leaf, name, name_ptr, name_len);
+       write_extent_buffer(leaf, data, data_ptr, data_len);
+       btrfs_mark_buffer_dirty(path->nodes[0]);
+
+       btrfs_free_path(path);
+       return ret;
+}
+
+/*
+ * insert a directory item in the tree, doing all the magic for
+ * both indexes. 'dir' indicates which objectid to insert it into,
+ * 'location' is the key to stuff into the directory item, 'type' is the
+ * type of the inode we're pointing to, and 'index' is the sequence number
+ * to use for the second index (if one is created).
+ */
+int btrfs_insert_dir_item(struct btrfs_trans_handle *trans, struct btrfs_root
+                         *root, const char *name, int name_len, u64 dir,
+                         struct btrfs_key *location, u8 type, u64 index)
+{
+       int ret = 0;
+       int ret2 = 0;
+       struct btrfs_path *path;
+       struct btrfs_dir_item *dir_item;
+       struct extent_buffer *leaf;
+       unsigned long name_ptr;
+       struct btrfs_key key;
+       struct btrfs_disk_key disk_key;
+       u32 data_size;
+
+       key.objectid = dir;
+       btrfs_set_key_type(&key, BTRFS_DIR_ITEM_KEY);
+       key.offset = btrfs_name_hash(name, name_len);
+       path = btrfs_alloc_path();
+       data_size = sizeof(*dir_item) + name_len;
+       dir_item = insert_with_overflow(trans, root, path, &key, data_size,
+                                       name, name_len);
+       if (IS_ERR(dir_item)) {
+               ret = PTR_ERR(dir_item);
+               if (ret == -EEXIST)
+                       goto second_insert;
+               goto out;
+       }
+
+       leaf = path->nodes[0];
+       btrfs_cpu_key_to_disk(&disk_key, location);
+       btrfs_set_dir_item_key(leaf, dir_item, &disk_key);
+       btrfs_set_dir_type(leaf, dir_item, type);
+       btrfs_set_dir_data_len(leaf, dir_item, 0);
+       btrfs_set_dir_name_len(leaf, dir_item, name_len);
+       btrfs_set_dir_transid(leaf, dir_item, trans->transid);
+       name_ptr = (unsigned long)(dir_item + 1);
+
+       write_extent_buffer(leaf, name, name_ptr, name_len);
+       btrfs_mark_buffer_dirty(leaf);
+
+second_insert:
+       /* FIXME, use some real flag for selecting the extra index */
+       if (root == root->fs_info->tree_root) {
+               ret = 0;
+               goto out;
+       }
+       btrfs_release_path(root, path);
+
+       btrfs_set_key_type(&key, BTRFS_DIR_INDEX_KEY);
+       key.offset = index;
+       dir_item = insert_with_overflow(trans, root, path, &key, data_size,
+                                       name, name_len);
+       if (IS_ERR(dir_item)) {
+               ret2 = PTR_ERR(dir_item);
+               goto out;
+       }
+       leaf = path->nodes[0];
+       btrfs_cpu_key_to_disk(&disk_key, location);
+       btrfs_set_dir_item_key(leaf, dir_item, &disk_key);
+       btrfs_set_dir_type(leaf, dir_item, type);
+       btrfs_set_dir_data_len(leaf, dir_item, 0);
+       btrfs_set_dir_name_len(leaf, dir_item, name_len);
+       btrfs_set_dir_transid(leaf, dir_item, trans->transid);
+       name_ptr = (unsigned long)(dir_item + 1);
+       write_extent_buffer(leaf, name, name_ptr, name_len);
+       btrfs_mark_buffer_dirty(leaf);
+out:
+       btrfs_free_path(path);
+       if (ret)
+               return ret;
+       if (ret2)
+               return ret2;
+       return 0;
+}
+
+/*
+ * lookup a directory item based on name.  'dir' is the objectid
+ * we're searching in, and 'mod' tells us if you plan on deleting the
+ * item (use mod < 0) or changing the options (use mod > 0)
+ */
+struct btrfs_dir_item *btrfs_lookup_dir_item(struct btrfs_trans_handle *trans,
+                                            struct btrfs_root *root,
+                                            struct btrfs_path *path, u64 dir,
+                                            const char *name, int name_len,
+                                            int mod)
+{
+       int ret;
+       struct btrfs_key key;
+       int ins_len = mod < 0 ? -1 : 0;
+       int cow = mod != 0;
+       struct btrfs_key found_key;
+       struct extent_buffer *leaf;
+
+       key.objectid = dir;
+       btrfs_set_key_type(&key, BTRFS_DIR_ITEM_KEY);
+
+       key.offset = btrfs_name_hash(name, name_len);
+
+       ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow);
+       if (ret < 0)
+               return ERR_PTR(ret);
+       if (ret > 0) {
+               if (path->slots[0] == 0)
+                       return NULL;
+               path->slots[0]--;
+       }
+
+       leaf = path->nodes[0];
+       btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+       if (found_key.objectid != dir ||
+           btrfs_key_type(&found_key) != BTRFS_DIR_ITEM_KEY ||
+           found_key.offset != key.offset)
+               return NULL;
+
+       return btrfs_match_dir_item_name(root, path, name, name_len);
+}
+
+/*
+ * lookup a directory item based on index.  'dir' is the objectid
+ * we're searching in, and 'mod' tells us if you plan on deleting the
+ * item (use mod < 0) or changing the options (use mod > 0)
+ *
+ * The name is used to make sure the index really points to the name you were
+ * looking for.
+ */
+struct btrfs_dir_item *
+btrfs_lookup_dir_index_item(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root,
+                           struct btrfs_path *path, u64 dir,
+                           u64 objectid, const char *name, int name_len,
+                           int mod)
+{
+       int ret;
+       struct btrfs_key key;
+       int ins_len = mod < 0 ? -1 : 0;
+       int cow = mod != 0;
+
+       key.objectid = dir;
+       btrfs_set_key_type(&key, BTRFS_DIR_INDEX_KEY);
+       key.offset = objectid;
+
+       ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow);
+       if (ret < 0)
+               return ERR_PTR(ret);
+       if (ret > 0)
+               return ERR_PTR(-ENOENT);
+       return btrfs_match_dir_item_name(root, path, name, name_len);
+}
+
+struct btrfs_dir_item *btrfs_lookup_xattr(struct btrfs_trans_handle *trans,
+                                         struct btrfs_root *root,
+                                         struct btrfs_path *path, u64 dir,
+                                         const char *name, u16 name_len,
+                                         int mod)
+{
+       int ret;
+       struct btrfs_key key;
+       int ins_len = mod < 0 ? -1 : 0;
+       int cow = mod != 0;
+       struct btrfs_key found_key;
+       struct extent_buffer *leaf;
+
+       key.objectid = dir;
+       btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY);
+       key.offset = btrfs_name_hash(name, name_len);
+       ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow);
+       if (ret < 0)
+               return ERR_PTR(ret);
+       if (ret > 0) {
+               if (path->slots[0] == 0)
+                       return NULL;
+               path->slots[0]--;
+       }
+
+       leaf = path->nodes[0];
+       btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+       if (found_key.objectid != dir ||
+           btrfs_key_type(&found_key) != BTRFS_XATTR_ITEM_KEY ||
+           found_key.offset != key.offset)
+               return NULL;
+
+       return btrfs_match_dir_item_name(root, path, name, name_len);
+}
+
+/*
+ * helper function to look at the directory item pointed to by 'path'
+ * this walks through all the entries in a dir item and finds one
+ * for a specific name.
+ */
+struct btrfs_dir_item *btrfs_match_dir_item_name(struct btrfs_root *root,
+                             struct btrfs_path *path,
+                             const char *name, int name_len)
+{
+       struct btrfs_dir_item *dir_item;
+       unsigned long name_ptr;
+       u32 total_len;
+       u32 cur = 0;
+       u32 this_len;
+       struct extent_buffer *leaf;
+
+       leaf = path->nodes[0];
+       dir_item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dir_item);
+       total_len = btrfs_item_size_nr(leaf, path->slots[0]);
+       while (cur < total_len) {
+               this_len = sizeof(*dir_item) +
+                       btrfs_dir_name_len(leaf, dir_item) +
+                       btrfs_dir_data_len(leaf, dir_item);
+               name_ptr = (unsigned long)(dir_item + 1);
+
+               if (btrfs_dir_name_len(leaf, dir_item) == name_len &&
+                   memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0)
+                       return dir_item;
+
+               cur += this_len;
+               dir_item = (struct btrfs_dir_item *)((char *)dir_item +
+                                                    this_len);
+       }
+       return NULL;
+}
+
+/*
+ * given a pointer into a directory item, delete it.  This
+ * handles items that have more than one entry in them.
+ */
+int btrfs_delete_one_dir_name(struct btrfs_trans_handle *trans,
+                             struct btrfs_root *root,
+                             struct btrfs_path *path,
+                             struct btrfs_dir_item *di)
+{
+
+       struct extent_buffer *leaf;
+       u32 sub_item_len;
+       u32 item_len;
+       int ret = 0;
+
+       leaf = path->nodes[0];
+       sub_item_len = sizeof(*di) + btrfs_dir_name_len(leaf, di) +
+               btrfs_dir_data_len(leaf, di);
+       item_len = btrfs_item_size_nr(leaf, path->slots[0]);
+       if (sub_item_len == item_len) {
+               ret = btrfs_del_item(trans, root, path);
+       } else {
+               /* MARKER */
+               unsigned long ptr = (unsigned long)di;
+               unsigned long start;
+
+               start = btrfs_item_ptr_offset(leaf, path->slots[0]);
+               memmove_extent_buffer(leaf, ptr, ptr + sub_item_len,
+                       item_len - (ptr + sub_item_len - start));
+               ret = btrfs_truncate_item(trans, root, path,
+                                         item_len - sub_item_len, 1);
+       }
+       return 0;
+}
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c

new file mode 100644 (file)

index 0000000..81a3138
--- /dev/null
+++ b/fs/btrfs/disk-io.c
@@ -0,0 +1,2343 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/version.h>
+#include <linux/fs.h>
+#include <linux/blkdev.h>
+#include <linux/scatterlist.h>
+#include <linux/swap.h>
+#include <linux/radix-tree.h>
+#include <linux/writeback.h>
+#include <linux/buffer_head.h>
+#include <linux/workqueue.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
+#include "compat.h"
+#include "crc32c.h"
+#include "ctree.h"
+#include "disk-io.h"
+#include "transaction.h"
+#include "btrfs_inode.h"
+#include "volumes.h"
+#include "print-tree.h"
+#include "async-thread.h"
+#include "locking.h"
+#include "ref-cache.h"
+#include "tree-log.h"
+
+static struct extent_io_ops btree_extent_io_ops;
+static void end_workqueue_fn(struct btrfs_work *work);
+
+/*
+ * end_io_wq structs are used to do processing in task context when an IO is
+ * complete.  This is used during reads to verify checksums, and it is used
+ * by writes to insert metadata for new file extents after IO is complete.
+ */
+struct end_io_wq {
+       struct bio *bio;
+       bio_end_io_t *end_io;
+       void *private;
+       struct btrfs_fs_info *info;
+       int error;
+       int metadata;
+       struct list_head list;
+       struct btrfs_work work;
+};
+
+/*
+ * async submit bios are used to offload expensive checksumming
+ * onto the worker threads.  They checksum file and metadata bios
+ * just before they are sent down the IO stack.
+ */
+struct async_submit_bio {
+       struct inode *inode;
+       struct bio *bio;
+       struct list_head list;
+       extent_submit_bio_hook_t *submit_bio_start;
+       extent_submit_bio_hook_t *submit_bio_done;
+       int rw;
+       int mirror_num;
+       unsigned long bio_flags;
+       struct btrfs_work work;
+};
+
+/*
+ * extents on the btree inode are pretty simple, there's one extent
+ * that covers the entire device
+ */
+static struct extent_map *btree_get_extent(struct inode *inode,
+               struct page *page, size_t page_offset, u64 start, u64 len,
+               int create)
+{
+       struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+       struct extent_map *em;
+       int ret;
+
+       spin_lock(&em_tree->lock);
+       em = lookup_extent_mapping(em_tree, start, len);
+       if (em) {
+               em->bdev =
+                       BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
+               spin_unlock(&em_tree->lock);
+               goto out;
+       }
+       spin_unlock(&em_tree->lock);
+
+       em = alloc_extent_map(GFP_NOFS);
+       if (!em) {
+               em = ERR_PTR(-ENOMEM);
+               goto out;
+       }
+       em->start = 0;
+       em->len = (u64)-1;
+       em->block_len = (u64)-1;
+       em->block_start = 0;
+       em->bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
+
+       spin_lock(&em_tree->lock);
+       ret = add_extent_mapping(em_tree, em);
+       if (ret == -EEXIST) {
+               u64 failed_start = em->start;
+               u64 failed_len = em->len;
+
+               free_extent_map(em);
+               em = lookup_extent_mapping(em_tree, start, len);
+               if (em) {
+                       ret = 0;
+               } else {
+                       em = lookup_extent_mapping(em_tree, failed_start,
+                                                  failed_len);
+                       ret = -EIO;
+               }
+       } else if (ret) {
+               free_extent_map(em);
+               em = NULL;
+       }
+       spin_unlock(&em_tree->lock);
+
+       if (ret)
+               em = ERR_PTR(ret);
+out:
+       return em;
+}
+
+u32 btrfs_csum_data(struct btrfs_root *root, char *data, u32 seed, size_t len)
+{
+       return btrfs_crc32c(seed, data, len);
+}
+
+void btrfs_csum_final(u32 crc, char *result)
+{
+       *(__le32 *)result = ~cpu_to_le32(crc);
+}
+
+/*
+ * compute the csum for a btree block, and either verify it or write it
+ * into the csum field of the block.
+ */
+static int csum_tree_block(struct btrfs_root *root, struct extent_buffer *buf,
+                          int verify)
+{
+       u16 csum_size =
+               btrfs_super_csum_size(&root->fs_info->super_copy);
+       char *result = NULL;
+       unsigned long len;
+       unsigned long cur_len;
+       unsigned long offset = BTRFS_CSUM_SIZE;
+       char *map_token = NULL;
+       char *kaddr;
+       unsigned long map_start;
+       unsigned long map_len;
+       int err;
+       u32 crc = ~(u32)0;
+       unsigned long inline_result;
+
+       len = buf->len - offset;
+       while (len > 0) {
+               err = map_private_extent_buffer(buf, offset, 32,
+                                       &map_token, &kaddr,
+                                       &map_start, &map_len, KM_USER0);
+               if (err)
+                       return 1;
+               cur_len = min(len, map_len - (offset - map_start));
+               crc = btrfs_csum_data(root, kaddr + offset - map_start,
+                                     crc, cur_len);
+               len -= cur_len;
+               offset += cur_len;
+               unmap_extent_buffer(buf, map_token, KM_USER0);
+       }
+       if (csum_size > sizeof(inline_result)) {
+               result = kzalloc(csum_size * sizeof(char), GFP_NOFS);
+               if (!result)
+                       return 1;
+       } else {
+               result = (char *)&inline_result;
+       }
+
+       btrfs_csum_final(crc, result);
+
+       if (verify) {
+               if (memcmp_extent_buffer(buf, result, 0, csum_size)) {
+                       u32 val;
+                       u32 found = 0;
+                       memcpy(&found, result, csum_size);
+
+                       read_extent_buffer(buf, &val, 0, csum_size);
+                       printk(KERN_INFO "btrfs: %s checksum verify failed "
+                              "on %llu wanted %X found %X level %d\n",
+                              root->fs_info->sb->s_id,
+                              buf->start, val, found, btrfs_header_level(buf));
+                       if (result != (char *)&inline_result)
+                               kfree(result);
+                       return 1;
+               }
+       } else {
+               write_extent_buffer(buf, result, 0, csum_size);
+       }
+       if (result != (char *)&inline_result)
+               kfree(result);
+       return 0;
+}
+
+/*
+ * we can't consider a given block up to date unless the transid of the
+ * block matches the transid in the parent node's pointer.  This is how we
+ * detect blocks that either didn't get written at all or got written
+ * in the wrong place.
+ */
+static int verify_parent_transid(struct extent_io_tree *io_tree,
+                                struct extent_buffer *eb, u64 parent_transid)
+{
+       int ret;
+
+       if (!parent_transid || btrfs_header_generation(eb) == parent_transid)
+               return 0;
+
+       lock_extent(io_tree, eb->start, eb->start + eb->len - 1, GFP_NOFS);
+       if (extent_buffer_uptodate(io_tree, eb) &&
+           btrfs_header_generation(eb) == parent_transid) {
+               ret = 0;
+               goto out;
+       }
+       printk("parent transid verify failed on %llu wanted %llu found %llu\n",
+              (unsigned long long)eb->start,
+              (unsigned long long)parent_transid,
+              (unsigned long long)btrfs_header_generation(eb));
+       ret = 1;
+       clear_extent_buffer_uptodate(io_tree, eb);
+out:
+       unlock_extent(io_tree, eb->start, eb->start + eb->len - 1,
+                     GFP_NOFS);
+       return ret;
+}
+
+/*
+ * helper to read a given tree block, doing retries as required when
+ * the checksums don't match and we have alternate mirrors to try.
+ */
+static int btree_read_extent_buffer_pages(struct btrfs_root *root,
+                                         struct extent_buffer *eb,
+                                         u64 start, u64 parent_transid)
+{
+       struct extent_io_tree *io_tree;
+       int ret;
+       int num_copies = 0;
+       int mirror_num = 0;
+
+       io_tree = &BTRFS_I(root->fs_info->btree_inode)->io_tree;
+       while (1) {
+               ret = read_extent_buffer_pages(io_tree, eb, start, 1,
+                                              btree_get_extent, mirror_num);
+               if (!ret &&
+                   !verify_parent_transid(io_tree, eb, parent_transid))
+                       return ret;
+
+               num_copies = btrfs_num_copies(&root->fs_info->mapping_tree,
+                                             eb->start, eb->len);
+               if (num_copies == 1)
+                       return ret;
+
+               mirror_num++;
+               if (mirror_num > num_copies)
+                       return ret;
+       }
+       return -EIO;
+}
+
+/*
+ * checksum a dirty tree block before IO.  This has extra checks to make sure
+ * we only fill in the checksum field in the first page of a multi-page block
+ */
+
+static int csum_dirty_buffer(struct btrfs_root *root, struct page *page)
+{
+       struct extent_io_tree *tree;
+       u64 start = (u64)page->index << PAGE_CACHE_SHIFT;
+       u64 found_start;
+       int found_level;
+       unsigned long len;
+       struct extent_buffer *eb;
+       int ret;
+
+       tree = &BTRFS_I(page->mapping->host)->io_tree;
+
+       if (page->private == EXTENT_PAGE_PRIVATE)
+               goto out;
+       if (!page->private)
+               goto out;
+       len = page->private >> 2;
+       WARN_ON(len == 0);
+
+       eb = alloc_extent_buffer(tree, start, len, page, GFP_NOFS);
+       ret = btree_read_extent_buffer_pages(root, eb, start + PAGE_CACHE_SIZE,
+                                            btrfs_header_generation(eb));
+       BUG_ON(ret);
+       found_start = btrfs_header_bytenr(eb);
+       if (found_start != start) {
+               WARN_ON(1);
+               goto err;
+       }
+       if (eb->first_page != page) {
+               WARN_ON(1);
+               goto err;
+       }
+       if (!PageUptodate(page)) {
+               WARN_ON(1);
+               goto err;
+       }
+       found_level = btrfs_header_level(eb);
+
+       csum_tree_block(root, eb, 0);
+err:
+       free_extent_buffer(eb);
+out:
+       return 0;
+}
+
+static int check_tree_block_fsid(struct btrfs_root *root,
+                                struct extent_buffer *eb)
+{
+       struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices;
+       u8 fsid[BTRFS_UUID_SIZE];
+       int ret = 1;
+
+       read_extent_buffer(eb, fsid, (unsigned long)btrfs_header_fsid(eb),
+                          BTRFS_FSID_SIZE);
+       while (fs_devices) {
+               if (!memcmp(fsid, fs_devices->fsid, BTRFS_FSID_SIZE)) {
+                       ret = 0;
+                       break;
+               }
+               fs_devices = fs_devices->seed;
+       }
+       return ret;
+}
+
+static int btree_readpage_end_io_hook(struct page *page, u64 start, u64 end,
+                              struct extent_state *state)
+{
+       struct extent_io_tree *tree;
+       u64 found_start;
+       int found_level;
+       unsigned long len;
+       struct extent_buffer *eb;
+       struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
+       int ret = 0;
+
+       tree = &BTRFS_I(page->mapping->host)->io_tree;
+       if (page->private == EXTENT_PAGE_PRIVATE)
+               goto out;
+       if (!page->private)
+               goto out;
+
+       len = page->private >> 2;
+       WARN_ON(len == 0);
+
+       eb = alloc_extent_buffer(tree, start, len, page, GFP_NOFS);
+
+       found_start = btrfs_header_bytenr(eb);
+       if (found_start != start) {
+               printk(KERN_INFO "btrfs bad tree block start %llu %llu\n",
+                      (unsigned long long)found_start,
+                      (unsigned long long)eb->start);
+               ret = -EIO;
+               goto err;
+       }
+       if (eb->first_page != page) {
+               printk(KERN_INFO "btrfs bad first page %lu %lu\n",
+                      eb->first_page->index, page->index);
+               WARN_ON(1);
+               ret = -EIO;
+               goto err;
+       }
+       if (check_tree_block_fsid(root, eb)) {
+               printk(KERN_INFO "btrfs bad fsid on block %llu\n",
+                      (unsigned long long)eb->start);
+               ret = -EIO;
+               goto err;
+       }
+       found_level = btrfs_header_level(eb);
+
+       ret = csum_tree_block(root, eb, 1);
+       if (ret)
+               ret = -EIO;
+
+       end = min_t(u64, eb->len, PAGE_CACHE_SIZE);
+       end = eb->start + end - 1;
+err:
+       free_extent_buffer(eb);
+out:
+       return ret;
+}
+
+static void end_workqueue_bio(struct bio *bio, int err)
+{
+       struct end_io_wq *end_io_wq = bio->bi_private;
+       struct btrfs_fs_info *fs_info;
+
+       fs_info = end_io_wq->info;
+       end_io_wq->error = err;
+       end_io_wq->work.func = end_workqueue_fn;
+       end_io_wq->work.flags = 0;
+
+       if (bio->bi_rw & (1 << BIO_RW)) {
+               if (end_io_wq->metadata)
+                       btrfs_queue_worker(&fs_info->endio_meta_write_workers,
+                                          &end_io_wq->work);
+               else
+                       btrfs_queue_worker(&fs_info->endio_write_workers,
+                                          &end_io_wq->work);
+       } else {
+               if (end_io_wq->metadata)
+                       btrfs_queue_worker(&fs_info->endio_meta_workers,
+                                          &end_io_wq->work);
+               else
+                       btrfs_queue_worker(&fs_info->endio_workers,
+                                          &end_io_wq->work);
+       }
+}
+
+int btrfs_bio_wq_end_io(struct btrfs_fs_info *info, struct bio *bio,
+                       int metadata)
+{
+       struct end_io_wq *end_io_wq;
+       end_io_wq = kmalloc(sizeof(*end_io_wq), GFP_NOFS);
+       if (!end_io_wq)
+               return -ENOMEM;
+
+       end_io_wq->private = bio->bi_private;
+       end_io_wq->end_io = bio->bi_end_io;
+       end_io_wq->info = info;
+       end_io_wq->error = 0;
+       end_io_wq->bio = bio;
+       end_io_wq->metadata = metadata;
+
+       bio->bi_private = end_io_wq;
+       bio->bi_end_io = end_workqueue_bio;
+       return 0;
+}
+
+unsigned long btrfs_async_submit_limit(struct btrfs_fs_info *info)
+{
+       unsigned long limit = min_t(unsigned long,
+                                   info->workers.max_workers,
+                                   info->fs_devices->open_devices);
+       return 256 * limit;
+}
+
+int btrfs_congested_async(struct btrfs_fs_info *info, int iodone)
+{
+       return atomic_read(&info->nr_async_bios) >
+               btrfs_async_submit_limit(info);
+}
+
+static void run_one_async_start(struct btrfs_work *work)
+{
+       struct btrfs_fs_info *fs_info;
+       struct async_submit_bio *async;
+
+       async = container_of(work, struct  async_submit_bio, work);
+       fs_info = BTRFS_I(async->inode)->root->fs_info;
+       async->submit_bio_start(async->inode, async->rw, async->bio,
+                              async->mirror_num, async->bio_flags);
+}
+
+static void run_one_async_done(struct btrfs_work *work)
+{
+       struct btrfs_fs_info *fs_info;
+       struct async_submit_bio *async;
+       int limit;
+
+       async = container_of(work, struct  async_submit_bio, work);
+       fs_info = BTRFS_I(async->inode)->root->fs_info;
+
+       limit = btrfs_async_submit_limit(fs_info);
+       limit = limit * 2 / 3;
+
+       atomic_dec(&fs_info->nr_async_submits);
+
+       if (atomic_read(&fs_info->nr_async_submits) < limit &&
+           waitqueue_active(&fs_info->async_submit_wait))
+               wake_up(&fs_info->async_submit_wait);
+
+       async->submit_bio_done(async->inode, async->rw, async->bio,
+                              async->mirror_num, async->bio_flags);
+}
+
+static void run_one_async_free(struct btrfs_work *work)
+{
+       struct async_submit_bio *async;
+
+       async = container_of(work, struct  async_submit_bio, work);
+       kfree(async);
+}
+
+int btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct inode *inode,
+                       int rw, struct bio *bio, int mirror_num,
+                       unsigned long bio_flags,
+                       extent_submit_bio_hook_t *submit_bio_start,
+                       extent_submit_bio_hook_t *submit_bio_done)
+{
+       struct async_submit_bio *async;
+
+       async = kmalloc(sizeof(*async), GFP_NOFS);
+       if (!async)
+               return -ENOMEM;
+
+       async->inode = inode;
+       async->rw = rw;
+       async->bio = bio;
+       async->mirror_num = mirror_num;
+       async->submit_bio_start = submit_bio_start;
+       async->submit_bio_done = submit_bio_done;
+
+       async->work.func = run_one_async_start;
+       async->work.ordered_func = run_one_async_done;
+       async->work.ordered_free = run_one_async_free;
+
+       async->work.flags = 0;
+       async->bio_flags = bio_flags;
+
+       atomic_inc(&fs_info->nr_async_submits);
+       btrfs_queue_worker(&fs_info->workers, &async->work);
+#if 0
+       int limit = btrfs_async_submit_limit(fs_info);
+       if (atomic_read(&fs_info->nr_async_submits) > limit) {
+               wait_event_timeout(fs_info->async_submit_wait,
+                          (atomic_read(&fs_info->nr_async_submits) < limit),
+                          HZ/10);
+
+               wait_event_timeout(fs_info->async_submit_wait,
+                          (atomic_read(&fs_info->nr_async_bios) < limit),
+                          HZ/10);
+       }
+#endif
+       while (atomic_read(&fs_info->async_submit_draining) &&
+             atomic_read(&fs_info->nr_async_submits)) {
+               wait_event(fs_info->async_submit_wait,
+                          (atomic_read(&fs_info->nr_async_submits) == 0));
+       }
+
+       return 0;
+}
+
+static int btree_csum_one_bio(struct bio *bio)
+{
+       struct bio_vec *bvec = bio->bi_io_vec;
+       int bio_index = 0;
+       struct btrfs_root *root;
+
+       WARN_ON(bio->bi_vcnt <= 0);
+       while (bio_index < bio->bi_vcnt) {
+               root = BTRFS_I(bvec->bv_page->mapping->host)->root;
+               csum_dirty_buffer(root, bvec->bv_page);
+               bio_index++;
+               bvec++;
+       }
+       return 0;
+}
+
+static int __btree_submit_bio_start(struct inode *inode, int rw,
+                                   struct bio *bio, int mirror_num,
+                                   unsigned long bio_flags)
+{
+       /*
+        * when we're called for a write, we're already in the async
+        * submission context.  Just jump into btrfs_map_bio
+        */
+       btree_csum_one_bio(bio);
+       return 0;
+}
+
+static int __btree_submit_bio_done(struct inode *inode, int rw, struct bio *bio,
+                                int mirror_num, unsigned long bio_flags)
+{
+       /*
+        * when we're called for a write, we're already in the async
+        * submission context.  Just jump into btrfs_map_bio
+        */
+       return btrfs_map_bio(BTRFS_I(inode)->root, rw, bio, mirror_num, 1);
+}
+
+static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
+                                int mirror_num, unsigned long bio_flags)
+{
+       int ret;
+
+       ret = btrfs_bio_wq_end_io(BTRFS_I(inode)->root->fs_info,
+                                         bio, 1);
+       BUG_ON(ret);
+
+       if (!(rw & (1 << BIO_RW))) {
+               /*
+                * called for a read, do the setup so that checksum validation
+                * can happen in the async kernel threads
+                */
+               return btrfs_map_bio(BTRFS_I(inode)->root, rw, bio,
+                                    mirror_num, 0);
+       }
+       /*
+        * kthread helpers are used to submit writes so that checksumming
+        * can happen in parallel across all CPUs
+        */
+       return btrfs_wq_submit_bio(BTRFS_I(inode)->root->fs_info,
+                                  inode, rw, bio, mirror_num, 0,
+                                  __btree_submit_bio_start,
+                                  __btree_submit_bio_done);
+}
+
+static int btree_writepage(struct page *page, struct writeback_control *wbc)
+{
+       struct extent_io_tree *tree;
+       tree = &BTRFS_I(page->mapping->host)->io_tree;
+
+       if (current->flags & PF_MEMALLOC) {
+               redirty_page_for_writepage(wbc, page);
+               unlock_page(page);
+               return 0;
+       }
+       return extent_write_full_page(tree, page, btree_get_extent, wbc);
+}
+
+static int btree_writepages(struct address_space *mapping,
+                           struct writeback_control *wbc)
+{
+       struct extent_io_tree *tree;
+       tree = &BTRFS_I(mapping->host)->io_tree;
+       if (wbc->sync_mode == WB_SYNC_NONE) {
+               u64 num_dirty;
+               u64 start = 0;
+               unsigned long thresh = 32 * 1024 * 1024;
+
+               if (wbc->for_kupdate)
+                       return 0;
+
+               num_dirty = count_range_bits(tree, &start, (u64)-1,
+                                            thresh, EXTENT_DIRTY);
+               if (num_dirty < thresh)
+                       return 0;
+       }
+       return extent_writepages(tree, mapping, btree_get_extent, wbc);
+}
+
+static int btree_readpage(struct file *file, struct page *page)
+{
+       struct extent_io_tree *tree;
+       tree = &BTRFS_I(page->mapping->host)->io_tree;
+       return extent_read_full_page(tree, page, btree_get_extent);
+}
+
+static int btree_releasepage(struct page *page, gfp_t gfp_flags)
+{
+       struct extent_io_tree *tree;
+       struct extent_map_tree *map;
+       int ret;
+
+       if (PageWriteback(page) || PageDirty(page))
+               return 0;
+
+       tree = &BTRFS_I(page->mapping->host)->io_tree;
+       map = &BTRFS_I(page->mapping->host)->extent_tree;
+
+       ret = try_release_extent_state(map, tree, page, gfp_flags);
+       if (!ret)
+               return 0;
+
+       ret = try_release_extent_buffer(tree, page);
+       if (ret == 1) {
+               ClearPagePrivate(page);
+               set_page_private(page, 0);
+               page_cache_release(page);
+       }
+
+       return ret;
+}
+
+static void btree_invalidatepage(struct page *page, unsigned long offset)
+{
+       struct extent_io_tree *tree;
+       tree = &BTRFS_I(page->mapping->host)->io_tree;
+       extent_invalidatepage(tree, page, offset);
+       btree_releasepage(page, GFP_NOFS);
+       if (PagePrivate(page)) {
+               printk(KERN_WARNING "btrfs warning page private not zero "
+                      "on page %llu\n", (unsigned long long)page_offset(page));
+               ClearPagePrivate(page);
+               set_page_private(page, 0);
+               page_cache_release(page);
+       }
+}
+
+#if 0
+static int btree_writepage(struct page *page, struct writeback_control *wbc)
+{
+       struct buffer_head *bh;
+       struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
+       struct buffer_head *head;
+       if (!page_has_buffers(page)) {
+               create_empty_buffers(page, root->fs_info->sb->s_blocksize,
+                                       (1 << BH_Dirty)|(1 << BH_Uptodate));
+       }
+       head = page_buffers(page);
+       bh = head;
+       do {
+               if (buffer_dirty(bh))
+                       csum_tree_block(root, bh, 0);
+               bh = bh->b_this_page;
+       } while (bh != head);
+       return block_write_full_page(page, btree_get_block, wbc);
+}
+#endif
+
+static struct address_space_operations btree_aops = {
+       .readpage       = btree_readpage,
+       .writepage      = btree_writepage,
+       .writepages     = btree_writepages,
+       .releasepage    = btree_releasepage,
+       .invalidatepage = btree_invalidatepage,
+       .sync_page      = block_sync_page,
+};
+
+int readahead_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize,
+                        u64 parent_transid)
+{
+       struct extent_buffer *buf = NULL;
+       struct inode *btree_inode = root->fs_info->btree_inode;
+       int ret = 0;
+
+       buf = btrfs_find_create_tree_block(root, bytenr, blocksize);
+       if (!buf)
+               return 0;
+       read_extent_buffer_pages(&BTRFS_I(btree_inode)->io_tree,
+                                buf, 0, 0, btree_get_extent, 0);
+       free_extent_buffer(buf);
+       return ret;
+}
+
+struct extent_buffer *btrfs_find_tree_block(struct btrfs_root *root,
+                                           u64 bytenr, u32 blocksize)
+{
+       struct inode *btree_inode = root->fs_info->btree_inode;
+       struct extent_buffer *eb;
+       eb = find_extent_buffer(&BTRFS_I(btree_inode)->io_tree,
+                               bytenr, blocksize, GFP_NOFS);
+       return eb;
+}
+
+struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
+                                                u64 bytenr, u32 blocksize)
+{
+       struct inode *btree_inode = root->fs_info->btree_inode;
+       struct extent_buffer *eb;
+
+       eb = alloc_extent_buffer(&BTRFS_I(btree_inode)->io_tree,
+                                bytenr, blocksize, NULL, GFP_NOFS);
+       return eb;
+}
+
+
+int btrfs_write_tree_block(struct extent_buffer *buf)
+{
+       return btrfs_fdatawrite_range(buf->first_page->mapping, buf->start,
+                                     buf->start + buf->len - 1, WB_SYNC_ALL);
+}
+
+int btrfs_wait_tree_block_writeback(struct extent_buffer *buf)
+{
+       return btrfs_wait_on_page_writeback_range(buf->first_page->mapping,
+                                 buf->start, buf->start + buf->len - 1);
+}
+
+struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr,
+                                     u32 blocksize, u64 parent_transid)
+{
+       struct extent_buffer *buf = NULL;
+       struct inode *btree_inode = root->fs_info->btree_inode;
+       struct extent_io_tree *io_tree;
+       int ret;
+
+       io_tree = &BTRFS_I(btree_inode)->io_tree;
+
+       buf = btrfs_find_create_tree_block(root, bytenr, blocksize);
+       if (!buf)
+               return NULL;
+
+       ret = btree_read_extent_buffer_pages(root, buf, 0, parent_transid);
+
+       if (ret == 0)
+               buf->flags |= EXTENT_UPTODATE;
+       else
+               WARN_ON(1);
+       return buf;
+
+}
+
+int clean_tree_block(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+                    struct extent_buffer *buf)
+{
+       struct inode *btree_inode = root->fs_info->btree_inode;
+       if (btrfs_header_generation(buf) ==
+           root->fs_info->running_transaction->transid) {
+               WARN_ON(!btrfs_tree_locked(buf));
+               clear_extent_buffer_dirty(&BTRFS_I(btree_inode)->io_tree,
+                                         buf);
+       }
+       return 0;
+}
+
+static int __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize,
+                       u32 stripesize, struct btrfs_root *root,
+                       struct btrfs_fs_info *fs_info,
+                       u64 objectid)
+{
+       root->node = NULL;
+       root->commit_root = NULL;
+       root->ref_tree = NULL;
+       root->sectorsize = sectorsize;
+       root->nodesize = nodesize;
+       root->leafsize = leafsize;
+       root->stripesize = stripesize;
+       root->ref_cows = 0;
+       root->track_dirty = 0;
+
+       root->fs_info = fs_info;
+       root->objectid = objectid;
+       root->last_trans = 0;
+       root->highest_inode = 0;
+       root->last_inode_alloc = 0;
+       root->name = NULL;
+       root->in_sysfs = 0;
+
+       INIT_LIST_HEAD(&root->dirty_list);
+       INIT_LIST_HEAD(&root->orphan_list);
+       INIT_LIST_HEAD(&root->dead_list);
+       spin_lock_init(&root->node_lock);
+       spin_lock_init(&root->list_lock);
+       mutex_init(&root->objectid_mutex);
+       mutex_init(&root->log_mutex);
+       extent_io_tree_init(&root->dirty_log_pages,
+                            fs_info->btree_inode->i_mapping, GFP_NOFS);
+
+       btrfs_leaf_ref_tree_init(&root->ref_tree_struct);
+       root->ref_tree = &root->ref_tree_struct;
+
+       memset(&root->root_key, 0, sizeof(root->root_key));
+       memset(&root->root_item, 0, sizeof(root->root_item));
+       memset(&root->defrag_progress, 0, sizeof(root->defrag_progress));
+       memset(&root->root_kobj, 0, sizeof(root->root_kobj));
+       root->defrag_trans_start = fs_info->generation;
+       init_completion(&root->kobj_unregister);
+       root->defrag_running = 0;
+       root->defrag_level = 0;
+       root->root_key.objectid = objectid;
+       root->anon_super.s_root = NULL;
+       root->anon_super.s_dev = 0;
+       INIT_LIST_HEAD(&root->anon_super.s_list);
+       INIT_LIST_HEAD(&root->anon_super.s_instances);
+       init_rwsem(&root->anon_super.s_umount);
+
+       return 0;
+}
+
+static int find_and_setup_root(struct btrfs_root *tree_root,
+                              struct btrfs_fs_info *fs_info,
+                              u64 objectid,
+                              struct btrfs_root *root)
+{
+       int ret;
+       u32 blocksize;
+       u64 generation;
+
+       __setup_root(tree_root->nodesize, tree_root->leafsize,
+                    tree_root->sectorsize, tree_root->stripesize,
+                    root, fs_info, objectid);
+       ret = btrfs_find_last_root(tree_root, objectid,
+                                  &root->root_item, &root->root_key);
+       BUG_ON(ret);
+
+       generation = btrfs_root_generation(&root->root_item);
+       blocksize = btrfs_level_size(root, btrfs_root_level(&root->root_item));
+       root->node = read_tree_block(root, btrfs_root_bytenr(&root->root_item),
+                                    blocksize, generation);
+       BUG_ON(!root->node);
+       return 0;
+}
+
+int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
+                            struct btrfs_fs_info *fs_info)
+{
+       struct extent_buffer *eb;
+       struct btrfs_root *log_root_tree = fs_info->log_root_tree;
+       u64 start = 0;
+       u64 end = 0;
+       int ret;
+
+       if (!log_root_tree)
+               return 0;
+
+       while (1) {
+               ret = find_first_extent_bit(&log_root_tree->dirty_log_pages,
+                                   0, &start, &end, EXTENT_DIRTY);
+               if (ret)
+                       break;
+
+               clear_extent_dirty(&log_root_tree->dirty_log_pages,
+                                  start, end, GFP_NOFS);
+       }
+       eb = fs_info->log_root_tree->node;
+
+       WARN_ON(btrfs_header_level(eb) != 0);
+       WARN_ON(btrfs_header_nritems(eb) != 0);
+
+       ret = btrfs_free_reserved_extent(fs_info->tree_root,
+                               eb->start, eb->len);
+       BUG_ON(ret);
+
+       free_extent_buffer(eb);
+       kfree(fs_info->log_root_tree);
+       fs_info->log_root_tree = NULL;
+       return 0;
+}
+
+int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
+                            struct btrfs_fs_info *fs_info)
+{
+       struct btrfs_root *root;
+       struct btrfs_root *tree_root = fs_info->tree_root;
+
+       root = kzalloc(sizeof(*root), GFP_NOFS);
+       if (!root)
+               return -ENOMEM;
+
+       __setup_root(tree_root->nodesize, tree_root->leafsize,
+                    tree_root->sectorsize, tree_root->stripesize,
+                    root, fs_info, BTRFS_TREE_LOG_OBJECTID);
+
+       root->root_key.objectid = BTRFS_TREE_LOG_OBJECTID;
+       root->root_key.type = BTRFS_ROOT_ITEM_KEY;
+       root->root_key.offset = BTRFS_TREE_LOG_OBJECTID;
+       root->ref_cows = 0;
+
+       root->node = btrfs_alloc_free_block(trans, root, root->leafsize,
+                                           0, BTRFS_TREE_LOG_OBJECTID,
+                                           trans->transid, 0, 0, 0);
+
+       btrfs_set_header_nritems(root->node, 0);
+       btrfs_set_header_level(root->node, 0);
+       btrfs_set_header_bytenr(root->node, root->node->start);
+       btrfs_set_header_generation(root->node, trans->transid);
+       btrfs_set_header_owner(root->node, BTRFS_TREE_LOG_OBJECTID);
+
+       write_extent_buffer(root->node, root->fs_info->fsid,
+                           (unsigned long)btrfs_header_fsid(root->node),
+                           BTRFS_FSID_SIZE);
+       btrfs_mark_buffer_dirty(root->node);
+       btrfs_tree_unlock(root->node);
+       fs_info->log_root_tree = root;
+       return 0;
+}
+
+struct btrfs_root *btrfs_read_fs_root_no_radix(struct btrfs_root *tree_root,
+                                              struct btrfs_key *location)
+{
+       struct btrfs_root *root;
+       struct btrfs_fs_info *fs_info = tree_root->fs_info;
+       struct btrfs_path *path;
+       struct extent_buffer *l;
+       u64 highest_inode;
+       u64 generation;
+       u32 blocksize;
+       int ret = 0;
+
+       root = kzalloc(sizeof(*root), GFP_NOFS);
+       if (!root)
+               return ERR_PTR(-ENOMEM);
+       if (location->offset == (u64)-1) {
+               ret = find_and_setup_root(tree_root, fs_info,
+                                         location->objectid, root);
+               if (ret) {
+                       kfree(root);
+                       return ERR_PTR(ret);
+               }
+               goto insert;
+       }
+
+       __setup_root(tree_root->nodesize, tree_root->leafsize,
+                    tree_root->sectorsize, tree_root->stripesize,
+                    root, fs_info, location->objectid);
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+       ret = btrfs_search_slot(NULL, tree_root, location, path, 0, 0);
+       if (ret != 0) {
+               if (ret > 0)
+                       ret = -ENOENT;
+               goto out;
+       }
+       l = path->nodes[0];
+       read_extent_buffer(l, &root->root_item,
+              btrfs_item_ptr_offset(l, path->slots[0]),
+              sizeof(root->root_item));
+       memcpy(&root->root_key, location, sizeof(*location));
+       ret = 0;
+out:
+       btrfs_release_path(root, path);
+       btrfs_free_path(path);
+       if (ret) {
+               kfree(root);
+               return ERR_PTR(ret);
+       }
+       generation = btrfs_root_generation(&root->root_item);
+       blocksize = btrfs_level_size(root, btrfs_root_level(&root->root_item));
+       root->node = read_tree_block(root, btrfs_root_bytenr(&root->root_item),
+                                    blocksize, generation);
+       BUG_ON(!root->node);
+insert:
+       if (location->objectid != BTRFS_TREE_LOG_OBJECTID) {
+               root->ref_cows = 1;
+               ret = btrfs_find_highest_inode(root, &highest_inode);
+               if (ret == 0) {
+                       root->highest_inode = highest_inode;
+                       root->last_inode_alloc = highest_inode;
+               }
+       }
+       return root;
+}
+
+struct btrfs_root *btrfs_lookup_fs_root(struct btrfs_fs_info *fs_info,
+                                       u64 root_objectid)
+{
+       struct btrfs_root *root;
+
+       if (root_objectid == BTRFS_ROOT_TREE_OBJECTID)
+               return fs_info->tree_root;
+       if (root_objectid == BTRFS_EXTENT_TREE_OBJECTID)
+               return fs_info->extent_root;
+
+       root = radix_tree_lookup(&fs_info->fs_roots_radix,
+                                (unsigned long)root_objectid);
+       return root;
+}
+
+struct btrfs_root *btrfs_read_fs_root_no_name(struct btrfs_fs_info *fs_info,
+                                             struct btrfs_key *location)
+{
+       struct btrfs_root *root;
+       int ret;
+
+       if (location->objectid == BTRFS_ROOT_TREE_OBJECTID)
+               return fs_info->tree_root;
+       if (location->objectid == BTRFS_EXTENT_TREE_OBJECTID)
+               return fs_info->extent_root;
+       if (location->objectid == BTRFS_CHUNK_TREE_OBJECTID)
+               return fs_info->chunk_root;
+       if (location->objectid == BTRFS_DEV_TREE_OBJECTID)
+               return fs_info->dev_root;
+       if (location->objectid == BTRFS_CSUM_TREE_OBJECTID)
+               return fs_info->csum_root;
+
+       root = radix_tree_lookup(&fs_info->fs_roots_radix,
+                                (unsigned long)location->objectid);
+       if (root)
+               return root;
+
+       root = btrfs_read_fs_root_no_radix(fs_info->tree_root, location);
+       if (IS_ERR(root))
+               return root;
+
+       set_anon_super(&root->anon_super, NULL);
+
+       ret = radix_tree_insert(&fs_info->fs_roots_radix,
+                               (unsigned long)root->root_key.objectid,
+                               root);
+       if (ret) {
+               free_extent_buffer(root->node);
+               kfree(root);
+               return ERR_PTR(ret);
+       }
+       if (!(fs_info->sb->s_flags & MS_RDONLY)) {
+               ret = btrfs_find_dead_roots(fs_info->tree_root,
+                                           root->root_key.objectid, root);
+               BUG_ON(ret);
+               btrfs_orphan_cleanup(root);
+       }
+       return root;
+}
+
+struct btrfs_root *btrfs_read_fs_root(struct btrfs_fs_info *fs_info,
+                                     struct btrfs_key *location,
+                                     const char *name, int namelen)
+{
+       struct btrfs_root *root;
+       int ret;
+
+       root = btrfs_read_fs_root_no_name(fs_info, location);
+       if (!root)
+               return NULL;
+
+       if (root->in_sysfs)
+               return root;
+
+       ret = btrfs_set_root_name(root, name, namelen);
+       if (ret) {
+               free_extent_buffer(root->node);
+               kfree(root);
+               return ERR_PTR(ret);
+       }
+#if 0
+       ret = btrfs_sysfs_add_root(root);
+       if (ret) {
+               free_extent_buffer(root->node);
+               kfree(root->name);
+               kfree(root);
+               return ERR_PTR(ret);
+       }
+#endif
+       root->in_sysfs = 1;
+       return root;
+}
+
+static int btrfs_congested_fn(void *congested_data, int bdi_bits)
+{
+       struct btrfs_fs_info *info = (struct btrfs_fs_info *)congested_data;
+       int ret = 0;
+       struct list_head *cur;
+       struct btrfs_device *device;
+       struct backing_dev_info *bdi;
+#if 0
+       if ((bdi_bits & (1 << BDI_write_congested)) &&
+           btrfs_congested_async(info, 0))
+               return 1;
+#endif
+       list_for_each(cur, &info->fs_devices->devices) {
+               device = list_entry(cur, struct btrfs_device, dev_list);
+               if (!device->bdev)
+                       continue;
+               bdi = blk_get_backing_dev_info(device->bdev);
+               if (bdi && bdi_congested(bdi, bdi_bits)) {
+                       ret = 1;
+                       break;
+               }
+       }
+       return ret;
+}
+
+/*
+ * this unplugs every device on the box, and it is only used when page
+ * is null
+ */
+static void __unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
+{
+       struct list_head *cur;
+       struct btrfs_device *device;
+       struct btrfs_fs_info *info;
+
+       info = (struct btrfs_fs_info *)bdi->unplug_io_data;
+       list_for_each(cur, &info->fs_devices->devices) {
+               device = list_entry(cur, struct btrfs_device, dev_list);
+               if (!device->bdev)
+                       continue;
+
+               bdi = blk_get_backing_dev_info(device->bdev);
+               if (bdi->unplug_io_fn)
+                       bdi->unplug_io_fn(bdi, page);
+       }
+}
+
+static void btrfs_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
+{
+       struct inode *inode;
+       struct extent_map_tree *em_tree;
+       struct extent_map *em;
+       struct address_space *mapping;
+       u64 offset;
+
+       /* the generic O_DIRECT read code does this */
+       if (1 || !page) {
+               __unplug_io_fn(bdi, page);
+               return;
+       }
+
+       /*
+        * page->mapping may change at any time.  Get a consistent copy
+        * and use that for everything below
+        */
+       smp_mb();
+       mapping = page->mapping;
+       if (!mapping)
+               return;
+
+       inode = mapping->host;
+
+       /*
+        * don't do the expensive searching for a small number of
+        * devices
+        */
+       if (BTRFS_I(inode)->root->fs_info->fs_devices->open_devices <= 2) {
+               __unplug_io_fn(bdi, page);
+               return;
+       }
+
+       offset = page_offset(page);
+
+       em_tree = &BTRFS_I(inode)->extent_tree;
+       spin_lock(&em_tree->lock);
+       em = lookup_extent_mapping(em_tree, offset, PAGE_CACHE_SIZE);
+       spin_unlock(&em_tree->lock);
+       if (!em) {
+               __unplug_io_fn(bdi, page);
+               return;
+       }
+
+       if (em->block_start >= EXTENT_MAP_LAST_BYTE) {
+               free_extent_map(em);
+               __unplug_io_fn(bdi, page);
+               return;
+       }
+       offset = offset - em->start;
+       btrfs_unplug_page(&BTRFS_I(inode)->root->fs_info->mapping_tree,
+                         em->block_start + offset, page);
+       free_extent_map(em);
+}
+
+static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi)
+{
+       bdi_init(bdi);
+       bdi->ra_pages   = default_backing_dev_info.ra_pages;
+       bdi->state              = 0;
+       bdi->capabilities       = default_backing_dev_info.capabilities;
+       bdi->unplug_io_fn       = btrfs_unplug_io_fn;
+       bdi->unplug_io_data     = info;
+       bdi->congested_fn       = btrfs_congested_fn;
+       bdi->congested_data     = info;
+       return 0;
+}
+
+static int bio_ready_for_csum(struct bio *bio)
+{
+       u64 length = 0;
+       u64 buf_len = 0;
+       u64 start = 0;
+       struct page *page;
+       struct extent_io_tree *io_tree = NULL;
+       struct btrfs_fs_info *info = NULL;
+       struct bio_vec *bvec;
+       int i;
+       int ret;
+
+       bio_for_each_segment(bvec, bio, i) {
+               page = bvec->bv_page;
+               if (page->private == EXTENT_PAGE_PRIVATE) {
+                       length += bvec->bv_len;
+                       continue;
+               }
+               if (!page->private) {
+                       length += bvec->bv_len;
+                       continue;
+               }
+               length = bvec->bv_len;
+               buf_len = page->private >> 2;
+               start = page_offset(page) + bvec->bv_offset;
+               io_tree = &BTRFS_I(page->mapping->host)->io_tree;
+               info = BTRFS_I(page->mapping->host)->root->fs_info;
+       }
+       /* are we fully contained in this bio? */
+       if (buf_len <= length)
+               return 1;
+
+       ret = extent_range_uptodate(io_tree, start + length,
+                                   start + buf_len - 1);
+       if (ret == 1)
+               return ret;
+       return ret;
+}
+
+/*
+ * called by the kthread helper functions to finally call the bio end_io
+ * functions.  This is where read checksum verification actually happens
+ */
+static void end_workqueue_fn(struct btrfs_work *work)
+{
+       struct bio *bio;
+       struct end_io_wq *end_io_wq;
+       struct btrfs_fs_info *fs_info;
+       int error;
+
+       end_io_wq = container_of(work, struct end_io_wq, work);
+       bio = end_io_wq->bio;
+       fs_info = end_io_wq->info;
+
+       /* metadata bio reads are special because the whole tree block must
+        * be checksummed at once.  This makes sure the entire block is in
+        * ram and up to date before trying to verify things.  For
+        * blocksize <= pagesize, it is basically a noop
+        */
+       if (!(bio->bi_rw & (1 << BIO_RW)) && end_io_wq->metadata &&
+           !bio_ready_for_csum(bio)) {
+               btrfs_queue_worker(&fs_info->endio_meta_workers,
+                                  &end_io_wq->work);
+               return;
+       }
+       error = end_io_wq->error;
+       bio->bi_private = end_io_wq->private;
+       bio->bi_end_io = end_io_wq->end_io;
+       kfree(end_io_wq);
+       bio_endio(bio, error);
+}
+
+static int cleaner_kthread(void *arg)
+{
+       struct btrfs_root *root = arg;
+
+       do {
+               smp_mb();
+               if (root->fs_info->closing)
+                       break;
+
+               vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
+               mutex_lock(&root->fs_info->cleaner_mutex);
+               btrfs_clean_old_snapshots(root);
+               mutex_unlock(&root->fs_info->cleaner_mutex);
+
+               if (freezing(current)) {
+                       refrigerator();
+               } else {
+                       smp_mb();
+                       if (root->fs_info->closing)
+                               break;
+                       set_current_state(TASK_INTERRUPTIBLE);
+                       schedule();
+                       __set_current_state(TASK_RUNNING);
+               }
+       } while (!kthread_should_stop());
+       return 0;
+}
+
+static int transaction_kthread(void *arg)
+{
+       struct btrfs_root *root = arg;
+       struct btrfs_trans_handle *trans;
+       struct btrfs_transaction *cur;
+       unsigned long now;
+       unsigned long delay;
+       int ret;
+
+       do {
+               smp_mb();
+               if (root->fs_info->closing)
+                       break;
+
+               delay = HZ * 30;
+               vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
+               mutex_lock(&root->fs_info->transaction_kthread_mutex);
+
+               if (root->fs_info->total_ref_cache_size > 20 * 1024 * 1024) {
+                       printk(KERN_INFO "btrfs: total reference cache "
+                              "size %llu\n",
+                              root->fs_info->total_ref_cache_size);
+               }
+
+               mutex_lock(&root->fs_info->trans_mutex);
+               cur = root->fs_info->running_transaction;
+               if (!cur) {
+                       mutex_unlock(&root->fs_info->trans_mutex);
+                       goto sleep;
+               }
+
+               now = get_seconds();
+               if (now < cur->start_time || now - cur->start_time < 30) {
+                       mutex_unlock(&root->fs_info->trans_mutex);
+                       delay = HZ * 5;
+                       goto sleep;
+               }
+               mutex_unlock(&root->fs_info->trans_mutex);
+               trans = btrfs_start_transaction(root, 1);
+               ret = btrfs_commit_transaction(trans, root);
+sleep:
+               wake_up_process(root->fs_info->cleaner_kthread);
+               mutex_unlock(&root->fs_info->transaction_kthread_mutex);
+
+               if (freezing(current)) {
+                       refrigerator();
+               } else {
+                       if (root->fs_info->closing)
+                               break;
+                       set_current_state(TASK_INTERRUPTIBLE);
+                       schedule_timeout(delay);
+                       __set_current_state(TASK_RUNNING);
+               }
+       } while (!kthread_should_stop());
+       return 0;
+}
+
+struct btrfs_root *open_ctree(struct super_block *sb,
+                             struct btrfs_fs_devices *fs_devices,
+                             char *options)
+{
+       u32 sectorsize;
+       u32 nodesize;
+       u32 leafsize;
+       u32 blocksize;
+       u32 stripesize;
+       u64 generation;
+       u64 features;
+       struct btrfs_key location;
+       struct buffer_head *bh;
+       struct btrfs_root *extent_root = kzalloc(sizeof(struct btrfs_root),
+                                                GFP_NOFS);
+       struct btrfs_root *csum_root = kzalloc(sizeof(struct btrfs_root),
+                                                GFP_NOFS);
+       struct btrfs_root *tree_root = kzalloc(sizeof(struct btrfs_root),
+                                              GFP_NOFS);
+       struct btrfs_fs_info *fs_info = kzalloc(sizeof(*fs_info),
+                                               GFP_NOFS);
+       struct btrfs_root *chunk_root = kzalloc(sizeof(struct btrfs_root),
+                                               GFP_NOFS);
+       struct btrfs_root *dev_root = kzalloc(sizeof(struct btrfs_root),
+                                             GFP_NOFS);
+       struct btrfs_root *log_tree_root;
+
+       int ret;
+       int err = -EINVAL;
+
+       struct btrfs_super_block *disk_super;
+
+       if (!extent_root || !tree_root || !fs_info ||
+           !chunk_root || !dev_root || !csum_root) {
+               err = -ENOMEM;
+               goto fail;
+       }
+       INIT_RADIX_TREE(&fs_info->fs_roots_radix, GFP_NOFS);
+       INIT_LIST_HEAD(&fs_info->trans_list);
+       INIT_LIST_HEAD(&fs_info->dead_roots);
+       INIT_LIST_HEAD(&fs_info->hashers);
+       INIT_LIST_HEAD(&fs_info->delalloc_inodes);
+       spin_lock_init(&fs_info->hash_lock);
+       spin_lock_init(&fs_info->delalloc_lock);
+       spin_lock_init(&fs_info->new_trans_lock);
+       spin_lock_init(&fs_info->ref_cache_lock);
+
+       init_completion(&fs_info->kobj_unregister);
+       fs_info->tree_root = tree_root;
+       fs_info->extent_root = extent_root;
+       fs_info->csum_root = csum_root;
+       fs_info->chunk_root = chunk_root;
+       fs_info->dev_root = dev_root;
+       fs_info->fs_devices = fs_devices;
+       INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
+       INIT_LIST_HEAD(&fs_info->space_info);
+       btrfs_mapping_init(&fs_info->mapping_tree);
+       atomic_set(&fs_info->nr_async_submits, 0);
+       atomic_set(&fs_info->async_delalloc_pages, 0);
+       atomic_set(&fs_info->async_submit_draining, 0);
+       atomic_set(&fs_info->nr_async_bios, 0);
+       atomic_set(&fs_info->throttles, 0);
+       atomic_set(&fs_info->throttle_gen, 0);
+       fs_info->sb = sb;
+       fs_info->max_extent = (u64)-1;
+       fs_info->max_inline = 8192 * 1024;
+       setup_bdi(fs_info, &fs_info->bdi);
+       fs_info->btree_inode = new_inode(sb);
+       fs_info->btree_inode->i_ino = 1;
+       fs_info->btree_inode->i_nlink = 1;
+
+       fs_info->thread_pool_size = min_t(unsigned long,
+                                         num_online_cpus() + 2, 8);
+
+       INIT_LIST_HEAD(&fs_info->ordered_extents);
+       spin_lock_init(&fs_info->ordered_extent_lock);
+
+       sb->s_blocksize = 4096;
+       sb->s_blocksize_bits = blksize_bits(4096);
+
+       /*
+        * we set the i_size on the btree inode to the max possible int.
+        * the real end of the address space is determined by all of
+        * the devices in the system
+        */
+       fs_info->btree_inode->i_size = OFFSET_MAX;
+       fs_info->btree_inode->i_mapping->a_ops = &btree_aops;
+       fs_info->btree_inode->i_mapping->backing_dev_info = &fs_info->bdi;
+
+       extent_io_tree_init(&BTRFS_I(fs_info->btree_inode)->io_tree,
+                            fs_info->btree_inode->i_mapping,
+                            GFP_NOFS);
+       extent_map_tree_init(&BTRFS_I(fs_info->btree_inode)->extent_tree,
+                            GFP_NOFS);
+
+       BTRFS_I(fs_info->btree_inode)->io_tree.ops = &btree_extent_io_ops;
+
+       spin_lock_init(&fs_info->block_group_cache_lock);
+       fs_info->block_group_cache_tree.rb_node = NULL;
+
+       extent_io_tree_init(&fs_info->pinned_extents,
+                            fs_info->btree_inode->i_mapping, GFP_NOFS);
+       extent_io_tree_init(&fs_info->pending_del,
+                            fs_info->btree_inode->i_mapping, GFP_NOFS);
+       extent_io_tree_init(&fs_info->extent_ins,
+                            fs_info->btree_inode->i_mapping, GFP_NOFS);
+       fs_info->do_barriers = 1;
+
+       INIT_LIST_HEAD(&fs_info->dead_reloc_roots);
+       btrfs_leaf_ref_tree_init(&fs_info->reloc_ref_tree);
+       btrfs_leaf_ref_tree_init(&fs_info->shared_ref_tree);
+
+       BTRFS_I(fs_info->btree_inode)->root = tree_root;
+       memset(&BTRFS_I(fs_info->btree_inode)->location, 0,
+              sizeof(struct btrfs_key));
+       insert_inode_hash(fs_info->btree_inode);
+
+       mutex_init(&fs_info->trans_mutex);
+       mutex_init(&fs_info->tree_log_mutex);
+       mutex_init(&fs_info->drop_mutex);
+       mutex_init(&fs_info->extent_ins_mutex);
+       mutex_init(&fs_info->pinned_mutex);
+       mutex_init(&fs_info->chunk_mutex);
+       mutex_init(&fs_info->transaction_kthread_mutex);
+       mutex_init(&fs_info->cleaner_mutex);
+       mutex_init(&fs_info->volume_mutex);
+       mutex_init(&fs_info->tree_reloc_mutex);
+       init_waitqueue_head(&fs_info->transaction_throttle);
+       init_waitqueue_head(&fs_info->transaction_wait);
+       init_waitqueue_head(&fs_info->async_submit_wait);
+       init_waitqueue_head(&fs_info->tree_log_wait);
+       atomic_set(&fs_info->tree_log_commit, 0);
+       atomic_set(&fs_info->tree_log_writers, 0);
+       fs_info->tree_log_transid = 0;
+
+       __setup_root(4096, 4096, 4096, 4096, tree_root,
+                    fs_info, BTRFS_ROOT_TREE_OBJECTID);
+
+
+       bh = btrfs_read_dev_super(fs_devices->latest_bdev);
+       if (!bh)
+               goto fail_iput;
+
+       memcpy(&fs_info->super_copy, bh->b_data, sizeof(fs_info->super_copy));
+       memcpy(&fs_info->super_for_commit, &fs_info->super_copy,
+              sizeof(fs_info->super_for_commit));
+       brelse(bh);
+
+       memcpy(fs_info->fsid, fs_info->super_copy.fsid, BTRFS_FSID_SIZE);
+
+       disk_super = &fs_info->super_copy;
+       if (!btrfs_super_root(disk_super))
+               goto fail_iput;
+
+       ret = btrfs_parse_options(tree_root, options);
+       if (ret) {
+               err = ret;
+               goto fail_iput;
+       }
+
+       features = btrfs_super_incompat_flags(disk_super) &
+               ~BTRFS_FEATURE_INCOMPAT_SUPP;
+       if (features) {
+               printk(KERN_ERR "BTRFS: couldn't mount because of "
+                      "unsupported optional features (%Lx).\n",
+                      features);
+               err = -EINVAL;
+               goto fail_iput;
+       }
+
+       features = btrfs_super_compat_ro_flags(disk_super) &
+               ~BTRFS_FEATURE_COMPAT_RO_SUPP;
+       if (!(sb->s_flags & MS_RDONLY) && features) {
+               printk(KERN_ERR "BTRFS: couldn't mount RDWR because of "
+                      "unsupported option features (%Lx).\n",
+                      features);
+               err = -EINVAL;
+               goto fail_iput;
+       }
+
+       /*
+        * we need to start all the end_io workers up front because the
+        * queue work function gets called at interrupt time, and so it
+        * cannot dynamically grow.
+        */
+       btrfs_init_workers(&fs_info->workers, "worker",
+                          fs_info->thread_pool_size);
+
+       btrfs_init_workers(&fs_info->delalloc_workers, "delalloc",
+                          fs_info->thread_pool_size);
+
+       btrfs_init_workers(&fs_info->submit_workers, "submit",
+                          min_t(u64, fs_devices->num_devices,
+                          fs_info->thread_pool_size));
+
+       /* a higher idle thresh on the submit workers makes it much more
+        * likely that bios will be send down in a sane order to the
+        * devices
+        */
+       fs_info->submit_workers.idle_thresh = 64;
+
+       fs_info->workers.idle_thresh = 16;
+       fs_info->workers.ordered = 1;
+
+       fs_info->delalloc_workers.idle_thresh = 2;
+       fs_info->delalloc_workers.ordered = 1;
+
+       btrfs_init_workers(&fs_info->fixup_workers, "fixup", 1);
+       btrfs_init_workers(&fs_info->endio_workers, "endio",
+                          fs_info->thread_pool_size);
+       btrfs_init_workers(&fs_info->endio_meta_workers, "endio-meta",
+                          fs_info->thread_pool_size);
+       btrfs_init_workers(&fs_info->endio_meta_write_workers,
+                          "endio-meta-write", fs_info->thread_pool_size);
+       btrfs_init_workers(&fs_info->endio_write_workers, "endio-write",
+                          fs_info->thread_pool_size);
+
+       /*
+        * endios are largely parallel and should have a very
+        * low idle thresh
+        */
+       fs_info->endio_workers.idle_thresh = 4;
+       fs_info->endio_write_workers.idle_thresh = 64;
+       fs_info->endio_meta_write_workers.idle_thresh = 64;
+
+       btrfs_start_workers(&fs_info->workers, 1);
+       btrfs_start_workers(&fs_info->submit_workers, 1);
+       btrfs_start_workers(&fs_info->delalloc_workers, 1);
+       btrfs_start_workers(&fs_info->fixup_workers, 1);
+       btrfs_start_workers(&fs_info->endio_workers, fs_info->thread_pool_size);
+       btrfs_start_workers(&fs_info->endio_meta_workers,
+                           fs_info->thread_pool_size);
+       btrfs_start_workers(&fs_info->endio_meta_write_workers,
+                           fs_info->thread_pool_size);
+       btrfs_start_workers(&fs_info->endio_write_workers,
+                           fs_info->thread_pool_size);
+
+       fs_info->bdi.ra_pages *= btrfs_super_num_devices(disk_super);
+       fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages,
+                                   4 * 1024 * 1024 / PAGE_CACHE_SIZE);
+
+       nodesize = btrfs_super_nodesize(disk_super);
+       leafsize = btrfs_super_leafsize(disk_super);
+       sectorsize = btrfs_super_sectorsize(disk_super);
+       stripesize = btrfs_super_stripesize(disk_super);
+       tree_root->nodesize = nodesize;
+       tree_root->leafsize = leafsize;
+       tree_root->sectorsize = sectorsize;
+       tree_root->stripesize = stripesize;
+
+       sb->s_blocksize = sectorsize;
+       sb->s_blocksize_bits = blksize_bits(sectorsize);
+
+       if (strncmp((char *)(&disk_super->magic), BTRFS_MAGIC,
+                   sizeof(disk_super->magic))) {
+               printk(KERN_INFO "btrfs: valid FS not found on %s\n", sb->s_id);
+               goto fail_sb_buffer;
+       }
+
+       mutex_lock(&fs_info->chunk_mutex);
+       ret = btrfs_read_sys_array(tree_root);
+       mutex_unlock(&fs_info->chunk_mutex);
+       if (ret) {
+               printk(KERN_WARNING "btrfs: failed to read the system "
+                      "array on %s\n", sb->s_id);
+               goto fail_sys_array;
+       }
+
+       blocksize = btrfs_level_size(tree_root,
+                                    btrfs_super_chunk_root_level(disk_super));
+       generation = btrfs_super_chunk_root_generation(disk_super);
+
+       __setup_root(nodesize, leafsize, sectorsize, stripesize,
+                    chunk_root, fs_info, BTRFS_CHUNK_TREE_OBJECTID);
+
+       chunk_root->node = read_tree_block(chunk_root,
+                                          btrfs_super_chunk_root(disk_super),
+                                          blocksize, generation);
+       BUG_ON(!chunk_root->node);
+
+       read_extent_buffer(chunk_root->node, fs_info->chunk_tree_uuid,
+          (unsigned long)btrfs_header_chunk_tree_uuid(chunk_root->node),
+          BTRFS_UUID_SIZE);
+
+       mutex_lock(&fs_info->chunk_mutex);
+       ret = btrfs_read_chunk_tree(chunk_root);
+       mutex_unlock(&fs_info->chunk_mutex);
+       if (ret) {
+               printk(KERN_WARNING "btrfs: failed to read chunk tree on %s\n",
+                      sb->s_id);
+               goto fail_chunk_root;
+       }
+
+       btrfs_close_extra_devices(fs_devices);
+
+       blocksize = btrfs_level_size(tree_root,
+                                    btrfs_super_root_level(disk_super));
+       generation = btrfs_super_generation(disk_super);
+
+       tree_root->node = read_tree_block(tree_root,
+                                         btrfs_super_root(disk_super),
+                                         blocksize, generation);
+       if (!tree_root->node)
+               goto fail_chunk_root;
+
+
+       ret = find_and_setup_root(tree_root, fs_info,
+                                 BTRFS_EXTENT_TREE_OBJECTID, extent_root);
+       if (ret)
+               goto fail_tree_root;
+       extent_root->track_dirty = 1;
+
+       ret = find_and_setup_root(tree_root, fs_info,
+                                 BTRFS_DEV_TREE_OBJECTID, dev_root);
+       dev_root->track_dirty = 1;
+
+       if (ret)
+               goto fail_extent_root;
+
+       ret = find_and_setup_root(tree_root, fs_info,
+                                 BTRFS_CSUM_TREE_OBJECTID, csum_root);
+       if (ret)
+               goto fail_extent_root;
+
+       csum_root->track_dirty = 1;
+
+       btrfs_read_block_groups(extent_root);
+
+       fs_info->generation = generation;
+       fs_info->last_trans_committed = generation;
+       fs_info->data_alloc_profile = (u64)-1;
+       fs_info->metadata_alloc_profile = (u64)-1;
+       fs_info->system_alloc_profile = fs_info->metadata_alloc_profile;
+       fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
+                                              "btrfs-cleaner");
+       if (!fs_info->cleaner_kthread)
+               goto fail_csum_root;
+
+       fs_info->transaction_kthread = kthread_run(transaction_kthread,
+                                                  tree_root,
+                                                  "btrfs-transaction");
+       if (!fs_info->transaction_kthread)
+               goto fail_cleaner;
+
+       if (btrfs_super_log_root(disk_super) != 0) {
+               u64 bytenr = btrfs_super_log_root(disk_super);
+
+               if (fs_devices->rw_devices == 0) {
+                       printk(KERN_WARNING "Btrfs log replay required "
+                              "on RO media\n");
+                       err = -EIO;
+                       goto fail_trans_kthread;
+               }
+               blocksize =
+                    btrfs_level_size(tree_root,
+                                     btrfs_super_log_root_level(disk_super));
+
+               log_tree_root = kzalloc(sizeof(struct btrfs_root),
+                                                     GFP_NOFS);
+
+               __setup_root(nodesize, leafsize, sectorsize, stripesize,
+                            log_tree_root, fs_info, BTRFS_TREE_LOG_OBJECTID);
+
+               log_tree_root->node = read_tree_block(tree_root, bytenr,
+                                                     blocksize,
+                                                     generation + 1);
+               ret = btrfs_recover_log_trees(log_tree_root);
+               BUG_ON(ret);
+
+               if (sb->s_flags & MS_RDONLY) {
+                       ret =  btrfs_commit_super(tree_root);
+                       BUG_ON(ret);
+               }
+       }
+
+       if (!(sb->s_flags & MS_RDONLY)) {
+               ret = btrfs_cleanup_reloc_trees(tree_root);
+               BUG_ON(ret);
+       }
+
+       location.objectid = BTRFS_FS_TREE_OBJECTID;
+       location.type = BTRFS_ROOT_ITEM_KEY;
+       location.offset = (u64)-1;
+
+       fs_info->fs_root = btrfs_read_fs_root_no_name(fs_info, &location);
+       if (!fs_info->fs_root)
+               goto fail_trans_kthread;
+       return tree_root;
+
+fail_trans_kthread:
+       kthread_stop(fs_info->transaction_kthread);
+fail_cleaner:
+       kthread_stop(fs_info->cleaner_kthread);
+
+       /*
+        * make sure we're done with the btree inode before we stop our
+        * kthreads
+        */
+       filemap_write_and_wait(fs_info->btree_inode->i_mapping);
+       invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
+
+fail_csum_root:
+       free_extent_buffer(csum_root->node);
+fail_extent_root:
+       free_extent_buffer(extent_root->node);
+fail_tree_root:
+       free_extent_buffer(tree_root->node);
+fail_chunk_root:
+       free_extent_buffer(chunk_root->node);
+fail_sys_array:
+       free_extent_buffer(dev_root->node);
+fail_sb_buffer:
+       btrfs_stop_workers(&fs_info->fixup_workers);
+       btrfs_stop_workers(&fs_info->delalloc_workers);
+       btrfs_stop_workers(&fs_info->workers);
+       btrfs_stop_workers(&fs_info->endio_workers);
+       btrfs_stop_workers(&fs_info->endio_meta_workers);
+       btrfs_stop_workers(&fs_info->endio_meta_write_workers);
+       btrfs_stop_workers(&fs_info->endio_write_workers);
+       btrfs_stop_workers(&fs_info->submit_workers);
+fail_iput:
+       invalidate_inode_pages2(fs_info->btree_inode->i_mapping);
+       iput(fs_info->btree_inode);
+fail:
+       btrfs_close_devices(fs_info->fs_devices);
+       btrfs_mapping_tree_free(&fs_info->mapping_tree);
+
+       kfree(extent_root);
+       kfree(tree_root);
+       bdi_destroy(&fs_info->bdi);
+       kfree(fs_info);
+       kfree(chunk_root);
+       kfree(dev_root);
+       kfree(csum_root);
+       return ERR_PTR(err);
+}
+
+static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate)
+{
+       char b[BDEVNAME_SIZE];
+
+       if (uptodate) {
+               set_buffer_uptodate(bh);
+       } else {
+               if (!buffer_eopnotsupp(bh) && printk_ratelimit()) {
+                       printk(KERN_WARNING "lost page write due to "
+                                       "I/O error on %s\n",
+                                      bdevname(bh->b_bdev, b));
+               }
+               /* note, we dont' set_buffer_write_io_error because we have
+                * our own ways of dealing with the IO errors
+                */
+               clear_buffer_uptodate(bh);
+       }
+       unlock_buffer(bh);
+       put_bh(bh);
+}
+
+struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
+{
+       struct buffer_head *bh;
+       struct buffer_head *latest = NULL;
+       struct btrfs_super_block *super;
+       int i;
+       u64 transid = 0;
+       u64 bytenr;
+
+       /* we would like to check all the supers, but that would make
+        * a btrfs mount succeed after a mkfs from a different FS.
+        * So, we need to add a special mount option to scan for
+        * later supers, using BTRFS_SUPER_MIRROR_MAX instead
+        */
+       for (i = 0; i < 1; i++) {
+               bytenr = btrfs_sb_offset(i);
+               if (bytenr + 4096 >= i_size_read(bdev->bd_inode))
+                       break;
+               bh = __bread(bdev, bytenr / 4096, 4096);
+               if (!bh)
+                       continue;
+
+               super = (struct btrfs_super_block *)bh->b_data;
+               if (btrfs_super_bytenr(super) != bytenr ||
+                   strncmp((char *)(&super->magic), BTRFS_MAGIC,
+                           sizeof(super->magic))) {
+                       brelse(bh);
+                       continue;
+               }
+
+               if (!latest || btrfs_super_generation(super) > transid) {
+                       brelse(latest);
+                       latest = bh;
+                       transid = btrfs_super_generation(super);
+               } else {
+                       brelse(bh);
+               }
+       }
+       return latest;
+}
+
+static int write_dev_supers(struct btrfs_device *device,
+                           struct btrfs_super_block *sb,
+                           int do_barriers, int wait, int max_mirrors)
+{
+       struct buffer_head *bh;
+       int i;
+       int ret;
+       int errors = 0;
+       u32 crc;
+       u64 bytenr;
+       int last_barrier = 0;
+
+       if (max_mirrors == 0)
+               max_mirrors = BTRFS_SUPER_MIRROR_MAX;
+
+       /* make sure only the last submit_bh does a barrier */
+       if (do_barriers) {
+               for (i = 0; i < max_mirrors; i++) {
+                       bytenr = btrfs_sb_offset(i);
+                       if (bytenr + BTRFS_SUPER_INFO_SIZE >=
+                           device->total_bytes)
+                               break;
+                       last_barrier = i;
+               }
+       }
+
+       for (i = 0; i < max_mirrors; i++) {
+               bytenr = btrfs_sb_offset(i);
+               if (bytenr + BTRFS_SUPER_INFO_SIZE >= device->total_bytes)
+                       break;
+
+               if (wait) {
+                       bh = __find_get_block(device->bdev, bytenr / 4096,
+                                             BTRFS_SUPER_INFO_SIZE);
+                       BUG_ON(!bh);
+                       brelse(bh);
+                       wait_on_buffer(bh);
+                       if (buffer_uptodate(bh)) {
+                               brelse(bh);
+                               continue;
+                       }
+               } else {
+                       btrfs_set_super_bytenr(sb, bytenr);
+
+                       crc = ~(u32)0;
+                       crc = btrfs_csum_data(NULL, (char *)sb +
+                                             BTRFS_CSUM_SIZE, crc,
+                                             BTRFS_SUPER_INFO_SIZE -
+                                             BTRFS_CSUM_SIZE);
+                       btrfs_csum_final(crc, sb->csum);
+
+                       bh = __getblk(device->bdev, bytenr / 4096,
+                                     BTRFS_SUPER_INFO_SIZE);
+                       memcpy(bh->b_data, sb, BTRFS_SUPER_INFO_SIZE);
+
+                       set_buffer_uptodate(bh);
+                       get_bh(bh);
+                       lock_buffer(bh);
+                       bh->b_end_io = btrfs_end_buffer_write_sync;
+               }
+
+               if (i == last_barrier && do_barriers && device->barriers) {
+                       ret = submit_bh(WRITE_BARRIER, bh);
+                       if (ret == -EOPNOTSUPP) {
+                               printk("btrfs: disabling barriers on dev %s\n",
+                                      device->name);
+                               set_buffer_uptodate(bh);
+                               device->barriers = 0;
+                               get_bh(bh);
+                               lock_buffer(bh);
+                               ret = submit_bh(WRITE, bh);
+                       }
+               } else {
+                       ret = submit_bh(WRITE, bh);
+               }
+
+               if (!ret && wait) {
+                       wait_on_buffer(bh);
+                       if (!buffer_uptodate(bh))
+                               errors++;
+               } else if (ret) {
+                       errors++;
+               }
+               if (wait)
+                       brelse(bh);
+       }
+       return errors < i ? 0 : -1;
+}
+
+int write_all_supers(struct btrfs_root *root, int max_mirrors)
+{
+       struct list_head *cur;
+       struct list_head *head = &root->fs_info->fs_devices->devices;
+       struct btrfs_device *dev;
+       struct btrfs_super_block *sb;
+       struct btrfs_dev_item *dev_item;
+       int ret;
+       int do_barriers;
+       int max_errors;
+       int total_errors = 0;
+       u64 flags;
+
+       max_errors = btrfs_super_num_devices(&root->fs_info->super_copy) - 1;
+       do_barriers = !btrfs_test_opt(root, NOBARRIER);
+
+       sb = &root->fs_info->super_for_commit;
+       dev_item = &sb->dev_item;
+       list_for_each(cur, head) {
+               dev = list_entry(cur, struct btrfs_device, dev_list);
+               if (!dev->bdev) {
+                       total_errors++;
+                       continue;
+               }
+               if (!dev->in_fs_metadata || !dev->writeable)
+                       continue;
+
+               btrfs_set_stack_device_generation(dev_item, 0);
+               btrfs_set_stack_device_type(dev_item, dev->type);
+               btrfs_set_stack_device_id(dev_item, dev->devid);
+               btrfs_set_stack_device_total_bytes(dev_item, dev->total_bytes);
+               btrfs_set_stack_device_bytes_used(dev_item, dev->bytes_used);
+               btrfs_set_stack_device_io_align(dev_item, dev->io_align);
+               btrfs_set_stack_device_io_width(dev_item, dev->io_width);
+               btrfs_set_stack_device_sector_size(dev_item, dev->sector_size);
+               memcpy(dev_item->uuid, dev->uuid, BTRFS_UUID_SIZE);
+               memcpy(dev_item->fsid, dev->fs_devices->fsid, BTRFS_UUID_SIZE);
+
+               flags = btrfs_super_flags(sb);
+               btrfs_set_super_flags(sb, flags | BTRFS_HEADER_FLAG_WRITTEN);
+
+               ret = write_dev_supers(dev, sb, do_barriers, 0, max_mirrors);
+               if (ret)
+                       total_errors++;
+       }
+       if (total_errors > max_errors) {
+               printk(KERN_ERR "btrfs: %d errors while writing supers\n",
+                      total_errors);
+               BUG();
+       }
+
+       total_errors = 0;
+       list_for_each(cur, head) {
+               dev = list_entry(cur, struct btrfs_device, dev_list);
+               if (!dev->bdev)
+                       continue;
+               if (!dev->in_fs_metadata || !dev->writeable)
+                       continue;
+
+               ret = write_dev_supers(dev, sb, do_barriers, 1, max_mirrors);
+               if (ret)
+                       total_errors++;
+       }
+       if (total_errors > max_errors) {
+               printk(KERN_ERR "btrfs: %d errors while writing supers\n",
+                      total_errors);
+               BUG();
+       }
+       return 0;
+}
+
+int write_ctree_super(struct btrfs_trans_handle *trans,
+                     struct btrfs_root *root, int max_mirrors)
+{
+       int ret;
+
+       ret = write_all_supers(root, max_mirrors);
+       return ret;
+}
+
+int btrfs_free_fs_root(struct btrfs_fs_info *fs_info, struct btrfs_root *root)
+{
+       radix_tree_delete(&fs_info->fs_roots_radix,
+                         (unsigned long)root->root_key.objectid);
+       if (root->anon_super.s_dev) {
+               down_write(&root->anon_super.s_umount);
+               kill_anon_super(&root->anon_super);
+       }
+       if (root->node)
+               free_extent_buffer(root->node);
+       if (root->commit_root)
+               free_extent_buffer(root->commit_root);
+       kfree(root->name);
+       kfree(root);
+       return 0;
+}
+
+static int del_fs_roots(struct btrfs_fs_info *fs_info)
+{
+       int ret;
+       struct btrfs_root *gang[8];
+       int i;
+
+       while (1) {
+               ret = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
+                                            (void **)gang, 0,
+                                            ARRAY_SIZE(gang));
+               if (!ret)
+                       break;
+               for (i = 0; i < ret; i++)
+                       btrfs_free_fs_root(fs_info, gang[i]);
+       }
+       return 0;
+}
+
+int btrfs_cleanup_fs_roots(struct btrfs_fs_info *fs_info)
+{
+       u64 root_objectid = 0;
+       struct btrfs_root *gang[8];
+       int i;
+       int ret;
+
+       while (1) {
+               ret = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
+                                            (void **)gang, root_objectid,
+                                            ARRAY_SIZE(gang));
+               if (!ret)
+                       break;
+               for (i = 0; i < ret; i++) {
+                       root_objectid = gang[i]->root_key.objectid;
+                       ret = btrfs_find_dead_roots(fs_info->tree_root,
+                                                   root_objectid, gang[i]);
+                       BUG_ON(ret);
+                       btrfs_orphan_cleanup(gang[i]);
+               }
+               root_objectid++;
+       }
+       return 0;
+}
+
+int btrfs_commit_super(struct btrfs_root *root)
+{
+       struct btrfs_trans_handle *trans;
+       int ret;
+
+       mutex_lock(&root->fs_info->cleaner_mutex);
+       btrfs_clean_old_snapshots(root);
+       mutex_unlock(&root->fs_info->cleaner_mutex);
+       trans = btrfs_start_transaction(root, 1);
+       ret = btrfs_commit_transaction(trans, root);
+       BUG_ON(ret);
+       /* run commit again to drop the original snapshot */
+       trans = btrfs_start_transaction(root, 1);
+       btrfs_commit_transaction(trans, root);
+       ret = btrfs_write_and_wait_transaction(NULL, root);
+       BUG_ON(ret);
+
+       ret = write_ctree_super(NULL, root, 0);
+       return ret;
+}
+
+int close_ctree(struct btrfs_root *root)
+{
+       struct btrfs_fs_info *fs_info = root->fs_info;
+       int ret;
+
+       fs_info->closing = 1;
+       smp_mb();
+
+       kthread_stop(root->fs_info->transaction_kthread);
+       kthread_stop(root->fs_info->cleaner_kthread);
+
+       if (!(fs_info->sb->s_flags & MS_RDONLY)) {
+               ret =  btrfs_commit_super(root);
+               if (ret)
+                       printk(KERN_ERR "btrfs: commit super ret %d\n", ret);
+       }
+
+       if (fs_info->delalloc_bytes) {
+               printk(KERN_INFO "btrfs: at unmount delalloc count %llu\n",
+                      fs_info->delalloc_bytes);
+       }
+       if (fs_info->total_ref_cache_size) {
+               printk(KERN_INFO "btrfs: at umount reference cache size %llu\n",
+                      (unsigned long long)fs_info->total_ref_cache_size);
+       }
+
+       if (fs_info->extent_root->node)
+               free_extent_buffer(fs_info->extent_root->node);
+
+       if (fs_info->tree_root->node)
+               free_extent_buffer(fs_info->tree_root->node);
+
+       if (root->fs_info->chunk_root->node)
+               free_extent_buffer(root->fs_info->chunk_root->node);
+
+       if (root->fs_info->dev_root->node)
+               free_extent_buffer(root->fs_info->dev_root->node);
+
+       if (root->fs_info->csum_root->node)
+               free_extent_buffer(root->fs_info->csum_root->node);
+
+       btrfs_free_block_groups(root->fs_info);
+
+       del_fs_roots(fs_info);
+
+       iput(fs_info->btree_inode);
+
+       btrfs_stop_workers(&fs_info->fixup_workers);
+       btrfs_stop_workers(&fs_info->delalloc_workers);
+       btrfs_stop_workers(&fs_info->workers);
+       btrfs_stop_workers(&fs_info->endio_workers);
+       btrfs_stop_workers(&fs_info->endio_meta_workers);
+       btrfs_stop_workers(&fs_info->endio_meta_write_workers);
+       btrfs_stop_workers(&fs_info->endio_write_workers);
+       btrfs_stop_workers(&fs_info->submit_workers);
+
+#if 0
+       while (!list_empty(&fs_info->hashers)) {
+               struct btrfs_hasher *hasher;
+               hasher = list_entry(fs_info->hashers.next, struct btrfs_hasher,
+                                   hashers);
+               list_del(&hasher->hashers);
+               crypto_free_hash(&fs_info->hash_tfm);
+               kfree(hasher);
+       }
+#endif
+       btrfs_close_devices(fs_info->fs_devices);
+       btrfs_mapping_tree_free(&fs_info->mapping_tree);
+
+       bdi_destroy(&fs_info->bdi);
+
+       kfree(fs_info->extent_root);
+       kfree(fs_info->tree_root);
+       kfree(fs_info->chunk_root);
+       kfree(fs_info->dev_root);
+       kfree(fs_info->csum_root);
+       return 0;
+}
+
+int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid)
+{
+       int ret;
+       struct inode *btree_inode = buf->first_page->mapping->host;
+
+       ret = extent_buffer_uptodate(&BTRFS_I(btree_inode)->io_tree, buf);
+       if (!ret)
+               return ret;
+
+       ret = verify_parent_transid(&BTRFS_I(btree_inode)->io_tree, buf,
+                                   parent_transid);
+       return !ret;
+}
+
+int btrfs_set_buffer_uptodate(struct extent_buffer *buf)
+{
+       struct inode *btree_inode = buf->first_page->mapping->host;
+       return set_extent_buffer_uptodate(&BTRFS_I(btree_inode)->io_tree,
+                                         buf);
+}
+
+void btrfs_mark_buffer_dirty(struct extent_buffer *buf)
+{
+       struct btrfs_root *root = BTRFS_I(buf->first_page->mapping->host)->root;
+       u64 transid = btrfs_header_generation(buf);
+       struct inode *btree_inode = root->fs_info->btree_inode;
+
+       WARN_ON(!btrfs_tree_locked(buf));
+       if (transid != root->fs_info->generation) {
+               printk(KERN_CRIT "btrfs transid mismatch buffer %llu, "
+                      "found %llu running %llu\n",
+                       (unsigned long long)buf->start,
+                       (unsigned long long)transid,
+                       (unsigned long long)root->fs_info->generation);
+               WARN_ON(1);
+       }
+       set_extent_buffer_dirty(&BTRFS_I(btree_inode)->io_tree, buf);
+}
+
+void btrfs_btree_balance_dirty(struct btrfs_root *root, unsigned long nr)
+{
+       /*
+        * looks as though older kernels can get into trouble with
+        * this code, they end up stuck in balance_dirty_pages forever
+        */
+       struct extent_io_tree *tree;
+       u64 num_dirty;
+       u64 start = 0;
+       unsigned long thresh = 32 * 1024 * 1024;
+       tree = &BTRFS_I(root->fs_info->btree_inode)->io_tree;
+
+       if (current_is_pdflush() || current->flags & PF_MEMALLOC)
+               return;
+
+       num_dirty = count_range_bits(tree, &start, (u64)-1,
+                                    thresh, EXTENT_DIRTY);
+       if (num_dirty > thresh) {
+               balance_dirty_pages_ratelimited_nr(
+                                  root->fs_info->btree_inode->i_mapping, 1);
+       }
+       return;
+}
+
+int btrfs_read_buffer(struct extent_buffer *buf, u64 parent_transid)
+{
+       struct btrfs_root *root = BTRFS_I(buf->first_page->mapping->host)->root;
+       int ret;
+       ret = btree_read_extent_buffer_pages(root, buf, 0, parent_transid);
+       if (ret == 0)
+               buf->flags |= EXTENT_UPTODATE;
+       return ret;
+}
+
+int btree_lock_page_hook(struct page *page)
+{
+       struct inode *inode = page->mapping->host;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+       struct extent_buffer *eb;
+       unsigned long len;
+       u64 bytenr = page_offset(page);
+
+       if (page->private == EXTENT_PAGE_PRIVATE)
+               goto out;
+
+       len = page->private >> 2;
+       eb = find_extent_buffer(io_tree, bytenr, len, GFP_NOFS);
+       if (!eb)
+               goto out;
+
+       btrfs_tree_lock(eb);
+       spin_lock(&root->fs_info->hash_lock);
+       btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
+       spin_unlock(&root->fs_info->hash_lock);
+       btrfs_tree_unlock(eb);
+       free_extent_buffer(eb);
+out:
+       lock_page(page);
+       return 0;
+}
+
+static struct extent_io_ops btree_extent_io_ops = {
+       .write_cache_pages_lock_hook = btree_lock_page_hook,
+       .readpage_end_io_hook = btree_readpage_end_io_hook,
+       .submit_bio_hook = btree_submit_bio_hook,
+       /* note we're sharing with inode.c for the merge bio hook */
+       .merge_bio_hook = btrfs_merge_bio_hook,
+};
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h

new file mode 100644 (file)

index 0000000..c0ff404
--- /dev/null
+++ b/fs/btrfs/disk-io.h
@@ -0,0 +1,102 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __DISKIO__
+#define __DISKIO__
+
+#define BTRFS_SUPER_INFO_OFFSET (64 * 1024)
+#define BTRFS_SUPER_INFO_SIZE 4096
+
+#define BTRFS_SUPER_MIRROR_MAX  3
+#define BTRFS_SUPER_MIRROR_SHIFT 12
+
+static inline u64 btrfs_sb_offset(int mirror)
+{
+       u64 start = 16 * 1024;
+       if (mirror)
+               return start << (BTRFS_SUPER_MIRROR_SHIFT * mirror);
+       return BTRFS_SUPER_INFO_OFFSET;
+}
+
+struct btrfs_device;
+struct btrfs_fs_devices;
+
+struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr,
+                                     u32 blocksize, u64 parent_transid);
+int readahead_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize,
+                        u64 parent_transid);
+struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
+                                                  u64 bytenr, u32 blocksize);
+int clean_tree_block(struct btrfs_trans_handle *trans,
+                    struct btrfs_root *root, struct extent_buffer *buf);
+struct btrfs_root *open_ctree(struct super_block *sb,
+                             struct btrfs_fs_devices *fs_devices,
+                             char *options);
+int close_ctree(struct btrfs_root *root);
+int write_ctree_super(struct btrfs_trans_handle *trans,
+                     struct btrfs_root *root, int max_mirrors);
+struct buffer_head *btrfs_read_dev_super(struct block_device *bdev);
+int btrfs_commit_super(struct btrfs_root *root);
+struct extent_buffer *btrfs_find_tree_block(struct btrfs_root *root,
+                                           u64 bytenr, u32 blocksize);
+struct btrfs_root *btrfs_lookup_fs_root(struct btrfs_fs_info *fs_info,
+                                       u64 root_objectid);
+struct btrfs_root *btrfs_read_fs_root(struct btrfs_fs_info *fs_info,
+                                     struct btrfs_key *location,
+                                     const char *name, int namelen);
+struct btrfs_root *btrfs_read_fs_root_no_radix(struct btrfs_root *tree_root,
+                                              struct btrfs_key *location);
+struct btrfs_root *btrfs_read_fs_root_no_name(struct btrfs_fs_info *fs_info,
+                                             struct btrfs_key *location);
+int btrfs_cleanup_fs_roots(struct btrfs_fs_info *fs_info);
+int btrfs_insert_dev_radix(struct btrfs_root *root,
+                          struct block_device *bdev,
+                          u64 device_id,
+                          u64 block_start,
+                          u64 num_blocks);
+void btrfs_btree_balance_dirty(struct btrfs_root *root, unsigned long nr);
+int btrfs_free_fs_root(struct btrfs_fs_info *fs_info, struct btrfs_root *root);
+void btrfs_mark_buffer_dirty(struct extent_buffer *buf);
+int btrfs_buffer_uptodate(struct extent_buffer *buf, u64 parent_transid);
+int btrfs_set_buffer_uptodate(struct extent_buffer *buf);
+int wait_on_tree_block_writeback(struct btrfs_root *root,
+                                struct extent_buffer *buf);
+int btrfs_read_buffer(struct extent_buffer *buf, u64 parent_transid);
+u32 btrfs_csum_data(struct btrfs_root *root, char *data, u32 seed, size_t len);
+void btrfs_csum_final(u32 crc, char *result);
+int btrfs_open_device(struct btrfs_device *dev);
+int btrfs_verify_block_csum(struct btrfs_root *root,
+                           struct extent_buffer *buf);
+int btrfs_bio_wq_end_io(struct btrfs_fs_info *info, struct bio *bio,
+                       int metadata);
+int btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct inode *inode,
+                       int rw, struct bio *bio, int mirror_num,
+                       unsigned long bio_flags,
+                       extent_submit_bio_hook_t *submit_bio_start,
+                       extent_submit_bio_hook_t *submit_bio_done);
+
+int btrfs_congested_async(struct btrfs_fs_info *info, int iodone);
+unsigned long btrfs_async_submit_limit(struct btrfs_fs_info *info);
+int btrfs_write_tree_block(struct extent_buffer *buf);
+int btrfs_wait_tree_block_writeback(struct extent_buffer *buf);
+int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
+                            struct btrfs_fs_info *fs_info);
+int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
+                            struct btrfs_fs_info *fs_info);
+int btree_lock_page_hook(struct page *page);
+#endif
diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c

new file mode 100644 (file)

index 0000000..85315d2
--- /dev/null
+++ b/fs/btrfs/export.c
@@ -0,0 +1,203 @@
+#include <linux/fs.h>
+#include <linux/types.h>
+#include "ctree.h"
+#include "disk-io.h"
+#include "btrfs_inode.h"
+#include "print-tree.h"
+#include "export.h"
+#include "compat.h"
+
+#define BTRFS_FID_SIZE_NON_CONNECTABLE (offsetof(struct btrfs_fid, \
+                                                parent_objectid) / 4)
+#define BTRFS_FID_SIZE_CONNECTABLE (offsetof(struct btrfs_fid, \
+                                            parent_root_objectid) / 4)
+#define BTRFS_FID_SIZE_CONNECTABLE_ROOT (sizeof(struct btrfs_fid) / 4)
+
+static int btrfs_encode_fh(struct dentry *dentry, u32 *fh, int *max_len,
+                          int connectable)
+{
+       struct btrfs_fid *fid = (struct btrfs_fid *)fh;
+       struct inode *inode = dentry->d_inode;
+       int len = *max_len;
+       int type;
+
+       if ((len < BTRFS_FID_SIZE_NON_CONNECTABLE) ||
+           (connectable && len < BTRFS_FID_SIZE_CONNECTABLE))
+               return 255;
+
+       len  = BTRFS_FID_SIZE_NON_CONNECTABLE;
+       type = FILEID_BTRFS_WITHOUT_PARENT;
+
+       fid->objectid = BTRFS_I(inode)->location.objectid;
+       fid->root_objectid = BTRFS_I(inode)->root->objectid;
+       fid->gen = inode->i_generation;
+
+       if (connectable && !S_ISDIR(inode->i_mode)) {
+               struct inode *parent;
+               u64 parent_root_id;
+
+               spin_lock(&dentry->d_lock);
+
+               parent = dentry->d_parent->d_inode;
+               fid->parent_objectid = BTRFS_I(parent)->location.objectid;
+               fid->parent_gen = parent->i_generation;
+               parent_root_id = BTRFS_I(parent)->root->objectid;
+
+               spin_unlock(&dentry->d_lock);
+
+               if (parent_root_id != fid->root_objectid) {
+                       fid->parent_root_objectid = parent_root_id;
+                       len = BTRFS_FID_SIZE_CONNECTABLE_ROOT;
+                       type = FILEID_BTRFS_WITH_PARENT_ROOT;
+               } else {
+                       len = BTRFS_FID_SIZE_CONNECTABLE;
+                       type = FILEID_BTRFS_WITH_PARENT;
+               }
+       }
+
+       *max_len = len;
+       return type;
+}
+
+static struct dentry *btrfs_get_dentry(struct super_block *sb, u64 objectid,
+                                      u64 root_objectid, u32 generation)
+{
+       struct btrfs_root *root;
+       struct inode *inode;
+       struct btrfs_key key;
+
+       key.objectid = root_objectid;
+       btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY);
+       key.offset = (u64)-1;
+
+       root = btrfs_read_fs_root_no_name(btrfs_sb(sb)->fs_info, &key);
+       if (IS_ERR(root))
+               return ERR_CAST(root);
+
+       key.objectid = objectid;
+       btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY);
+       key.offset = 0;
+
+       inode = btrfs_iget(sb, &key, root, NULL);
+       if (IS_ERR(inode))
+               return (void *)inode;
+
+       if (generation != inode->i_generation) {
+               iput(inode);
+               return ERR_PTR(-ESTALE);
+       }
+
+       return d_obtain_alias(inode);
+}
+
+static struct dentry *btrfs_fh_to_parent(struct super_block *sb, struct fid *fh,
+                                        int fh_len, int fh_type)
+{
+       struct btrfs_fid *fid = (struct btrfs_fid *) fh;
+       u64 objectid, root_objectid;
+       u32 generation;
+
+       if (fh_type == FILEID_BTRFS_WITH_PARENT) {
+               if (fh_len !=  BTRFS_FID_SIZE_CONNECTABLE)
+                       return NULL;
+               root_objectid = fid->root_objectid;
+       } else if (fh_type == FILEID_BTRFS_WITH_PARENT_ROOT) {
+               if (fh_len != BTRFS_FID_SIZE_CONNECTABLE_ROOT)
+                       return NULL;
+               root_objectid = fid->parent_root_objectid;
+       } else
+               return NULL;
+
+       objectid = fid->parent_objectid;
+       generation = fid->parent_gen;
+
+       return btrfs_get_dentry(sb, objectid, root_objectid, generation);
+}
+
+static struct dentry *btrfs_fh_to_dentry(struct super_block *sb, struct fid *fh,
+                                        int fh_len, int fh_type)
+{
+       struct btrfs_fid *fid = (struct btrfs_fid *) fh;
+       u64 objectid, root_objectid;
+       u32 generation;
+
+       if ((fh_type != FILEID_BTRFS_WITH_PARENT ||
+            fh_len != BTRFS_FID_SIZE_CONNECTABLE) &&
+           (fh_type != FILEID_BTRFS_WITH_PARENT_ROOT ||
+            fh_len != BTRFS_FID_SIZE_CONNECTABLE_ROOT) &&
+           (fh_type != FILEID_BTRFS_WITHOUT_PARENT ||
+            fh_len != BTRFS_FID_SIZE_NON_CONNECTABLE))
+               return NULL;
+
+       objectid = fid->objectid;
+       root_objectid = fid->root_objectid;
+       generation = fid->gen;
+
+       return btrfs_get_dentry(sb, objectid, root_objectid, generation);
+}
+
+static struct dentry *btrfs_get_parent(struct dentry *child)
+{
+       struct inode *dir = child->d_inode;
+       struct btrfs_root *root = BTRFS_I(dir)->root;
+       struct btrfs_key key;
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       int slot;
+       u64 objectid;
+       int ret;
+
+       path = btrfs_alloc_path();
+
+       key.objectid = dir->i_ino;
+       btrfs_set_key_type(&key, BTRFS_INODE_REF_KEY);
+       key.offset = (u64)-1;
+
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       if (ret < 0) {
+               /* Error */
+               btrfs_free_path(path);
+               return ERR_PTR(ret);
+       }
+       leaf = path->nodes[0];
+       slot = path->slots[0];
+       if (ret) {
+               /* btrfs_search_slot() returns the slot where we'd want to
+                  insert a backref for parent inode #0xFFFFFFFFFFFFFFFF.
+                  The _real_ backref, telling us what the parent inode
+                  _actually_ is, will be in the slot _before_ the one
+                  that btrfs_search_slot() returns. */
+               if (!slot) {
+                       /* Unless there is _no_ key in the tree before... */
+                       btrfs_free_path(path);
+                       return ERR_PTR(-EIO);
+               }
+               slot--;
+       }
+
+       btrfs_item_key_to_cpu(leaf, &key, slot);
+       btrfs_free_path(path);
+
+       if (key.objectid != dir->i_ino || key.type != BTRFS_INODE_REF_KEY)
+               return ERR_PTR(-EINVAL);
+
+       objectid = key.offset;
+
+       /* If we are already at the root of a subvol, return the real root */
+       if (objectid == dir->i_ino)
+               return dget(dir->i_sb->s_root);
+
+       /* Build a new key for the inode item */
+       key.objectid = objectid;
+       btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY);
+       key.offset = 0;
+
+       return d_obtain_alias(btrfs_iget(root->fs_info->sb, &key, root, NULL));
+}
+
+const struct export_operations btrfs_export_ops = {
+       .encode_fh      = btrfs_encode_fh,
+       .fh_to_dentry   = btrfs_fh_to_dentry,
+       .fh_to_parent   = btrfs_fh_to_parent,
+       .get_parent     = btrfs_get_parent,
+};
diff --git a/fs/btrfs/export.h b/fs/btrfs/export.h

new file mode 100644 (file)

index 0000000..074348a
--- /dev/null
+++ b/fs/btrfs/export.h
@@ -0,0 +1,19 @@
+#ifndef BTRFS_EXPORT_H
+#define BTRFS_EXPORT_H
+
+#include <linux/exportfs.h>
+
+extern const struct export_operations btrfs_export_ops;
+
+struct btrfs_fid {
+       u64 objectid;
+       u64 root_objectid;
+       u32 gen;
+
+       u64 parent_objectid;
+       u32 parent_gen;
+
+       u64 parent_root_objectid;
+} __attribute__ ((packed));
+
+#endif
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c

new file mode 100644 (file)

index 0000000..293da65
--- /dev/null
+++ b/fs/btrfs/extent-tree.c
@@ -0,0 +1,5986 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include <linux/sched.h>
+#include <linux/pagemap.h>
+#include <linux/writeback.h>
+#include <linux/blkdev.h>
+#include <linux/version.h>
+#include "compat.h"
+#include "hash.h"
+#include "crc32c.h"
+#include "ctree.h"
+#include "disk-io.h"
+#include "print-tree.h"
+#include "transaction.h"
+#include "volumes.h"
+#include "locking.h"
+#include "ref-cache.h"
+#include "compat.h"
+
+#define PENDING_EXTENT_INSERT 0
+#define PENDING_EXTENT_DELETE 1
+#define PENDING_BACKREF_UPDATE 2
+
+struct pending_extent_op {
+       int type;
+       u64 bytenr;
+       u64 num_bytes;
+       u64 parent;
+       u64 orig_parent;
+       u64 generation;
+       u64 orig_generation;
+       int level;
+       struct list_head list;
+       int del;
+};
+
+static int finish_current_insert(struct btrfs_trans_handle *trans,
+                                struct btrfs_root *extent_root, int all);
+static int del_pending_extents(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *extent_root, int all);
+static int pin_down_bytes(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root,
+                         u64 bytenr, u64 num_bytes, int is_data);
+static int update_block_group(struct btrfs_trans_handle *trans,
+                             struct btrfs_root *root,
+                             u64 bytenr, u64 num_bytes, int alloc,
+                             int mark_free);
+
+static int block_group_bits(struct btrfs_block_group_cache *cache, u64 bits)
+{
+       return (cache->flags & bits) == bits;
+}
+
+/*
+ * this adds the block group to the fs_info rb tree for the block group
+ * cache
+ */
+static int btrfs_add_block_group_cache(struct btrfs_fs_info *info,
+                               struct btrfs_block_group_cache *block_group)
+{
+       struct rb_node **p;
+       struct rb_node *parent = NULL;
+       struct btrfs_block_group_cache *cache;
+
+       spin_lock(&info->block_group_cache_lock);
+       p = &info->block_group_cache_tree.rb_node;
+
+       while (*p) {
+               parent = *p;
+               cache = rb_entry(parent, struct btrfs_block_group_cache,
+                                cache_node);
+               if (block_group->key.objectid < cache->key.objectid) {
+                       p = &(*p)->rb_left;
+               } else if (block_group->key.objectid > cache->key.objectid) {
+                       p = &(*p)->rb_right;
+               } else {
+                       spin_unlock(&info->block_group_cache_lock);
+                       return -EEXIST;
+               }
+       }
+
+       rb_link_node(&block_group->cache_node, parent, p);
+       rb_insert_color(&block_group->cache_node,
+                       &info->block_group_cache_tree);
+       spin_unlock(&info->block_group_cache_lock);
+
+       return 0;
+}
+
+/*
+ * This will return the block group at or after bytenr if contains is 0, else
+ * it will return the block group that contains the bytenr
+ */
+static struct btrfs_block_group_cache *
+block_group_cache_tree_search(struct btrfs_fs_info *info, u64 bytenr,
+                             int contains)
+{
+       struct btrfs_block_group_cache *cache, *ret = NULL;
+       struct rb_node *n;
+       u64 end, start;
+
+       spin_lock(&info->block_group_cache_lock);
+       n = info->block_group_cache_tree.rb_node;
+
+       while (n) {
+               cache = rb_entry(n, struct btrfs_block_group_cache,
+                                cache_node);
+               end = cache->key.objectid + cache->key.offset - 1;
+               start = cache->key.objectid;
+
+               if (bytenr < start) {
+                       if (!contains && (!ret || start < ret->key.objectid))
+                               ret = cache;
+                       n = n->rb_left;
+               } else if (bytenr > start) {
+                       if (contains && bytenr <= end) {
+                               ret = cache;
+                               break;
+                       }
+                       n = n->rb_right;
+               } else {
+                       ret = cache;
+                       break;
+               }
+       }
+       if (ret)
+               atomic_inc(&ret->count);
+       spin_unlock(&info->block_group_cache_lock);
+
+       return ret;
+}
+
+/*
+ * this is only called by cache_block_group, since we could have freed extents
+ * we need to check the pinned_extents for any extents that can't be used yet
+ * since their free space will be released as soon as the transaction commits.
+ */
+static int add_new_free_space(struct btrfs_block_group_cache *block_group,
+                             struct btrfs_fs_info *info, u64 start, u64 end)
+{
+       u64 extent_start, extent_end, size;
+       int ret;
+
+       mutex_lock(&info->pinned_mutex);
+       while (start < end) {
+               ret = find_first_extent_bit(&info->pinned_extents, start,
+                                           &extent_start, &extent_end,
+                                           EXTENT_DIRTY);
+               if (ret)
+                       break;
+
+               if (extent_start == start) {
+                       start = extent_end + 1;
+               } else if (extent_start > start && extent_start < end) {
+                       size = extent_start - start;
+                       ret = btrfs_add_free_space(block_group, start,
+                                                  size);
+                       BUG_ON(ret);
+                       start = extent_end + 1;
+               } else {
+                       break;
+               }
+       }
+
+       if (start < end) {
+               size = end - start;
+               ret = btrfs_add_free_space(block_group, start, size);
+               BUG_ON(ret);
+       }
+       mutex_unlock(&info->pinned_mutex);
+
+       return 0;
+}
+
+static int remove_sb_from_cache(struct btrfs_root *root,
+                               struct btrfs_block_group_cache *cache)
+{
+       u64 bytenr;
+       u64 *logical;
+       int stripe_len;
+       int i, nr, ret;
+
+       for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+               bytenr = btrfs_sb_offset(i);
+               ret = btrfs_rmap_block(&root->fs_info->mapping_tree,
+                                      cache->key.objectid, bytenr, 0,
+                                      &logical, &nr, &stripe_len);
+               BUG_ON(ret);
+               while (nr--) {
+                       btrfs_remove_free_space(cache, logical[nr],
+                                               stripe_len);
+               }
+               kfree(logical);
+       }
+       return 0;
+}
+
+static int cache_block_group(struct btrfs_root *root,
+                            struct btrfs_block_group_cache *block_group)
+{
+       struct btrfs_path *path;
+       int ret = 0;
+       struct btrfs_key key;
+       struct extent_buffer *leaf;
+       int slot;
+       u64 last;
+
+       if (!block_group)
+               return 0;
+
+       root = root->fs_info->extent_root;
+
+       if (block_group->cached)
+               return 0;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       path->reada = 2;
+       /*
+        * we get into deadlocks with paths held by callers of this function.
+        * since the alloc_mutex is protecting things right now, just
+        * skip the locking here
+        */
+       path->skip_locking = 1;
+       last = max_t(u64, block_group->key.objectid, BTRFS_SUPER_INFO_OFFSET);
+       key.objectid = last;
+       key.offset = 0;
+       btrfs_set_key_type(&key, BTRFS_EXTENT_ITEM_KEY);
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       if (ret < 0)
+               goto err;
+
+       while (1) {
+               leaf = path->nodes[0];
+               slot = path->slots[0];
+               if (slot >= btrfs_header_nritems(leaf)) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret < 0)
+                               goto err;
+                       if (ret == 0)
+                               continue;
+                       else
+                               break;
+               }
+               btrfs_item_key_to_cpu(leaf, &key, slot);
+               if (key.objectid < block_group->key.objectid)
+                       goto next;
+
+               if (key.objectid >= block_group->key.objectid +
+                   block_group->key.offset)
+                       break;
+
+               if (btrfs_key_type(&key) == BTRFS_EXTENT_ITEM_KEY) {
+                       add_new_free_space(block_group, root->fs_info, last,
+                                          key.objectid);
+
+                       last = key.objectid + key.offset;
+               }
+next:
+               path->slots[0]++;
+       }
+
+       add_new_free_space(block_group, root->fs_info, last,
+                          block_group->key.objectid +
+                          block_group->key.offset);
+
+       remove_sb_from_cache(root, block_group);
+       block_group->cached = 1;
+       ret = 0;
+err:
+       btrfs_free_path(path);
+       return ret;
+}
+
+/*
+ * return the block group that starts at or after bytenr
+ */
+static struct btrfs_block_group_cache *
+btrfs_lookup_first_block_group(struct btrfs_fs_info *info, u64 bytenr)
+{
+       struct btrfs_block_group_cache *cache;
+
+       cache = block_group_cache_tree_search(info, bytenr, 0);
+
+       return cache;
+}
+
+/*
+ * return the block group that contains teh given bytenr
+ */
+struct btrfs_block_group_cache *btrfs_lookup_block_group(
+                                                struct btrfs_fs_info *info,
+                                                u64 bytenr)
+{
+       struct btrfs_block_group_cache *cache;
+
+       cache = block_group_cache_tree_search(info, bytenr, 1);
+
+       return cache;
+}
+
+static inline void put_block_group(struct btrfs_block_group_cache *cache)
+{
+       if (atomic_dec_and_test(&cache->count))
+               kfree(cache);
+}
+
+static struct btrfs_space_info *__find_space_info(struct btrfs_fs_info *info,
+                                                 u64 flags)
+{
+       struct list_head *head = &info->space_info;
+       struct list_head *cur;
+       struct btrfs_space_info *found;
+       list_for_each(cur, head) {
+               found = list_entry(cur, struct btrfs_space_info, list);
+               if (found->flags == flags)
+                       return found;
+       }
+       return NULL;
+}
+
+static u64 div_factor(u64 num, int factor)
+{
+       if (factor == 10)
+               return num;
+       num *= factor;
+       do_div(num, 10);
+       return num;
+}
+
+u64 btrfs_find_block_group(struct btrfs_root *root,
+                          u64 search_start, u64 search_hint, int owner)
+{
+       struct btrfs_block_group_cache *cache;
+       u64 used;
+       u64 last = max(search_hint, search_start);
+       u64 group_start = 0;
+       int full_search = 0;
+       int factor = 9;
+       int wrapped = 0;
+again:
+       while (1) {
+               cache = btrfs_lookup_first_block_group(root->fs_info, last);
+               if (!cache)
+                       break;
+
+               spin_lock(&cache->lock);
+               last = cache->key.objectid + cache->key.offset;
+               used = btrfs_block_group_used(&cache->item);
+
+               if ((full_search || !cache->ro) &&
+                   block_group_bits(cache, BTRFS_BLOCK_GROUP_METADATA)) {
+                       if (used + cache->pinned + cache->reserved <
+                           div_factor(cache->key.offset, factor)) {
+                               group_start = cache->key.objectid;
+                               spin_unlock(&cache->lock);
+                               put_block_group(cache);
+                               goto found;
+                       }
+               }
+               spin_unlock(&cache->lock);
+               put_block_group(cache);
+               cond_resched();
+       }
+       if (!wrapped) {
+               last = search_start;
+               wrapped = 1;
+               goto again;
+       }
+       if (!full_search && factor < 10) {
+               last = search_start;
+               full_search = 1;
+               factor = 10;
+               goto again;
+       }
+found:
+       return group_start;
+}
+
+/* simple helper to search for an existing extent at a given offset */
+int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len)
+{
+       int ret;
+       struct btrfs_key key;
+       struct btrfs_path *path;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+       key.objectid = start;
+       key.offset = len;
+       btrfs_set_key_type(&key, BTRFS_EXTENT_ITEM_KEY);
+       ret = btrfs_search_slot(NULL, root->fs_info->extent_root, &key, path,
+                               0, 0);
+       btrfs_free_path(path);
+       return ret;
+}
+
+/*
+ * Back reference rules.  Back refs have three main goals:
+ *
+ * 1) differentiate between all holders of references to an extent so that
+ *    when a reference is dropped we can make sure it was a valid reference
+ *    before freeing the extent.
+ *
+ * 2) Provide enough information to quickly find the holders of an extent
+ *    if we notice a given block is corrupted or bad.
+ *
+ * 3) Make it easy to migrate blocks for FS shrinking or storage pool
+ *    maintenance.  This is actually the same as #2, but with a slightly
+ *    different use case.
+ *
+ * File extents can be referenced by:
+ *
+ * - multiple snapshots, subvolumes, or different generations in one subvol
+ * - different files inside a single subvolume
+ * - different offsets inside a file (bookend extents in file.c)
+ *
+ * The extent ref structure has fields for:
+ *
+ * - Objectid of the subvolume root
+ * - Generation number of the tree holding the reference
+ * - objectid of the file holding the reference
+ * - number of references holding by parent node (alway 1 for tree blocks)
+ *
+ * Btree leaf may hold multiple references to a file extent. In most cases,
+ * these references are from same file and the corresponding offsets inside
+ * the file are close together.
+ *
+ * When a file extent is allocated the fields are filled in:
+ *     (root_key.objectid, trans->transid, inode objectid, 1)
+ *
+ * When a leaf is cow'd new references are added for every file extent found
+ * in the leaf.  It looks similar to the create case, but trans->transid will
+ * be different when the block is cow'd.
+ *
+ *     (root_key.objectid, trans->transid, inode objectid,
+ *      number of references in the leaf)
+ *
+ * When a file extent is removed either during snapshot deletion or
+ * file truncation, we find the corresponding back reference and check
+ * the following fields:
+ *
+ *     (btrfs_header_owner(leaf), btrfs_header_generation(leaf),
+ *      inode objectid)
+ *
+ * Btree extents can be referenced by:
+ *
+ * - Different subvolumes
+ * - Different generations of the same subvolume
+ *
+ * When a tree block is created, back references are inserted:
+ *
+ * (root->root_key.objectid, trans->transid, level, 1)
+ *
+ * When a tree block is cow'd, new back references are added for all the
+ * blocks it points to. If the tree block isn't in reference counted root,
+ * the old back references are removed. These new back references are of
+ * the form (trans->transid will have increased since creation):
+ *
+ * (root->root_key.objectid, trans->transid, level, 1)
+ *
+ * When a backref is in deleting, the following fields are checked:
+ *
+ * if backref was for a tree root:
+ *     (btrfs_header_owner(itself), btrfs_header_generation(itself), level)
+ * else
+ *     (btrfs_header_owner(parent), btrfs_header_generation(parent), level)
+ *
+ * Back Reference Key composing:
+ *
+ * The key objectid corresponds to the first byte in the extent, the key
+ * type is set to BTRFS_EXTENT_REF_KEY, and the key offset is the first
+ * byte of parent extent. If a extent is tree root, the key offset is set
+ * to the key objectid.
+ */
+
+static noinline int lookup_extent_backref(struct btrfs_trans_handle *trans,
+                                         struct btrfs_root *root,
+                                         struct btrfs_path *path,
+                                         u64 bytenr, u64 parent,
+                                         u64 ref_root, u64 ref_generation,
+                                         u64 owner_objectid, int del)
+{
+       struct btrfs_key key;
+       struct btrfs_extent_ref *ref;
+       struct extent_buffer *leaf;
+       u64 ref_objectid;
+       int ret;
+
+       key.objectid = bytenr;
+       key.type = BTRFS_EXTENT_REF_KEY;
+       key.offset = parent;
+
+       ret = btrfs_search_slot(trans, root, &key, path, del ? -1 : 0, 1);
+       if (ret < 0)
+               goto out;
+       if (ret > 0) {
+               ret = -ENOENT;
+               goto out;
+       }
+
+       leaf = path->nodes[0];
+       ref = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_ref);
+       ref_objectid = btrfs_ref_objectid(leaf, ref);
+       if (btrfs_ref_root(leaf, ref) != ref_root ||
+           btrfs_ref_generation(leaf, ref) != ref_generation ||
+           (ref_objectid != owner_objectid &&
+            ref_objectid != BTRFS_MULTIPLE_OBJECTIDS)) {
+               ret = -EIO;
+               WARN_ON(1);
+               goto out;
+       }
+       ret = 0;
+out:
+       return ret;
+}
+
+/*
+ * updates all the backrefs that are pending on update_list for the
+ * extent_root
+ */
+static noinline int update_backrefs(struct btrfs_trans_handle *trans,
+                                   struct btrfs_root *extent_root,
+                                   struct btrfs_path *path,
+                                   struct list_head *update_list)
+{
+       struct btrfs_key key;
+       struct btrfs_extent_ref *ref;
+       struct btrfs_fs_info *info = extent_root->fs_info;
+       struct pending_extent_op *op;
+       struct extent_buffer *leaf;
+       int ret = 0;
+       struct list_head *cur = update_list->next;
+       u64 ref_objectid;
+       u64 ref_root = extent_root->root_key.objectid;
+
+       op = list_entry(cur, struct pending_extent_op, list);
+
+search:
+       key.objectid = op->bytenr;
+       key.type = BTRFS_EXTENT_REF_KEY;
+       key.offset = op->orig_parent;
+
+       ret = btrfs_search_slot(trans, extent_root, &key, path, 0, 1);
+       BUG_ON(ret);
+
+       leaf = path->nodes[0];
+
+loop:
+       ref = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_ref);
+
+       ref_objectid = btrfs_ref_objectid(leaf, ref);
+
+       if (btrfs_ref_root(leaf, ref) != ref_root ||
+           btrfs_ref_generation(leaf, ref) != op->orig_generation ||
+           (ref_objectid != op->level &&
+            ref_objectid != BTRFS_MULTIPLE_OBJECTIDS)) {
+               printk(KERN_ERR "btrfs couldn't find %llu, parent %llu, "
+                      "root %llu, owner %u\n",
+                      (unsigned long long)op->bytenr,
+                      (unsigned long long)op->orig_parent,
+                      (unsigned long long)ref_root, op->level);
+               btrfs_print_leaf(extent_root, leaf);
+               BUG();
+       }
+
+       key.objectid = op->bytenr;
+       key.offset = op->parent;
+       key.type = BTRFS_EXTENT_REF_KEY;
+       ret = btrfs_set_item_key_safe(trans, extent_root, path, &key);
+       BUG_ON(ret);
+       ref = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_ref);
+       btrfs_set_ref_generation(leaf, ref, op->generation);
+
+       cur = cur->next;
+
+       list_del_init(&op->list);
+       unlock_extent(&info->extent_ins, op->bytenr,
+                     op->bytenr + op->num_bytes - 1, GFP_NOFS);
+       kfree(op);
+
+       if (cur == update_list) {
+               btrfs_mark_buffer_dirty(path->nodes[0]);
+               btrfs_release_path(extent_root, path);
+               goto out;
+       }
+
+       op = list_entry(cur, struct pending_extent_op, list);
+
+       path->slots[0]++;
+       while (path->slots[0] < btrfs_header_nritems(leaf)) {
+               btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+               if (key.objectid == op->bytenr &&
+                   key.type == BTRFS_EXTENT_REF_KEY)
+                       goto loop;
+               path->slots[0]++;
+       }
+
+       btrfs_mark_buffer_dirty(path->nodes[0]);
+       btrfs_release_path(extent_root, path);
+       goto search;
+
+out:
+       return 0;
+}
+
+static noinline int insert_extents(struct btrfs_trans_handle *trans,
+                                  struct btrfs_root *extent_root,
+                                  struct btrfs_path *path,
+                                  struct list_head *insert_list, int nr)
+{
+       struct btrfs_key *keys;
+       u32 *data_size;
+       struct pending_extent_op *op;
+       struct extent_buffer *leaf;
+       struct list_head *cur = insert_list->next;
+       struct btrfs_fs_info *info = extent_root->fs_info;
+       u64 ref_root = extent_root->root_key.objectid;
+       int i = 0, last = 0, ret;
+       int total = nr * 2;
+
+       if (!nr)
+               return 0;
+
+       keys = kzalloc(total * sizeof(struct btrfs_key), GFP_NOFS);
+       if (!keys)
+               return -ENOMEM;
+
+       data_size = kzalloc(total * sizeof(u32), GFP_NOFS);
+       if (!data_size) {
+               kfree(keys);
+               return -ENOMEM;
+       }
+
+       list_for_each_entry(op, insert_list, list) {
+               keys[i].objectid = op->bytenr;
+               keys[i].offset = op->num_bytes;
+               keys[i].type = BTRFS_EXTENT_ITEM_KEY;
+               data_size[i] = sizeof(struct btrfs_extent_item);
+               i++;
+
+               keys[i].objectid = op->bytenr;
+               keys[i].offset = op->parent;
+               keys[i].type = BTRFS_EXTENT_REF_KEY;
+               data_size[i] = sizeof(struct btrfs_extent_ref);
+               i++;
+       }
+
+       op = list_entry(cur, struct pending_extent_op, list);
+       i = 0;
+       while (i < total) {
+               int c;
+               ret = btrfs_insert_some_items(trans, extent_root, path,
+                                             keys+i, data_size+i, total-i);
+               BUG_ON(ret < 0);
+
+               if (last && ret > 1)
+                       BUG();
+
+               leaf = path->nodes[0];
+               for (c = 0; c < ret; c++) {
+                       int ref_first = keys[i].type == BTRFS_EXTENT_REF_KEY;
+
+                       /*
+                        * if the first item we inserted was a backref, then
+                        * the EXTENT_ITEM will be the odd c's, else it will
+                        * be the even c's
+                        */
+                       if ((ref_first && (c % 2)) ||
+                           (!ref_first && !(c % 2))) {
+                               struct btrfs_extent_item *itm;
+
+                               itm = btrfs_item_ptr(leaf, path->slots[0] + c,
+                                                    struct btrfs_extent_item);
+                               btrfs_set_extent_refs(path->nodes[0], itm, 1);
+                               op->del++;
+                       } else {
+                               struct btrfs_extent_ref *ref;
+
+                               ref = btrfs_item_ptr(leaf, path->slots[0] + c,
+                                                    struct btrfs_extent_ref);
+                               btrfs_set_ref_root(leaf, ref, ref_root);
+                               btrfs_set_ref_generation(leaf, ref,
+                                                        op->generation);
+                               btrfs_set_ref_objectid(leaf, ref, op->level);
+                               btrfs_set_ref_num_refs(leaf, ref, 1);
+                               op->del++;
+                       }
+
+                       /*
+                        * using del to see when its ok to free up the
+                        * pending_extent_op.  In the case where we insert the
+                        * last item on the list in order to help do batching
+                        * we need to not free the extent op until we actually
+                        * insert the extent_item
+                        */
+                       if (op->del == 2) {
+                               unlock_extent(&info->extent_ins, op->bytenr,
+                                             op->bytenr + op->num_bytes - 1,
+                                             GFP_NOFS);
+                               cur = cur->next;
+                               list_del_init(&op->list);
+                               kfree(op);
+                               if (cur != insert_list)
+                                       op = list_entry(cur,
+                                               struct pending_extent_op,
+                                               list);
+                       }
+               }
+               btrfs_mark_buffer_dirty(leaf);
+               btrfs_release_path(extent_root, path);
+
+               /*
+                * Ok backref's and items usually go right next to eachother,
+                * but if we could only insert 1 item that means that we
+                * inserted on the end of a leaf, and we have no idea what may
+                * be on the next leaf so we just play it safe.  In order to
+                * try and help this case we insert the last thing on our
+                * insert list so hopefully it will end up being the last
+                * thing on the leaf and everything else will be before it,
+                * which will let us insert a whole bunch of items at the same
+                * time.
+                */
+               if (ret == 1 && !last && (i + ret < total)) {
+                       /*
+                        * last: where we will pick up the next time around
+                        * i: our current key to insert, will be total - 1
+                        * cur: the current op we are screwing with
+                        * op: duh
+                        */
+                       last = i + ret;
+                       i = total - 1;
+                       cur = insert_list->prev;
+                       op = list_entry(cur, struct pending_extent_op, list);
+               } else if (last) {
+                       /*
+                        * ok we successfully inserted the last item on the
+                        * list, lets reset everything
+                        *
+                        * i: our current key to insert, so where we left off
+                        *    last time
+                        * last: done with this
+                        * cur: the op we are messing with
+                        * op: duh
+                        * total: since we inserted the last key, we need to
+                        *        decrement total so we dont overflow
+                        */
+                       i = last;
+                       last = 0;
+                       total--;
+                       if (i < total) {
+                               cur = insert_list->next;
+                               op = list_entry(cur, struct pending_extent_op,
+                                               list);
+                       }
+               } else {
+                       i += ret;
+               }
+
+               cond_resched();
+       }
+       ret = 0;
+       kfree(keys);
+       kfree(data_size);
+       return ret;
+}
+
+static noinline int insert_extent_backref(struct btrfs_trans_handle *trans,
+                                         struct btrfs_root *root,
+                                         struct btrfs_path *path,
+                                         u64 bytenr, u64 parent,
+                                         u64 ref_root, u64 ref_generation,
+                                         u64 owner_objectid)
+{
+       struct btrfs_key key;
+       struct extent_buffer *leaf;
+       struct btrfs_extent_ref *ref;
+       u32 num_refs;
+       int ret;
+
+       key.objectid = bytenr;
+       key.type = BTRFS_EXTENT_REF_KEY;
+       key.offset = parent;
+
+       ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*ref));
+       if (ret == 0) {
+               leaf = path->nodes[0];
+               ref = btrfs_item_ptr(leaf, path->slots[0],
+                                    struct btrfs_extent_ref);
+               btrfs_set_ref_root(leaf, ref, ref_root);
+               btrfs_set_ref_generation(leaf, ref, ref_generation);
+               btrfs_set_ref_objectid(leaf, ref, owner_objectid);
+               btrfs_set_ref_num_refs(leaf, ref, 1);
+       } else if (ret == -EEXIST) {
+               u64 existing_owner;
+               BUG_ON(owner_objectid < BTRFS_FIRST_FREE_OBJECTID);
+               leaf = path->nodes[0];
+               ref = btrfs_item_ptr(leaf, path->slots[0],
+                                    struct btrfs_extent_ref);
+               if (btrfs_ref_root(leaf, ref) != ref_root ||
+                   btrfs_ref_generation(leaf, ref) != ref_generation) {
+                       ret = -EIO;
+                       WARN_ON(1);
+                       goto out;
+               }
+
+               num_refs = btrfs_ref_num_refs(leaf, ref);
+               BUG_ON(num_refs == 0);
+               btrfs_set_ref_num_refs(leaf, ref, num_refs + 1);
+
+               existing_owner = btrfs_ref_objectid(leaf, ref);
+               if (existing_owner != owner_objectid &&
+                   existing_owner != BTRFS_MULTIPLE_OBJECTIDS) {
+                       btrfs_set_ref_objectid(leaf, ref,
+                                       BTRFS_MULTIPLE_OBJECTIDS);
+               }
+               ret = 0;
+       } else {
+               goto out;
+       }
+       btrfs_mark_buffer_dirty(path->nodes[0]);
+out:
+       btrfs_release_path(root, path);
+       return ret;
+}
+
+static noinline int remove_extent_backref(struct btrfs_trans_handle *trans,
+                                         struct btrfs_root *root,
+                                         struct btrfs_path *path)
+{
+       struct extent_buffer *leaf;
+       struct btrfs_extent_ref *ref;
+       u32 num_refs;
+       int ret = 0;
+
+       leaf = path->nodes[0];
+       ref = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_extent_ref);
+       num_refs = btrfs_ref_num_refs(leaf, ref);
+       BUG_ON(num_refs == 0);
+       num_refs -= 1;
+       if (num_refs == 0) {
+               ret = btrfs_del_item(trans, root, path);
+       } else {
+               btrfs_set_ref_num_refs(leaf, ref, num_refs);
+               btrfs_mark_buffer_dirty(leaf);
+       }
+       btrfs_release_path(root, path);
+       return ret;
+}
+
+#ifdef BIO_RW_DISCARD
+static void btrfs_issue_discard(struct block_device *bdev,
+                               u64 start, u64 len)
+{
+       blkdev_issue_discard(bdev, start >> 9, len >> 9, GFP_KERNEL);
+}
+#endif
+
+static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
+                               u64 num_bytes)
+{
+#ifdef BIO_RW_DISCARD
+       int ret;
+       u64 map_length = num_bytes;
+       struct btrfs_multi_bio *multi = NULL;
+
+       /* Tell the block device(s) that the sectors can be discarded */
+       ret = btrfs_map_block(&root->fs_info->mapping_tree, READ,
+                             bytenr, &map_length, &multi, 0);
+       if (!ret) {
+               struct btrfs_bio_stripe *stripe = multi->stripes;
+               int i;
+
+               if (map_length > num_bytes)
+                       map_length = num_bytes;
+
+               for (i = 0; i < multi->num_stripes; i++, stripe++) {
+                       btrfs_issue_discard(stripe->dev->bdev,
+                                           stripe->physical,
+                                           map_length);
+               }
+               kfree(multi);
+       }
+
+       return ret;
+#else
+       return 0;
+#endif
+}
+
+static noinline int free_extents(struct btrfs_trans_handle *trans,
+                                struct btrfs_root *extent_root,
+                                struct list_head *del_list)
+{
+       struct btrfs_fs_info *info = extent_root->fs_info;
+       struct btrfs_path *path;
+       struct btrfs_key key, found_key;
+       struct extent_buffer *leaf;
+       struct list_head *cur;
+       struct pending_extent_op *op;
+       struct btrfs_extent_item *ei;
+       int ret, num_to_del, extent_slot = 0, found_extent = 0;
+       u32 refs;
+       u64 bytes_freed = 0;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+       path->reada = 1;
+
+search:
+       /* search for the backref for the current ref we want to delete */
+       cur = del_list->next;
+       op = list_entry(cur, struct pending_extent_op, list);
+       ret = lookup_extent_backref(trans, extent_root, path, op->bytenr,
+                                   op->orig_parent,
+                                   extent_root->root_key.objectid,
+                                   op->orig_generation, op->level, 1);
+       if (ret) {
+               printk(KERN_ERR "btrfs unable to find backref byte nr %llu "
+                      "root %llu gen %llu owner %u\n",
+                      (unsigned long long)op->bytenr,
+                      (unsigned long long)extent_root->root_key.objectid,
+                      (unsigned long long)op->orig_generation, op->level);
+               btrfs_print_leaf(extent_root, path->nodes[0]);
+               WARN_ON(1);
+               goto out;
+       }
+
+       extent_slot = path->slots[0];
+       num_to_del = 1;
+       found_extent = 0;
+
+       /*
+        * if we aren't the first item on the leaf we can move back one and see
+        * if our ref is right next to our extent item
+        */
+       if (likely(extent_slot)) {
+               extent_slot--;
+               btrfs_item_key_to_cpu(path->nodes[0], &found_key,
+                                     extent_slot);
+               if (found_key.objectid == op->bytenr &&
+                   found_key.type == BTRFS_EXTENT_ITEM_KEY &&
+                   found_key.offset == op->num_bytes) {
+                       num_to_del++;
+                       found_extent = 1;
+               }
+       }
+
+       /*
+        * if we didn't find the extent we need to delete the backref and then
+        * search for the extent item key so we can update its ref count
+        */
+       if (!found_extent) {
+               key.objectid = op->bytenr;
+               key.type = BTRFS_EXTENT_ITEM_KEY;
+               key.offset = op->num_bytes;
+
+               ret = remove_extent_backref(trans, extent_root, path);
+               BUG_ON(ret);
+               btrfs_release_path(extent_root, path);
+               ret = btrfs_search_slot(trans, extent_root, &key, path, -1, 1);
+               BUG_ON(ret);
+               extent_slot = path->slots[0];
+       }
+
+       /* this is where we update the ref count for the extent */
+       leaf = path->nodes[0];
+       ei = btrfs_item_ptr(leaf, extent_slot, struct btrfs_extent_item);
+       refs = btrfs_extent_refs(leaf, ei);
+       BUG_ON(refs == 0);
+       refs--;
+       btrfs_set_extent_refs(leaf, ei, refs);
+
+       btrfs_mark_buffer_dirty(leaf);
+
+       /*
+        * This extent needs deleting.  The reason cur_slot is extent_slot +
+        * num_to_del is because extent_slot points to the slot where the extent
+        * is, and if the backref was not right next to the extent we will be
+        * deleting at least 1 item, and will want to start searching at the
+        * slot directly next to extent_slot.  However if we did find the
+        * backref next to the extent item them we will be deleting at least 2
+        * items and will want to start searching directly after the ref slot
+        */
+       if (!refs) {
+               struct list_head *pos, *n, *end;
+               int cur_slot = extent_slot+num_to_del;
+               u64 super_used;
+               u64 root_used;
+
+               path->slots[0] = extent_slot;
+               bytes_freed = op->num_bytes;
+
+               mutex_lock(&info->pinned_mutex);
+               ret = pin_down_bytes(trans, extent_root, op->bytenr,
+                                    op->num_bytes, op->level >=
+                                    BTRFS_FIRST_FREE_OBJECTID);
+               mutex_unlock(&info->pinned_mutex);
+               BUG_ON(ret < 0);
+               op->del = ret;
+
+               /*
+                * we need to see if we can delete multiple things at once, so
+                * start looping through the list of extents we are wanting to
+                * delete and see if their extent/backref's are right next to
+                * eachother and the extents only have 1 ref
+                */
+               for (pos = cur->next; pos != del_list; pos = pos->next) {
+                       struct pending_extent_op *tmp;
+
+                       tmp = list_entry(pos, struct pending_extent_op, list);
+
+                       /* we only want to delete extent+ref at this stage */
+                       if (cur_slot >= btrfs_header_nritems(leaf) - 1)
+                               break;
+
+                       btrfs_item_key_to_cpu(leaf, &found_key, cur_slot);
+                       if (found_key.objectid != tmp->bytenr ||
+                           found_key.type != BTRFS_EXTENT_ITEM_KEY ||
+                           found_key.offset != tmp->num_bytes)
+                               break;
+
+                       /* check to make sure this extent only has one ref */
+                       ei = btrfs_item_ptr(leaf, cur_slot,
+                                           struct btrfs_extent_item);
+                       if (btrfs_extent_refs(leaf, ei) != 1)
+                               break;
+
+                       btrfs_item_key_to_cpu(leaf, &found_key, cur_slot+1);
+                       if (found_key.objectid != tmp->bytenr ||
+                           found_key.type != BTRFS_EXTENT_REF_KEY ||
+                           found_key.offset != tmp->orig_parent)
+                               break;
+
+                       /*
+                        * the ref is right next to the extent, we can set the
+                        * ref count to 0 since we will delete them both now
+                        */
+                       btrfs_set_extent_refs(leaf, ei, 0);
+
+                       /* pin down the bytes for this extent */
+                       mutex_lock(&info->pinned_mutex);
+                       ret = pin_down_bytes(trans, extent_root, tmp->bytenr,
+                                            tmp->num_bytes, tmp->level >=
+                                            BTRFS_FIRST_FREE_OBJECTID);
+                       mutex_unlock(&info->pinned_mutex);
+                       BUG_ON(ret < 0);
+
+                       /*
+                        * use the del field to tell if we need to go ahead and
+                        * free up the extent when we delete the item or not.
+                        */
+                       tmp->del = ret;
+                       bytes_freed += tmp->num_bytes;
+
+                       num_to_del += 2;
+                       cur_slot += 2;
+               }
+               end = pos;
+
+               /* update the free space counters */
+               spin_lock(&info->delalloc_lock);
+               super_used = btrfs_super_bytes_used(&info->super_copy);
+               btrfs_set_super_bytes_used(&info->super_copy,
+                                          super_used - bytes_freed);
+
+               root_used = btrfs_root_used(&extent_root->root_item);
+               btrfs_set_root_used(&extent_root->root_item,
+                                   root_used - bytes_freed);
+               spin_unlock(&info->delalloc_lock);
+
+               /* delete the items */
+               ret = btrfs_del_items(trans, extent_root, path,
+                                     path->slots[0], num_to_del);
+               BUG_ON(ret);
+
+               /*
+                * loop through the extents we deleted and do the cleanup work
+                * on them
+                */
+               for (pos = cur, n = pos->next; pos != end;
+                    pos = n, n = pos->next) {
+                       struct pending_extent_op *tmp;
+                       tmp = list_entry(pos, struct pending_extent_op, list);
+
+                       /*
+                        * remember tmp->del tells us wether or not we pinned
+                        * down the extent
+                        */
+                       ret = update_block_group(trans, extent_root,
+                                                tmp->bytenr, tmp->num_bytes, 0,
+                                                tmp->del);
+                       BUG_ON(ret);
+
+                       list_del_init(&tmp->list);
+                       unlock_extent(&info->extent_ins, tmp->bytenr,
+                                     tmp->bytenr + tmp->num_bytes - 1,
+                                     GFP_NOFS);
+                       kfree(tmp);
+               }
+       } else if (refs && found_extent) {
+               /*
+                * the ref and extent were right next to eachother, but the
+                * extent still has a ref, so just free the backref and keep
+                * going
+                */
+               ret = remove_extent_backref(trans, extent_root, path);
+               BUG_ON(ret);
+
+               list_del_init(&op->list);
+               unlock_extent(&info->extent_ins, op->bytenr,
+                             op->bytenr + op->num_bytes - 1, GFP_NOFS);
+               kfree(op);
+       } else {
+               /*
+                * the extent has multiple refs and the backref we were looking
+                * for was not right next to it, so just unlock and go next,
+                * we're good to go
+                */
+               list_del_init(&op->list);
+               unlock_extent(&info->extent_ins, op->bytenr,
+                             op->bytenr + op->num_bytes - 1, GFP_NOFS);
+               kfree(op);
+       }
+
+       btrfs_release_path(extent_root, path);
+       if (!list_empty(del_list))
+               goto search;
+
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+static int __btrfs_update_extent_ref(struct btrfs_trans_handle *trans,
+                                    struct btrfs_root *root, u64 bytenr,
+                                    u64 orig_parent, u64 parent,
+                                    u64 orig_root, u64 ref_root,
+                                    u64 orig_generation, u64 ref_generation,
+                                    u64 owner_objectid)
+{
+       int ret;
+       struct btrfs_root *extent_root = root->fs_info->extent_root;
+       struct btrfs_path *path;
+
+       if (root == root->fs_info->extent_root) {
+               struct pending_extent_op *extent_op;
+               u64 num_bytes;
+
+               BUG_ON(owner_objectid >= BTRFS_MAX_LEVEL);
+               num_bytes = btrfs_level_size(root, (int)owner_objectid);
+               mutex_lock(&root->fs_info->extent_ins_mutex);
+               if (test_range_bit(&root->fs_info->extent_ins, bytenr,
+                               bytenr + num_bytes - 1, EXTENT_WRITEBACK, 0)) {
+                       u64 priv;
+                       ret = get_state_private(&root->fs_info->extent_ins,
+                                               bytenr, &priv);
+                       BUG_ON(ret);
+                       extent_op = (struct pending_extent_op *)
+                                                       (unsigned long)priv;
+                       BUG_ON(extent_op->parent != orig_parent);
+                       BUG_ON(extent_op->generation != orig_generation);
+
+                       extent_op->parent = parent;
+                       extent_op->generation = ref_generation;
+               } else {
+                       extent_op = kmalloc(sizeof(*extent_op), GFP_NOFS);
+                       BUG_ON(!extent_op);
+
+                       extent_op->type = PENDING_BACKREF_UPDATE;
+                       extent_op->bytenr = bytenr;
+                       extent_op->num_bytes = num_bytes;
+                       extent_op->parent = parent;
+                       extent_op->orig_parent = orig_parent;
+                       extent_op->generation = ref_generation;
+                       extent_op->orig_generation = orig_generation;
+                       extent_op->level = (int)owner_objectid;
+                       INIT_LIST_HEAD(&extent_op->list);
+                       extent_op->del = 0;
+
+                       set_extent_bits(&root->fs_info->extent_ins,
+                                       bytenr, bytenr + num_bytes - 1,
+                                       EXTENT_WRITEBACK, GFP_NOFS);
+                       set_state_private(&root->fs_info->extent_ins,
+                                         bytenr, (unsigned long)extent_op);
+               }
+               mutex_unlock(&root->fs_info->extent_ins_mutex);
+               return 0;
+       }
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+       ret = lookup_extent_backref(trans, extent_root, path,
+                                   bytenr, orig_parent, orig_root,
+                                   orig_generation, owner_objectid, 1);
+       if (ret)
+               goto out;
+       ret = remove_extent_backref(trans, extent_root, path);
+       if (ret)
+               goto out;
+       ret = insert_extent_backref(trans, extent_root, path, bytenr,
+                                   parent, ref_root, ref_generation,
+                                   owner_objectid);
+       BUG_ON(ret);
+       finish_current_insert(trans, extent_root, 0);
+       del_pending_extents(trans, extent_root, 0);
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+int btrfs_update_extent_ref(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root, u64 bytenr,
+                           u64 orig_parent, u64 parent,
+                           u64 ref_root, u64 ref_generation,
+                           u64 owner_objectid)
+{
+       int ret;
+       if (ref_root == BTRFS_TREE_LOG_OBJECTID &&
+           owner_objectid < BTRFS_FIRST_FREE_OBJECTID)
+               return 0;
+       ret = __btrfs_update_extent_ref(trans, root, bytenr, orig_parent,
+                                       parent, ref_root, ref_root,
+                                       ref_generation, ref_generation,
+                                       owner_objectid);
+       return ret;
+}
+
+static int __btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
+                                 struct btrfs_root *root, u64 bytenr,
+                                 u64 orig_parent, u64 parent,
+                                 u64 orig_root, u64 ref_root,
+                                 u64 orig_generation, u64 ref_generation,
+                                 u64 owner_objectid)
+{
+       struct btrfs_path *path;
+       int ret;
+       struct btrfs_key key;
+       struct extent_buffer *l;
+       struct btrfs_extent_item *item;
+       u32 refs;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       path->reada = 1;
+       key.objectid = bytenr;
+       key.type = BTRFS_EXTENT_ITEM_KEY;
+       key.offset = (u64)-1;
+
+       ret = btrfs_search_slot(trans, root->fs_info->extent_root, &key, path,
+                               0, 1);
+       if (ret < 0)
+               return ret;
+       BUG_ON(ret == 0 || path->slots[0] == 0);
+
+       path->slots[0]--;
+       l = path->nodes[0];
+
+       btrfs_item_key_to_cpu(l, &key, path->slots[0]);
+       if (key.objectid != bytenr) {
+               btrfs_print_leaf(root->fs_info->extent_root, path->nodes[0]);
+               printk(KERN_ERR "btrfs wanted %llu found %llu\n",
+                      (unsigned long long)bytenr,
+                      (unsigned long long)key.objectid);
+               BUG();
+       }
+       BUG_ON(key.type != BTRFS_EXTENT_ITEM_KEY);
+
+       item = btrfs_item_ptr(l, path->slots[0], struct btrfs_extent_item);
+       refs = btrfs_extent_refs(l, item);
+       btrfs_set_extent_refs(l, item, refs + 1);
+       btrfs_mark_buffer_dirty(path->nodes[0]);
+
+       btrfs_release_path(root->fs_info->extent_root, path);
+
+       path->reada = 1;
+       ret = insert_extent_backref(trans, root->fs_info->extent_root,
+                                   path, bytenr, parent,
+                                   ref_root, ref_generation,
+                                   owner_objectid);
+       BUG_ON(ret);
+       finish_current_insert(trans, root->fs_info->extent_root, 0);
+       del_pending_extents(trans, root->fs_info->extent_root, 0);
+
+       btrfs_free_path(path);
+       return 0;
+}
+
+int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
+                        struct btrfs_root *root,
+                        u64 bytenr, u64 num_bytes, u64 parent,
+                        u64 ref_root, u64 ref_generation,
+                        u64 owner_objectid)
+{
+       int ret;
+       if (ref_root == BTRFS_TREE_LOG_OBJECTID &&
+           owner_objectid < BTRFS_FIRST_FREE_OBJECTID)
+               return 0;
+       ret = __btrfs_inc_extent_ref(trans, root, bytenr, 0, parent,
+                                    0, ref_root, 0, ref_generation,
+                                    owner_objectid);
+       return ret;
+}
+
+int btrfs_extent_post_op(struct btrfs_trans_handle *trans,
+                        struct btrfs_root *root)
+{
+       finish_current_insert(trans, root->fs_info->extent_root, 1);
+       del_pending_extents(trans, root->fs_info->extent_root, 1);
+       return 0;
+}
+
+int btrfs_lookup_extent_ref(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root, u64 bytenr,
+                           u64 num_bytes, u32 *refs)
+{
+       struct btrfs_path *path;
+       int ret;
+       struct btrfs_key key;
+       struct extent_buffer *l;
+       struct btrfs_extent_item *item;
+
+       WARN_ON(num_bytes < root->sectorsize);
+       path = btrfs_alloc_path();
+       path->reada = 1;
+       key.objectid = bytenr;
+       key.offset = num_bytes;
+       btrfs_set_key_type(&key, BTRFS_EXTENT_ITEM_KEY);
+       ret = btrfs_search_slot(trans, root->fs_info->extent_root, &key, path,
+                               0, 0);
+       if (ret < 0)
+               goto out;
+       if (ret != 0) {
+               btrfs_print_leaf(root, path->nodes[0]);
+               printk(KERN_INFO "btrfs failed to find block number %llu\n",
+                      (unsigned long long)bytenr);
+               BUG();
+       }
+       l = path->nodes[0];
+       item = btrfs_item_ptr(l, path->slots[0], struct btrfs_extent_item);
+       *refs = btrfs_extent_refs(l, item);
+out:
+       btrfs_free_path(path);
+       return 0;
+}
+
+int btrfs_cross_ref_exist(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root, u64 objectid, u64 bytenr)
+{
+       struct btrfs_root *extent_root = root->fs_info->extent_root;
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       struct btrfs_extent_ref *ref_item;
+       struct btrfs_key key;
+       struct btrfs_key found_key;
+       u64 ref_root;
+       u64 last_snapshot;
+       u32 nritems;
+       int ret;
+
+       key.objectid = bytenr;
+       key.offset = (u64)-1;
+       key.type = BTRFS_EXTENT_ITEM_KEY;
+
+       path = btrfs_alloc_path();
+       ret = btrfs_search_slot(NULL, extent_root, &key, path, 0, 0);
+       if (ret < 0)
+               goto out;
+       BUG_ON(ret == 0);
+
+       ret = -ENOENT;
+       if (path->slots[0] == 0)
+               goto out;
+
+       path->slots[0]--;
+       leaf = path->nodes[0];
+       btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+       if (found_key.objectid != bytenr ||
+           found_key.type != BTRFS_EXTENT_ITEM_KEY)
+               goto out;
+
+       last_snapshot = btrfs_root_last_snapshot(&root->root_item);
+       while (1) {
+               leaf = path->nodes[0];
+               nritems = btrfs_header_nritems(leaf);
+               if (path->slots[0] >= nritems) {
+                       ret = btrfs_next_leaf(extent_root, path);
+                       if (ret < 0)
+                               goto out;
+                       if (ret == 0)
+                               continue;
+                       break;
+               }
+               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+               if (found_key.objectid != bytenr)
+                       break;
+
+               if (found_key.type != BTRFS_EXTENT_REF_KEY) {
+                       path->slots[0]++;
+                       continue;
+               }
+
+               ref_item = btrfs_item_ptr(leaf, path->slots[0],
+                                         struct btrfs_extent_ref);
+               ref_root = btrfs_ref_root(leaf, ref_item);
+               if ((ref_root != root->root_key.objectid &&
+                    ref_root != BTRFS_TREE_LOG_OBJECTID) ||
+                    objectid != btrfs_ref_objectid(leaf, ref_item)) {
+                       ret = 1;
+                       goto out;
+               }
+               if (btrfs_ref_generation(leaf, ref_item) <= last_snapshot) {
+                       ret = 1;
+                       goto out;
+               }
+
+               path->slots[0]++;
+       }
+       ret = 0;
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+int btrfs_cache_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+                   struct extent_buffer *buf, u32 nr_extents)
+{
+       struct btrfs_key key;
+       struct btrfs_file_extent_item *fi;
+       u64 root_gen;
+       u32 nritems;
+       int i;
+       int level;
+       int ret = 0;
+       int shared = 0;
+
+       if (!root->ref_cows)
+               return 0;
+
+       if (root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID) {
+               shared = 0;
+               root_gen = root->root_key.offset;
+       } else {
+               shared = 1;
+               root_gen = trans->transid - 1;
+       }
+
+       level = btrfs_header_level(buf);
+       nritems = btrfs_header_nritems(buf);
+
+       if (level == 0) {
+               struct btrfs_leaf_ref *ref;
+               struct btrfs_extent_info *info;
+
+               ref = btrfs_alloc_leaf_ref(root, nr_extents);
+               if (!ref) {
+                       ret = -ENOMEM;
+                       goto out;
+               }
+
+               ref->root_gen = root_gen;
+               ref->bytenr = buf->start;
+               ref->owner = btrfs_header_owner(buf);
+               ref->generation = btrfs_header_generation(buf);
+               ref->nritems = nr_extents;
+               info = ref->extents;
+
+               for (i = 0; nr_extents > 0 && i < nritems; i++) {
+                       u64 disk_bytenr;
+                       btrfs_item_key_to_cpu(buf, &key, i);
+                       if (btrfs_key_type(&key) != BTRFS_EXTENT_DATA_KEY)
+                               continue;
+                       fi = btrfs_item_ptr(buf, i,
+                                           struct btrfs_file_extent_item);
+                       if (btrfs_file_extent_type(buf, fi) ==
+                           BTRFS_FILE_EXTENT_INLINE)
+                               continue;
+                       disk_bytenr = btrfs_file_extent_disk_bytenr(buf, fi);
+                       if (disk_bytenr == 0)
+                               continue;
+
+                       info->bytenr = disk_bytenr;
+                       info->num_bytes =
+                               btrfs_file_extent_disk_num_bytes(buf, fi);
+                       info->objectid = key.objectid;
+                       info->offset = key.offset;
+                       info++;
+               }
+
+               ret = btrfs_add_leaf_ref(root, ref, shared);
+               if (ret == -EEXIST && shared) {
+                       struct btrfs_leaf_ref *old;
+                       old = btrfs_lookup_leaf_ref(root, ref->bytenr);
+                       BUG_ON(!old);
+                       btrfs_remove_leaf_ref(root, old);
+                       btrfs_free_leaf_ref(root, old);
+                       ret = btrfs_add_leaf_ref(root, ref, shared);
+               }
+               WARN_ON(ret);
+               btrfs_free_leaf_ref(root, ref);
+       }
+out:
+       return ret;
+}
+
+int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+                 struct extent_buffer *orig_buf, struct extent_buffer *buf,
+                 u32 *nr_extents)
+{
+       u64 bytenr;
+       u64 ref_root;
+       u64 orig_root;
+       u64 ref_generation;
+       u64 orig_generation;
+       u32 nritems;
+       u32 nr_file_extents = 0;
+       struct btrfs_key key;
+       struct btrfs_file_extent_item *fi;
+       int i;
+       int level;
+       int ret = 0;
+       int faili = 0;
+       int (*process_func)(struct btrfs_trans_handle *, struct btrfs_root *,
+                           u64, u64, u64, u64, u64, u64, u64, u64);
+
+       ref_root = btrfs_header_owner(buf);
+       ref_generation = btrfs_header_generation(buf);
+       orig_root = btrfs_header_owner(orig_buf);
+       orig_generation = btrfs_header_generation(orig_buf);
+
+       nritems = btrfs_header_nritems(buf);
+       level = btrfs_header_level(buf);
+
+       if (root->ref_cows) {
+               process_func = __btrfs_inc_extent_ref;
+       } else {
+               if (level == 0 &&
+                   root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
+                       goto out;
+               if (level != 0 &&
+                   root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID)
+                       goto out;
+               process_func = __btrfs_update_extent_ref;
+       }
+
+       for (i = 0; i < nritems; i++) {
+               cond_resched();
+               if (level == 0) {
+                       btrfs_item_key_to_cpu(buf, &key, i);
+                       if (btrfs_key_type(&key) != BTRFS_EXTENT_DATA_KEY)
+                               continue;
+                       fi = btrfs_item_ptr(buf, i,
+                                           struct btrfs_file_extent_item);
+                       if (btrfs_file_extent_type(buf, fi) ==
+                           BTRFS_FILE_EXTENT_INLINE)
+                               continue;
+                       bytenr = btrfs_file_extent_disk_bytenr(buf, fi);
+                       if (bytenr == 0)
+                               continue;
+
+                       nr_file_extents++;
+
+                       ret = process_func(trans, root, bytenr,
+                                          orig_buf->start, buf->start,
+                                          orig_root, ref_root,
+                                          orig_generation, ref_generation,
+                                          key.objectid);
+
+                       if (ret) {
+                               faili = i;
+                               WARN_ON(1);
+                               goto fail;
+                       }
+               } else {
+                       bytenr = btrfs_node_blockptr(buf, i);
+                       ret = process_func(trans, root, bytenr,
+                                          orig_buf->start, buf->start,
+                                          orig_root, ref_root,
+                                          orig_generation, ref_generation,
+                                          level - 1);
+                       if (ret) {
+                               faili = i;
+                               WARN_ON(1);
+                               goto fail;
+                       }
+               }
+       }
+out:
+       if (nr_extents) {
+               if (level == 0)
+                       *nr_extents = nr_file_extents;
+               else
+                       *nr_extents = nritems;
+       }
+       return 0;
+fail:
+       WARN_ON(1);
+       return ret;
+}
+
+int btrfs_update_ref(struct btrfs_trans_handle *trans,
+                    struct btrfs_root *root, struct extent_buffer *orig_buf,
+                    struct extent_buffer *buf, int start_slot, int nr)
+
+{
+       u64 bytenr;
+       u64 ref_root;
+       u64 orig_root;
+       u64 ref_generation;
+       u64 orig_generation;
+       struct btrfs_key key;
+       struct btrfs_file_extent_item *fi;
+       int i;
+       int ret;
+       int slot;
+       int level;
+
+       BUG_ON(start_slot < 0);
+       BUG_ON(start_slot + nr > btrfs_header_nritems(buf));
+
+       ref_root = btrfs_header_owner(buf);
+       ref_generation = btrfs_header_generation(buf);
+       orig_root = btrfs_header_owner(orig_buf);
+       orig_generation = btrfs_header_generation(orig_buf);
+       level = btrfs_header_level(buf);
+
+       if (!root->ref_cows) {
+               if (level == 0 &&
+                   root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
+                       return 0;
+               if (level != 0 &&
+                   root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID)
+                       return 0;
+       }
+
+       for (i = 0, slot = start_slot; i < nr; i++, slot++) {
+               cond_resched();
+               if (level == 0) {
+                       btrfs_item_key_to_cpu(buf, &key, slot);
+                       if (btrfs_key_type(&key) != BTRFS_EXTENT_DATA_KEY)
+                               continue;
+                       fi = btrfs_item_ptr(buf, slot,
+                                           struct btrfs_file_extent_item);
+                       if (btrfs_file_extent_type(buf, fi) ==
+                           BTRFS_FILE_EXTENT_INLINE)
+                               continue;
+                       bytenr = btrfs_file_extent_disk_bytenr(buf, fi);
+                       if (bytenr == 0)
+                               continue;
+                       ret = __btrfs_update_extent_ref(trans, root, bytenr,
+                                           orig_buf->start, buf->start,
+                                           orig_root, ref_root,
+                                           orig_generation, ref_generation,
+                                           key.objectid);
+                       if (ret)
+                               goto fail;
+               } else {
+                       bytenr = btrfs_node_blockptr(buf, slot);
+                       ret = __btrfs_update_extent_ref(trans, root, bytenr,
+                                           orig_buf->start, buf->start,
+                                           orig_root, ref_root,
+                                           orig_generation, ref_generation,
+                                           level - 1);
+                       if (ret)
+                               goto fail;
+               }
+       }
+       return 0;
+fail:
+       WARN_ON(1);
+       return -1;
+}
+
+static int write_one_cache_group(struct btrfs_trans_handle *trans,
+                                struct btrfs_root *root,
+                                struct btrfs_path *path,
+                                struct btrfs_block_group_cache *cache)
+{
+       int ret;
+       int pending_ret;
+       struct btrfs_root *extent_root = root->fs_info->extent_root;
+       unsigned long bi;
+       struct extent_buffer *leaf;
+
+       ret = btrfs_search_slot(trans, extent_root, &cache->key, path, 0, 1);
+       if (ret < 0)
+               goto fail;
+       BUG_ON(ret);
+
+       leaf = path->nodes[0];
+       bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
+       write_extent_buffer(leaf, &cache->item, bi, sizeof(cache->item));
+       btrfs_mark_buffer_dirty(leaf);
+       btrfs_release_path(extent_root, path);
+fail:
+       finish_current_insert(trans, extent_root, 0);
+       pending_ret = del_pending_extents(trans, extent_root, 0);
+       if (ret)
+               return ret;
+       if (pending_ret)
+               return pending_ret;
+       return 0;
+
+}
+
+int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
+                                  struct btrfs_root *root)
+{
+       struct btrfs_block_group_cache *cache, *entry;
+       struct rb_node *n;
+       int err = 0;
+       int werr = 0;
+       struct btrfs_path *path;
+       u64 last = 0;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       while (1) {
+               cache = NULL;
+               spin_lock(&root->fs_info->block_group_cache_lock);
+               for (n = rb_first(&root->fs_info->block_group_cache_tree);
+                    n; n = rb_next(n)) {
+                       entry = rb_entry(n, struct btrfs_block_group_cache,
+                                        cache_node);
+                       if (entry->dirty) {
+                               cache = entry;
+                               break;
+                       }
+               }
+               spin_unlock(&root->fs_info->block_group_cache_lock);
+
+               if (!cache)
+                       break;
+
+               cache->dirty = 0;
+               last += cache->key.offset;
+
+               err = write_one_cache_group(trans, root,
+                                           path, cache);
+               /*
+                * if we fail to write the cache group, we want
+                * to keep it marked dirty in hopes that a later
+                * write will work
+                */
+               if (err) {
+                       werr = err;
+                       continue;
+               }
+       }
+       btrfs_free_path(path);
+       return werr;
+}
+
+int btrfs_extent_readonly(struct btrfs_root *root, u64 bytenr)
+{
+       struct btrfs_block_group_cache *block_group;
+       int readonly = 0;
+
+       block_group = btrfs_lookup_block_group(root->fs_info, bytenr);
+       if (!block_group || block_group->ro)
+               readonly = 1;
+       if (block_group)
+               put_block_group(block_group);
+       return readonly;
+}
+
+static int update_space_info(struct btrfs_fs_info *info, u64 flags,
+                            u64 total_bytes, u64 bytes_used,
+                            struct btrfs_space_info **space_info)
+{
+       struct btrfs_space_info *found;
+
+       found = __find_space_info(info, flags);
+       if (found) {
+               spin_lock(&found->lock);
+               found->total_bytes += total_bytes;
+               found->bytes_used += bytes_used;
+               found->full = 0;
+               spin_unlock(&found->lock);
+               *space_info = found;
+               return 0;
+       }
+       found = kzalloc(sizeof(*found), GFP_NOFS);
+       if (!found)
+               return -ENOMEM;
+
+       list_add(&found->list, &info->space_info);
+       INIT_LIST_HEAD(&found->block_groups);
+       init_rwsem(&found->groups_sem);
+       spin_lock_init(&found->lock);
+       found->flags = flags;
+       found->total_bytes = total_bytes;
+       found->bytes_used = bytes_used;
+       found->bytes_pinned = 0;
+       found->bytes_reserved = 0;
+       found->bytes_readonly = 0;
+       found->full = 0;
+       found->force_alloc = 0;
+       *space_info = found;
+       return 0;
+}
+
+static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
+{
+       u64 extra_flags = flags & (BTRFS_BLOCK_GROUP_RAID0 |
+                                  BTRFS_BLOCK_GROUP_RAID1 |
+                                  BTRFS_BLOCK_GROUP_RAID10 |
+                                  BTRFS_BLOCK_GROUP_DUP);
+       if (extra_flags) {
+               if (flags & BTRFS_BLOCK_GROUP_DATA)
+                       fs_info->avail_data_alloc_bits |= extra_flags;
+               if (flags & BTRFS_BLOCK_GROUP_METADATA)
+                       fs_info->avail_metadata_alloc_bits |= extra_flags;
+               if (flags & BTRFS_BLOCK_GROUP_SYSTEM)
+                       fs_info->avail_system_alloc_bits |= extra_flags;
+       }
+}
+
+static void set_block_group_readonly(struct btrfs_block_group_cache *cache)
+{
+       spin_lock(&cache->space_info->lock);
+       spin_lock(&cache->lock);
+       if (!cache->ro) {
+               cache->space_info->bytes_readonly += cache->key.offset -
+                                       btrfs_block_group_used(&cache->item);
+               cache->ro = 1;
+       }
+       spin_unlock(&cache->lock);
+       spin_unlock(&cache->space_info->lock);
+}
+
+u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags)
+{
+       u64 num_devices = root->fs_info->fs_devices->rw_devices;
+
+       if (num_devices == 1)
+               flags &= ~(BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID0);
+       if (num_devices < 4)
+               flags &= ~BTRFS_BLOCK_GROUP_RAID10;
+
+       if ((flags & BTRFS_BLOCK_GROUP_DUP) &&
+           (flags & (BTRFS_BLOCK_GROUP_RAID1 |
+                     BTRFS_BLOCK_GROUP_RAID10))) {
+               flags &= ~BTRFS_BLOCK_GROUP_DUP;
+       }
+
+       if ((flags & BTRFS_BLOCK_GROUP_RAID1) &&
+           (flags & BTRFS_BLOCK_GROUP_RAID10)) {
+               flags &= ~BTRFS_BLOCK_GROUP_RAID1;
+       }
+
+       if ((flags & BTRFS_BLOCK_GROUP_RAID0) &&
+           ((flags & BTRFS_BLOCK_GROUP_RAID1) |
+            (flags & BTRFS_BLOCK_GROUP_RAID10) |
+            (flags & BTRFS_BLOCK_GROUP_DUP)))
+               flags &= ~BTRFS_BLOCK_GROUP_RAID0;
+       return flags;
+}
+
+static int do_chunk_alloc(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *extent_root, u64 alloc_bytes,
+                         u64 flags, int force)
+{
+       struct btrfs_space_info *space_info;
+       u64 thresh;
+       int ret = 0;
+
+       mutex_lock(&extent_root->fs_info->chunk_mutex);
+
+       flags = btrfs_reduce_alloc_profile(extent_root, flags);
+
+       space_info = __find_space_info(extent_root->fs_info, flags);
+       if (!space_info) {
+               ret = update_space_info(extent_root->fs_info, flags,
+                                       0, 0, &space_info);
+               BUG_ON(ret);
+       }
+       BUG_ON(!space_info);
+
+       spin_lock(&space_info->lock);
+       if (space_info->force_alloc) {
+               force = 1;
+               space_info->force_alloc = 0;
+       }
+       if (space_info->full) {
+               spin_unlock(&space_info->lock);
+               goto out;
+       }
+
+       thresh = space_info->total_bytes - space_info->bytes_readonly;
+       thresh = div_factor(thresh, 6);
+       if (!force &&
+          (space_info->bytes_used + space_info->bytes_pinned +
+           space_info->bytes_reserved + alloc_bytes) < thresh) {
+               spin_unlock(&space_info->lock);
+               goto out;
+       }
+       spin_unlock(&space_info->lock);
+
+       ret = btrfs_alloc_chunk(trans, extent_root, flags);
+       if (ret)
+               space_info->full = 1;
+out:
+       mutex_unlock(&extent_root->fs_info->chunk_mutex);
+       return ret;
+}
+
+static int update_block_group(struct btrfs_trans_handle *trans,
+                             struct btrfs_root *root,
+                             u64 bytenr, u64 num_bytes, int alloc,
+                             int mark_free)
+{
+       struct btrfs_block_group_cache *cache;
+       struct btrfs_fs_info *info = root->fs_info;
+       u64 total = num_bytes;
+       u64 old_val;
+       u64 byte_in_group;
+
+       while (total) {
+               cache = btrfs_lookup_block_group(info, bytenr);
+               if (!cache)
+                       return -1;
+               byte_in_group = bytenr - cache->key.objectid;
+               WARN_ON(byte_in_group > cache->key.offset);
+
+               spin_lock(&cache->space_info->lock);
+               spin_lock(&cache->lock);
+               cache->dirty = 1;
+               old_val = btrfs_block_group_used(&cache->item);
+               num_bytes = min(total, cache->key.offset - byte_in_group);
+               if (alloc) {
+                       old_val += num_bytes;
+                       cache->space_info->bytes_used += num_bytes;
+                       if (cache->ro)
+                               cache->space_info->bytes_readonly -= num_bytes;
+                       btrfs_set_block_group_used(&cache->item, old_val);
+                       spin_unlock(&cache->lock);
+                       spin_unlock(&cache->space_info->lock);
+               } else {
+                       old_val -= num_bytes;
+                       cache->space_info->bytes_used -= num_bytes;
+                       if (cache->ro)
+                               cache->space_info->bytes_readonly += num_bytes;
+                       btrfs_set_block_group_used(&cache->item, old_val);
+                       spin_unlock(&cache->lock);
+                       spin_unlock(&cache->space_info->lock);
+                       if (mark_free) {
+                               int ret;
+
+                               ret = btrfs_discard_extent(root, bytenr,
+                                                          num_bytes);
+                               WARN_ON(ret);
+
+                               ret = btrfs_add_free_space(cache, bytenr,
+                                                          num_bytes);
+                               WARN_ON(ret);
+                       }
+               }
+               put_block_group(cache);
+               total -= num_bytes;
+               bytenr += num_bytes;
+       }
+       return 0;
+}
+
+static u64 first_logical_byte(struct btrfs_root *root, u64 search_start)
+{
+       struct btrfs_block_group_cache *cache;
+       u64 bytenr;
+
+       cache = btrfs_lookup_first_block_group(root->fs_info, search_start);
+       if (!cache)
+               return 0;
+
+       bytenr = cache->key.objectid;
+       put_block_group(cache);
+
+       return bytenr;
+}
+
+int btrfs_update_pinned_extents(struct btrfs_root *root,
+                               u64 bytenr, u64 num, int pin)
+{
+       u64 len;
+       struct btrfs_block_group_cache *cache;
+       struct btrfs_fs_info *fs_info = root->fs_info;
+
+       WARN_ON(!mutex_is_locked(&root->fs_info->pinned_mutex));
+       if (pin) {
+               set_extent_dirty(&fs_info->pinned_extents,
+                               bytenr, bytenr + num - 1, GFP_NOFS);
+       } else {
+               clear_extent_dirty(&fs_info->pinned_extents,
+                               bytenr, bytenr + num - 1, GFP_NOFS);
+       }
+       while (num > 0) {
+               cache = btrfs_lookup_block_group(fs_info, bytenr);
+               BUG_ON(!cache);
+               len = min(num, cache->key.offset -
+                         (bytenr - cache->key.objectid));
+               if (pin) {
+                       spin_lock(&cache->space_info->lock);
+                       spin_lock(&cache->lock);
+                       cache->pinned += len;
+                       cache->space_info->bytes_pinned += len;
+                       spin_unlock(&cache->lock);
+                       spin_unlock(&cache->space_info->lock);
+                       fs_info->total_pinned += len;
+               } else {
+                       spin_lock(&cache->space_info->lock);
+                       spin_lock(&cache->lock);
+                       cache->pinned -= len;
+                       cache->space_info->bytes_pinned -= len;
+                       spin_unlock(&cache->lock);
+                       spin_unlock(&cache->space_info->lock);
+                       fs_info->total_pinned -= len;
+                       if (cache->cached)
+                               btrfs_add_free_space(cache, bytenr, len);
+               }
+               put_block_group(cache);
+               bytenr += len;
+               num -= len;
+       }
+       return 0;
+}
+
+static int update_reserved_extents(struct btrfs_root *root,
+                                  u64 bytenr, u64 num, int reserve)
+{
+       u64 len;
+       struct btrfs_block_group_cache *cache;
+       struct btrfs_fs_info *fs_info = root->fs_info;
+
+       while (num > 0) {
+               cache = btrfs_lookup_block_group(fs_info, bytenr);
+               BUG_ON(!cache);
+               len = min(num, cache->key.offset -
+                         (bytenr - cache->key.objectid));
+
+               spin_lock(&cache->space_info->lock);
+               spin_lock(&cache->lock);
+               if (reserve) {
+                       cache->reserved += len;
+                       cache->space_info->bytes_reserved += len;
+               } else {
+                       cache->reserved -= len;
+                       cache->space_info->bytes_reserved -= len;
+               }
+               spin_unlock(&cache->lock);
+               spin_unlock(&cache->space_info->lock);
+               put_block_group(cache);
+               bytenr += len;
+               num -= len;
+       }
+       return 0;
+}
+
+int btrfs_copy_pinned(struct btrfs_root *root, struct extent_io_tree *copy)
+{
+       u64 last = 0;
+       u64 start;
+       u64 end;
+       struct extent_io_tree *pinned_extents = &root->fs_info->pinned_extents;
+       int ret;
+
+       mutex_lock(&root->fs_info->pinned_mutex);
+       while (1) {
+               ret = find_first_extent_bit(pinned_extents, last,
+                                           &start, &end, EXTENT_DIRTY);
+               if (ret)
+                       break;
+               set_extent_dirty(copy, start, end, GFP_NOFS);
+               last = end + 1;
+       }
+       mutex_unlock(&root->fs_info->pinned_mutex);
+       return 0;
+}
+
+int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *root,
+                              struct extent_io_tree *unpin)
+{
+       u64 start;
+       u64 end;
+       int ret;
+
+       mutex_lock(&root->fs_info->pinned_mutex);
+       while (1) {
+               ret = find_first_extent_bit(unpin, 0, &start, &end,
+                                           EXTENT_DIRTY);
+               if (ret)
+                       break;
+
+               ret = btrfs_discard_extent(root, start, end + 1 - start);
+
+               btrfs_update_pinned_extents(root, start, end + 1 - start, 0);
+               clear_extent_dirty(unpin, start, end, GFP_NOFS);
+
+               if (need_resched()) {
+                       mutex_unlock(&root->fs_info->pinned_mutex);
+                       cond_resched();
+                       mutex_lock(&root->fs_info->pinned_mutex);
+               }
+       }
+       mutex_unlock(&root->fs_info->pinned_mutex);
+       return ret;
+}
+
+static int finish_current_insert(struct btrfs_trans_handle *trans,
+                                struct btrfs_root *extent_root, int all)
+{
+       u64 start;
+       u64 end;
+       u64 priv;
+       u64 search = 0;
+       u64 skipped = 0;
+       struct btrfs_fs_info *info = extent_root->fs_info;
+       struct btrfs_path *path;
+       struct pending_extent_op *extent_op, *tmp;
+       struct list_head insert_list, update_list;
+       int ret;
+       int num_inserts = 0, max_inserts;
+
+       path = btrfs_alloc_path();
+       INIT_LIST_HEAD(&insert_list);
+       INIT_LIST_HEAD(&update_list);
+
+       max_inserts = extent_root->leafsize /
+               (2 * sizeof(struct btrfs_key) + 2 * sizeof(struct btrfs_item) +
+                sizeof(struct btrfs_extent_ref) +
+                sizeof(struct btrfs_extent_item));
+again:
+       mutex_lock(&info->extent_ins_mutex);
+       while (1) {
+               ret = find_first_extent_bit(&info->extent_ins, search, &start,
+                                           &end, EXTENT_WRITEBACK);
+               if (ret) {
+                       if (skipped && all && !num_inserts) {
+                               skipped = 0;
+                               search = 0;
+                               continue;
+                       }
+                       mutex_unlock(&info->extent_ins_mutex);
+                       break;
+               }
+
+               ret = try_lock_extent(&info->extent_ins, start, end, GFP_NOFS);
+               if (!ret) {
+                       skipped = 1;
+                       search = end + 1;
+                       if (need_resched()) {
+                               mutex_unlock(&info->extent_ins_mutex);
+                               cond_resched();
+                               mutex_lock(&info->extent_ins_mutex);
+                       }
+                       continue;
+               }
+
+               ret = get_state_private(&info->extent_ins, start, &priv);
+               BUG_ON(ret);
+               extent_op = (struct pending_extent_op *)(unsigned long) priv;
+
+               if (extent_op->type == PENDING_EXTENT_INSERT) {
+                       num_inserts++;
+                       list_add_tail(&extent_op->list, &insert_list);
+                       search = end + 1;
+                       if (num_inserts == max_inserts) {
+                               mutex_unlock(&info->extent_ins_mutex);
+                               break;
+                       }
+               } else if (extent_op->type == PENDING_BACKREF_UPDATE) {
+                       list_add_tail(&extent_op->list, &update_list);
+                       search = end + 1;
+               } else {
+                       BUG();
+               }
+       }
+
+       /*
+        * process the update list, clear the writeback bit for it, and if
+        * somebody marked this thing for deletion then just unlock it and be
+        * done, the free_extents will handle it
+        */
+       mutex_lock(&info->extent_ins_mutex);
+       list_for_each_entry_safe(extent_op, tmp, &update_list, list) {
+               clear_extent_bits(&info->extent_ins, extent_op->bytenr,
+                                 extent_op->bytenr + extent_op->num_bytes - 1,
+                                 EXTENT_WRITEBACK, GFP_NOFS);
+               if (extent_op->del) {
+                       list_del_init(&extent_op->list);
+                       unlock_extent(&info->extent_ins, extent_op->bytenr,
+                                     extent_op->bytenr + extent_op->num_bytes
+                                     - 1, GFP_NOFS);
+                       kfree(extent_op);
+               }
+       }
+       mutex_unlock(&info->extent_ins_mutex);
+
+       /*
+        * still have things left on the update list, go ahead an update
+        * everything
+        */
+       if (!list_empty(&update_list)) {
+               ret = update_backrefs(trans, extent_root, path, &update_list);
+               BUG_ON(ret);
+       }
+
+       /*
+        * if no inserts need to be done, but we skipped some extents and we
+        * need to make sure everything is cleaned then reset everything and
+        * go back to the beginning
+        */
+       if (!num_inserts && all && skipped) {
+               search = 0;
+               skipped = 0;
+               INIT_LIST_HEAD(&update_list);
+               INIT_LIST_HEAD(&insert_list);
+               goto again;
+       } else if (!num_inserts) {
+               goto out;
+       }
+
+       /*
+        * process the insert extents list.  Again if we are deleting this
+        * extent, then just unlock it, pin down the bytes if need be, and be
+        * done with it.  Saves us from having to actually insert the extent
+        * into the tree and then subsequently come along and delete it
+        */
+       mutex_lock(&info->extent_ins_mutex);
+       list_for_each_entry_safe(extent_op, tmp, &insert_list, list) {
+               clear_extent_bits(&info->extent_ins, extent_op->bytenr,
+                                 extent_op->bytenr + extent_op->num_bytes - 1,
+                                 EXTENT_WRITEBACK, GFP_NOFS);
+               if (extent_op->del) {
+                       u64 used;
+                       list_del_init(&extent_op->list);
+                       unlock_extent(&info->extent_ins, extent_op->bytenr,
+                                     extent_op->bytenr + extent_op->num_bytes
+                                     - 1, GFP_NOFS);
+
+                       mutex_lock(&extent_root->fs_info->pinned_mutex);
+                       ret = pin_down_bytes(trans, extent_root,
+                                            extent_op->bytenr,
+                                            extent_op->num_bytes, 0);
+                       mutex_unlock(&extent_root->fs_info->pinned_mutex);
+
+                       spin_lock(&info->delalloc_lock);
+                       used = btrfs_super_bytes_used(&info->super_copy);
+                       btrfs_set_super_bytes_used(&info->super_copy,
+                                       used - extent_op->num_bytes);
+                       used = btrfs_root_used(&extent_root->root_item);
+                       btrfs_set_root_used(&extent_root->root_item,
+                                       used - extent_op->num_bytes);
+                       spin_unlock(&info->delalloc_lock);
+
+                       ret = update_block_group(trans, extent_root,
+                                                extent_op->bytenr,
+                                                extent_op->num_bytes,
+                                                0, ret > 0);
+                       BUG_ON(ret);
+                       kfree(extent_op);
+                       num_inserts--;
+               }
+       }
+       mutex_unlock(&info->extent_ins_mutex);
+
+       ret = insert_extents(trans, extent_root, path, &insert_list,
+                            num_inserts);
+       BUG_ON(ret);
+
+       /*
+        * if we broke out of the loop in order to insert stuff because we hit
+        * the maximum number of inserts at a time we can handle, then loop
+        * back and pick up where we left off
+        */
+       if (num_inserts == max_inserts) {
+               INIT_LIST_HEAD(&insert_list);
+               INIT_LIST_HEAD(&update_list);
+               num_inserts = 0;
+               goto again;
+       }
+
+       /*
+        * again, if we need to make absolutely sure there are no more pending
+        * extent operations left and we know that we skipped some, go back to
+        * the beginning and do it all again
+        */
+       if (all && skipped) {
+               INIT_LIST_HEAD(&insert_list);
+               INIT_LIST_HEAD(&update_list);
+               search = 0;
+               skipped = 0;
+               num_inserts = 0;
+               goto again;
+       }
+out:
+       btrfs_free_path(path);
+       return 0;
+}
+
+static int pin_down_bytes(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root,
+                         u64 bytenr, u64 num_bytes, int is_data)
+{
+       int err = 0;
+       struct extent_buffer *buf;
+
+       if (is_data)
+               goto pinit;
+
+       buf = btrfs_find_tree_block(root, bytenr, num_bytes);
+       if (!buf)
+               goto pinit;
+
+       /* we can reuse a block if it hasn't been written
+        * and it is from this transaction.  We can't
+        * reuse anything from the tree log root because
+        * it has tiny sub-transactions.
+        */
+       if (btrfs_buffer_uptodate(buf, 0) &&
+           btrfs_try_tree_lock(buf)) {
+               u64 header_owner = btrfs_header_owner(buf);
+               u64 header_transid = btrfs_header_generation(buf);
+               if (header_owner != BTRFS_TREE_LOG_OBJECTID &&
+                   header_owner != BTRFS_TREE_RELOC_OBJECTID &&
+                   header_transid == trans->transid &&
+                   !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
+                       clean_tree_block(NULL, root, buf);
+                       btrfs_tree_unlock(buf);
+                       free_extent_buffer(buf);
+                       return 1;
+               }
+               btrfs_tree_unlock(buf);
+       }
+       free_extent_buffer(buf);
+pinit:
+       btrfs_update_pinned_extents(root, bytenr, num_bytes, 1);
+
+       BUG_ON(err < 0);
+       return 0;
+}
+
+/*
+ * remove an extent from the root, returns 0 on success
+ */
+static int __free_extent(struct btrfs_trans_handle *trans,
+                        struct btrfs_root *root,
+                        u64 bytenr, u64 num_bytes, u64 parent,
+                        u64 root_objectid, u64 ref_generation,
+                        u64 owner_objectid, int pin, int mark_free)
+{
+       struct btrfs_path *path;
+       struct btrfs_key key;
+       struct btrfs_fs_info *info = root->fs_info;
+       struct btrfs_root *extent_root = info->extent_root;
+       struct extent_buffer *leaf;
+       int ret;
+       int extent_slot = 0;
+       int found_extent = 0;
+       int num_to_del = 1;
+       struct btrfs_extent_item *ei;
+       u32 refs;
+
+       key.objectid = bytenr;
+       btrfs_set_key_type(&key, BTRFS_EXTENT_ITEM_KEY);
+       key.offset = num_bytes;
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       path->reada = 1;
+       ret = lookup_extent_backref(trans, extent_root, path,
+                                   bytenr, parent, root_objectid,
+                                   ref_generation, owner_objectid, 1);
+       if (ret == 0) {
+               struct btrfs_key found_key;
+               extent_slot = path->slots[0];
+               while (extent_slot > 0) {
+                       extent_slot--;
+                       btrfs_item_key_to_cpu(path->nodes[0], &found_key,
+                                             extent_slot);
+                       if (found_key.objectid != bytenr)
+                               break;
+                       if (found_key.type == BTRFS_EXTENT_ITEM_KEY &&
+                           found_key.offset == num_bytes) {
+                               found_extent = 1;
+                               break;
+                       }
+                       if (path->slots[0] - extent_slot > 5)
+                               break;
+               }
+               if (!found_extent) {
+                       ret = remove_extent_backref(trans, extent_root, path);
+                       BUG_ON(ret);
+                       btrfs_release_path(extent_root, path);
+                       ret = btrfs_search_slot(trans, extent_root,
+                                               &key, path, -1, 1);
+                       if (ret) {
+                               printk(KERN_ERR "umm, got %d back from search"
+                                      ", was looking for %llu\n", ret,
+                                      (unsigned long long)bytenr);
+                               btrfs_print_leaf(extent_root, path->nodes[0]);
+                       }
+                       BUG_ON(ret);
+                       extent_slot = path->slots[0];
+               }
+       } else {
+               btrfs_print_leaf(extent_root, path->nodes[0]);
+               WARN_ON(1);
+               printk(KERN_ERR "btrfs unable to find ref byte nr %llu "
+                      "root %llu gen %llu owner %llu\n",
+                      (unsigned long long)bytenr,
+                      (unsigned long long)root_objectid,
+                      (unsigned long long)ref_generation,
+                      (unsigned long long)owner_objectid);
+       }
+
+       leaf = path->nodes[0];
+       ei = btrfs_item_ptr(leaf, extent_slot,
+                           struct btrfs_extent_item);
+       refs = btrfs_extent_refs(leaf, ei);
+       BUG_ON(refs == 0);
+       refs -= 1;
+       btrfs_set_extent_refs(leaf, ei, refs);
+
+       btrfs_mark_buffer_dirty(leaf);
+
+       if (refs == 0 && found_extent && path->slots[0] == extent_slot + 1) {
+               struct btrfs_extent_ref *ref;
+               ref = btrfs_item_ptr(leaf, path->slots[0],
+                                    struct btrfs_extent_ref);
+               BUG_ON(btrfs_ref_num_refs(leaf, ref) != 1);
+               /* if the back ref and the extent are next to each other
+                * they get deleted below in one shot
+                */
+               path->slots[0] = extent_slot;
+               num_to_del = 2;
+       } else if (found_extent) {
+               /* otherwise delete the extent back ref */
+               ret = remove_extent_backref(trans, extent_root, path);
+               BUG_ON(ret);
+               /* if refs are 0, we need to setup the path for deletion */
+               if (refs == 0) {
+                       btrfs_release_path(extent_root, path);
+                       ret = btrfs_search_slot(trans, extent_root, &key, path,
+                                               -1, 1);
+                       BUG_ON(ret);
+               }
+       }
+
+       if (refs == 0) {
+               u64 super_used;
+               u64 root_used;
+
+               if (pin) {
+                       mutex_lock(&root->fs_info->pinned_mutex);
+                       ret = pin_down_bytes(trans, root, bytenr, num_bytes,
+                               owner_objectid >= BTRFS_FIRST_FREE_OBJECTID);
+                       mutex_unlock(&root->fs_info->pinned_mutex);
+                       if (ret > 0)
+                               mark_free = 1;
+                       BUG_ON(ret < 0);
+               }
+               /* block accounting for super block */
+               spin_lock(&info->delalloc_lock);
+               super_used = btrfs_super_bytes_used(&info->super_copy);
+               btrfs_set_super_bytes_used(&info->super_copy,
+                                          super_used - num_bytes);
+
+               /* block accounting for root item */
+               root_used = btrfs_root_used(&root->root_item);
+               btrfs_set_root_used(&root->root_item,
+                                          root_used - num_bytes);
+               spin_unlock(&info->delalloc_lock);
+               ret = btrfs_del_items(trans, extent_root, path, path->slots[0],
+                                     num_to_del);
+               BUG_ON(ret);
+               btrfs_release_path(extent_root, path);
+
+               if (owner_objectid >= BTRFS_FIRST_FREE_OBJECTID) {
+                       ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
+                       BUG_ON(ret);
+               }
+
+               ret = update_block_group(trans, root, bytenr, num_bytes, 0,
+                                        mark_free);
+               BUG_ON(ret);
+       }
+       btrfs_free_path(path);
+       finish_current_insert(trans, extent_root, 0);
+       return ret;
+}
+
+/*
+ * find all the blocks marked as pending in the radix tree and remove
+ * them from the extent map
+ */
+static int del_pending_extents(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *extent_root, int all)
+{
+       int ret;
+       int err = 0;
+       u64 start;
+       u64 end;
+       u64 priv;
+       u64 search = 0;
+       int nr = 0, skipped = 0;
+       struct extent_io_tree *pending_del;
+       struct extent_io_tree *extent_ins;
+       struct pending_extent_op *extent_op;
+       struct btrfs_fs_info *info = extent_root->fs_info;
+       struct list_head delete_list;
+
+       INIT_LIST_HEAD(&delete_list);
+       extent_ins = &extent_root->fs_info->extent_ins;
+       pending_del = &extent_root->fs_info->pending_del;
+
+again:
+       mutex_lock(&info->extent_ins_mutex);
+       while (1) {
+               ret = find_first_extent_bit(pending_del, search, &start, &end,
+                                           EXTENT_WRITEBACK);
+               if (ret) {
+                       if (all && skipped && !nr) {
+                               search = 0;
+                               continue;
+                       }
+                       mutex_unlock(&info->extent_ins_mutex);
+                       break;
+               }
+
+               ret = try_lock_extent(extent_ins, start, end, GFP_NOFS);
+               if (!ret) {
+                       search = end+1;
+                       skipped = 1;
+
+                       if (need_resched()) {
+                               mutex_unlock(&info->extent_ins_mutex);
+                               cond_resched();
+                               mutex_lock(&info->extent_ins_mutex);
+                       }
+
+                       continue;
+               }
+               BUG_ON(ret < 0);
+
+               ret = get_state_private(pending_del, start, &priv);
+               BUG_ON(ret);
+               extent_op = (struct pending_extent_op *)(unsigned long)priv;
+
+               clear_extent_bits(pending_del, start, end, EXTENT_WRITEBACK,
+                                 GFP_NOFS);
+               if (!test_range_bit(extent_ins, start, end,
+                                   EXTENT_WRITEBACK, 0)) {
+                       list_add_tail(&extent_op->list, &delete_list);
+                       nr++;
+               } else {
+                       kfree(extent_op);
+
+                       ret = get_state_private(&info->extent_ins, start,
+                                               &priv);
+                       BUG_ON(ret);
+                       extent_op = (struct pending_extent_op *)
+                                               (unsigned long)priv;
+
+                       clear_extent_bits(&info->extent_ins, start, end,
+                                         EXTENT_WRITEBACK, GFP_NOFS);
+
+                       if (extent_op->type == PENDING_BACKREF_UPDATE) {
+                               list_add_tail(&extent_op->list, &delete_list);
+                               search = end + 1;
+                               nr++;
+                               continue;
+                       }
+
+                       mutex_lock(&extent_root->fs_info->pinned_mutex);
+                       ret = pin_down_bytes(trans, extent_root, start,
+                                            end + 1 - start, 0);
+                       mutex_unlock(&extent_root->fs_info->pinned_mutex);
+
+                       ret = update_block_group(trans, extent_root, start,
+                                               end + 1 - start, 0, ret > 0);
+
+                       unlock_extent(extent_ins, start, end, GFP_NOFS);
+                       BUG_ON(ret);
+                       kfree(extent_op);
+               }
+               if (ret)
+                       err = ret;
+
+               search = end + 1;
+
+               if (need_resched()) {
+                       mutex_unlock(&info->extent_ins_mutex);
+                       cond_resched();
+                       mutex_lock(&info->extent_ins_mutex);
+               }
+       }
+
+       if (nr) {
+               ret = free_extents(trans, extent_root, &delete_list);
+               BUG_ON(ret);
+       }
+
+       if (all && skipped) {
+               INIT_LIST_HEAD(&delete_list);
+               search = 0;
+               nr = 0;
+               goto again;
+       }
+
+       return err;
+}
+
+/*
+ * remove an extent from the root, returns 0 on success
+ */
+static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *root,
+                              u64 bytenr, u64 num_bytes, u64 parent,
+                              u64 root_objectid, u64 ref_generation,
+                              u64 owner_objectid, int pin)
+{
+       struct btrfs_root *extent_root = root->fs_info->extent_root;
+       int pending_ret;
+       int ret;
+
+       WARN_ON(num_bytes < root->sectorsize);
+       if (root == extent_root) {
+               struct pending_extent_op *extent_op = NULL;
+
+               mutex_lock(&root->fs_info->extent_ins_mutex);
+               if (test_range_bit(&root->fs_info->extent_ins, bytenr,
+                               bytenr + num_bytes - 1, EXTENT_WRITEBACK, 0)) {
+                       u64 priv;
+                       ret = get_state_private(&root->fs_info->extent_ins,
+                                               bytenr, &priv);
+                       BUG_ON(ret);
+                       extent_op = (struct pending_extent_op *)
+                                               (unsigned long)priv;
+
+                       extent_op->del = 1;
+                       if (extent_op->type == PENDING_EXTENT_INSERT) {
+                               mutex_unlock(&root->fs_info->extent_ins_mutex);
+                               return 0;
+                       }
+               }
+
+               if (extent_op) {
+                       ref_generation = extent_op->orig_generation;
+                       parent = extent_op->orig_parent;
+               }
+
+               extent_op = kmalloc(sizeof(*extent_op), GFP_NOFS);
+               BUG_ON(!extent_op);
+
+               extent_op->type = PENDING_EXTENT_DELETE;
+               extent_op->bytenr = bytenr;
+               extent_op->num_bytes = num_bytes;
+               extent_op->parent = parent;
+               extent_op->orig_parent = parent;
+               extent_op->generation = ref_generation;
+               extent_op->orig_generation = ref_generation;
+               extent_op->level = (int)owner_objectid;
+               INIT_LIST_HEAD(&extent_op->list);
+               extent_op->del = 0;
+
+               set_extent_bits(&root->fs_info->pending_del,
+                               bytenr, bytenr + num_bytes - 1,
+                               EXTENT_WRITEBACK, GFP_NOFS);
+               set_state_private(&root->fs_info->pending_del,
+                                 bytenr, (unsigned long)extent_op);
+               mutex_unlock(&root->fs_info->extent_ins_mutex);
+               return 0;
+       }
+       /* if metadata always pin */
+       if (owner_objectid < BTRFS_FIRST_FREE_OBJECTID) {
+               if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+                       struct btrfs_block_group_cache *cache;
+
+                       /* btrfs_free_reserved_extent */
+                       cache = btrfs_lookup_block_group(root->fs_info, bytenr);
+                       BUG_ON(!cache);
+                       btrfs_add_free_space(cache, bytenr, num_bytes);
+                       put_block_group(cache);
+                       update_reserved_extents(root, bytenr, num_bytes, 0);
+                       return 0;
+               }
+               pin = 1;
+       }
+
+       /* if data pin when any transaction has committed this */
+       if (ref_generation != trans->transid)
+               pin = 1;
+
+       ret = __free_extent(trans, root, bytenr, num_bytes, parent,
+                           root_objectid, ref_generation,
+                           owner_objectid, pin, pin == 0);
+
+       finish_current_insert(trans, root->fs_info->extent_root, 0);
+       pending_ret = del_pending_extents(trans, root->fs_info->extent_root, 0);
+       return ret ? ret : pending_ret;
+}
+
+int btrfs_free_extent(struct btrfs_trans_handle *trans,
+                     struct btrfs_root *root,
+                     u64 bytenr, u64 num_bytes, u64 parent,
+                     u64 root_objectid, u64 ref_generation,
+                     u64 owner_objectid, int pin)
+{
+       int ret;
+
+       ret = __btrfs_free_extent(trans, root, bytenr, num_bytes, parent,
+                                 root_objectid, ref_generation,
+                                 owner_objectid, pin);
+       return ret;
+}
+
+static u64 stripe_align(struct btrfs_root *root, u64 val)
+{
+       u64 mask = ((u64)root->stripesize - 1);
+       u64 ret = (val + mask) & ~mask;
+       return ret;
+}
+
+/*
+ * walks the btree of allocated extents and find a hole of a given size.
+ * The key ins is changed to record the hole:
+ * ins->objectid == block start
+ * ins->flags = BTRFS_EXTENT_ITEM_KEY
+ * ins->offset == number of blocks
+ * Any available blocks before search_start are skipped.
+ */
+static noinline int find_free_extent(struct btrfs_trans_handle *trans,
+                                    struct btrfs_root *orig_root,
+                                    u64 num_bytes, u64 empty_size,
+                                    u64 search_start, u64 search_end,
+                                    u64 hint_byte, struct btrfs_key *ins,
+                                    u64 exclude_start, u64 exclude_nr,
+                                    int data)
+{
+       int ret = 0;
+       struct btrfs_root *root = orig_root->fs_info->extent_root;
+       u64 total_needed = num_bytes;
+       u64 *last_ptr = NULL;
+       u64 last_wanted = 0;
+       struct btrfs_block_group_cache *block_group = NULL;
+       int chunk_alloc_done = 0;
+       int empty_cluster = 2 * 1024 * 1024;
+       int allowed_chunk_alloc = 0;
+       struct list_head *head = NULL, *cur = NULL;
+       int loop = 0;
+       int extra_loop = 0;
+       struct btrfs_space_info *space_info;
+
+       WARN_ON(num_bytes < root->sectorsize);
+       btrfs_set_key_type(ins, BTRFS_EXTENT_ITEM_KEY);
+       ins->objectid = 0;
+       ins->offset = 0;
+
+       if (orig_root->ref_cows || empty_size)
+               allowed_chunk_alloc = 1;
+
+       if (data & BTRFS_BLOCK_GROUP_METADATA) {
+               last_ptr = &root->fs_info->last_alloc;
+               empty_cluster = 64 * 1024;
+       }
+
+       if ((data & BTRFS_BLOCK_GROUP_DATA) && btrfs_test_opt(root, SSD))
+               last_ptr = &root->fs_info->last_data_alloc;
+
+       if (last_ptr) {
+               if (*last_ptr) {
+                       hint_byte = *last_ptr;
+                       last_wanted = *last_ptr;
+               } else
+                       empty_size += empty_cluster;
+       } else {
+               empty_cluster = 0;
+       }
+       search_start = max(search_start, first_logical_byte(root, 0));
+       search_start = max(search_start, hint_byte);
+
+       if (last_wanted && search_start != last_wanted) {
+               last_wanted = 0;
+               empty_size += empty_cluster;
+       }
+
+       total_needed += empty_size;
+       block_group = btrfs_lookup_block_group(root->fs_info, search_start);
+       if (!block_group)
+               block_group = btrfs_lookup_first_block_group(root->fs_info,
+                                                            search_start);
+       space_info = __find_space_info(root->fs_info, data);
+
+       down_read(&space_info->groups_sem);
+       while (1) {
+               struct btrfs_free_space *free_space;
+               /*
+                * the only way this happens if our hint points to a block
+                * group thats not of the proper type, while looping this
+                * should never happen
+                */
+               if (empty_size)
+                       extra_loop = 1;
+
+               if (!block_group)
+                       goto new_group_no_lock;
+
+               if (unlikely(!block_group->cached)) {
+                       mutex_lock(&block_group->cache_mutex);
+                       ret = cache_block_group(root, block_group);
+                       mutex_unlock(&block_group->cache_mutex);
+                       if (ret)
+                               break;
+               }
+
+               mutex_lock(&block_group->alloc_mutex);
+               if (unlikely(!block_group_bits(block_group, data)))
+                       goto new_group;
+
+               if (unlikely(block_group->ro))
+                       goto new_group;
+
+               free_space = btrfs_find_free_space(block_group, search_start,
+                                                  total_needed);
+               if (free_space) {
+                       u64 start = block_group->key.objectid;
+                       u64 end = block_group->key.objectid +
+                               block_group->key.offset;
+
+                       search_start = stripe_align(root, free_space->offset);
+
+                       /* move on to the next group */
+                       if (search_start + num_bytes >= search_end)
+                               goto new_group;
+
+                       /* move on to the next group */
+                       if (search_start + num_bytes > end)
+                               goto new_group;
+
+                       if (last_wanted && search_start != last_wanted) {
+                               total_needed += empty_cluster;
+                               empty_size += empty_cluster;
+                               last_wanted = 0;
+                               /*
+                                * if search_start is still in this block group
+                                * then we just re-search this block group
+                                */
+                               if (search_start >= start &&
+                                   search_start < end) {
+                                       mutex_unlock(&block_group->alloc_mutex);
+                                       continue;
+                               }
+
+                               /* else we go to the next block group */
+                               goto new_group;
+                       }
+
+                       if (exclude_nr > 0 &&
+                           (search_start + num_bytes > exclude_start &&
+                            search_start < exclude_start + exclude_nr)) {
+                               search_start = exclude_start + exclude_nr;
+                               /*
+                                * if search_start is still in this block group
+                                * then we just re-search this block group
+                                */
+                               if (search_start >= start &&
+                                   search_start < end) {
+                                       mutex_unlock(&block_group->alloc_mutex);
+                                       last_wanted = 0;
+                                       continue;
+                               }
+
+                               /* else we go to the next block group */
+                               goto new_group;
+                       }
+
+                       ins->objectid = search_start;
+                       ins->offset = num_bytes;
+
+                       btrfs_remove_free_space_lock(block_group, search_start,
+                                                    num_bytes);
+                       /* we are all good, lets return */
+                       mutex_unlock(&block_group->alloc_mutex);
+                       break;
+               }
+new_group:
+               mutex_unlock(&block_group->alloc_mutex);
+               put_block_group(block_group);
+               block_group = NULL;
+new_group_no_lock:
+               /* don't try to compare new allocations against the
+                * last allocation any more
+                */
+               last_wanted = 0;
+
+               /*
+                * Here's how this works.
+                * loop == 0: we were searching a block group via a hint
+                *              and didn't find anything, so we start at
+                *              the head of the block groups and keep searching
+                * loop == 1: we're searching through all of the block groups
+                *              if we hit the head again we have searched
+                *              all of the block groups for this space and we
+                *              need to try and allocate, if we cant error out.
+                * loop == 2: we allocated more space and are looping through
+                *              all of the block groups again.
+                */
+               if (loop == 0) {
+                       head = &space_info->block_groups;
+                       cur = head->next;
+                       loop++;
+               } else if (loop == 1 && cur == head) {
+                       int keep_going;
+
+                       /* at this point we give up on the empty_size
+                        * allocations and just try to allocate the min
+                        * space.
+                        *
+                        * The extra_loop field was set if an empty_size
+                        * allocation was attempted above, and if this
+                        * is try we need to try the loop again without
+                        * the additional empty_size.
+                        */
+                       total_needed -= empty_size;
+                       empty_size = 0;
+                       keep_going = extra_loop;
+                       loop++;
+
+                       if (allowed_chunk_alloc && !chunk_alloc_done) {
+                               up_read(&space_info->groups_sem);
+                               ret = do_chunk_alloc(trans, root, num_bytes +
+                                                    2 * 1024 * 1024, data, 1);
+                               down_read(&space_info->groups_sem);
+                               if (ret < 0)
+                                       goto loop_check;
+                               head = &space_info->block_groups;
+                               /*
+                                * we've allocated a new chunk, keep
+                                * trying
+                                */
+                               keep_going = 1;
+                               chunk_alloc_done = 1;
+                       } else if (!allowed_chunk_alloc) {
+                               space_info->force_alloc = 1;
+                       }
+loop_check:
+                       if (keep_going) {
+                               cur = head->next;
+                               extra_loop = 0;
+                       } else {
+                               break;
+                       }
+               } else if (cur == head) {
+                       break;
+               }
+
+               block_group = list_entry(cur, struct btrfs_block_group_cache,
+                                        list);
+               atomic_inc(&block_group->count);
+
+               search_start = block_group->key.objectid;
+               cur = cur->next;
+       }
+
+       /* we found what we needed */
+       if (ins->objectid) {
+               if (!(data & BTRFS_BLOCK_GROUP_DATA))
+                       trans->block_group = block_group->key.objectid;
+
+               if (last_ptr)
+                       *last_ptr = ins->objectid + ins->offset;
+               ret = 0;
+       } else if (!ret) {
+               printk(KERN_ERR "btrfs searching for %llu bytes, "
+                      "num_bytes %llu, loop %d, allowed_alloc %d\n",
+                      (unsigned long long)total_needed,
+                      (unsigned long long)num_bytes,
+                      loop, allowed_chunk_alloc);
+               ret = -ENOSPC;
+       }
+       if (block_group)
+               put_block_group(block_group);
+
+       up_read(&space_info->groups_sem);
+       return ret;
+}
+
+static void dump_space_info(struct btrfs_space_info *info, u64 bytes)
+{
+       struct btrfs_block_group_cache *cache;
+       struct list_head *l;
+
+       printk(KERN_INFO "space_info has %llu free, is %sfull\n",
+              (unsigned long long)(info->total_bytes - info->bytes_used -
+                                   info->bytes_pinned - info->bytes_reserved),
+              (info->full) ? "" : "not ");
+
+       down_read(&info->groups_sem);
+       list_for_each(l, &info->block_groups) {
+               cache = list_entry(l, struct btrfs_block_group_cache, list);
+               spin_lock(&cache->lock);
+               printk(KERN_INFO "block group %llu has %llu bytes, %llu used "
+                      "%llu pinned %llu reserved\n",
+                      (unsigned long long)cache->key.objectid,
+                      (unsigned long long)cache->key.offset,
+                      (unsigned long long)btrfs_block_group_used(&cache->item),
+                      (unsigned long long)cache->pinned,
+                      (unsigned long long)cache->reserved);
+               btrfs_dump_free_space(cache, bytes);
+               spin_unlock(&cache->lock);
+       }
+       up_read(&info->groups_sem);
+}
+
+static int __btrfs_reserve_extent(struct btrfs_trans_handle *trans,
+                                 struct btrfs_root *root,
+                                 u64 num_bytes, u64 min_alloc_size,
+                                 u64 empty_size, u64 hint_byte,
+                                 u64 search_end, struct btrfs_key *ins,
+                                 u64 data)
+{
+       int ret;
+       u64 search_start = 0;
+       u64 alloc_profile;
+       struct btrfs_fs_info *info = root->fs_info;
+
+       if (data) {
+               alloc_profile = info->avail_data_alloc_bits &
+                       info->data_alloc_profile;
+               data = BTRFS_BLOCK_GROUP_DATA | alloc_profile;
+       } else if (root == root->fs_info->chunk_root) {
+               alloc_profile = info->avail_system_alloc_bits &
+                       info->system_alloc_profile;
+               data = BTRFS_BLOCK_GROUP_SYSTEM | alloc_profile;
+       } else {
+               alloc_profile = info->avail_metadata_alloc_bits &
+                       info->metadata_alloc_profile;
+               data = BTRFS_BLOCK_GROUP_METADATA | alloc_profile;
+       }
+again:
+       data = btrfs_reduce_alloc_profile(root, data);
+       /*
+        * the only place that sets empty_size is btrfs_realloc_node, which
+        * is not called recursively on allocations
+        */
+       if (empty_size || root->ref_cows) {
+               if (!(data & BTRFS_BLOCK_GROUP_METADATA)) {
+                       ret = do_chunk_alloc(trans, root->fs_info->extent_root,
+                                    2 * 1024 * 1024,
+                                    BTRFS_BLOCK_GROUP_METADATA |
+                                    (info->metadata_alloc_profile &
+                                     info->avail_metadata_alloc_bits), 0);
+               }
+               ret = do_chunk_alloc(trans, root->fs_info->extent_root,
+                                    num_bytes + 2 * 1024 * 1024, data, 0);
+       }
+
+       WARN_ON(num_bytes < root->sectorsize);
+       ret = find_free_extent(trans, root, num_bytes, empty_size,
+                              search_start, search_end, hint_byte, ins,
+                              trans->alloc_exclude_start,
+                              trans->alloc_exclude_nr, data);
+
+       if (ret == -ENOSPC && num_bytes > min_alloc_size) {
+               num_bytes = num_bytes >> 1;
+               num_bytes = num_bytes & ~(root->sectorsize - 1);
+               num_bytes = max(num_bytes, min_alloc_size);
+               do_chunk_alloc(trans, root->fs_info->extent_root,
+                              num_bytes, data, 1);
+               goto again;
+       }
+       if (ret) {
+               struct btrfs_space_info *sinfo;
+
+               sinfo = __find_space_info(root->fs_info, data);
+               printk(KERN_ERR "btrfs allocation failed flags %llu, "
+                      "wanted %llu\n", (unsigned long long)data,
+                      (unsigned long long)num_bytes);
+               dump_space_info(sinfo, num_bytes);
+               BUG();
+       }
+
+       return ret;
+}
+
+int btrfs_free_reserved_extent(struct btrfs_root *root, u64 start, u64 len)
+{
+       struct btrfs_block_group_cache *cache;
+       int ret = 0;
+
+       cache = btrfs_lookup_block_group(root->fs_info, start);
+       if (!cache) {
+               printk(KERN_ERR "Unable to find block group for %llu\n",
+                      (unsigned long long)start);
+               return -ENOSPC;
+       }
+
+       ret = btrfs_discard_extent(root, start, len);
+
+       btrfs_add_free_space(cache, start, len);
+       put_block_group(cache);
+       update_reserved_extents(root, start, len, 0);
+
+       return ret;
+}
+
+int btrfs_reserve_extent(struct btrfs_trans_handle *trans,
+                                 struct btrfs_root *root,
+                                 u64 num_bytes, u64 min_alloc_size,
+                                 u64 empty_size, u64 hint_byte,
+                                 u64 search_end, struct btrfs_key *ins,
+                                 u64 data)
+{
+       int ret;
+       ret = __btrfs_reserve_extent(trans, root, num_bytes, min_alloc_size,
+                                    empty_size, hint_byte, search_end, ins,
+                                    data);
+       update_reserved_extents(root, ins->objectid, ins->offset, 1);
+       return ret;
+}
+
+static int __btrfs_alloc_reserved_extent(struct btrfs_trans_handle *trans,
+                                        struct btrfs_root *root, u64 parent,
+                                        u64 root_objectid, u64 ref_generation,
+                                        u64 owner, struct btrfs_key *ins)
+{
+       int ret;
+       int pending_ret;
+       u64 super_used;
+       u64 root_used;
+       u64 num_bytes = ins->offset;
+       u32 sizes[2];
+       struct btrfs_fs_info *info = root->fs_info;
+       struct btrfs_root *extent_root = info->extent_root;
+       struct btrfs_extent_item *extent_item;
+       struct btrfs_extent_ref *ref;
+       struct btrfs_path *path;
+       struct btrfs_key keys[2];
+
+       if (parent == 0)
+               parent = ins->objectid;
+
+       /* block accounting for super block */
+       spin_lock(&info->delalloc_lock);
+       super_used = btrfs_super_bytes_used(&info->super_copy);
+       btrfs_set_super_bytes_used(&info->super_copy, super_used + num_bytes);
+
+       /* block accounting for root item */
+       root_used = btrfs_root_used(&root->root_item);
+       btrfs_set_root_used(&root->root_item, root_used + num_bytes);
+       spin_unlock(&info->delalloc_lock);
+
+       if (root == extent_root) {
+               struct pending_extent_op *extent_op;
+
+               extent_op = kmalloc(sizeof(*extent_op), GFP_NOFS);
+               BUG_ON(!extent_op);
+
+               extent_op->type = PENDING_EXTENT_INSERT;
+               extent_op->bytenr = ins->objectid;
+               extent_op->num_bytes = ins->offset;
+               extent_op->parent = parent;
+               extent_op->orig_parent = 0;
+               extent_op->generation = ref_generation;
+               extent_op->orig_generation = 0;
+               extent_op->level = (int)owner;
+               INIT_LIST_HEAD(&extent_op->list);
+               extent_op->del = 0;
+
+               mutex_lock(&root->fs_info->extent_ins_mutex);
+               set_extent_bits(&root->fs_info->extent_ins, ins->objectid,
+                               ins->objectid + ins->offset - 1,
+                               EXTENT_WRITEBACK, GFP_NOFS);
+               set_state_private(&root->fs_info->extent_ins,
+                                 ins->objectid, (unsigned long)extent_op);
+               mutex_unlock(&root->fs_info->extent_ins_mutex);
+               goto update_block;
+       }
+
+       memcpy(&keys[0], ins, sizeof(*ins));
+       keys[1].objectid = ins->objectid;
+       keys[1].type = BTRFS_EXTENT_REF_KEY;
+       keys[1].offset = parent;
+       sizes[0] = sizeof(*extent_item);
+       sizes[1] = sizeof(*ref);
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       ret = btrfs_insert_empty_items(trans, extent_root, path, keys,
+                                      sizes, 2);
+       BUG_ON(ret);
+
+       extent_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+                                    struct btrfs_extent_item);
+       btrfs_set_extent_refs(path->nodes[0], extent_item, 1);
+       ref = btrfs_item_ptr(path->nodes[0], path->slots[0] + 1,
+                            struct btrfs_extent_ref);
+
+       btrfs_set_ref_root(path->nodes[0], ref, root_objectid);
+       btrfs_set_ref_generation(path->nodes[0], ref, ref_generation);
+       btrfs_set_ref_objectid(path->nodes[0], ref, owner);
+       btrfs_set_ref_num_refs(path->nodes[0], ref, 1);
+
+       btrfs_mark_buffer_dirty(path->nodes[0]);
+
+       trans->alloc_exclude_start = 0;
+       trans->alloc_exclude_nr = 0;
+       btrfs_free_path(path);
+       finish_current_insert(trans, extent_root, 0);
+       pending_ret = del_pending_extents(trans, extent_root, 0);
+
+       if (ret)
+               goto out;
+       if (pending_ret) {
+               ret = pending_ret;
+               goto out;
+       }
+
+update_block:
+       ret = update_block_group(trans, root, ins->objectid,
+                                ins->offset, 1, 0);
+       if (ret) {
+               printk(KERN_ERR "btrfs update block group failed for %llu "
+                      "%llu\n", (unsigned long long)ins->objectid,
+                      (unsigned long long)ins->offset);
+               BUG();
+       }
+out:
+       return ret;
+}
+
+int btrfs_alloc_reserved_extent(struct btrfs_trans_handle *trans,
+                               struct btrfs_root *root, u64 parent,
+                               u64 root_objectid, u64 ref_generation,
+                               u64 owner, struct btrfs_key *ins)
+{
+       int ret;
+
+       if (root_objectid == BTRFS_TREE_LOG_OBJECTID)
+               return 0;
+       ret = __btrfs_alloc_reserved_extent(trans, root, parent, root_objectid,
+                                           ref_generation, owner, ins);
+       update_reserved_extents(root, ins->objectid, ins->offset, 0);
+       return ret;
+}
+
+/*
+ * this is used by the tree logging recovery code.  It records that
+ * an extent has been allocated and makes sure to clear the free
+ * space cache bits as well
+ */
+int btrfs_alloc_logged_extent(struct btrfs_trans_handle *trans,
+                               struct btrfs_root *root, u64 parent,
+                               u64 root_objectid, u64 ref_generation,
+                               u64 owner, struct btrfs_key *ins)
+{
+       int ret;
+       struct btrfs_block_group_cache *block_group;
+
+       block_group = btrfs_lookup_block_group(root->fs_info, ins->objectid);
+       mutex_lock(&block_group->cache_mutex);
+       cache_block_group(root, block_group);
+       mutex_unlock(&block_group->cache_mutex);
+
+       ret = btrfs_remove_free_space(block_group, ins->objectid,
+                                     ins->offset);
+       BUG_ON(ret);
+       put_block_group(block_group);
+       ret = __btrfs_alloc_reserved_extent(trans, root, parent, root_objectid,
+                                           ref_generation, owner, ins);
+       return ret;
+}
+
+/*
+ * finds a free extent and does all the dirty work required for allocation
+ * returns the key for the extent through ins, and a tree buffer for
+ * the first block of the extent through buf.
+ *
+ * returns 0 if everything worked, non-zero otherwise.
+ */
+int btrfs_alloc_extent(struct btrfs_trans_handle *trans,
+                      struct btrfs_root *root,
+                      u64 num_bytes, u64 parent, u64 min_alloc_size,
+                      u64 root_objectid, u64 ref_generation,
+                      u64 owner_objectid, u64 empty_size, u64 hint_byte,
+                      u64 search_end, struct btrfs_key *ins, u64 data)
+{
+       int ret;
+
+       ret = __btrfs_reserve_extent(trans, root, num_bytes,
+                                    min_alloc_size, empty_size, hint_byte,
+                                    search_end, ins, data);
+       BUG_ON(ret);
+       if (root_objectid != BTRFS_TREE_LOG_OBJECTID) {
+               ret = __btrfs_alloc_reserved_extent(trans, root, parent,
+                                       root_objectid, ref_generation,
+                                       owner_objectid, ins);
+               BUG_ON(ret);
+
+       } else {
+               update_reserved_extents(root, ins->objectid, ins->offset, 1);
+       }
+       return ret;
+}
+
+struct extent_buffer *btrfs_init_new_buffer(struct btrfs_trans_handle *trans,
+                                           struct btrfs_root *root,
+                                           u64 bytenr, u32 blocksize)
+{
+       struct extent_buffer *buf;
+
+       buf = btrfs_find_create_tree_block(root, bytenr, blocksize);
+       if (!buf)
+               return ERR_PTR(-ENOMEM);
+       btrfs_set_header_generation(buf, trans->transid);
+       btrfs_tree_lock(buf);
+       clean_tree_block(trans, root, buf);
+       btrfs_set_buffer_uptodate(buf);
+       if (root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID) {
+               set_extent_dirty(&root->dirty_log_pages, buf->start,
+                        buf->start + buf->len - 1, GFP_NOFS);
+       } else {
+               set_extent_dirty(&trans->transaction->dirty_pages, buf->start,
+                        buf->start + buf->len - 1, GFP_NOFS);
+       }
+       trans->blocks_used++;
+       return buf;
+}
+
+/*
+ * helper function to allocate a block for a given tree
+ * returns the tree buffer or NULL.
+ */
+struct extent_buffer *btrfs_alloc_free_block(struct btrfs_trans_handle *trans,
+                                            struct btrfs_root *root,
+                                            u32 blocksize, u64 parent,
+                                            u64 root_objectid,
+                                            u64 ref_generation,
+                                            int level,
+                                            u64 hint,
+                                            u64 empty_size)
+{
+       struct btrfs_key ins;
+       int ret;
+       struct extent_buffer *buf;
+
+       ret = btrfs_alloc_extent(trans, root, blocksize, parent, blocksize,
+                                root_objectid, ref_generation, level,
+                                empty_size, hint, (u64)-1, &ins, 0);
+       if (ret) {
+               BUG_ON(ret > 0);
+               return ERR_PTR(ret);
+       }
+
+       buf = btrfs_init_new_buffer(trans, root, ins.objectid, blocksize);
+       return buf;
+}
+
+int btrfs_drop_leaf_ref(struct btrfs_trans_handle *trans,
+                       struct btrfs_root *root, struct extent_buffer *leaf)
+{
+       u64 leaf_owner;
+       u64 leaf_generation;
+       struct btrfs_key key;
+       struct btrfs_file_extent_item *fi;
+       int i;
+       int nritems;
+       int ret;
+
+       BUG_ON(!btrfs_is_leaf(leaf));
+       nritems = btrfs_header_nritems(leaf);
+       leaf_owner = btrfs_header_owner(leaf);
+       leaf_generation = btrfs_header_generation(leaf);
+
+       for (i = 0; i < nritems; i++) {
+               u64 disk_bytenr;
+               cond_resched();
+
+               btrfs_item_key_to_cpu(leaf, &key, i);
+               if (btrfs_key_type(&key) != BTRFS_EXTENT_DATA_KEY)
+                       continue;
+               fi = btrfs_item_ptr(leaf, i, struct btrfs_file_extent_item);
+               if (btrfs_file_extent_type(leaf, fi) ==
+                   BTRFS_FILE_EXTENT_INLINE)
+                       continue;
+               /*
+                * FIXME make sure to insert a trans record that
+                * repeats the snapshot del on crash
+                */
+               disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+               if (disk_bytenr == 0)
+                       continue;
+
+               ret = __btrfs_free_extent(trans, root, disk_bytenr,
+                               btrfs_file_extent_disk_num_bytes(leaf, fi),
+                               leaf->start, leaf_owner, leaf_generation,
+                               key.objectid, 0);
+               BUG_ON(ret);
+
+               atomic_inc(&root->fs_info->throttle_gen);
+               wake_up(&root->fs_info->transaction_throttle);
+               cond_resched();
+       }
+       return 0;
+}
+
+static noinline int cache_drop_leaf_ref(struct btrfs_trans_handle *trans,
+                                       struct btrfs_root *root,
+                                       struct btrfs_leaf_ref *ref)
+{
+       int i;
+       int ret;
+       struct btrfs_extent_info *info = ref->extents;
+
+       for (i = 0; i < ref->nritems; i++) {
+               ret = __btrfs_free_extent(trans, root, info->bytenr,
+                                         info->num_bytes, ref->bytenr,
+                                         ref->owner, ref->generation,
+                                         info->objectid, 0);
+
+               atomic_inc(&root->fs_info->throttle_gen);
+               wake_up(&root->fs_info->transaction_throttle);
+               cond_resched();
+
+               BUG_ON(ret);
+               info++;
+       }
+
+       return 0;
+}
+
+static int drop_snap_lookup_refcount(struct btrfs_root *root, u64 start,
+                                    u64 len, u32 *refs)
+{
+       int ret;
+
+       ret = btrfs_lookup_extent_ref(NULL, root, start, len, refs);
+       BUG_ON(ret);
+
+#if 0 /* some debugging code in case we see problems here */
+       /* if the refs count is one, it won't get increased again.  But
+        * if the ref count is > 1, someone may be decreasing it at
+        * the same time we are.
+        */
+       if (*refs != 1) {
+               struct extent_buffer *eb = NULL;
+               eb = btrfs_find_create_tree_block(root, start, len);
+               if (eb)
+                       btrfs_tree_lock(eb);
+
+               mutex_lock(&root->fs_info->alloc_mutex);
+               ret = lookup_extent_ref(NULL, root, start, len, refs);
+               BUG_ON(ret);
+               mutex_unlock(&root->fs_info->alloc_mutex);
+
+               if (eb) {
+                       btrfs_tree_unlock(eb);
+                       free_extent_buffer(eb);
+               }
+               if (*refs == 1) {
+                       printk(KERN_ERR "btrfs block %llu went down to one "
+                              "during drop_snap\n", (unsigned long long)start);
+               }
+
+       }
+#endif
+
+       cond_resched();
+       return ret;
+}
+
+/*
+ * helper function for drop_snapshot, this walks down the tree dropping ref
+ * counts as it goes.
+ */
+static noinline int walk_down_tree(struct btrfs_trans_handle *trans,
+                                  struct btrfs_root *root,
+                                  struct btrfs_path *path, int *level)
+{
+       u64 root_owner;
+       u64 root_gen;
+       u64 bytenr;
+       u64 ptr_gen;
+       struct extent_buffer *next;
+       struct extent_buffer *cur;
+       struct extent_buffer *parent;
+       struct btrfs_leaf_ref *ref;
+       u32 blocksize;
+       int ret;
+       u32 refs;
+
+       WARN_ON(*level < 0);
+       WARN_ON(*level >= BTRFS_MAX_LEVEL);
+       ret = drop_snap_lookup_refcount(root, path->nodes[*level]->start,
+                               path->nodes[*level]->len, &refs);
+       BUG_ON(ret);
+       if (refs > 1)
+               goto out;
+
+       /*
+        * walk down to the last node level and free all the leaves
+        */
+       while (*level >= 0) {
+               WARN_ON(*level < 0);
+               WARN_ON(*level >= BTRFS_MAX_LEVEL);
+               cur = path->nodes[*level];
+
+               if (btrfs_header_level(cur) != *level)
+                       WARN_ON(1);
+
+               if (path->slots[*level] >=
+                   btrfs_header_nritems(cur))
+                       break;
+               if (*level == 0) {
+                       ret = btrfs_drop_leaf_ref(trans, root, cur);
+                       BUG_ON(ret);
+                       break;
+               }
+               bytenr = btrfs_node_blockptr(cur, path->slots[*level]);
+               ptr_gen = btrfs_node_ptr_generation(cur, path->slots[*level]);
+               blocksize = btrfs_level_size(root, *level - 1);
+
+               ret = drop_snap_lookup_refcount(root, bytenr, blocksize, &refs);
+               BUG_ON(ret);
+               if (refs != 1) {
+                       parent = path->nodes[*level];
+                       root_owner = btrfs_header_owner(parent);
+                       root_gen = btrfs_header_generation(parent);
+                       path->slots[*level]++;
+
+                       ret = __btrfs_free_extent(trans, root, bytenr,
+                                               blocksize, parent->start,
+                                               root_owner, root_gen,
+                                               *level - 1, 1);
+                       BUG_ON(ret);
+
+                       atomic_inc(&root->fs_info->throttle_gen);
+                       wake_up(&root->fs_info->transaction_throttle);
+                       cond_resched();
+
+                       continue;
+               }
+               /*
+                * at this point, we have a single ref, and since the
+                * only place referencing this extent is a dead root
+                * the reference count should never go higher.
+                * So, we don't need to check it again
+                */
+               if (*level == 1) {
+                       ref = btrfs_lookup_leaf_ref(root, bytenr);
+                       if (ref && ref->generation != ptr_gen) {
+                               btrfs_free_leaf_ref(root, ref);
+                               ref = NULL;
+                       }
+                       if (ref) {
+                               ret = cache_drop_leaf_ref(trans, root, ref);
+                               BUG_ON(ret);
+                               btrfs_remove_leaf_ref(root, ref);
+                               btrfs_free_leaf_ref(root, ref);
+                               *level = 0;
+                               break;
+                       }
+               }
+               next = btrfs_find_tree_block(root, bytenr, blocksize);
+               if (!next || !btrfs_buffer_uptodate(next, ptr_gen)) {
+                       free_extent_buffer(next);
+
+                       next = read_tree_block(root, bytenr, blocksize,
+                                              ptr_gen);
+                       cond_resched();
+#if 0
+                       /*
+                        * this is a debugging check and can go away
+                        * the ref should never go all the way down to 1
+                        * at this point
+                        */
+                       ret = lookup_extent_ref(NULL, root, bytenr, blocksize,
+                                               &refs);
+                       BUG_ON(ret);
+                       WARN_ON(refs != 1);
+#endif
+               }
+               WARN_ON(*level <= 0);
+               if (path->nodes[*level-1])
+                       free_extent_buffer(path->nodes[*level-1]);
+               path->nodes[*level-1] = next;
+               *level = btrfs_header_level(next);
+               path->slots[*level] = 0;
+               cond_resched();
+       }
+out:
+       WARN_ON(*level < 0);
+       WARN_ON(*level >= BTRFS_MAX_LEVEL);
+
+       if (path->nodes[*level] == root->node) {
+               parent = path->nodes[*level];
+               bytenr = path->nodes[*level]->start;
+       } else {
+               parent = path->nodes[*level + 1];
+               bytenr = btrfs_node_blockptr(parent, path->slots[*level + 1]);
+       }
+
+       blocksize = btrfs_level_size(root, *level);
+       root_owner = btrfs_header_owner(parent);
+       root_gen = btrfs_header_generation(parent);
+
+       ret = __btrfs_free_extent(trans, root, bytenr, blocksize,
+                                 parent->start, root_owner, root_gen,
+                                 *level, 1);
+       free_extent_buffer(path->nodes[*level]);
+       path->nodes[*level] = NULL;
+       *level += 1;
+       BUG_ON(ret);
+
+       cond_resched();
+       return 0;
+}
+
+/*
+ * helper function for drop_subtree, this function is similar to
+ * walk_down_tree. The main difference is that it checks reference
+ * counts while tree blocks are locked.
+ */
+static noinline int walk_down_subtree(struct btrfs_trans_handle *trans,
+                                     struct btrfs_root *root,
+                                     struct btrfs_path *path, int *level)
+{
+       struct extent_buffer *next;
+       struct extent_buffer *cur;
+       struct extent_buffer *parent;
+       u64 bytenr;
+       u64 ptr_gen;
+       u32 blocksize;
+       u32 refs;
+       int ret;
+
+       cur = path->nodes[*level];
+       ret = btrfs_lookup_extent_ref(trans, root, cur->start, cur->len,
+                                     &refs);
+       BUG_ON(ret);
+       if (refs > 1)
+               goto out;
+
+       while (*level >= 0) {
+               cur = path->nodes[*level];
+               if (*level == 0) {
+                       ret = btrfs_drop_leaf_ref(trans, root, cur);
+                       BUG_ON(ret);
+                       clean_tree_block(trans, root, cur);
+                       break;
+               }
+               if (path->slots[*level] >= btrfs_header_nritems(cur)) {
+                       clean_tree_block(trans, root, cur);
+                       break;
+               }
+
+               bytenr = btrfs_node_blockptr(cur, path->slots[*level]);
+               blocksize = btrfs_level_size(root, *level - 1);
+               ptr_gen = btrfs_node_ptr_generation(cur, path->slots[*level]);
+
+               next = read_tree_block(root, bytenr, blocksize, ptr_gen);
+               btrfs_tree_lock(next);
+
+               ret = btrfs_lookup_extent_ref(trans, root, bytenr, blocksize,
+                                             &refs);
+               BUG_ON(ret);
+               if (refs > 1) {
+                       parent = path->nodes[*level];
+                       ret = btrfs_free_extent(trans, root, bytenr,
+                                       blocksize, parent->start,
+                                       btrfs_header_owner(parent),
+                                       btrfs_header_generation(parent),
+                                       *level - 1, 1);
+                       BUG_ON(ret);
+                       path->slots[*level]++;
+                       btrfs_tree_unlock(next);
+                       free_extent_buffer(next);
+                       continue;
+               }
+
+               *level = btrfs_header_level(next);
+               path->nodes[*level] = next;
+               path->slots[*level] = 0;
+               path->locks[*level] = 1;
+               cond_resched();
+       }
+out:
+       parent = path->nodes[*level + 1];
+       bytenr = path->nodes[*level]->start;
+       blocksize = path->nodes[*level]->len;
+
+       ret = btrfs_free_extent(trans, root, bytenr, blocksize,
+                       parent->start, btrfs_header_owner(parent),
+                       btrfs_header_generation(parent), *level, 1);
+       BUG_ON(ret);
+
+       if (path->locks[*level]) {
+               btrfs_tree_unlock(path->nodes[*level]);
+               path->locks[*level] = 0;
+       }
+       free_extent_buffer(path->nodes[*level]);
+       path->nodes[*level] = NULL;
+       *level += 1;
+       cond_resched();
+       return 0;
+}
+
+/*
+ * helper for dropping snapshots.  This walks back up the tree in the path
+ * to find the first node higher up where we haven't yet gone through
+ * all the slots
+ */
+static noinline int walk_up_tree(struct btrfs_trans_handle *trans,
+                                struct btrfs_root *root,
+                                struct btrfs_path *path,
+                                int *level, int max_level)
+{
+       u64 root_owner;
+       u64 root_gen;
+       struct btrfs_root_item *root_item = &root->root_item;
+       int i;
+       int slot;
+       int ret;
+
+       for (i = *level; i < max_level && path->nodes[i]; i++) {
+               slot = path->slots[i];
+               if (slot < btrfs_header_nritems(path->nodes[i]) - 1) {
+                       struct extent_buffer *node;
+                       struct btrfs_disk_key disk_key;
+                       node = path->nodes[i];
+                       path->slots[i]++;
+                       *level = i;
+                       WARN_ON(*level == 0);
+                       btrfs_node_key(node, &disk_key, path->slots[i]);
+                       memcpy(&root_item->drop_progress,
+                              &disk_key, sizeof(disk_key));
+                       root_item->drop_level = i;
+                       return 0;
+               } else {
+                       struct extent_buffer *parent;
+                       if (path->nodes[*level] == root->node)
+                               parent = path->nodes[*level];
+                       else
+                               parent = path->nodes[*level + 1];
+
+                       root_owner = btrfs_header_owner(parent);
+                       root_gen = btrfs_header_generation(parent);
+
+                       clean_tree_block(trans, root, path->nodes[*level]);
+                       ret = btrfs_free_extent(trans, root,
+                                               path->nodes[*level]->start,
+                                               path->nodes[*level]->len,
+                                               parent->start, root_owner,
+                                               root_gen, *level, 1);
+                       BUG_ON(ret);
+                       if (path->locks[*level]) {
+                               btrfs_tree_unlock(path->nodes[*level]);
+                               path->locks[*level] = 0;
+                       }
+                       free_extent_buffer(path->nodes[*level]);
+                       path->nodes[*level] = NULL;
+                       *level = i + 1;
+               }
+       }
+       return 1;
+}
+
+/*
+ * drop the reference count on the tree rooted at 'snap'.  This traverses
+ * the tree freeing any blocks that have a ref count of zero after being
+ * decremented.
+ */
+int btrfs_drop_snapshot(struct btrfs_trans_handle *trans, struct btrfs_root
+                       *root)
+{
+       int ret = 0;
+       int wret;
+       int level;
+       struct btrfs_path *path;
+       int i;
+       int orig_level;
+       struct btrfs_root_item *root_item = &root->root_item;
+
+       WARN_ON(!mutex_is_locked(&root->fs_info->drop_mutex));
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       level = btrfs_header_level(root->node);
+       orig_level = level;
+       if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) {
+               path->nodes[level] = root->node;
+               extent_buffer_get(root->node);
+               path->slots[level] = 0;
+       } else {
+               struct btrfs_key key;
+               struct btrfs_disk_key found_key;
+               struct extent_buffer *node;
+
+               btrfs_disk_key_to_cpu(&key, &root_item->drop_progress);
+               level = root_item->drop_level;
+               path->lowest_level = level;
+               wret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+               if (wret < 0) {
+                       ret = wret;
+                       goto out;
+               }
+               node = path->nodes[level];
+               btrfs_node_key(node, &found_key, path->slots[level]);
+               WARN_ON(memcmp(&found_key, &root_item->drop_progress,
+                              sizeof(found_key)));
+               /*
+                * unlock our path, this is safe because only this
+                * function is allowed to delete this snapshot
+                */
+               for (i = 0; i < BTRFS_MAX_LEVEL; i++) {
+                       if (path->nodes[i] && path->locks[i]) {
+                               path->locks[i] = 0;
+                               btrfs_tree_unlock(path->nodes[i]);
+                       }
+               }
+       }
+       while (1) {
+               wret = walk_down_tree(trans, root, path, &level);
+               if (wret > 0)
+                       break;
+               if (wret < 0)
+                       ret = wret;
+
+               wret = walk_up_tree(trans, root, path, &level,
+                                   BTRFS_MAX_LEVEL);
+               if (wret > 0)
+                       break;
+               if (wret < 0)
+                       ret = wret;
+               if (trans->transaction->in_commit) {
+                       ret = -EAGAIN;
+                       break;
+               }
+               atomic_inc(&root->fs_info->throttle_gen);
+               wake_up(&root->fs_info->transaction_throttle);
+       }
+       for (i = 0; i <= orig_level; i++) {
+               if (path->nodes[i]) {
+                       free_extent_buffer(path->nodes[i]);
+                       path->nodes[i] = NULL;
+               }
+       }
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+int btrfs_drop_subtree(struct btrfs_trans_handle *trans,
+                       struct btrfs_root *root,
+                       struct extent_buffer *node,
+                       struct extent_buffer *parent)
+{
+       struct btrfs_path *path;
+       int level;
+       int parent_level;
+       int ret = 0;
+       int wret;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       BUG_ON(!btrfs_tree_locked(parent));
+       parent_level = btrfs_header_level(parent);
+       extent_buffer_get(parent);
+       path->nodes[parent_level] = parent;
+       path->slots[parent_level] = btrfs_header_nritems(parent);
+
+       BUG_ON(!btrfs_tree_locked(node));
+       level = btrfs_header_level(node);
+       extent_buffer_get(node);
+       path->nodes[level] = node;
+       path->slots[level] = 0;
+
+       while (1) {
+               wret = walk_down_subtree(trans, root, path, &level);
+               if (wret < 0)
+                       ret = wret;
+               if (wret != 0)
+                       break;
+
+               wret = walk_up_tree(trans, root, path, &level, parent_level);
+               if (wret < 0)
+                       ret = wret;
+               if (wret != 0)
+                       break;
+       }
+
+       btrfs_free_path(path);
+       return ret;
+}
+
+static unsigned long calc_ra(unsigned long start, unsigned long last,
+                            unsigned long nr)
+{
+       return min(last, start + nr - 1);
+}
+
+static noinline int relocate_inode_pages(struct inode *inode, u64 start,
+                                        u64 len)
+{
+       u64 page_start;
+       u64 page_end;
+       unsigned long first_index;
+       unsigned long last_index;
+       unsigned long i;
+       struct page *page;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+       struct file_ra_state *ra;
+       struct btrfs_ordered_extent *ordered;
+       unsigned int total_read = 0;
+       unsigned int total_dirty = 0;
+       int ret = 0;
+
+       ra = kzalloc(sizeof(*ra), GFP_NOFS);
+
+       mutex_lock(&inode->i_mutex);
+       first_index = start >> PAGE_CACHE_SHIFT;
+       last_index = (start + len - 1) >> PAGE_CACHE_SHIFT;
+
+       /* make sure the dirty trick played by the caller work */
+       ret = invalidate_inode_pages2_range(inode->i_mapping,
+                                           first_index, last_index);
+       if (ret)
+               goto out_unlock;
+
+       file_ra_state_init(ra, inode->i_mapping);
+
+       for (i = first_index ; i <= last_index; i++) {
+               if (total_read % ra->ra_pages == 0) {
+                       btrfs_force_ra(inode->i_mapping, ra, NULL, i,
+                                      calc_ra(i, last_index, ra->ra_pages));
+               }
+               total_read++;
+again:
+               if (((u64)i << PAGE_CACHE_SHIFT) > i_size_read(inode))
+                       BUG_ON(1);
+               page = grab_cache_page(inode->i_mapping, i);
+               if (!page) {
+                       ret = -ENOMEM;
+                       goto out_unlock;
+               }
+               if (!PageUptodate(page)) {
+                       btrfs_readpage(NULL, page);
+                       lock_page(page);
+                       if (!PageUptodate(page)) {
+                               unlock_page(page);
+                               page_cache_release(page);
+                               ret = -EIO;
+                               goto out_unlock;
+                       }
+               }
+               wait_on_page_writeback(page);
+
+               page_start = (u64)page->index << PAGE_CACHE_SHIFT;
+               page_end = page_start + PAGE_CACHE_SIZE - 1;
+               lock_extent(io_tree, page_start, page_end, GFP_NOFS);
+
+               ordered = btrfs_lookup_ordered_extent(inode, page_start);
+               if (ordered) {
+                       unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
+                       unlock_page(page);
+                       page_cache_release(page);
+                       btrfs_start_ordered_extent(inode, ordered, 1);
+                       btrfs_put_ordered_extent(ordered);
+                       goto again;
+               }
+               set_page_extent_mapped(page);
+
+               if (i == first_index)
+                       set_extent_bits(io_tree, page_start, page_end,
+                                       EXTENT_BOUNDARY, GFP_NOFS);
+               btrfs_set_extent_delalloc(inode, page_start, page_end);
+
+               set_page_dirty(page);
+               total_dirty++;
+
+               unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
+               unlock_page(page);
+               page_cache_release(page);
+       }
+
+out_unlock:
+       kfree(ra);
+       mutex_unlock(&inode->i_mutex);
+       balance_dirty_pages_ratelimited_nr(inode->i_mapping, total_dirty);
+       return ret;
+}
+
+static noinline int relocate_data_extent(struct inode *reloc_inode,
+                                        struct btrfs_key *extent_key,
+                                        u64 offset)
+{
+       struct btrfs_root *root = BTRFS_I(reloc_inode)->root;
+       struct extent_map_tree *em_tree = &BTRFS_I(reloc_inode)->extent_tree;
+       struct extent_map *em;
+       u64 start = extent_key->objectid - offset;
+       u64 end = start + extent_key->offset - 1;
+
+       em = alloc_extent_map(GFP_NOFS);
+       BUG_ON(!em || IS_ERR(em));
+
+       em->start = start;
+       em->len = extent_key->offset;
+       em->block_len = extent_key->offset;
+       em->block_start = extent_key->objectid;
+       em->bdev = root->fs_info->fs_devices->latest_bdev;
+       set_bit(EXTENT_FLAG_PINNED, &em->flags);
+
+       /* setup extent map to cheat btrfs_readpage */
+       lock_extent(&BTRFS_I(reloc_inode)->io_tree, start, end, GFP_NOFS);
+       while (1) {
+               int ret;
+               spin_lock(&em_tree->lock);
+               ret = add_extent_mapping(em_tree, em);
+               spin_unlock(&em_tree->lock);
+               if (ret != -EEXIST) {
+                       free_extent_map(em);
+                       break;
+               }
+               btrfs_drop_extent_cache(reloc_inode, start, end, 0);
+       }
+       unlock_extent(&BTRFS_I(reloc_inode)->io_tree, start, end, GFP_NOFS);
+
+       return relocate_inode_pages(reloc_inode, start, extent_key->offset);
+}
+
+struct btrfs_ref_path {
+       u64 extent_start;
+       u64 nodes[BTRFS_MAX_LEVEL];
+       u64 root_objectid;
+       u64 root_generation;
+       u64 owner_objectid;
+       u32 num_refs;
+       int lowest_level;
+       int current_level;
+       int shared_level;
+
+       struct btrfs_key node_keys[BTRFS_MAX_LEVEL];
+       u64 new_nodes[BTRFS_MAX_LEVEL];
+};
+
+struct disk_extent {
+       u64 ram_bytes;
+       u64 disk_bytenr;
+       u64 disk_num_bytes;
+       u64 offset;
+       u64 num_bytes;
+       u8 compression;
+       u8 encryption;
+       u16 other_encoding;
+};
+
+static int is_cowonly_root(u64 root_objectid)
+{
+       if (root_objectid == BTRFS_ROOT_TREE_OBJECTID ||
+           root_objectid == BTRFS_EXTENT_TREE_OBJECTID ||
+           root_objectid == BTRFS_CHUNK_TREE_OBJECTID ||
+           root_objectid == BTRFS_DEV_TREE_OBJECTID ||
+           root_objectid == BTRFS_TREE_LOG_OBJECTID ||
+           root_objectid == BTRFS_CSUM_TREE_OBJECTID)
+               return 1;
+       return 0;
+}
+
+static noinline int __next_ref_path(struct btrfs_trans_handle *trans,
+                                   struct btrfs_root *extent_root,
+                                   struct btrfs_ref_path *ref_path,
+                                   int first_time)
+{
+       struct extent_buffer *leaf;
+       struct btrfs_path *path;
+       struct btrfs_extent_ref *ref;
+       struct btrfs_key key;
+       struct btrfs_key found_key;
+       u64 bytenr;
+       u32 nritems;
+       int level;
+       int ret = 1;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       if (first_time) {
+               ref_path->lowest_level = -1;
+               ref_path->current_level = -1;
+               ref_path->shared_level = -1;
+               goto walk_up;
+       }
+walk_down:
+       level = ref_path->current_level - 1;
+       while (level >= -1) {
+               u64 parent;
+               if (level < ref_path->lowest_level)
+                       break;
+
+               if (level >= 0)
+                       bytenr = ref_path->nodes[level];
+               else
+                       bytenr = ref_path->extent_start;
+               BUG_ON(bytenr == 0);
+
+               parent = ref_path->nodes[level + 1];
+               ref_path->nodes[level + 1] = 0;
+               ref_path->current_level = level;
+               BUG_ON(parent == 0);
+
+               key.objectid = bytenr;
+               key.offset = parent + 1;
+               key.type = BTRFS_EXTENT_REF_KEY;
+
+               ret = btrfs_search_slot(trans, extent_root, &key, path, 0, 0);
+               if (ret < 0)
+                       goto out;
+               BUG_ON(ret == 0);
+
+               leaf = path->nodes[0];
+               nritems = btrfs_header_nritems(leaf);
+               if (path->slots[0] >= nritems) {
+                       ret = btrfs_next_leaf(extent_root, path);
+                       if (ret < 0)
+                               goto out;
+                       if (ret > 0)
+                               goto next;
+                       leaf = path->nodes[0];
+               }
+
+               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+               if (found_key.objectid == bytenr &&
+                   found_key.type == BTRFS_EXTENT_REF_KEY) {
+                       if (level < ref_path->shared_level)
+                               ref_path->shared_level = level;
+                       goto found;
+               }
+next:
+               level--;
+               btrfs_release_path(extent_root, path);
+               cond_resched();
+       }
+       /* reached lowest level */
+       ret = 1;
+       goto out;
+walk_up:
+       level = ref_path->current_level;
+       while (level < BTRFS_MAX_LEVEL - 1) {
+               u64 ref_objectid;
+
+               if (level >= 0)
+                       bytenr = ref_path->nodes[level];
+               else
+                       bytenr = ref_path->extent_start;
+
+               BUG_ON(bytenr == 0);
+
+               key.objectid = bytenr;
+               key.offset = 0;
+               key.type = BTRFS_EXTENT_REF_KEY;
+
+               ret = btrfs_search_slot(trans, extent_root, &key, path, 0, 0);
+               if (ret < 0)
+                       goto out;
+
+               leaf = path->nodes[0];
+               nritems = btrfs_header_nritems(leaf);
+               if (path->slots[0] >= nritems) {
+                       ret = btrfs_next_leaf(extent_root, path);
+                       if (ret < 0)
+                               goto out;
+                       if (ret > 0) {
+                               /* the extent was freed by someone */
+                               if (ref_path->lowest_level == level)
+                                       goto out;
+                               btrfs_release_path(extent_root, path);
+                               goto walk_down;
+                       }
+                       leaf = path->nodes[0];
+               }
+
+               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+               if (found_key.objectid != bytenr ||
+                               found_key.type != BTRFS_EXTENT_REF_KEY) {
+                       /* the extent was freed by someone */
+                       if (ref_path->lowest_level == level) {
+                               ret = 1;
+                               goto out;
+                       }
+                       btrfs_release_path(extent_root, path);
+                       goto walk_down;
+               }
+found:
+               ref = btrfs_item_ptr(leaf, path->slots[0],
+                               struct btrfs_extent_ref);
+               ref_objectid = btrfs_ref_objectid(leaf, ref);
+               if (ref_objectid < BTRFS_FIRST_FREE_OBJECTID) {
+                       if (first_time) {
+                               level = (int)ref_objectid;
+                               BUG_ON(level >= BTRFS_MAX_LEVEL);
+                               ref_path->lowest_level = level;
+                               ref_path->current_level = level;
+                               ref_path->nodes[level] = bytenr;
+                       } else {
+                               WARN_ON(ref_objectid != level);
+                       }
+               } else {
+                       WARN_ON(level != -1);
+               }
+               first_time = 0;
+
+               if (ref_path->lowest_level == level) {
+                       ref_path->owner_objectid = ref_objectid;
+                       ref_path->num_refs = btrfs_ref_num_refs(leaf, ref);
+               }
+
+               /*
+                * the block is tree root or the block isn't in reference
+                * counted tree.
+                */
+               if (found_key.objectid == found_key.offset ||
+                   is_cowonly_root(btrfs_ref_root(leaf, ref))) {
+                       ref_path->root_objectid = btrfs_ref_root(leaf, ref);
+                       ref_path->root_generation =
+                               btrfs_ref_generation(leaf, ref);
+                       if (level < 0) {
+                               /* special reference from the tree log */
+                               ref_path->nodes[0] = found_key.offset;
+                               ref_path->current_level = 0;
+                       }
+                       ret = 0;
+                       goto out;
+               }
+
+               level++;
+               BUG_ON(ref_path->nodes[level] != 0);
+               ref_path->nodes[level] = found_key.offset;
+               ref_path->current_level = level;
+
+               /*
+                * the reference was created in the running transaction,
+                * no need to continue walking up.
+                */
+               if (btrfs_ref_generation(leaf, ref) == trans->transid) {
+                       ref_path->root_objectid = btrfs_ref_root(leaf, ref);
+                       ref_path->root_generation =
+                               btrfs_ref_generation(leaf, ref);
+                       ret = 0;
+                       goto out;
+               }
+
+               btrfs_release_path(extent_root, path);
+               cond_resched();
+       }
+       /* reached max tree level, but no tree root found. */
+       BUG();
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+static int btrfs_first_ref_path(struct btrfs_trans_handle *trans,
+                               struct btrfs_root *extent_root,
+                               struct btrfs_ref_path *ref_path,
+                               u64 extent_start)
+{
+       memset(ref_path, 0, sizeof(*ref_path));
+       ref_path->extent_start = extent_start;
+
+       return __next_ref_path(trans, extent_root, ref_path, 1);
+}
+
+static int btrfs_next_ref_path(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *extent_root,
+                              struct btrfs_ref_path *ref_path)
+{
+       return __next_ref_path(trans, extent_root, ref_path, 0);
+}
+
+static noinline int get_new_locations(struct inode *reloc_inode,
+                                     struct btrfs_key *extent_key,
+                                     u64 offset, int no_fragment,
+                                     struct disk_extent **extents,
+                                     int *nr_extents)
+{
+       struct btrfs_root *root = BTRFS_I(reloc_inode)->root;
+       struct btrfs_path *path;
+       struct btrfs_file_extent_item *fi;
+       struct extent_buffer *leaf;
+       struct disk_extent *exts = *extents;
+       struct btrfs_key found_key;
+       u64 cur_pos;
+       u64 last_byte;
+       u32 nritems;
+       int nr = 0;
+       int max = *nr_extents;
+       int ret;
+
+       WARN_ON(!no_fragment && *extents);
+       if (!exts) {
+               max = 1;
+               exts = kmalloc(sizeof(*exts) * max, GFP_NOFS);
+               if (!exts)
+                       return -ENOMEM;
+       }
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       cur_pos = extent_key->objectid - offset;
+       last_byte = extent_key->objectid + extent_key->offset;
+       ret = btrfs_lookup_file_extent(NULL, root, path, reloc_inode->i_ino,
+                                      cur_pos, 0);
+       if (ret < 0)
+               goto out;
+       if (ret > 0) {
+               ret = -ENOENT;
+               goto out;
+       }
+
+       while (1) {
+               leaf = path->nodes[0];
+               nritems = btrfs_header_nritems(leaf);
+               if (path->slots[0] >= nritems) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret < 0)
+                               goto out;
+                       if (ret > 0)
+                               break;
+                       leaf = path->nodes[0];
+               }
+
+               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+               if (found_key.offset != cur_pos ||
+                   found_key.type != BTRFS_EXTENT_DATA_KEY ||
+                   found_key.objectid != reloc_inode->i_ino)
+                       break;
+
+               fi = btrfs_item_ptr(leaf, path->slots[0],
+                                   struct btrfs_file_extent_item);
+               if (btrfs_file_extent_type(leaf, fi) !=
+                   BTRFS_FILE_EXTENT_REG ||
+                   btrfs_file_extent_disk_bytenr(leaf, fi) == 0)
+                       break;
+
+               if (nr == max) {
+                       struct disk_extent *old = exts;
+                       max *= 2;
+                       exts = kzalloc(sizeof(*exts) * max, GFP_NOFS);
+                       memcpy(exts, old, sizeof(*exts) * nr);
+                       if (old != *extents)
+                               kfree(old);
+               }
+
+               exts[nr].disk_bytenr =
+                       btrfs_file_extent_disk_bytenr(leaf, fi);
+               exts[nr].disk_num_bytes =
+                       btrfs_file_extent_disk_num_bytes(leaf, fi);
+               exts[nr].offset = btrfs_file_extent_offset(leaf, fi);
+               exts[nr].num_bytes = btrfs_file_extent_num_bytes(leaf, fi);
+               exts[nr].ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi);
+               exts[nr].compression = btrfs_file_extent_compression(leaf, fi);
+               exts[nr].encryption = btrfs_file_extent_encryption(leaf, fi);
+               exts[nr].other_encoding = btrfs_file_extent_other_encoding(leaf,
+                                                                          fi);
+               BUG_ON(exts[nr].offset > 0);
+               BUG_ON(exts[nr].compression || exts[nr].encryption);
+               BUG_ON(exts[nr].num_bytes != exts[nr].disk_num_bytes);
+
+               cur_pos += exts[nr].num_bytes;
+               nr++;
+
+               if (cur_pos + offset >= last_byte)
+                       break;
+
+               if (no_fragment) {
+                       ret = 1;
+                       goto out;
+               }
+               path->slots[0]++;
+       }
+
+       BUG_ON(cur_pos + offset > last_byte);
+       if (cur_pos + offset < last_byte) {
+               ret = -ENOENT;
+               goto out;
+       }
+       ret = 0;
+out:
+       btrfs_free_path(path);
+       if (ret) {
+               if (exts != *extents)
+                       kfree(exts);
+       } else {
+               *extents = exts;
+               *nr_extents = nr;
+       }
+       return ret;
+}
+
+static noinline int replace_one_extent(struct btrfs_trans_handle *trans,
+                                       struct btrfs_root *root,
+                                       struct btrfs_path *path,
+                                       struct btrfs_key *extent_key,
+                                       struct btrfs_key *leaf_key,
+                                       struct btrfs_ref_path *ref_path,
+                                       struct disk_extent *new_extents,
+                                       int nr_extents)
+{
+       struct extent_buffer *leaf;
+       struct btrfs_file_extent_item *fi;
+       struct inode *inode = NULL;
+       struct btrfs_key key;
+       u64 lock_start = 0;
+       u64 lock_end = 0;
+       u64 num_bytes;
+       u64 ext_offset;
+       u64 first_pos;
+       u32 nritems;
+       int nr_scaned = 0;
+       int extent_locked = 0;
+       int extent_type;
+       int ret;
+
+       memcpy(&key, leaf_key, sizeof(key));
+       first_pos = INT_LIMIT(loff_t) - extent_key->offset;
+       if (ref_path->owner_objectid != BTRFS_MULTIPLE_OBJECTIDS) {
+               if (key.objectid < ref_path->owner_objectid ||
+                   (key.objectid == ref_path->owner_objectid &&
+                    key.type < BTRFS_EXTENT_DATA_KEY)) {
+                       key.objectid = ref_path->owner_objectid;
+                       key.type = BTRFS_EXTENT_DATA_KEY;
+                       key.offset = 0;
+               }
+       }
+
+       while (1) {
+               ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
+               if (ret < 0)
+                       goto out;
+
+               leaf = path->nodes[0];
+               nritems = btrfs_header_nritems(leaf);
+next:
+               if (extent_locked && ret > 0) {
+                       /*
+                        * the file extent item was modified by someone
+                        * before the extent got locked.
+                        */
+                       unlock_extent(&BTRFS_I(inode)->io_tree, lock_start,
+                                     lock_end, GFP_NOFS);
+                       extent_locked = 0;
+               }
+
+               if (path->slots[0] >= nritems) {
+                       if (++nr_scaned > 2)
+                               break;
+
+                       BUG_ON(extent_locked);
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret < 0)
+                               goto out;
+                       if (ret > 0)
+                               break;
+                       leaf = path->nodes[0];
+                       nritems = btrfs_header_nritems(leaf);
+               }
+
+               btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+               if (ref_path->owner_objectid != BTRFS_MULTIPLE_OBJECTIDS) {
+                       if ((key.objectid > ref_path->owner_objectid) ||
+                           (key.objectid == ref_path->owner_objectid &&
+                            key.type > BTRFS_EXTENT_DATA_KEY) ||
+                           (key.offset >= first_pos + extent_key->offset))
+                               break;
+               }
+
+               if (inode && key.objectid != inode->i_ino) {
+                       BUG_ON(extent_locked);
+                       btrfs_release_path(root, path);
+                       mutex_unlock(&inode->i_mutex);
+                       iput(inode);
+                       inode = NULL;
+                       continue;
+               }
+
+               if (key.type != BTRFS_EXTENT_DATA_KEY) {
+                       path->slots[0]++;
+                       ret = 1;
+                       goto next;
+               }
+               fi = btrfs_item_ptr(leaf, path->slots[0],
+                                   struct btrfs_file_extent_item);
+               extent_type = btrfs_file_extent_type(leaf, fi);
+               if ((extent_type != BTRFS_FILE_EXTENT_REG &&
+                    extent_type != BTRFS_FILE_EXTENT_PREALLOC) ||
+                   (btrfs_file_extent_disk_bytenr(leaf, fi) !=
+                    extent_key->objectid)) {
+                       path->slots[0]++;
+                       ret = 1;
+                       goto next;
+               }
+
+               num_bytes = btrfs_file_extent_num_bytes(leaf, fi);
+               ext_offset = btrfs_file_extent_offset(leaf, fi);
+
+               if (first_pos > key.offset - ext_offset)
+                       first_pos = key.offset - ext_offset;
+
+               if (!extent_locked) {
+                       lock_start = key.offset;
+                       lock_end = lock_start + num_bytes - 1;
+               } else {
+                       if (lock_start > key.offset ||
+                           lock_end + 1 < key.offset + num_bytes) {
+                               unlock_extent(&BTRFS_I(inode)->io_tree,
+                                             lock_start, lock_end, GFP_NOFS);
+                               extent_locked = 0;
+                       }
+               }
+
+               if (!inode) {
+                       btrfs_release_path(root, path);
+
+                       inode = btrfs_iget_locked(root->fs_info->sb,
+                                                 key.objectid, root);
+                       if (inode->i_state & I_NEW) {
+                               BTRFS_I(inode)->root = root;
+                               BTRFS_I(inode)->location.objectid =
+                                       key.objectid;
+                               BTRFS_I(inode)->location.type =
+                                       BTRFS_INODE_ITEM_KEY;
+                               BTRFS_I(inode)->location.offset = 0;
+                               btrfs_read_locked_inode(inode);
+                               unlock_new_inode(inode);
+                       }
+                       /*
+                        * some code call btrfs_commit_transaction while
+                        * holding the i_mutex, so we can't use mutex_lock
+                        * here.
+                        */
+                       if (is_bad_inode(inode) ||
+                           !mutex_trylock(&inode->i_mutex)) {
+                               iput(inode);
+                               inode = NULL;
+                               key.offset = (u64)-1;
+                               goto skip;
+                       }
+               }
+
+               if (!extent_locked) {
+                       struct btrfs_ordered_extent *ordered;
+
+                       btrfs_release_path(root, path);
+
+                       lock_extent(&BTRFS_I(inode)->io_tree, lock_start,
+                                   lock_end, GFP_NOFS);
+                       ordered = btrfs_lookup_first_ordered_extent(inode,
+                                                                   lock_end);
+                       if (ordered &&
+                           ordered->file_offset <= lock_end &&
+                           ordered->file_offset + ordered->len > lock_start) {
+                               unlock_extent(&BTRFS_I(inode)->io_tree,
+                                             lock_start, lock_end, GFP_NOFS);
+                               btrfs_start_ordered_extent(inode, ordered, 1);
+                               btrfs_put_ordered_extent(ordered);
+                               key.offset += num_bytes;
+                               goto skip;
+                       }
+                       if (ordered)
+                               btrfs_put_ordered_extent(ordered);
+
+                       extent_locked = 1;
+                       continue;
+               }
+
+               if (nr_extents == 1) {
+                       /* update extent pointer in place */
+                       btrfs_set_file_extent_disk_bytenr(leaf, fi,
+                                               new_extents[0].disk_bytenr);
+                       btrfs_set_file_extent_disk_num_bytes(leaf, fi,
+                                               new_extents[0].disk_num_bytes);
+                       btrfs_mark_buffer_dirty(leaf);
+
+                       btrfs_drop_extent_cache(inode, key.offset,
+                                               key.offset + num_bytes - 1, 0);
+
+                       ret = btrfs_inc_extent_ref(trans, root,
+                                               new_extents[0].disk_bytenr,
+                                               new_extents[0].disk_num_bytes,
+                                               leaf->start,
+                                               root->root_key.objectid,
+                                               trans->transid,
+                                               key.objectid);
+                       BUG_ON(ret);
+
+                       ret = btrfs_free_extent(trans, root,
+                                               extent_key->objectid,
+                                               extent_key->offset,
+                                               leaf->start,
+                                               btrfs_header_owner(leaf),
+                                               btrfs_header_generation(leaf),
+                                               key.objectid, 0);
+                       BUG_ON(ret);
+
+                       btrfs_release_path(root, path);
+                       key.offset += num_bytes;
+               } else {
+                       BUG_ON(1);
+#if 0
+                       u64 alloc_hint;
+                       u64 extent_len;
+                       int i;
+                       /*
+                        * drop old extent pointer at first, then insert the
+                        * new pointers one bye one
+                        */
+                       btrfs_release_path(root, path);
+                       ret = btrfs_drop_extents(trans, root, inode, key.offset,
+                                                key.offset + num_bytes,
+                                                key.offset, &alloc_hint);
+                       BUG_ON(ret);
+
+                       for (i = 0; i < nr_extents; i++) {
+                               if (ext_offset >= new_extents[i].num_bytes) {
+                                       ext_offset -= new_extents[i].num_bytes;
+                                       continue;
+                               }
+                               extent_len = min(new_extents[i].num_bytes -
+                                                ext_offset, num_bytes);
+
+                               ret = btrfs_insert_empty_item(trans, root,
+                                                             path, &key,
+                                                             sizeof(*fi));
+                               BUG_ON(ret);
+
+                               leaf = path->nodes[0];
+                               fi = btrfs_item_ptr(leaf, path->slots[0],
+                                               struct btrfs_file_extent_item);
+                               btrfs_set_file_extent_generation(leaf, fi,
+                                                       trans->transid);
+                               btrfs_set_file_extent_type(leaf, fi,
+                                                       BTRFS_FILE_EXTENT_REG);
+                               btrfs_set_file_extent_disk_bytenr(leaf, fi,
+                                               new_extents[i].disk_bytenr);
+                               btrfs_set_file_extent_disk_num_bytes(leaf, fi,
+                                               new_extents[i].disk_num_bytes);
+                               btrfs_set_file_extent_ram_bytes(leaf, fi,
+                                               new_extents[i].ram_bytes);
+
+                               btrfs_set_file_extent_compression(leaf, fi,
+                                               new_extents[i].compression);
+                               btrfs_set_file_extent_encryption(leaf, fi,
+                                               new_extents[i].encryption);
+                               btrfs_set_file_extent_other_encoding(leaf, fi,
+                                               new_extents[i].other_encoding);
+
+                               btrfs_set_file_extent_num_bytes(leaf, fi,
+                                                       extent_len);
+                               ext_offset += new_extents[i].offset;
+                               btrfs_set_file_extent_offset(leaf, fi,
+                                                       ext_offset);
+                               btrfs_mark_buffer_dirty(leaf);
+
+                               btrfs_drop_extent_cache(inode, key.offset,
+                                               key.offset + extent_len - 1, 0);
+
+                               ret = btrfs_inc_extent_ref(trans, root,
+                                               new_extents[i].disk_bytenr,
+                                               new_extents[i].disk_num_bytes,
+                                               leaf->start,
+                                               root->root_key.objectid,
+                                               trans->transid, key.objectid);
+                               BUG_ON(ret);
+                               btrfs_release_path(root, path);
+
+                               inode_add_bytes(inode, extent_len);
+
+                               ext_offset = 0;
+                               num_bytes -= extent_len;
+                               key.offset += extent_len;
+
+                               if (num_bytes == 0)
+                                       break;
+                       }
+                       BUG_ON(i >= nr_extents);
+#endif
+               }
+
+               if (extent_locked) {
+                       unlock_extent(&BTRFS_I(inode)->io_tree, lock_start,
+                                     lock_end, GFP_NOFS);
+                       extent_locked = 0;
+               }
+skip:
+               if (ref_path->owner_objectid != BTRFS_MULTIPLE_OBJECTIDS &&
+                   key.offset >= first_pos + extent_key->offset)
+                       break;
+
+               cond_resched();
+       }
+       ret = 0;
+out:
+       btrfs_release_path(root, path);
+       if (inode) {
+               mutex_unlock(&inode->i_mutex);
+               if (extent_locked) {
+                       unlock_extent(&BTRFS_I(inode)->io_tree, lock_start,
+                                     lock_end, GFP_NOFS);
+               }
+               iput(inode);
+       }
+       return ret;
+}
+
+int btrfs_reloc_tree_cache_ref(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *root,
+                              struct extent_buffer *buf, u64 orig_start)
+{
+       int level;
+       int ret;
+
+       BUG_ON(btrfs_header_generation(buf) != trans->transid);
+       BUG_ON(root->root_key.objectid != BTRFS_TREE_RELOC_OBJECTID);
+
+       level = btrfs_header_level(buf);
+       if (level == 0) {
+               struct btrfs_leaf_ref *ref;
+               struct btrfs_leaf_ref *orig_ref;
+
+               orig_ref = btrfs_lookup_leaf_ref(root, orig_start);
+               if (!orig_ref)
+                       return -ENOENT;
+
+               ref = btrfs_alloc_leaf_ref(root, orig_ref->nritems);
+               if (!ref) {
+                       btrfs_free_leaf_ref(root, orig_ref);
+                       return -ENOMEM;
+               }
+
+               ref->nritems = orig_ref->nritems;
+               memcpy(ref->extents, orig_ref->extents,
+                       sizeof(ref->extents[0]) * ref->nritems);
+
+               btrfs_free_leaf_ref(root, orig_ref);
+
+               ref->root_gen = trans->transid;
+               ref->bytenr = buf->start;
+               ref->owner = btrfs_header_owner(buf);
+               ref->generation = btrfs_header_generation(buf);
+               ret = btrfs_add_leaf_ref(root, ref, 0);
+               WARN_ON(ret);
+               btrfs_free_leaf_ref(root, ref);
+       }
+       return 0;
+}
+
+static noinline int invalidate_extent_cache(struct btrfs_root *root,
+                                       struct extent_buffer *leaf,
+                                       struct btrfs_block_group_cache *group,
+                                       struct btrfs_root *target_root)
+{
+       struct btrfs_key key;
+       struct inode *inode = NULL;
+       struct btrfs_file_extent_item *fi;
+       u64 num_bytes;
+       u64 skip_objectid = 0;
+       u32 nritems;
+       u32 i;
+
+       nritems = btrfs_header_nritems(leaf);
+       for (i = 0; i < nritems; i++) {
+               btrfs_item_key_to_cpu(leaf, &key, i);
+               if (key.objectid == skip_objectid ||
+                   key.type != BTRFS_EXTENT_DATA_KEY)
+                       continue;
+               fi = btrfs_item_ptr(leaf, i, struct btrfs_file_extent_item);
+               if (btrfs_file_extent_type(leaf, fi) ==
+                   BTRFS_FILE_EXTENT_INLINE)
+                       continue;
+               if (btrfs_file_extent_disk_bytenr(leaf, fi) == 0)
+                       continue;
+               if (!inode || inode->i_ino != key.objectid) {
+                       iput(inode);
+                       inode = btrfs_ilookup(target_root->fs_info->sb,
+                                             key.objectid, target_root, 1);
+               }
+               if (!inode) {
+                       skip_objectid = key.objectid;
+                       continue;
+               }
+               num_bytes = btrfs_file_extent_num_bytes(leaf, fi);
+
+               lock_extent(&BTRFS_I(inode)->io_tree, key.offset,
+                           key.offset + num_bytes - 1, GFP_NOFS);
+               btrfs_drop_extent_cache(inode, key.offset,
+                                       key.offset + num_bytes - 1, 1);
+               unlock_extent(&BTRFS_I(inode)->io_tree, key.offset,
+                             key.offset + num_bytes - 1, GFP_NOFS);
+               cond_resched();
+       }
+       iput(inode);
+       return 0;
+}
+
+static noinline int replace_extents_in_leaf(struct btrfs_trans_handle *trans,
+                                       struct btrfs_root *root,
+                                       struct extent_buffer *leaf,
+                                       struct btrfs_block_group_cache *group,
+                                       struct inode *reloc_inode)
+{
+       struct btrfs_key key;
+       struct btrfs_key extent_key;
+       struct btrfs_file_extent_item *fi;
+       struct btrfs_leaf_ref *ref;
+       struct disk_extent *new_extent;
+       u64 bytenr;
+       u64 num_bytes;
+       u32 nritems;
+       u32 i;
+       int ext_index;
+       int nr_extent;
+       int ret;
+
+       new_extent = kmalloc(sizeof(*new_extent), GFP_NOFS);
+       BUG_ON(!new_extent);
+
+       ref = btrfs_lookup_leaf_ref(root, leaf->start);
+       BUG_ON(!ref);
+
+       ext_index = -1;
+       nritems = btrfs_header_nritems(leaf);
+       for (i = 0; i < nritems; i++) {
+               btrfs_item_key_to_cpu(leaf, &key, i);
+               if (btrfs_key_type(&key) != BTRFS_EXTENT_DATA_KEY)
+                       continue;
+               fi = btrfs_item_ptr(leaf, i, struct btrfs_file_extent_item);
+               if (btrfs_file_extent_type(leaf, fi) ==
+                   BTRFS_FILE_EXTENT_INLINE)
+                       continue;
+               bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+               num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
+               if (bytenr == 0)
+                       continue;
+
+               ext_index++;
+               if (bytenr >= group->key.objectid + group->key.offset ||
+                   bytenr + num_bytes <= group->key.objectid)
+                       continue;
+
+               extent_key.objectid = bytenr;
+               extent_key.offset = num_bytes;
+               extent_key.type = BTRFS_EXTENT_ITEM_KEY;
+               nr_extent = 1;
+               ret = get_new_locations(reloc_inode, &extent_key,
+                                       group->key.objectid, 1,
+                                       &new_extent, &nr_extent);
+               if (ret > 0)
+                       continue;
+               BUG_ON(ret < 0);
+
+               BUG_ON(ref->extents[ext_index].bytenr != bytenr);
+               BUG_ON(ref->extents[ext_index].num_bytes != num_bytes);
+               ref->extents[ext_index].bytenr = new_extent->disk_bytenr;
+               ref->extents[ext_index].num_bytes = new_extent->disk_num_bytes;
+
+               btrfs_set_file_extent_disk_bytenr(leaf, fi,
+                                               new_extent->disk_bytenr);
+               btrfs_set_file_extent_disk_num_bytes(leaf, fi,
+                                               new_extent->disk_num_bytes);
+               btrfs_mark_buffer_dirty(leaf);
+
+               ret = btrfs_inc_extent_ref(trans, root,
+                                       new_extent->disk_bytenr,
+                                       new_extent->disk_num_bytes,
+                                       leaf->start,
+                                       root->root_key.objectid,
+                                       trans->transid, key.objectid);
+               BUG_ON(ret);
+               ret = btrfs_free_extent(trans, root,
+                                       bytenr, num_bytes, leaf->start,
+                                       btrfs_header_owner(leaf),
+                                       btrfs_header_generation(leaf),
+                                       key.objectid, 0);
+               BUG_ON(ret);
+               cond_resched();
+       }
+       kfree(new_extent);
+       BUG_ON(ext_index + 1 != ref->nritems);
+       btrfs_free_leaf_ref(root, ref);
+       return 0;
+}
+
+int btrfs_free_reloc_root(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root)
+{
+       struct btrfs_root *reloc_root;
+       int ret;
+
+       if (root->reloc_root) {
+               reloc_root = root->reloc_root;
+               root->reloc_root = NULL;
+               list_add(&reloc_root->dead_list,
+                        &root->fs_info->dead_reloc_roots);
+
+               btrfs_set_root_bytenr(&reloc_root->root_item,
+                                     reloc_root->node->start);
+               btrfs_set_root_level(&root->root_item,
+                                    btrfs_header_level(reloc_root->node));
+               memset(&reloc_root->root_item.drop_progress, 0,
+                       sizeof(struct btrfs_disk_key));
+               reloc_root->root_item.drop_level = 0;
+
+               ret = btrfs_update_root(trans, root->fs_info->tree_root,
+                                       &reloc_root->root_key,
+                                       &reloc_root->root_item);
+               BUG_ON(ret);
+       }
+       return 0;
+}
+
+int btrfs_drop_dead_reloc_roots(struct btrfs_root *root)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *reloc_root;
+       struct btrfs_root *prev_root = NULL;
+       struct list_head dead_roots;
+       int ret;
+       unsigned long nr;
+
+       INIT_LIST_HEAD(&dead_roots);
+       list_splice_init(&root->fs_info->dead_reloc_roots, &dead_roots);
+
+       while (!list_empty(&dead_roots)) {
+               reloc_root = list_entry(dead_roots.prev,
+                                       struct btrfs_root, dead_list);
+               list_del_init(&reloc_root->dead_list);
+
+               BUG_ON(reloc_root->commit_root != NULL);
+               while (1) {
+                       trans = btrfs_join_transaction(root, 1);
+                       BUG_ON(!trans);
+
+                       mutex_lock(&root->fs_info->drop_mutex);
+                       ret = btrfs_drop_snapshot(trans, reloc_root);
+                       if (ret != -EAGAIN)
+                               break;
+                       mutex_unlock(&root->fs_info->drop_mutex);
+
+                       nr = trans->blocks_used;
+                       ret = btrfs_end_transaction(trans, root);
+                       BUG_ON(ret);
+                       btrfs_btree_balance_dirty(root, nr);
+               }
+
+               free_extent_buffer(reloc_root->node);
+
+               ret = btrfs_del_root(trans, root->fs_info->tree_root,
+                                    &reloc_root->root_key);
+               BUG_ON(ret);
+               mutex_unlock(&root->fs_info->drop_mutex);
+
+               nr = trans->blocks_used;
+               ret = btrfs_end_transaction(trans, root);
+               BUG_ON(ret);
+               btrfs_btree_balance_dirty(root, nr);
+
+               kfree(prev_root);
+               prev_root = reloc_root;
+       }
+       if (prev_root) {
+               btrfs_remove_leaf_refs(prev_root, (u64)-1, 0);
+               kfree(prev_root);
+       }
+       return 0;
+}
+
+int btrfs_add_dead_reloc_root(struct btrfs_root *root)
+{
+       list_add(&root->dead_list, &root->fs_info->dead_reloc_roots);
+       return 0;
+}
+
+int btrfs_cleanup_reloc_trees(struct btrfs_root *root)
+{
+       struct btrfs_root *reloc_root;
+       struct btrfs_trans_handle *trans;
+       struct btrfs_key location;
+       int found;
+       int ret;
+
+       mutex_lock(&root->fs_info->tree_reloc_mutex);
+       ret = btrfs_find_dead_roots(root, BTRFS_TREE_RELOC_OBJECTID, NULL);
+       BUG_ON(ret);
+       found = !list_empty(&root->fs_info->dead_reloc_roots);
+       mutex_unlock(&root->fs_info->tree_reloc_mutex);
+
+       if (found) {
+               trans = btrfs_start_transaction(root, 1);
+               BUG_ON(!trans);
+               ret = btrfs_commit_transaction(trans, root);
+               BUG_ON(ret);
+       }
+
+       location.objectid = BTRFS_DATA_RELOC_TREE_OBJECTID;
+       location.offset = (u64)-1;
+       location.type = BTRFS_ROOT_ITEM_KEY;
+
+       reloc_root = btrfs_read_fs_root_no_name(root->fs_info, &location);
+       BUG_ON(!reloc_root);
+       btrfs_orphan_cleanup(reloc_root);
+       return 0;
+}
+
+static noinline int init_reloc_tree(struct btrfs_trans_handle *trans,
+                                   struct btrfs_root *root)
+{
+       struct btrfs_root *reloc_root;
+       struct extent_buffer *eb;
+       struct btrfs_root_item *root_item;
+       struct btrfs_key root_key;
+       int ret;
+
+       BUG_ON(!root->ref_cows);
+       if (root->reloc_root)
+               return 0;
+
+       root_item = kmalloc(sizeof(*root_item), GFP_NOFS);
+       BUG_ON(!root_item);
+
+       ret = btrfs_copy_root(trans, root, root->commit_root,
+                             &eb, BTRFS_TREE_RELOC_OBJECTID);
+       BUG_ON(ret);
+
+       root_key.objectid = BTRFS_TREE_RELOC_OBJECTID;
+       root_key.offset = root->root_key.objectid;
+       root_key.type = BTRFS_ROOT_ITEM_KEY;
+
+       memcpy(root_item, &root->root_item, sizeof(root_item));
+       btrfs_set_root_refs(root_item, 0);
+       btrfs_set_root_bytenr(root_item, eb->start);
+       btrfs_set_root_level(root_item, btrfs_header_level(eb));
+       btrfs_set_root_generation(root_item, trans->transid);
+
+       btrfs_tree_unlock(eb);
+       free_extent_buffer(eb);
+
+       ret = btrfs_insert_root(trans, root->fs_info->tree_root,
+                               &root_key, root_item);
+       BUG_ON(ret);
+       kfree(root_item);
+
+       reloc_root = btrfs_read_fs_root_no_radix(root->fs_info->tree_root,
+                                                &root_key);
+       BUG_ON(!reloc_root);
+       reloc_root->last_trans = trans->transid;
+       reloc_root->commit_root = NULL;
+       reloc_root->ref_tree = &root->fs_info->reloc_ref_tree;
+
+       root->reloc_root = reloc_root;
+       return 0;
+}
+
+/*
+ * Core function of space balance.
+ *
+ * The idea is using reloc trees to relocate tree blocks in reference
+ * counted roots. There is one reloc tree for each subvol, and all
+ * reloc trees share same root key objectid. Reloc trees are snapshots
+ * of the latest committed roots of subvols (root->commit_root).
+ *
+ * To relocate a tree block referenced by a subvol, there are two steps.
+ * COW the block through subvol's reloc tree, then update block pointer
+ * in the subvol to point to the new block. Since all reloc trees share
+ * same root key objectid, doing special handing for tree blocks owned
+ * by them is easy. Once a tree block has been COWed in one reloc tree,
+ * we can use the resulting new block directly when the same block is
+ * required to COW again through other reloc trees. By this way, relocated
+ * tree blocks are shared between reloc trees, so they are also shared
+ * between subvols.
+ */
+static noinline int relocate_one_path(struct btrfs_trans_handle *trans,
+                                     struct btrfs_root *root,
+                                     struct btrfs_path *path,
+                                     struct btrfs_key *first_key,
+                                     struct btrfs_ref_path *ref_path,
+                                     struct btrfs_block_group_cache *group,
+                                     struct inode *reloc_inode)
+{
+       struct btrfs_root *reloc_root;
+       struct extent_buffer *eb = NULL;
+       struct btrfs_key *keys;
+       u64 *nodes;
+       int level;
+       int shared_level;
+       int lowest_level = 0;
+       int ret;
+
+       if (ref_path->owner_objectid < BTRFS_FIRST_FREE_OBJECTID)
+               lowest_level = ref_path->owner_objectid;
+
+       if (!root->ref_cows) {
+               path->lowest_level = lowest_level;
+               ret = btrfs_search_slot(trans, root, first_key, path, 0, 1);
+               BUG_ON(ret < 0);
+               path->lowest_level = 0;
+               btrfs_release_path(root, path);
+               return 0;
+       }
+
+       mutex_lock(&root->fs_info->tree_reloc_mutex);
+       ret = init_reloc_tree(trans, root);
+       BUG_ON(ret);
+       reloc_root = root->reloc_root;
+
+       shared_level = ref_path->shared_level;
+       ref_path->shared_level = BTRFS_MAX_LEVEL - 1;
+
+       keys = ref_path->node_keys;
+       nodes = ref_path->new_nodes;
+       memset(&keys[shared_level + 1], 0,
+              sizeof(*keys) * (BTRFS_MAX_LEVEL - shared_level - 1));
+       memset(&nodes[shared_level + 1], 0,
+              sizeof(*nodes) * (BTRFS_MAX_LEVEL - shared_level - 1));
+
+       if (nodes[lowest_level] == 0) {
+               path->lowest_level = lowest_level;
+               ret = btrfs_search_slot(trans, reloc_root, first_key, path,
+                                       0, 1);
+               BUG_ON(ret);
+               for (level = lowest_level; level < BTRFS_MAX_LEVEL; level++) {
+                       eb = path->nodes[level];
+                       if (!eb || eb == reloc_root->node)
+                               break;
+                       nodes[level] = eb->start;
+                       if (level == 0)
+                               btrfs_item_key_to_cpu(eb, &keys[level], 0);
+                       else
+                               btrfs_node_key_to_cpu(eb, &keys[level], 0);
+               }
+               if (nodes[0] &&
+                   ref_path->owner_objectid >= BTRFS_FIRST_FREE_OBJECTID) {
+                       eb = path->nodes[0];
+                       ret = replace_extents_in_leaf(trans, reloc_root, eb,
+                                                     group, reloc_inode);
+                       BUG_ON(ret);
+               }
+               btrfs_release_path(reloc_root, path);
+       } else {
+               ret = btrfs_merge_path(trans, reloc_root, keys, nodes,
+                                      lowest_level);
+               BUG_ON(ret);
+       }
+
+       /*
+        * replace tree blocks in the fs tree with tree blocks in
+        * the reloc tree.
+        */
+       ret = btrfs_merge_path(trans, root, keys, nodes, lowest_level);
+       BUG_ON(ret < 0);
+
+       if (ref_path->owner_objectid >= BTRFS_FIRST_FREE_OBJECTID) {
+               ret = btrfs_search_slot(trans, reloc_root, first_key, path,
+                                       0, 0);
+               BUG_ON(ret);
+               extent_buffer_get(path->nodes[0]);
+               eb = path->nodes[0];
+               btrfs_release_path(reloc_root, path);
+               ret = invalidate_extent_cache(reloc_root, eb, group, root);
+               BUG_ON(ret);
+               free_extent_buffer(eb);
+       }
+
+       mutex_unlock(&root->fs_info->tree_reloc_mutex);
+       path->lowest_level = 0;
+       return 0;
+}
+
+static noinline int relocate_tree_block(struct btrfs_trans_handle *trans,
+                                       struct btrfs_root *root,
+                                       struct btrfs_path *path,
+                                       struct btrfs_key *first_key,
+                                       struct btrfs_ref_path *ref_path)
+{
+       int ret;
+
+       ret = relocate_one_path(trans, root, path, first_key,
+                               ref_path, NULL, NULL);
+       BUG_ON(ret);
+
+       if (root == root->fs_info->extent_root)
+               btrfs_extent_post_op(trans, root);
+
+       return 0;
+}
+
+static noinline int del_extent_zero(struct btrfs_trans_handle *trans,
+                                   struct btrfs_root *extent_root,
+                                   struct btrfs_path *path,
+                                   struct btrfs_key *extent_key)
+{
+       int ret;
+
+       ret = btrfs_search_slot(trans, extent_root, extent_key, path, -1, 1);
+       if (ret)
+               goto out;
+       ret = btrfs_del_item(trans, extent_root, path);
+out:
+       btrfs_release_path(extent_root, path);
+       return ret;
+}
+
+static noinline struct btrfs_root *read_ref_root(struct btrfs_fs_info *fs_info,
+                                               struct btrfs_ref_path *ref_path)
+{
+       struct btrfs_key root_key;
+
+       root_key.objectid = ref_path->root_objectid;
+       root_key.type = BTRFS_ROOT_ITEM_KEY;
+       if (is_cowonly_root(ref_path->root_objectid))
+               root_key.offset = 0;
+       else
+               root_key.offset = (u64)-1;
+
+       return btrfs_read_fs_root_no_name(fs_info, &root_key);
+}
+
+static noinline int relocate_one_extent(struct btrfs_root *extent_root,
+                                       struct btrfs_path *path,
+                                       struct btrfs_key *extent_key,
+                                       struct btrfs_block_group_cache *group,
+                                       struct inode *reloc_inode, int pass)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *found_root;
+       struct btrfs_ref_path *ref_path = NULL;
+       struct disk_extent *new_extents = NULL;
+       int nr_extents = 0;
+       int loops;
+       int ret;
+       int level;
+       struct btrfs_key first_key;
+       u64 prev_block = 0;
+
+
+       trans = btrfs_start_transaction(extent_root, 1);
+       BUG_ON(!trans);
+
+       if (extent_key->objectid == 0) {
+               ret = del_extent_zero(trans, extent_root, path, extent_key);
+               goto out;
+       }
+
+       ref_path = kmalloc(sizeof(*ref_path), GFP_NOFS);
+       if (!ref_path) {
+               ret = -ENOMEM;
+               goto out;
+       }
+
+       for (loops = 0; ; loops++) {
+               if (loops == 0) {
+                       ret = btrfs_first_ref_path(trans, extent_root, ref_path,
+                                                  extent_key->objectid);
+               } else {
+                       ret = btrfs_next_ref_path(trans, extent_root, ref_path);
+               }
+               if (ret < 0)
+                       goto out;
+               if (ret > 0)
+                       break;
+
+               if (ref_path->root_objectid == BTRFS_TREE_LOG_OBJECTID ||
+                   ref_path->root_objectid == BTRFS_TREE_RELOC_OBJECTID)
+                       continue;
+
+               found_root = read_ref_root(extent_root->fs_info, ref_path);
+               BUG_ON(!found_root);
+               /*
+                * for reference counted tree, only process reference paths
+                * rooted at the latest committed root.
+                */
+               if (found_root->ref_cows &&
+                   ref_path->root_generation != found_root->root_key.offset)
+                       continue;
+
+               if (ref_path->owner_objectid >= BTRFS_FIRST_FREE_OBJECTID) {
+                       if (pass == 0) {
+                               /*
+                                * copy data extents to new locations
+                                */
+                               u64 group_start = group->key.objectid;
+                               ret = relocate_data_extent(reloc_inode,
+                                                          extent_key,
+                                                          group_start);
+                               if (ret < 0)
+                                       goto out;
+                               break;
+                       }
+                       level = 0;
+               } else {
+                       level = ref_path->owner_objectid;
+               }
+
+               if (prev_block != ref_path->nodes[level]) {
+                       struct extent_buffer *eb;
+                       u64 block_start = ref_path->nodes[level];
+                       u64 block_size = btrfs_level_size(found_root, level);
+
+                       eb = read_tree_block(found_root, block_start,
+                                            block_size, 0);
+                       btrfs_tree_lock(eb);
+                       BUG_ON(level != btrfs_header_level(eb));
+
+                       if (level == 0)
+                               btrfs_item_key_to_cpu(eb, &first_key, 0);
+                       else
+                               btrfs_node_key_to_cpu(eb, &first_key, 0);
+
+                       btrfs_tree_unlock(eb);
+                       free_extent_buffer(eb);
+                       prev_block = block_start;
+               }
+
+               btrfs_record_root_in_trans(found_root);
+               if (ref_path->owner_objectid >= BTRFS_FIRST_FREE_OBJECTID) {
+                       /*
+                        * try to update data extent references while
+                        * keeping metadata shared between snapshots.
+                        */
+                       if (pass == 1) {
+                               ret = relocate_one_path(trans, found_root,
+                                               path, &first_key, ref_path,
+                                               group, reloc_inode);
+                               if (ret < 0)
+                                       goto out;
+                               continue;
+                       }
+                       /*
+                        * use fallback method to process the remaining
+                        * references.
+                        */
+                       if (!new_extents) {
+                               u64 group_start = group->key.objectid;
+                               new_extents = kmalloc(sizeof(*new_extents),
+                                                     GFP_NOFS);
+                               nr_extents = 1;
+                               ret = get_new_locations(reloc_inode,
+                                                       extent_key,
+                                                       group_start, 1,
+                                                       &new_extents,
+                                                       &nr_extents);
+                               if (ret)
+                                       goto out;
+                       }
+                       ret = replace_one_extent(trans, found_root,
+                                               path, extent_key,
+                                               &first_key, ref_path,
+                                               new_extents, nr_extents);
+               } else {
+                       ret = relocate_tree_block(trans, found_root, path,
+                                                 &first_key, ref_path);
+               }
+               if (ret < 0)
+                       goto out;
+       }
+       ret = 0;
+out:
+       btrfs_end_transaction(trans, extent_root);
+       kfree(new_extents);
+       kfree(ref_path);
+       return ret;
+}
+
+static u64 update_block_group_flags(struct btrfs_root *root, u64 flags)
+{
+       u64 num_devices;
+       u64 stripped = BTRFS_BLOCK_GROUP_RAID0 |
+               BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10;
+
+       num_devices = root->fs_info->fs_devices->rw_devices;
+       if (num_devices == 1) {
+               stripped |= BTRFS_BLOCK_GROUP_DUP;
+               stripped = flags & ~stripped;
+
+               /* turn raid0 into single device chunks */
+               if (flags & BTRFS_BLOCK_GROUP_RAID0)
+                       return stripped;
+
+               /* turn mirroring into duplication */
+               if (flags & (BTRFS_BLOCK_GROUP_RAID1 |
+                            BTRFS_BLOCK_GROUP_RAID10))
+                       return stripped | BTRFS_BLOCK_GROUP_DUP;
+               return flags;
+       } else {
+               /* they already had raid on here, just return */
+               if (flags & stripped)
+                       return flags;
+
+               stripped |= BTRFS_BLOCK_GROUP_DUP;
+               stripped = flags & ~stripped;
+
+               /* switch duplicated blocks with raid1 */
+               if (flags & BTRFS_BLOCK_GROUP_DUP)
+                       return stripped | BTRFS_BLOCK_GROUP_RAID1;
+
+               /* turn single device chunks into raid0 */
+               return stripped | BTRFS_BLOCK_GROUP_RAID0;
+       }
+       return flags;
+}
+
+static int __alloc_chunk_for_shrink(struct btrfs_root *root,
+                    struct btrfs_block_group_cache *shrink_block_group,
+                    int force)
+{
+       struct btrfs_trans_handle *trans;
+       u64 new_alloc_flags;
+       u64 calc;
+
+       spin_lock(&shrink_block_group->lock);
+       if (btrfs_block_group_used(&shrink_block_group->item) > 0) {
+               spin_unlock(&shrink_block_group->lock);
+
+               trans = btrfs_start_transaction(root, 1);
+               spin_lock(&shrink_block_group->lock);
+
+               new_alloc_flags = update_block_group_flags(root,
+                                                  shrink_block_group->flags);
+               if (new_alloc_flags != shrink_block_group->flags) {
+                       calc =
+                            btrfs_block_group_used(&shrink_block_group->item);
+               } else {
+                       calc = shrink_block_group->key.offset;
+               }
+               spin_unlock(&shrink_block_group->lock);
+
+               do_chunk_alloc(trans, root->fs_info->extent_root,
+                              calc + 2 * 1024 * 1024, new_alloc_flags, force);
+
+               btrfs_end_transaction(trans, root);
+       } else
+               spin_unlock(&shrink_block_group->lock);
+       return 0;
+}
+
+static int __insert_orphan_inode(struct btrfs_trans_handle *trans,
+                                struct btrfs_root *root,
+                                u64 objectid, u64 size)
+{
+       struct btrfs_path *path;
+       struct btrfs_inode_item *item;
+       struct extent_buffer *leaf;
+       int ret;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       ret = btrfs_insert_empty_inode(trans, root, path, objectid);
+       if (ret)
+               goto out;
+
+       leaf = path->nodes[0];
+       item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_inode_item);
+       memset_extent_buffer(leaf, 0, (unsigned long)item, sizeof(*item));
+       btrfs_set_inode_generation(leaf, item, 1);
+       btrfs_set_inode_size(leaf, item, size);
+       btrfs_set_inode_mode(leaf, item, S_IFREG | 0600);
+       btrfs_set_inode_flags(leaf, item, BTRFS_INODE_NOCOMPRESS);
+       btrfs_mark_buffer_dirty(leaf);
+       btrfs_release_path(root, path);
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+static noinline struct inode *create_reloc_inode(struct btrfs_fs_info *fs_info,
+                                       struct btrfs_block_group_cache *group)
+{
+       struct inode *inode = NULL;
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *root;
+       struct btrfs_key root_key;
+       u64 objectid = BTRFS_FIRST_FREE_OBJECTID;
+       int err = 0;
+
+       root_key.objectid = BTRFS_DATA_RELOC_TREE_OBJECTID;
+       root_key.type = BTRFS_ROOT_ITEM_KEY;
+       root_key.offset = (u64)-1;
+       root = btrfs_read_fs_root_no_name(fs_info, &root_key);
+       if (IS_ERR(root))
+               return ERR_CAST(root);
+
+       trans = btrfs_start_transaction(root, 1);
+       BUG_ON(!trans);
+
+       err = btrfs_find_free_objectid(trans, root, objectid, &objectid);
+       if (err)
+               goto out;
+
+       err = __insert_orphan_inode(trans, root, objectid, group->key.offset);
+       BUG_ON(err);
+
+       err = btrfs_insert_file_extent(trans, root, objectid, 0, 0, 0,
+                                      group->key.offset, 0, group->key.offset,
+                                      0, 0, 0);
+       BUG_ON(err);
+
+       inode = btrfs_iget_locked(root->fs_info->sb, objectid, root);
+       if (inode->i_state & I_NEW) {
+               BTRFS_I(inode)->root = root;
+               BTRFS_I(inode)->location.objectid = objectid;
+               BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY;
+               BTRFS_I(inode)->location.offset = 0;
+               btrfs_read_locked_inode(inode);
+               unlock_new_inode(inode);
+               BUG_ON(is_bad_inode(inode));
+       } else {
+               BUG_ON(1);
+       }
+       BTRFS_I(inode)->index_cnt = group->key.objectid;
+
+       err = btrfs_orphan_add(trans, inode);
+out:
+       btrfs_end_transaction(trans, root);
+       if (err) {
+               if (inode)
+                       iput(inode);
+               inode = ERR_PTR(err);
+       }
+       return inode;
+}
+
+int btrfs_reloc_clone_csums(struct inode *inode, u64 file_pos, u64 len)
+{
+
+       struct btrfs_ordered_sum *sums;
+       struct btrfs_sector_sum *sector_sum;
+       struct btrfs_ordered_extent *ordered;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct list_head list;
+       size_t offset;
+       int ret;
+       u64 disk_bytenr;
+
+       INIT_LIST_HEAD(&list);
+
+       ordered = btrfs_lookup_ordered_extent(inode, file_pos);
+       BUG_ON(ordered->file_offset != file_pos || ordered->len != len);
+
+       disk_bytenr = file_pos + BTRFS_I(inode)->index_cnt;
+       ret = btrfs_lookup_csums_range(root->fs_info->csum_root, disk_bytenr,
+                                      disk_bytenr + len - 1, &list);
+
+       while (!list_empty(&list)) {
+               sums = list_entry(list.next, struct btrfs_ordered_sum, list);
+               list_del_init(&sums->list);
+
+               sector_sum = sums->sums;
+               sums->bytenr = ordered->start;
+
+               offset = 0;
+               while (offset < sums->len) {
+                       sector_sum->bytenr += ordered->start - disk_bytenr;
+                       sector_sum++;
+                       offset += root->sectorsize;
+               }
+
+               btrfs_add_ordered_sum(inode, ordered, sums);
+       }
+       btrfs_put_ordered_extent(ordered);
+       return 0;
+}
+
+int btrfs_relocate_block_group(struct btrfs_root *root, u64 group_start)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_path *path;
+       struct btrfs_fs_info *info = root->fs_info;
+       struct extent_buffer *leaf;
+       struct inode *reloc_inode;
+       struct btrfs_block_group_cache *block_group;
+       struct btrfs_key key;
+       u64 skipped;
+       u64 cur_byte;
+       u64 total_found;
+       u32 nritems;
+       int ret;
+       int progress;
+       int pass = 0;
+
+       root = root->fs_info->extent_root;
+
+       block_group = btrfs_lookup_block_group(info, group_start);
+       BUG_ON(!block_group);
+
+       printk(KERN_INFO "btrfs relocating block group %llu flags %llu\n",
+              (unsigned long long)block_group->key.objectid,
+              (unsigned long long)block_group->flags);
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       reloc_inode = create_reloc_inode(info, block_group);
+       BUG_ON(IS_ERR(reloc_inode));
+
+       __alloc_chunk_for_shrink(root, block_group, 1);
+       set_block_group_readonly(block_group);
+
+       btrfs_start_delalloc_inodes(info->tree_root);
+       btrfs_wait_ordered_extents(info->tree_root, 0);
+again:
+       skipped = 0;
+       total_found = 0;
+       progress = 0;
+       key.objectid = block_group->key.objectid;
+       key.offset = 0;
+       key.type = 0;
+       cur_byte = key.objectid;
+
+       trans = btrfs_start_transaction(info->tree_root, 1);
+       btrfs_commit_transaction(trans, info->tree_root);
+
+       mutex_lock(&root->fs_info->cleaner_mutex);
+       btrfs_clean_old_snapshots(info->tree_root);
+       btrfs_remove_leaf_refs(info->tree_root, (u64)-1, 1);
+       mutex_unlock(&root->fs_info->cleaner_mutex);
+
+       while (1) {
+               ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+               if (ret < 0)
+                       goto out;
+next:
+               leaf = path->nodes[0];
+               nritems = btrfs_header_nritems(leaf);
+               if (path->slots[0] >= nritems) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret < 0)
+                               goto out;
+                       if (ret == 1) {
+                               ret = 0;
+                               break;
+                       }
+                       leaf = path->nodes[0];
+                       nritems = btrfs_header_nritems(leaf);
+               }
+
+               btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+               if (key.objectid >= block_group->key.objectid +
+                   block_group->key.offset)
+                       break;
+
+               if (progress && need_resched()) {
+                       btrfs_release_path(root, path);
+                       cond_resched();
+                       progress = 0;
+                       continue;
+               }
+               progress = 1;
+
+               if (btrfs_key_type(&key) != BTRFS_EXTENT_ITEM_KEY ||
+                   key.objectid + key.offset <= cur_byte) {
+                       path->slots[0]++;
+                       goto next;
+               }
+
+               total_found++;
+               cur_byte = key.objectid + key.offset;
+               btrfs_release_path(root, path);
+
+               __alloc_chunk_for_shrink(root, block_group, 0);
+               ret = relocate_one_extent(root, path, &key, block_group,
+                                         reloc_inode, pass);
+               BUG_ON(ret < 0);
+               if (ret > 0)
+                       skipped++;
+
+               key.objectid = cur_byte;
+               key.type = 0;
+               key.offset = 0;
+       }
+
+       btrfs_release_path(root, path);
+
+       if (pass == 0) {
+               btrfs_wait_ordered_range(reloc_inode, 0, (u64)-1);
+               invalidate_mapping_pages(reloc_inode->i_mapping, 0, -1);
+       }
+
+       if (total_found > 0) {
+               printk(KERN_INFO "btrfs found %llu extents in pass %d\n",
+                      (unsigned long long)total_found, pass);
+               pass++;
+               if (total_found == skipped && pass > 2) {
+                       iput(reloc_inode);
+                       reloc_inode = create_reloc_inode(info, block_group);
+                       pass = 0;
+               }
+               goto again;
+       }
+
+       /* delete reloc_inode */
+       iput(reloc_inode);
+
+       /* unpin extents in this range */
+       trans = btrfs_start_transaction(info->tree_root, 1);
+       btrfs_commit_transaction(trans, info->tree_root);
+
+       spin_lock(&block_group->lock);
+       WARN_ON(block_group->pinned > 0);
+       WARN_ON(block_group->reserved > 0);
+       WARN_ON(btrfs_block_group_used(&block_group->item) > 0);
+       spin_unlock(&block_group->lock);
+       put_block_group(block_group);
+       ret = 0;
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+static int find_first_block_group(struct btrfs_root *root,
+               struct btrfs_path *path, struct btrfs_key *key)
+{
+       int ret = 0;
+       struct btrfs_key found_key;
+       struct extent_buffer *leaf;
+       int slot;
+
+       ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
+       if (ret < 0)
+               goto out;
+
+       while (1) {
+               slot = path->slots[0];
+               leaf = path->nodes[0];
+               if (slot >= btrfs_header_nritems(leaf)) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret == 0)
+                               continue;
+                       if (ret < 0)
+                               goto out;
+                       break;
+               }
+               btrfs_item_key_to_cpu(leaf, &found_key, slot);
+
+               if (found_key.objectid >= key->objectid &&
+                   found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
+                       ret = 0;
+                       goto out;
+               }
+               path->slots[0]++;
+       }
+       ret = -ENOENT;
+out:
+       return ret;
+}
+
+int btrfs_free_block_groups(struct btrfs_fs_info *info)
+{
+       struct btrfs_block_group_cache *block_group;
+       struct rb_node *n;
+
+       spin_lock(&info->block_group_cache_lock);
+       while ((n = rb_last(&info->block_group_cache_tree)) != NULL) {
+               block_group = rb_entry(n, struct btrfs_block_group_cache,
+                                      cache_node);
+               rb_erase(&block_group->cache_node,
+                        &info->block_group_cache_tree);
+               spin_unlock(&info->block_group_cache_lock);
+
+               btrfs_remove_free_space_cache(block_group);
+               down_write(&block_group->space_info->groups_sem);
+               list_del(&block_group->list);
+               up_write(&block_group->space_info->groups_sem);
+
+               WARN_ON(atomic_read(&block_group->count) != 1);
+               kfree(block_group);
+
+               spin_lock(&info->block_group_cache_lock);
+       }
+       spin_unlock(&info->block_group_cache_lock);
+       return 0;
+}
+
+int btrfs_read_block_groups(struct btrfs_root *root)
+{
+       struct btrfs_path *path;
+       int ret;
+       struct btrfs_block_group_cache *cache;
+       struct btrfs_fs_info *info = root->fs_info;
+       struct btrfs_space_info *space_info;
+       struct btrfs_key key;
+       struct btrfs_key found_key;
+       struct extent_buffer *leaf;
+
+       root = info->extent_root;
+       key.objectid = 0;
+       key.offset = 0;
+       btrfs_set_key_type(&key, BTRFS_BLOCK_GROUP_ITEM_KEY);
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       while (1) {
+               ret = find_first_block_group(root, path, &key);
+               if (ret > 0) {
+                       ret = 0;
+                       goto error;
+               }
+               if (ret != 0)
+                       goto error;
+
+               leaf = path->nodes[0];
+               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+               cache = kzalloc(sizeof(*cache), GFP_NOFS);
+               if (!cache) {
+                       ret = -ENOMEM;
+                       break;
+               }
+
+               atomic_set(&cache->count, 1);
+               spin_lock_init(&cache->lock);
+               mutex_init(&cache->alloc_mutex);
+               mutex_init(&cache->cache_mutex);
+               INIT_LIST_HEAD(&cache->list);
+               read_extent_buffer(leaf, &cache->item,
+                                  btrfs_item_ptr_offset(leaf, path->slots[0]),
+                                  sizeof(cache->item));
+               memcpy(&cache->key, &found_key, sizeof(found_key));
+
+               key.objectid = found_key.objectid + found_key.offset;
+               btrfs_release_path(root, path);
+               cache->flags = btrfs_block_group_flags(&cache->item);
+
+               ret = update_space_info(info, cache->flags, found_key.offset,
+                                       btrfs_block_group_used(&cache->item),
+                                       &space_info);
+               BUG_ON(ret);
+               cache->space_info = space_info;
+               down_write(&space_info->groups_sem);
+               list_add_tail(&cache->list, &space_info->block_groups);
+               up_write(&space_info->groups_sem);
+
+               ret = btrfs_add_block_group_cache(root->fs_info, cache);
+               BUG_ON(ret);
+
+               set_avail_alloc_bits(root->fs_info, cache->flags);
+               if (btrfs_chunk_readonly(root, cache->key.objectid))
+                       set_block_group_readonly(cache);
+       }
+       ret = 0;
+error:
+       btrfs_free_path(path);
+       return ret;
+}
+
+int btrfs_make_block_group(struct btrfs_trans_handle *trans,
+                          struct btrfs_root *root, u64 bytes_used,
+                          u64 type, u64 chunk_objectid, u64 chunk_offset,
+                          u64 size)
+{
+       int ret;
+       struct btrfs_root *extent_root;
+       struct btrfs_block_group_cache *cache;
+
+       extent_root = root->fs_info->extent_root;
+
+       root->fs_info->last_trans_new_blockgroup = trans->transid;
+
+       cache = kzalloc(sizeof(*cache), GFP_NOFS);
+       if (!cache)
+               return -ENOMEM;
+
+       cache->key.objectid = chunk_offset;
+       cache->key.offset = size;
+       cache->key.type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+       atomic_set(&cache->count, 1);
+       spin_lock_init(&cache->lock);
+       mutex_init(&cache->alloc_mutex);
+       mutex_init(&cache->cache_mutex);
+       INIT_LIST_HEAD(&cache->list);
+
+       btrfs_set_block_group_used(&cache->item, bytes_used);
+       btrfs_set_block_group_chunk_objectid(&cache->item, chunk_objectid);
+       cache->flags = type;
+       btrfs_set_block_group_flags(&cache->item, type);
+
+       ret = update_space_info(root->fs_info, cache->flags, size, bytes_used,
+                               &cache->space_info);
+       BUG_ON(ret);
+       down_write(&cache->space_info->groups_sem);
+       list_add_tail(&cache->list, &cache->space_info->block_groups);
+       up_write(&cache->space_info->groups_sem);
+
+       ret = btrfs_add_block_group_cache(root->fs_info, cache);
+       BUG_ON(ret);
+
+       ret = btrfs_insert_item(trans, extent_root, &cache->key, &cache->item,
+                               sizeof(cache->item));
+       BUG_ON(ret);
+
+       finish_current_insert(trans, extent_root, 0);
+       ret = del_pending_extents(trans, extent_root, 0);
+       BUG_ON(ret);
+       set_avail_alloc_bits(extent_root->fs_info, type);
+
+       return 0;
+}
+
+int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root, u64 group_start)
+{
+       struct btrfs_path *path;
+       struct btrfs_block_group_cache *block_group;
+       struct btrfs_key key;
+       int ret;
+
+       root = root->fs_info->extent_root;
+
+       block_group = btrfs_lookup_block_group(root->fs_info, group_start);
+       BUG_ON(!block_group);
+       BUG_ON(!block_group->ro);
+
+       memcpy(&key, &block_group->key, sizeof(key));
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       btrfs_remove_free_space_cache(block_group);
+       rb_erase(&block_group->cache_node,
+                &root->fs_info->block_group_cache_tree);
+       down_write(&block_group->space_info->groups_sem);
+       list_del(&block_group->list);
+       up_write(&block_group->space_info->groups_sem);
+
+       spin_lock(&block_group->space_info->lock);
+       block_group->space_info->total_bytes -= block_group->key.offset;
+       block_group->space_info->bytes_readonly -= block_group->key.offset;
+       spin_unlock(&block_group->space_info->lock);
+       block_group->space_info->full = 0;
+
+       put_block_group(block_group);
+       put_block_group(block_group);
+
+       ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+       if (ret > 0)
+               ret = -EIO;
+       if (ret < 0)
+               goto out;
+
+       ret = btrfs_del_item(trans, root, path);
+out:
+       btrfs_free_path(path);
+       return ret;
+}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c

new file mode 100644 (file)

index 0000000..e086d40
--- /dev/null
+++ b/fs/btrfs/extent_io.c
@@ -0,0 +1,3717 @@
+#include <linux/bitops.h>
+#include <linux/slab.h>
+#include <linux/bio.h>
+#include <linux/mm.h>
+#include <linux/gfp.h>
+#include <linux/pagemap.h>
+#include <linux/page-flags.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/swap.h>
+#include <linux/version.h>
+#include <linux/writeback.h>
+#include <linux/pagevec.h>
+#include "extent_io.h"
+#include "extent_map.h"
+#include "compat.h"
+#include "ctree.h"
+#include "btrfs_inode.h"
+
+/* temporary define until extent_map moves out of btrfs */
+struct kmem_cache *btrfs_cache_create(const char *name, size_t size,
+                                      unsigned long extra_flags,
+                                      void (*ctor)(void *, struct kmem_cache *,
+                                                   unsigned long));
+
+static struct kmem_cache *extent_state_cache;
+static struct kmem_cache *extent_buffer_cache;
+
+static LIST_HEAD(buffers);
+static LIST_HEAD(states);
+
+#define LEAK_DEBUG 0
+#ifdef LEAK_DEBUG
+static DEFINE_SPINLOCK(leak_lock);
+#endif
+
+#define BUFFER_LRU_MAX 64
+
+struct tree_entry {
+       u64 start;
+       u64 end;
+       struct rb_node rb_node;
+};
+
+struct extent_page_data {
+       struct bio *bio;
+       struct extent_io_tree *tree;
+       get_extent_t *get_extent;
+
+       /* tells writepage not to lock the state bits for this range
+        * it still does the unlocking
+        */
+       int extent_locked;
+};
+
+int __init extent_io_init(void)
+{
+       extent_state_cache = btrfs_cache_create("extent_state",
+                                           sizeof(struct extent_state), 0,
+                                           NULL);
+       if (!extent_state_cache)
+               return -ENOMEM;
+
+       extent_buffer_cache = btrfs_cache_create("extent_buffers",
+                                           sizeof(struct extent_buffer), 0,
+                                           NULL);
+       if (!extent_buffer_cache)
+               goto free_state_cache;
+       return 0;
+
+free_state_cache:
+       kmem_cache_destroy(extent_state_cache);
+       return -ENOMEM;
+}
+
+void extent_io_exit(void)
+{
+       struct extent_state *state;
+       struct extent_buffer *eb;
+
+       while (!list_empty(&states)) {
+               state = list_entry(states.next, struct extent_state, leak_list);
+               printk(KERN_ERR "btrfs state leak: start %llu end %llu "
+                      "state %lu in tree %p refs %d\n",
+                      (unsigned long long)state->start,
+                      (unsigned long long)state->end,
+                      state->state, state->tree, atomic_read(&state->refs));
+               list_del(&state->leak_list);
+               kmem_cache_free(extent_state_cache, state);
+
+       }
+
+       while (!list_empty(&buffers)) {
+               eb = list_entry(buffers.next, struct extent_buffer, leak_list);
+               printk(KERN_ERR "btrfs buffer leak start %llu len %lu "
+                      "refs %d\n", (unsigned long long)eb->start,
+                      eb->len, atomic_read(&eb->refs));
+               list_del(&eb->leak_list);
+               kmem_cache_free(extent_buffer_cache, eb);
+       }
+       if (extent_state_cache)
+               kmem_cache_destroy(extent_state_cache);
+       if (extent_buffer_cache)
+               kmem_cache_destroy(extent_buffer_cache);
+}
+
+void extent_io_tree_init(struct extent_io_tree *tree,
+                         struct address_space *mapping, gfp_t mask)
+{
+       tree->state.rb_node = NULL;
+       tree->buffer.rb_node = NULL;
+       tree->ops = NULL;
+       tree->dirty_bytes = 0;
+       spin_lock_init(&tree->lock);
+       spin_lock_init(&tree->buffer_lock);
+       tree->mapping = mapping;
+}
+
+static struct extent_state *alloc_extent_state(gfp_t mask)
+{
+       struct extent_state *state;
+#ifdef LEAK_DEBUG
+       unsigned long flags;
+#endif
+
+       state = kmem_cache_alloc(extent_state_cache, mask);
+       if (!state)
+               return state;
+       state->state = 0;
+       state->private = 0;
+       state->tree = NULL;
+#ifdef LEAK_DEBUG
+       spin_lock_irqsave(&leak_lock, flags);
+       list_add(&state->leak_list, &states);
+       spin_unlock_irqrestore(&leak_lock, flags);
+#endif
+       atomic_set(&state->refs, 1);
+       init_waitqueue_head(&state->wq);
+       return state;
+}
+
+static void free_extent_state(struct extent_state *state)
+{
+       if (!state)
+               return;
+       if (atomic_dec_and_test(&state->refs)) {
+#ifdef LEAK_DEBUG
+               unsigned long flags;
+#endif
+               WARN_ON(state->tree);
+#ifdef LEAK_DEBUG
+               spin_lock_irqsave(&leak_lock, flags);
+               list_del(&state->leak_list);
+               spin_unlock_irqrestore(&leak_lock, flags);
+#endif
+               kmem_cache_free(extent_state_cache, state);
+       }
+}
+
+static struct rb_node *tree_insert(struct rb_root *root, u64 offset,
+                                  struct rb_node *node)
+{
+       struct rb_node **p = &root->rb_node;
+       struct rb_node *parent = NULL;
+       struct tree_entry *entry;
+
+       while (*p) {
+               parent = *p;
+               entry = rb_entry(parent, struct tree_entry, rb_node);
+
+               if (offset < entry->start)
+                       p = &(*p)->rb_left;
+               else if (offset > entry->end)
+                       p = &(*p)->rb_right;
+               else
+                       return parent;
+       }
+
+       entry = rb_entry(node, struct tree_entry, rb_node);
+       rb_link_node(node, parent, p);
+       rb_insert_color(node, root);
+       return NULL;
+}
+
+static struct rb_node *__etree_search(struct extent_io_tree *tree, u64 offset,
+                                    struct rb_node **prev_ret,
+                                    struct rb_node **next_ret)
+{
+       struct rb_root *root = &tree->state;
+       struct rb_node *n = root->rb_node;
+       struct rb_node *prev = NULL;
+       struct rb_node *orig_prev = NULL;
+       struct tree_entry *entry;
+       struct tree_entry *prev_entry = NULL;
+
+       while (n) {
+               entry = rb_entry(n, struct tree_entry, rb_node);
+               prev = n;
+               prev_entry = entry;
+
+               if (offset < entry->start)
+                       n = n->rb_left;
+               else if (offset > entry->end)
+                       n = n->rb_right;
+               else
+                       return n;
+       }
+
+       if (prev_ret) {
+               orig_prev = prev;
+               while (prev && offset > prev_entry->end) {
+                       prev = rb_next(prev);
+                       prev_entry = rb_entry(prev, struct tree_entry, rb_node);
+               }
+               *prev_ret = prev;
+               prev = orig_prev;
+       }
+
+       if (next_ret) {
+               prev_entry = rb_entry(prev, struct tree_entry, rb_node);
+               while (prev && offset < prev_entry->start) {
+                       prev = rb_prev(prev);
+                       prev_entry = rb_entry(prev, struct tree_entry, rb_node);
+               }
+               *next_ret = prev;
+       }
+       return NULL;
+}
+
+static inline struct rb_node *tree_search(struct extent_io_tree *tree,
+                                         u64 offset)
+{
+       struct rb_node *prev = NULL;
+       struct rb_node *ret;
+
+       ret = __etree_search(tree, offset, &prev, NULL);
+       if (!ret)
+               return prev;
+       return ret;
+}
+
+static struct extent_buffer *buffer_tree_insert(struct extent_io_tree *tree,
+                                         u64 offset, struct rb_node *node)
+{
+       struct rb_root *root = &tree->buffer;
+       struct rb_node **p = &root->rb_node;
+       struct rb_node *parent = NULL;
+       struct extent_buffer *eb;
+
+       while (*p) {
+               parent = *p;
+               eb = rb_entry(parent, struct extent_buffer, rb_node);
+
+               if (offset < eb->start)
+                       p = &(*p)->rb_left;
+               else if (offset > eb->start)
+                       p = &(*p)->rb_right;
+               else
+                       return eb;
+       }
+
+       rb_link_node(node, parent, p);
+       rb_insert_color(node, root);
+       return NULL;
+}
+
+static struct extent_buffer *buffer_search(struct extent_io_tree *tree,
+                                          u64 offset)
+{
+       struct rb_root *root = &tree->buffer;
+       struct rb_node *n = root->rb_node;
+       struct extent_buffer *eb;
+
+       while (n) {
+               eb = rb_entry(n, struct extent_buffer, rb_node);
+               if (offset < eb->start)
+                       n = n->rb_left;
+               else if (offset > eb->start)
+                       n = n->rb_right;
+               else
+                       return eb;
+       }
+       return NULL;
+}
+
+/*
+ * utility function to look for merge candidates inside a given range.
+ * Any extents with matching state are merged together into a single
+ * extent in the tree.  Extents with EXTENT_IO in their state field
+ * are not merged because the end_io handlers need to be able to do
+ * operations on them without sleeping (or doing allocations/splits).
+ *
+ * This should be called with the tree lock held.
+ */
+static int merge_state(struct extent_io_tree *tree,
+                      struct extent_state *state)
+{
+       struct extent_state *other;
+       struct rb_node *other_node;
+
+       if (state->state & (EXTENT_IOBITS | EXTENT_BOUNDARY))
+               return 0;
+
+       other_node = rb_prev(&state->rb_node);
+       if (other_node) {
+               other = rb_entry(other_node, struct extent_state, rb_node);
+               if (other->end == state->start - 1 &&
+                   other->state == state->state) {
+                       state->start = other->start;
+                       other->tree = NULL;
+                       rb_erase(&other->rb_node, &tree->state);
+                       free_extent_state(other);
+               }
+       }
+       other_node = rb_next(&state->rb_node);
+       if (other_node) {
+               other = rb_entry(other_node, struct extent_state, rb_node);
+               if (other->start == state->end + 1 &&
+                   other->state == state->state) {
+                       other->start = state->start;
+                       state->tree = NULL;
+                       rb_erase(&state->rb_node, &tree->state);
+                       free_extent_state(state);
+               }
+       }
+       return 0;
+}
+
+static void set_state_cb(struct extent_io_tree *tree,
+                        struct extent_state *state,
+                        unsigned long bits)
+{
+       if (tree->ops && tree->ops->set_bit_hook) {
+               tree->ops->set_bit_hook(tree->mapping->host, state->start,
+                                       state->end, state->state, bits);
+       }
+}
+
+static void clear_state_cb(struct extent_io_tree *tree,
+                          struct extent_state *state,
+                          unsigned long bits)
+{
+       if (tree->ops && tree->ops->clear_bit_hook) {
+               tree->ops->clear_bit_hook(tree->mapping->host, state->start,
+                                         state->end, state->state, bits);
+       }
+}
+
+/*
+ * insert an extent_state struct into the tree.  'bits' are set on the
+ * struct before it is inserted.
+ *
+ * This may return -EEXIST if the extent is already there, in which case the
+ * state struct is freed.
+ *
+ * The tree lock is not taken internally.  This is a utility function and
+ * probably isn't what you want to call (see set/clear_extent_bit).
+ */
+static int insert_state(struct extent_io_tree *tree,
+                       struct extent_state *state, u64 start, u64 end,
+                       int bits)
+{
+       struct rb_node *node;
+
+       if (end < start) {
+               printk(KERN_ERR "btrfs end < start %llu %llu\n",
+                      (unsigned long long)end,
+                      (unsigned long long)start);
+               WARN_ON(1);
+       }
+       if (bits & EXTENT_DIRTY)
+               tree->dirty_bytes += end - start + 1;
+       set_state_cb(tree, state, bits);
+       state->state |= bits;
+       state->start = start;
+       state->end = end;
+       node = tree_insert(&tree->state, end, &state->rb_node);
+       if (node) {
+               struct extent_state *found;
+               found = rb_entry(node, struct extent_state, rb_node);
+               printk(KERN_ERR "btrfs found node %llu %llu on insert of "
+                      "%llu %llu\n", (unsigned long long)found->start,
+                      (unsigned long long)found->end,
+                      (unsigned long long)start, (unsigned long long)end);
+               free_extent_state(state);
+               return -EEXIST;
+       }
+       state->tree = tree;
+       merge_state(tree, state);
+       return 0;
+}
+
+/*
+ * split a given extent state struct in two, inserting the preallocated
+ * struct 'prealloc' as the newly created second half.  'split' indicates an
+ * offset inside 'orig' where it should be split.
+ *
+ * Before calling,
+ * the tree has 'orig' at [orig->start, orig->end].  After calling, there
+ * are two extent state structs in the tree:
+ * prealloc: [orig->start, split - 1]
+ * orig: [ split, orig->end ]
+ *
+ * The tree locks are not taken by this function. They need to be held
+ * by the caller.
+ */
+static int split_state(struct extent_io_tree *tree, struct extent_state *orig,
+                      struct extent_state *prealloc, u64 split)
+{
+       struct rb_node *node;
+       prealloc->start = orig->start;
+       prealloc->end = split - 1;
+       prealloc->state = orig->state;
+       orig->start = split;
+
+       node = tree_insert(&tree->state, prealloc->end, &prealloc->rb_node);
+       if (node) {
+               struct extent_state *found;
+               found = rb_entry(node, struct extent_state, rb_node);
+               free_extent_state(prealloc);
+               return -EEXIST;
+       }
+       prealloc->tree = tree;
+       return 0;
+}
+
+/*
+ * utility function to clear some bits in an extent state struct.
+ * it will optionally wake up any one waiting on this state (wake == 1), or
+ * forcibly remove the state from the tree (delete == 1).
+ *
+ * If no bits are set on the state struct after clearing things, the
+ * struct is freed and removed from the tree
+ */
+static int clear_state_bit(struct extent_io_tree *tree,
+                           struct extent_state *state, int bits, int wake,
+                           int delete)
+{
+       int ret = state->state & bits;
+
+       if ((bits & EXTENT_DIRTY) && (state->state & EXTENT_DIRTY)) {
+               u64 range = state->end - state->start + 1;
+               WARN_ON(range > tree->dirty_bytes);
+               tree->dirty_bytes -= range;
+       }
+       clear_state_cb(tree, state, bits);
+       state->state &= ~bits;
+       if (wake)
+               wake_up(&state->wq);
+       if (delete || state->state == 0) {
+               if (state->tree) {
+                       clear_state_cb(tree, state, state->state);
+                       rb_erase(&state->rb_node, &tree->state);
+                       state->tree = NULL;
+                       free_extent_state(state);
+               } else {
+                       WARN_ON(1);
+               }
+       } else {
+               merge_state(tree, state);
+       }
+       return ret;
+}
+
+/*
+ * clear some bits on a range in the tree.  This may require splitting
+ * or inserting elements in the tree, so the gfp mask is used to
+ * indicate which allocations or sleeping are allowed.
+ *
+ * pass 'wake' == 1 to kick any sleepers, and 'delete' == 1 to remove
+ * the given range from the tree regardless of state (ie for truncate).
+ *
+ * the range [start, end] is inclusive.
+ *
+ * This takes the tree lock, and returns < 0 on error, > 0 if any of the
+ * bits were already set, or zero if none of the bits were already set.
+ */
+int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
+                    int bits, int wake, int delete, gfp_t mask)
+{
+       struct extent_state *state;
+       struct extent_state *prealloc = NULL;
+       struct rb_node *node;
+       int err;
+       int set = 0;
+
+again:
+       if (!prealloc && (mask & __GFP_WAIT)) {
+               prealloc = alloc_extent_state(mask);
+               if (!prealloc)
+                       return -ENOMEM;
+       }
+
+       spin_lock(&tree->lock);
+       /*
+        * this search will find the extents that end after
+        * our range starts
+        */
+       node = tree_search(tree, start);
+       if (!node)
+               goto out;
+       state = rb_entry(node, struct extent_state, rb_node);
+       if (state->start > end)
+               goto out;
+       WARN_ON(state->end < start);
+
+       /*
+        *     | ---- desired range ---- |
+        *  | state | or
+        *  | ------------- state -------------- |
+        *
+        * We need to split the extent we found, and may flip
+        * bits on second half.
+        *
+        * If the extent we found extends past our range, we
+        * just split and search again.  It'll get split again
+        * the next time though.
+        *
+        * If the extent we found is inside our range, we clear
+        * the desired bit on it.
+        */
+
+       if (state->start < start) {
+               if (!prealloc)
+                       prealloc = alloc_extent_state(GFP_ATOMIC);
+               err = split_state(tree, state, prealloc, start);
+               BUG_ON(err == -EEXIST);
+               prealloc = NULL;
+               if (err)
+                       goto out;
+               if (state->end <= end) {
+                       start = state->end + 1;
+                       set |= clear_state_bit(tree, state, bits,
+                                       wake, delete);
+               } else {
+                       start = state->start;
+               }
+               goto search_again;
+       }
+       /*
+        * | ---- desired range ---- |
+        *                        | state |
+        * We need to split the extent, and clear the bit
+        * on the first half
+        */
+       if (state->start <= end && state->end > end) {
+               if (!prealloc)
+                       prealloc = alloc_extent_state(GFP_ATOMIC);
+               err = split_state(tree, state, prealloc, end + 1);
+               BUG_ON(err == -EEXIST);
+
+               if (wake)
+                       wake_up(&state->wq);
+               set |= clear_state_bit(tree, prealloc, bits,
+                                      wake, delete);
+               prealloc = NULL;
+               goto out;
+       }
+
+       start = state->end + 1;
+       set |= clear_state_bit(tree, state, bits, wake, delete);
+       goto search_again;
+
+out:
+       spin_unlock(&tree->lock);
+       if (prealloc)
+               free_extent_state(prealloc);
+
+       return set;
+
+search_again:
+       if (start > end)
+               goto out;
+       spin_unlock(&tree->lock);
+       if (mask & __GFP_WAIT)
+               cond_resched();
+       goto again;
+}
+
+static int wait_on_state(struct extent_io_tree *tree,
+                        struct extent_state *state)
+               __releases(tree->lock)
+               __acquires(tree->lock)
+{
+       DEFINE_WAIT(wait);
+       prepare_to_wait(&state->wq, &wait, TASK_UNINTERRUPTIBLE);
+       spin_unlock(&tree->lock);
+       schedule();
+       spin_lock(&tree->lock);
+       finish_wait(&state->wq, &wait);
+       return 0;
+}
+
+/*
+ * waits for one or more bits to clear on a range in the state tree.
+ * The range [start, end] is inclusive.
+ * The tree lock is taken by this function
+ */
+int wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, int bits)
+{
+       struct extent_state *state;
+       struct rb_node *node;
+
+       spin_lock(&tree->lock);
+again:
+       while (1) {
+               /*
+                * this search will find all the extents that end after
+                * our range starts
+                */
+               node = tree_search(tree, start);
+               if (!node)
+                       break;
+
+               state = rb_entry(node, struct extent_state, rb_node);
+
+               if (state->start > end)
+                       goto out;
+
+               if (state->state & bits) {
+                       start = state->start;
+                       atomic_inc(&state->refs);
+                       wait_on_state(tree, state);
+                       free_extent_state(state);
+                       goto again;
+               }
+               start = state->end + 1;
+
+               if (start > end)
+                       break;
+
+               if (need_resched()) {
+                       spin_unlock(&tree->lock);
+                       cond_resched();
+                       spin_lock(&tree->lock);
+               }
+       }
+out:
+       spin_unlock(&tree->lock);
+       return 0;
+}
+
+static void set_state_bits(struct extent_io_tree *tree,
+                          struct extent_state *state,
+                          int bits)
+{
+       if ((bits & EXTENT_DIRTY) && !(state->state & EXTENT_DIRTY)) {
+               u64 range = state->end - state->start + 1;
+               tree->dirty_bytes += range;
+       }
+       set_state_cb(tree, state, bits);
+       state->state |= bits;
+}
+
+/*
+ * set some bits on a range in the tree.  This may require allocations
+ * or sleeping, so the gfp mask is used to indicate what is allowed.
+ *
+ * If 'exclusive' == 1, this will fail with -EEXIST if some part of the
+ * range already has the desired bits set.  The start of the existing
+ * range is returned in failed_start in this case.
+ *
+ * [start, end] is inclusive
+ * This takes the tree lock.
+ */
+static int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
+                         int bits, int exclusive, u64 *failed_start,
+                         gfp_t mask)
+{
+       struct extent_state *state;
+       struct extent_state *prealloc = NULL;
+       struct rb_node *node;
+       int err = 0;
+       int set;
+       u64 last_start;
+       u64 last_end;
+again:
+       if (!prealloc && (mask & __GFP_WAIT)) {
+               prealloc = alloc_extent_state(mask);
+               if (!prealloc)
+                       return -ENOMEM;
+       }
+
+       spin_lock(&tree->lock);
+       /*
+        * this search will find all the extents that end after
+        * our range starts.
+        */
+       node = tree_search(tree, start);
+       if (!node) {
+               err = insert_state(tree, prealloc, start, end, bits);
+               prealloc = NULL;
+               BUG_ON(err == -EEXIST);
+               goto out;
+       }
+
+       state = rb_entry(node, struct extent_state, rb_node);
+       last_start = state->start;
+       last_end = state->end;
+
+       /*
+        * | ---- desired range ---- |
+        * | state |
+        *
+        * Just lock what we found and keep going
+        */
+       if (state->start == start && state->end <= end) {
+               set = state->state & bits;
+               if (set && exclusive) {
+                       *failed_start = state->start;
+                       err = -EEXIST;
+                       goto out;
+               }
+               set_state_bits(tree, state, bits);
+               start = state->end + 1;
+               merge_state(tree, state);
+               goto search_again;
+       }
+
+       /*
+        *     | ---- desired range ---- |
+        * | state |
+        *   or
+        * | ------------- state -------------- |
+        *
+        * We need to split the extent we found, and may flip bits on
+        * second half.
+        *
+        * If the extent we found extends past our
+        * range, we just split and search again.  It'll get split
+        * again the next time though.
+        *
+        * If the extent we found is inside our range, we set the
+        * desired bit on it.
+        */
+       if (state->start < start) {
+               set = state->state & bits;
+               if (exclusive && set) {
+                       *failed_start = start;
+                       err = -EEXIST;
+                       goto out;
+               }
+               err = split_state(tree, state, prealloc, start);
+               BUG_ON(err == -EEXIST);
+               prealloc = NULL;
+               if (err)
+                       goto out;
+               if (state->end <= end) {
+                       set_state_bits(tree, state, bits);
+                       start = state->end + 1;
+                       merge_state(tree, state);
+               } else {
+                       start = state->start;
+               }
+               goto search_again;
+       }
+       /*
+        * | ---- desired range ---- |
+        *     | state | or               | state |
+        *
+        * There's a hole, we need to insert something in it and
+        * ignore the extent we found.
+        */
+       if (state->start > start) {
+               u64 this_end;
+               if (end < last_start)
+                       this_end = end;
+               else
+                       this_end = last_start - 1;
+               err = insert_state(tree, prealloc, start, this_end,
+                                  bits);
+               prealloc = NULL;
+               BUG_ON(err == -EEXIST);
+               if (err)
+                       goto out;
+               start = this_end + 1;
+               goto search_again;
+       }
+       /*
+        * | ---- desired range ---- |
+        *                        | state |
+        * We need to split the extent, and set the bit
+        * on the first half
+        */
+       if (state->start <= end && state->end > end) {
+               set = state->state & bits;
+               if (exclusive && set) {
+                       *failed_start = start;
+                       err = -EEXIST;
+                       goto out;
+               }
+               err = split_state(tree, state, prealloc, end + 1);
+               BUG_ON(err == -EEXIST);
+
+               set_state_bits(tree, prealloc, bits);
+               merge_state(tree, prealloc);
+               prealloc = NULL;
+               goto out;
+       }
+
+       goto search_again;
+
+out:
+       spin_unlock(&tree->lock);
+       if (prealloc)
+               free_extent_state(prealloc);
+
+       return err;
+
+search_again:
+       if (start > end)
+               goto out;
+       spin_unlock(&tree->lock);
+       if (mask & __GFP_WAIT)
+               cond_resched();
+       goto again;
+}
+
+/* wrappers around set/clear extent bit */
+int set_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end,
+                    gfp_t mask)
+{
+       return set_extent_bit(tree, start, end, EXTENT_DIRTY, 0, NULL,
+                             mask);
+}
+
+int set_extent_ordered(struct extent_io_tree *tree, u64 start, u64 end,
+                      gfp_t mask)
+{
+       return set_extent_bit(tree, start, end, EXTENT_ORDERED, 0, NULL, mask);
+}
+
+int set_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
+                   int bits, gfp_t mask)
+{
+       return set_extent_bit(tree, start, end, bits, 0, NULL,
+                             mask);
+}
+
+int clear_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
+                     int bits, gfp_t mask)
+{
+       return clear_extent_bit(tree, start, end, bits, 0, 0, mask);
+}
+
+int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end,
+                    gfp_t mask)
+{
+       return set_extent_bit(tree, start, end,
+                             EXTENT_DELALLOC | EXTENT_DIRTY,
+                             0, NULL, mask);
+}
+
+int clear_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end,
+                      gfp_t mask)
+{
+       return clear_extent_bit(tree, start, end,
+                               EXTENT_DIRTY | EXTENT_DELALLOC, 0, 0, mask);
+}
+
+int clear_extent_ordered(struct extent_io_tree *tree, u64 start, u64 end,
+                        gfp_t mask)
+{
+       return clear_extent_bit(tree, start, end, EXTENT_ORDERED, 1, 0, mask);
+}
+
+int set_extent_new(struct extent_io_tree *tree, u64 start, u64 end,
+                    gfp_t mask)
+{
+       return set_extent_bit(tree, start, end, EXTENT_NEW, 0, NULL,
+                             mask);
+}
+
+static int clear_extent_new(struct extent_io_tree *tree, u64 start, u64 end,
+                      gfp_t mask)
+{
+       return clear_extent_bit(tree, start, end, EXTENT_NEW, 0, 0, mask);
+}
+
+int set_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end,
+                       gfp_t mask)
+{
+       return set_extent_bit(tree, start, end, EXTENT_UPTODATE, 0, NULL,
+                             mask);
+}
+
+static int clear_extent_uptodate(struct extent_io_tree *tree, u64 start,
+                                u64 end, gfp_t mask)
+{
+       return clear_extent_bit(tree, start, end, EXTENT_UPTODATE, 0, 0, mask);
+}
+
+static int set_extent_writeback(struct extent_io_tree *tree, u64 start, u64 end,
+                        gfp_t mask)
+{
+       return set_extent_bit(tree, start, end, EXTENT_WRITEBACK,
+                             0, NULL, mask);
+}
+
+static int clear_extent_writeback(struct extent_io_tree *tree, u64 start,
+                                 u64 end, gfp_t mask)
+{
+       return clear_extent_bit(tree, start, end, EXTENT_WRITEBACK, 1, 0, mask);
+}
+
+int wait_on_extent_writeback(struct extent_io_tree *tree, u64 start, u64 end)
+{
+       return wait_extent_bit(tree, start, end, EXTENT_WRITEBACK);
+}
+
+/*
+ * either insert or lock state struct between start and end use mask to tell
+ * us if waiting is desired.
+ */
+int lock_extent(struct extent_io_tree *tree, u64 start, u64 end, gfp_t mask)
+{
+       int err;
+       u64 failed_start;
+       while (1) {
+               err = set_extent_bit(tree, start, end, EXTENT_LOCKED, 1,
+                                    &failed_start, mask);
+               if (err == -EEXIST && (mask & __GFP_WAIT)) {
+                       wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED);
+                       start = failed_start;
+               } else {
+                       break;
+               }
+               WARN_ON(start > end);
+       }
+       return err;
+}
+
+int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
+                   gfp_t mask)
+{
+       int err;
+       u64 failed_start;
+
+       err = set_extent_bit(tree, start, end, EXTENT_LOCKED, 1,
+                            &failed_start, mask);
+       if (err == -EEXIST) {
+               if (failed_start > start)
+                       clear_extent_bit(tree, start, failed_start - 1,
+                                        EXTENT_LOCKED, 1, 0, mask);
+               return 0;
+       }
+       return 1;
+}
+
+int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end,
+                 gfp_t mask)
+{
+       return clear_extent_bit(tree, start, end, EXTENT_LOCKED, 1, 0, mask);
+}
+
+/*
+ * helper function to set pages and extents in the tree dirty
+ */
+int set_range_dirty(struct extent_io_tree *tree, u64 start, u64 end)
+{
+       unsigned long index = start >> PAGE_CACHE_SHIFT;
+       unsigned long end_index = end >> PAGE_CACHE_SHIFT;
+       struct page *page;
+
+       while (index <= end_index) {
+               page = find_get_page(tree->mapping, index);
+               BUG_ON(!page);
+               __set_page_dirty_nobuffers(page);
+               page_cache_release(page);
+               index++;
+       }
+       set_extent_dirty(tree, start, end, GFP_NOFS);
+       return 0;
+}
+
+/*
+ * helper function to set both pages and extents in the tree writeback
+ */
+static int set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end)
+{
+       unsigned long index = start >> PAGE_CACHE_SHIFT;
+       unsigned long end_index = end >> PAGE_CACHE_SHIFT;
+       struct page *page;
+
+       while (index <= end_index) {
+               page = find_get_page(tree->mapping, index);
+               BUG_ON(!page);
+               set_page_writeback(page);
+               page_cache_release(page);
+               index++;
+       }
+       set_extent_writeback(tree, start, end, GFP_NOFS);
+       return 0;
+}
+
+/*
+ * find the first offset in the io tree with 'bits' set. zero is
+ * returned if we find something, and *start_ret and *end_ret are
+ * set to reflect the state struct that was found.
+ *
+ * If nothing was found, 1 is returned, < 0 on error
+ */
+int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
+                         u64 *start_ret, u64 *end_ret, int bits)
+{
+       struct rb_node *node;
+       struct extent_state *state;
+       int ret = 1;
+
+       spin_lock(&tree->lock);
+       /*
+        * this search will find all the extents that end after
+        * our range starts.
+        */
+       node = tree_search(tree, start);
+       if (!node)
+               goto out;
+
+       while (1) {
+               state = rb_entry(node, struct extent_state, rb_node);
+               if (state->end >= start && (state->state & bits)) {
+                       *start_ret = state->start;
+                       *end_ret = state->end;
+                       ret = 0;
+                       break;
+               }
+               node = rb_next(node);
+               if (!node)
+                       break;
+       }
+out:
+       spin_unlock(&tree->lock);
+       return ret;
+}
+
+/* find the first state struct with 'bits' set after 'start', and
+ * return it.  tree->lock must be held.  NULL will returned if
+ * nothing was found after 'start'
+ */
+struct extent_state *find_first_extent_bit_state(struct extent_io_tree *tree,
+                                                u64 start, int bits)
+{
+       struct rb_node *node;
+       struct extent_state *state;
+
+       /*
+        * this search will find all the extents that end after
+        * our range starts.
+        */
+       node = tree_search(tree, start);
+       if (!node)
+               goto out;
+
+       while (1) {
+               state = rb_entry(node, struct extent_state, rb_node);
+               if (state->end >= start && (state->state & bits))
+                       return state;
+
+               node = rb_next(node);
+               if (!node)
+                       break;
+       }
+out:
+       return NULL;
+}
+
+/*
+ * find a contiguous range of bytes in the file marked as delalloc, not
+ * more than 'max_bytes'.  start and end are used to return the range,
+ *
+ * 1 is returned if we find something, 0 if nothing was in the tree
+ */
+static noinline u64 find_delalloc_range(struct extent_io_tree *tree,
+                                       u64 *start, u64 *end, u64 max_bytes)
+{
+       struct rb_node *node;
+       struct extent_state *state;
+       u64 cur_start = *start;
+       u64 found = 0;
+       u64 total_bytes = 0;
+
+       spin_lock(&tree->lock);
+
+       /*
+        * this search will find all the extents that end after
+        * our range starts.
+        */
+       node = tree_search(tree, cur_start);
+       if (!node) {
+               if (!found)
+                       *end = (u64)-1;
+               goto out;
+       }
+
+       while (1) {
+               state = rb_entry(node, struct extent_state, rb_node);
+               if (found && (state->start != cur_start ||
+                             (state->state & EXTENT_BOUNDARY))) {
+                       goto out;
+               }
+               if (!(state->state & EXTENT_DELALLOC)) {
+                       if (!found)
+                               *end = state->end;
+                       goto out;
+               }
+               if (!found)
+                       *start = state->start;
+               found++;
+               *end = state->end;
+               cur_start = state->end + 1;
+               node = rb_next(node);
+               if (!node)
+                       break;
+               total_bytes += state->end - state->start + 1;
+               if (total_bytes >= max_bytes)
+                       break;
+       }
+out:
+       spin_unlock(&tree->lock);
+       return found;
+}
+
+static noinline int __unlock_for_delalloc(struct inode *inode,
+                                         struct page *locked_page,
+                                         u64 start, u64 end)
+{
+       int ret;
+       struct page *pages[16];
+       unsigned long index = start >> PAGE_CACHE_SHIFT;
+       unsigned long end_index = end >> PAGE_CACHE_SHIFT;
+       unsigned long nr_pages = end_index - index + 1;
+       int i;
+
+       if (index == locked_page->index && end_index == index)
+               return 0;
+
+       while (nr_pages > 0) {
+               ret = find_get_pages_contig(inode->i_mapping, index,
+                                    min_t(unsigned long, nr_pages,
+                                    ARRAY_SIZE(pages)), pages);
+               for (i = 0; i < ret; i++) {
+                       if (pages[i] != locked_page)
+                               unlock_page(pages[i]);
+                       page_cache_release(pages[i]);
+               }
+               nr_pages -= ret;
+               index += ret;
+               cond_resched();
+       }
+       return 0;
+}
+
+static noinline int lock_delalloc_pages(struct inode *inode,
+                                       struct page *locked_page,
+                                       u64 delalloc_start,
+                                       u64 delalloc_end)
+{
+       unsigned long index = delalloc_start >> PAGE_CACHE_SHIFT;
+       unsigned long start_index = index;
+       unsigned long end_index = delalloc_end >> PAGE_CACHE_SHIFT;
+       unsigned long pages_locked = 0;
+       struct page *pages[16];
+       unsigned long nrpages;
+       int ret;
+       int i;
+
+       /* the caller is responsible for locking the start index */
+       if (index == locked_page->index && index == end_index)
+               return 0;
+
+       /* skip the page at the start index */
+       nrpages = end_index - index + 1;
+       while (nrpages > 0) {
+               ret = find_get_pages_contig(inode->i_mapping, index,
+                                    min_t(unsigned long,
+                                    nrpages, ARRAY_SIZE(pages)), pages);
+               if (ret == 0) {
+                       ret = -EAGAIN;
+                       goto done;
+               }
+               /* now we have an array of pages, lock them all */
+               for (i = 0; i < ret; i++) {
+                       /*
+                        * the caller is taking responsibility for
+                        * locked_page
+                        */
+                       if (pages[i] != locked_page) {
+                               lock_page(pages[i]);
+                               if (!PageDirty(pages[i]) ||
+                                   pages[i]->mapping != inode->i_mapping) {
+                                       ret = -EAGAIN;
+                                       unlock_page(pages[i]);
+                                       page_cache_release(pages[i]);
+                                       goto done;
+                               }
+                       }
+                       page_cache_release(pages[i]);
+                       pages_locked++;
+               }
+               nrpages -= ret;
+               index += ret;
+               cond_resched();
+       }
+       ret = 0;
+done:
+       if (ret && pages_locked) {
+               __unlock_for_delalloc(inode, locked_page,
+                             delalloc_start,
+                             ((u64)(start_index + pages_locked - 1)) <<
+                             PAGE_CACHE_SHIFT);
+       }
+       return ret;
+}
+
+/*
+ * find a contiguous range of bytes in the file marked as delalloc, not
+ * more than 'max_bytes'.  start and end are used to return the range,
+ *
+ * 1 is returned if we find something, 0 if nothing was in the tree
+ */
+static noinline u64 find_lock_delalloc_range(struct inode *inode,
+                                            struct extent_io_tree *tree,
+                                            struct page *locked_page,
+                                            u64 *start, u64 *end,
+                                            u64 max_bytes)
+{
+       u64 delalloc_start;
+       u64 delalloc_end;
+       u64 found;
+       int ret;
+       int loops = 0;
+
+again:
+       /* step one, find a bunch of delalloc bytes starting at start */
+       delalloc_start = *start;
+       delalloc_end = 0;
+       found = find_delalloc_range(tree, &delalloc_start, &delalloc_end,
+                                   max_bytes);
+       if (!found || delalloc_end <= *start) {
+               *start = delalloc_start;
+               *end = delalloc_end;
+               return found;
+       }
+
+       /*
+        * start comes from the offset of locked_page.  We have to lock
+        * pages in order, so we can't process delalloc bytes before
+        * locked_page
+        */
+       if (delalloc_start < *start)
+               delalloc_start = *start;
+
+       /*
+        * make sure to limit the number of pages we try to lock down
+        * if we're looping.
+        */
+       if (delalloc_end + 1 - delalloc_start > max_bytes && loops)
+               delalloc_end = delalloc_start + PAGE_CACHE_SIZE - 1;
+
+       /* step two, lock all the pages after the page that has start */
+       ret = lock_delalloc_pages(inode, locked_page,
+                                 delalloc_start, delalloc_end);
+       if (ret == -EAGAIN) {
+               /* some of the pages are gone, lets avoid looping by
+                * shortening the size of the delalloc range we're searching
+                */
+               if (!loops) {
+                       unsigned long offset = (*start) & (PAGE_CACHE_SIZE - 1);
+                       max_bytes = PAGE_CACHE_SIZE - offset;
+                       loops = 1;
+                       goto again;
+               } else {
+                       found = 0;
+                       goto out_failed;
+               }
+       }
+       BUG_ON(ret);
+
+       /* step three, lock the state bits for the whole range */
+       lock_extent(tree, delalloc_start, delalloc_end, GFP_NOFS);
+
+       /* then test to make sure it is all still delalloc */
+       ret = test_range_bit(tree, delalloc_start, delalloc_end,
+                            EXTENT_DELALLOC, 1);
+       if (!ret) {
+               unlock_extent(tree, delalloc_start, delalloc_end, GFP_NOFS);
+               __unlock_for_delalloc(inode, locked_page,
+                             delalloc_start, delalloc_end);
+               cond_resched();
+               goto again;
+       }
+       *start = delalloc_start;
+       *end = delalloc_end;
+out_failed:
+       return found;
+}
+
+int extent_clear_unlock_delalloc(struct inode *inode,
+                               struct extent_io_tree *tree,
+                               u64 start, u64 end, struct page *locked_page,
+                               int unlock_pages,
+                               int clear_unlock,
+                               int clear_delalloc, int clear_dirty,
+                               int set_writeback,
+                               int end_writeback)
+{
+       int ret;
+       struct page *pages[16];
+       unsigned long index = start >> PAGE_CACHE_SHIFT;
+       unsigned long end_index = end >> PAGE_CACHE_SHIFT;
+       unsigned long nr_pages = end_index - index + 1;
+       int i;
+       int clear_bits = 0;
+
+       if (clear_unlock)
+               clear_bits |= EXTENT_LOCKED;
+       if (clear_dirty)
+               clear_bits |= EXTENT_DIRTY;
+
+       if (clear_delalloc)
+               clear_bits |= EXTENT_DELALLOC;
+
+       clear_extent_bit(tree, start, end, clear_bits, 1, 0, GFP_NOFS);
+       if (!(unlock_pages || clear_dirty || set_writeback || end_writeback))
+               return 0;
+
+       while (nr_pages > 0) {
+               ret = find_get_pages_contig(inode->i_mapping, index,
+                                    min_t(unsigned long,
+                                    nr_pages, ARRAY_SIZE(pages)), pages);
+               for (i = 0; i < ret; i++) {
+                       if (pages[i] == locked_page) {
+                               page_cache_release(pages[i]);
+                               continue;
+                       }
+                       if (clear_dirty)
+                               clear_page_dirty_for_io(pages[i]);
+                       if (set_writeback)
+                               set_page_writeback(pages[i]);
+                       if (end_writeback)
+                               end_page_writeback(pages[i]);
+                       if (unlock_pages)
+                               unlock_page(pages[i]);
+                       page_cache_release(pages[i]);
+               }
+               nr_pages -= ret;
+               index += ret;
+               cond_resched();
+       }
+       return 0;
+}
+
+/*
+ * count the number of bytes in the tree that have a given bit(s)
+ * set.  This can be fairly slow, except for EXTENT_DIRTY which is
+ * cached.  The total number found is returned.
+ */
+u64 count_range_bits(struct extent_io_tree *tree,
+                    u64 *start, u64 search_end, u64 max_bytes,
+                    unsigned long bits)
+{
+       struct rb_node *node;
+       struct extent_state *state;
+       u64 cur_start = *start;
+       u64 total_bytes = 0;
+       int found = 0;
+
+       if (search_end <= cur_start) {
+               WARN_ON(1);
+               return 0;
+       }
+
+       spin_lock(&tree->lock);
+       if (cur_start == 0 && bits == EXTENT_DIRTY) {
+               total_bytes = tree->dirty_bytes;
+               goto out;
+       }
+       /*
+        * this search will find all the extents that end after
+        * our range starts.
+        */
+       node = tree_search(tree, cur_start);
+       if (!node)
+               goto out;
+
+       while (1) {
+               state = rb_entry(node, struct extent_state, rb_node);
+               if (state->start > search_end)
+                       break;
+               if (state->end >= cur_start && (state->state & bits)) {
+                       total_bytes += min(search_end, state->end) + 1 -
+                                      max(cur_start, state->start);
+                       if (total_bytes >= max_bytes)
+                               break;
+                       if (!found) {
+                               *start = state->start;
+                               found = 1;
+                       }
+               }
+               node = rb_next(node);
+               if (!node)
+                       break;
+       }
+out:
+       spin_unlock(&tree->lock);
+       return total_bytes;
+}
+
+#if 0
+/*
+ * helper function to lock both pages and extents in the tree.
+ * pages must be locked first.
+ */
+static int lock_range(struct extent_io_tree *tree, u64 start, u64 end)
+{
+       unsigned long index = start >> PAGE_CACHE_SHIFT;
+       unsigned long end_index = end >> PAGE_CACHE_SHIFT;
+       struct page *page;
+       int err;
+
+       while (index <= end_index) {
+               page = grab_cache_page(tree->mapping, index);
+               if (!page) {
+                       err = -ENOMEM;
+                       goto failed;
+               }
+               if (IS_ERR(page)) {
+                       err = PTR_ERR(page);
+                       goto failed;
+               }
+               index++;
+       }
+       lock_extent(tree, start, end, GFP_NOFS);
+       return 0;
+
+failed:
+       /*
+        * we failed above in getting the page at 'index', so we undo here
+        * up to but not including the page at 'index'
+        */
+       end_index = index;
+       index = start >> PAGE_CACHE_SHIFT;
+       while (index < end_index) {
+               page = find_get_page(tree->mapping, index);
+               unlock_page(page);
+               page_cache_release(page);
+               index++;
+       }
+       return err;
+}
+
+/*
+ * helper function to unlock both pages and extents in the tree.
+ */
+static int unlock_range(struct extent_io_tree *tree, u64 start, u64 end)
+{
+       unsigned long index = start >> PAGE_CACHE_SHIFT;
+       unsigned long end_index = end >> PAGE_CACHE_SHIFT;
+       struct page *page;
+
+       while (index <= end_index) {
+               page = find_get_page(tree->mapping, index);
+               unlock_page(page);
+               page_cache_release(page);
+               index++;
+       }
+       unlock_extent(tree, start, end, GFP_NOFS);
+       return 0;
+}
+#endif
+
+/*
+ * set the private field for a given byte offset in the tree.  If there isn't
+ * an extent_state there already, this does nothing.
+ */
+int set_state_private(struct extent_io_tree *tree, u64 start, u64 private)
+{
+       struct rb_node *node;
+       struct extent_state *state;
+       int ret = 0;
+
+       spin_lock(&tree->lock);
+       /*
+        * this search will find all the extents that end after
+        * our range starts.
+        */
+       node = tree_search(tree, start);
+       if (!node) {
+               ret = -ENOENT;
+               goto out;
+       }
+       state = rb_entry(node, struct extent_state, rb_node);
+       if (state->start != start) {
+               ret = -ENOENT;
+               goto out;
+       }
+       state->private = private;
+out:
+       spin_unlock(&tree->lock);
+       return ret;
+}
+
+int get_state_private(struct extent_io_tree *tree, u64 start, u64 *private)
+{
+       struct rb_node *node;
+       struct extent_state *state;
+       int ret = 0;
+
+       spin_lock(&tree->lock);
+       /*
+        * this search will find all the extents that end after
+        * our range starts.
+        */
+       node = tree_search(tree, start);
+       if (!node) {
+               ret = -ENOENT;
+               goto out;
+       }
+       state = rb_entry(node, struct extent_state, rb_node);
+       if (state->start != start) {
+               ret = -ENOENT;
+               goto out;
+       }
+       *private = state->private;
+out:
+       spin_unlock(&tree->lock);
+       return ret;
+}
+
+/*
+ * searches a range in the state tree for a given mask.
+ * If 'filled' == 1, this returns 1 only if every extent in the tree
+ * has the bits set.  Otherwise, 1 is returned if any bit in the
+ * range is found set.
+ */
+int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
+                  int bits, int filled)
+{
+       struct extent_state *state = NULL;
+       struct rb_node *node;
+       int bitset = 0;
+
+       spin_lock(&tree->lock);
+       node = tree_search(tree, start);
+       while (node && start <= end) {
+               state = rb_entry(node, struct extent_state, rb_node);
+
+               if (filled && state->start > start) {
+                       bitset = 0;
+                       break;
+               }
+
+               if (state->start > end)
+                       break;
+
+               if (state->state & bits) {
+                       bitset = 1;
+                       if (!filled)
+                               break;
+               } else if (filled) {
+                       bitset = 0;
+                       break;
+               }
+               start = state->end + 1;
+               if (start > end)
+                       break;
+               node = rb_next(node);
+               if (!node) {
+                       if (filled)
+                               bitset = 0;
+                       break;
+               }
+       }
+       spin_unlock(&tree->lock);
+       return bitset;
+}
+
+/*
+ * helper function to set a given page up to date if all the
+ * extents in the tree for that page are up to date
+ */
+static int check_page_uptodate(struct extent_io_tree *tree,
+                              struct page *page)
+{
+       u64 start = (u64)page->index << PAGE_CACHE_SHIFT;
+       u64 end = start + PAGE_CACHE_SIZE - 1;
+       if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1))
+               SetPageUptodate(page);
+       return 0;
+}
+
+/*
+ * helper function to unlock a page if all the extents in the tree
+ * for that page are unlocked
+ */
+static int check_page_locked(struct extent_io_tree *tree,
+                            struct page *page)
+{
+       u64 start = (u64)page->index << PAGE_CACHE_SHIFT;
+       u64 end = start + PAGE_CACHE_SIZE - 1;
+       if (!test_range_bit(tree, start, end, EXTENT_LOCKED, 0))
+               unlock_page(page);
+       return 0;
+}
+
+/*
+ * helper function to end page writeback if all the extents
+ * in the tree for that page are done with writeback
+ */
+static int check_page_writeback(struct extent_io_tree *tree,
+                            struct page *page)
+{
+       u64 start = (u64)page->index << PAGE_CACHE_SHIFT;
+       u64 end = start + PAGE_CACHE_SIZE - 1;
+       if (!test_range_bit(tree, start, end, EXTENT_WRITEBACK, 0))
+               end_page_writeback(page);
+       return 0;
+}
+
+/* lots and lots of room for performance fixes in the end_bio funcs */
+
+/*
+ * after a writepage IO is done, we need to:
+ * clear the uptodate bits on error
+ * clear the writeback bits in the extent tree for this IO
+ * end_page_writeback if the page has no more pending IO
+ *
+ * Scheduling is not allowed, so the extent state tree is expected
+ * to have one and only one object corresponding to this IO.
+ */
+static void end_bio_extent_writepage(struct bio *bio, int err)
+{
+       int uptodate = err == 0;
+       struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+       struct extent_io_tree *tree;
+       u64 start;
+       u64 end;
+       int whole_page;
+       int ret;
+
+       do {
+               struct page *page = bvec->bv_page;
+               tree = &BTRFS_I(page->mapping->host)->io_tree;
+
+               start = ((u64)page->index << PAGE_CACHE_SHIFT) +
+                        bvec->bv_offset;
+               end = start + bvec->bv_len - 1;
+
+               if (bvec->bv_offset == 0 && bvec->bv_len == PAGE_CACHE_SIZE)
+                       whole_page = 1;
+               else
+                       whole_page = 0;
+
+               if (--bvec >= bio->bi_io_vec)
+                       prefetchw(&bvec->bv_page->flags);
+               if (tree->ops && tree->ops->writepage_end_io_hook) {
+                       ret = tree->ops->writepage_end_io_hook(page, start,
+                                                      end, NULL, uptodate);
+                       if (ret)
+                               uptodate = 0;
+               }
+
+               if (!uptodate && tree->ops &&
+                   tree->ops->writepage_io_failed_hook) {
+                       ret = tree->ops->writepage_io_failed_hook(bio, page,
+                                                        start, end, NULL);
+                       if (ret == 0) {
+                               uptodate = (err == 0);
+                               continue;
+                       }
+               }
+
+               if (!uptodate) {
+                       clear_extent_uptodate(tree, start, end, GFP_ATOMIC);
+                       ClearPageUptodate(page);
+                       SetPageError(page);
+               }
+
+               clear_extent_writeback(tree, start, end, GFP_ATOMIC);
+
+               if (whole_page)
+                       end_page_writeback(page);
+               else
+                       check_page_writeback(tree, page);
+       } while (bvec >= bio->bi_io_vec);
+
+       bio_put(bio);
+}
+
+/*
+ * after a readpage IO is done, we need to:
+ * clear the uptodate bits on error
+ * set the uptodate bits if things worked
+ * set the page up to date if all extents in the tree are uptodate
+ * clear the lock bit in the extent tree
+ * unlock the page if there are no other extents locked for it
+ *
+ * Scheduling is not allowed, so the extent state tree is expected
+ * to have one and only one object corresponding to this IO.
+ */
+static void end_bio_extent_readpage(struct bio *bio, int err)
+{
+       int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+       struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+       struct extent_io_tree *tree;
+       u64 start;
+       u64 end;
+       int whole_page;
+       int ret;
+
+       if (err)
+               uptodate = 0;
+
+       do {
+               struct page *page = bvec->bv_page;
+               tree = &BTRFS_I(page->mapping->host)->io_tree;
+
+               start = ((u64)page->index << PAGE_CACHE_SHIFT) +
+                       bvec->bv_offset;
+               end = start + bvec->bv_len - 1;
+
+               if (bvec->bv_offset == 0 && bvec->bv_len == PAGE_CACHE_SIZE)
+                       whole_page = 1;
+               else
+                       whole_page = 0;
+
+               if (--bvec >= bio->bi_io_vec)
+                       prefetchw(&bvec->bv_page->flags);
+
+               if (uptodate && tree->ops && tree->ops->readpage_end_io_hook) {
+                       ret = tree->ops->readpage_end_io_hook(page, start, end,
+                                                             NULL);
+                       if (ret)
+                               uptodate = 0;
+               }
+               if (!uptodate && tree->ops &&
+                   tree->ops->readpage_io_failed_hook) {
+                       ret = tree->ops->readpage_io_failed_hook(bio, page,
+                                                        start, end, NULL);
+                       if (ret == 0) {
+                               uptodate =
+                                       test_bit(BIO_UPTODATE, &bio->bi_flags);
+                               if (err)
+                                       uptodate = 0;
+                               continue;
+                       }
+               }
+
+               if (uptodate) {
+                       set_extent_uptodate(tree, start, end,
+                                           GFP_ATOMIC);
+               }
+               unlock_extent(tree, start, end, GFP_ATOMIC);
+
+               if (whole_page) {
+                       if (uptodate) {
+                               SetPageUptodate(page);
+                       } else {
+                               ClearPageUptodate(page);
+                               SetPageError(page);
+                       }
+                       unlock_page(page);
+               } else {
+                       if (uptodate) {
+                               check_page_uptodate(tree, page);
+                       } else {
+                               ClearPageUptodate(page);
+                               SetPageError(page);
+                       }
+                       check_page_locked(tree, page);
+               }
+       } while (bvec >= bio->bi_io_vec);
+
+       bio_put(bio);
+}
+
+/*
+ * IO done from prepare_write is pretty simple, we just unlock
+ * the structs in the extent tree when done, and set the uptodate bits
+ * as appropriate.
+ */
+static void end_bio_extent_preparewrite(struct bio *bio, int err)
+{
+       const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
+       struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+       struct extent_io_tree *tree;
+       u64 start;
+       u64 end;
+
+       do {
+               struct page *page = bvec->bv_page;
+               tree = &BTRFS_I(page->mapping->host)->io_tree;
+
+               start = ((u64)page->index << PAGE_CACHE_SHIFT) +
+                       bvec->bv_offset;
+               end = start + bvec->bv_len - 1;
+
+               if (--bvec >= bio->bi_io_vec)
+                       prefetchw(&bvec->bv_page->flags);
+
+               if (uptodate) {
+                       set_extent_uptodate(tree, start, end, GFP_ATOMIC);
+               } else {
+                       ClearPageUptodate(page);
+                       SetPageError(page);
+               }
+
+               unlock_extent(tree, start, end, GFP_ATOMIC);
+
+       } while (bvec >= bio->bi_io_vec);
+
+       bio_put(bio);
+}
+
+static struct bio *
+extent_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs,
+                gfp_t gfp_flags)
+{
+       struct bio *bio;
+
+       bio = bio_alloc(gfp_flags, nr_vecs);
+
+       if (bio == NULL && (current->flags & PF_MEMALLOC)) {
+               while (!bio && (nr_vecs /= 2))
+                       bio = bio_alloc(gfp_flags, nr_vecs);
+       }
+
+       if (bio) {
+               bio->bi_size = 0;
+               bio->bi_bdev = bdev;
+               bio->bi_sector = first_sector;
+       }
+       return bio;
+}
+
+static int submit_one_bio(int rw, struct bio *bio, int mirror_num,
+                         unsigned long bio_flags)
+{
+       int ret = 0;
+       struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1;
+       struct page *page = bvec->bv_page;
+       struct extent_io_tree *tree = bio->bi_private;
+       u64 start;
+       u64 end;
+
+       start = ((u64)page->index << PAGE_CACHE_SHIFT) + bvec->bv_offset;
+       end = start + bvec->bv_len - 1;
+
+       bio->bi_private = NULL;
+
+       bio_get(bio);
+
+       if (tree->ops && tree->ops->submit_bio_hook)
+               tree->ops->submit_bio_hook(page->mapping->host, rw, bio,
+                                          mirror_num, bio_flags);
+       else
+               submit_bio(rw, bio);
+       if (bio_flagged(bio, BIO_EOPNOTSUPP))
+               ret = -EOPNOTSUPP;
+       bio_put(bio);
+       return ret;
+}
+
+static int submit_extent_page(int rw, struct extent_io_tree *tree,
+                             struct page *page, sector_t sector,
+                             size_t size, unsigned long offset,
+                             struct block_device *bdev,
+                             struct bio **bio_ret,
+                             unsigned long max_pages,
+                             bio_end_io_t end_io_func,
+                             int mirror_num,
+                             unsigned long prev_bio_flags,
+                             unsigned long bio_flags)
+{
+       int ret = 0;
+       struct bio *bio;
+       int nr;
+       int contig = 0;
+       int this_compressed = bio_flags & EXTENT_BIO_COMPRESSED;
+       int old_compressed = prev_bio_flags & EXTENT_BIO_COMPRESSED;
+       size_t page_size = min_t(size_t, size, PAGE_CACHE_SIZE);
+
+       if (bio_ret && *bio_ret) {
+               bio = *bio_ret;
+               if (old_compressed)
+                       contig = bio->bi_sector == sector;
+               else
+                       contig = bio->bi_sector + (bio->bi_size >> 9) ==
+                               sector;
+
+               if (prev_bio_flags != bio_flags || !contig ||
+                   (tree->ops && tree->ops->merge_bio_hook &&
+                    tree->ops->merge_bio_hook(page, offset, page_size, bio,
+                                              bio_flags)) ||
+                   bio_add_page(bio, page, page_size, offset) < page_size) {
+                       ret = submit_one_bio(rw, bio, mirror_num,
+                                            prev_bio_flags);
+                       bio = NULL;
+               } else {
+                       return 0;
+               }
+       }
+       if (this_compressed)
+               nr = BIO_MAX_PAGES;
+       else
+               nr = bio_get_nr_vecs(bdev);
+
+       bio = extent_bio_alloc(bdev, sector, nr, GFP_NOFS | __GFP_HIGH);
+
+       bio_add_page(bio, page, page_size, offset);
+       bio->bi_end_io = end_io_func;
+       bio->bi_private = tree;
+
+       if (bio_ret)
+               *bio_ret = bio;
+       else
+               ret = submit_one_bio(rw, bio, mirror_num, bio_flags);
+
+       return ret;
+}
+
+void set_page_extent_mapped(struct page *page)
+{
+       if (!PagePrivate(page)) {
+               SetPagePrivate(page);
+               page_cache_get(page);
+               set_page_private(page, EXTENT_PAGE_PRIVATE);
+       }
+}
+
+static void set_page_extent_head(struct page *page, unsigned long len)
+{
+       set_page_private(page, EXTENT_PAGE_PRIVATE_FIRST_PAGE | len << 2);
+}
+
+/*
+ * basic readpage implementation.  Locked extent state structs are inserted
+ * into the tree that are removed when the IO is done (by the end_io
+ * handlers)
+ */
+static int __extent_read_full_page(struct extent_io_tree *tree,
+                                  struct page *page,
+                                  get_extent_t *get_extent,
+                                  struct bio **bio, int mirror_num,
+                                  unsigned long *bio_flags)
+{
+       struct inode *inode = page->mapping->host;
+       u64 start = (u64)page->index << PAGE_CACHE_SHIFT;
+       u64 page_end = start + PAGE_CACHE_SIZE - 1;
+       u64 end;
+       u64 cur = start;
+       u64 extent_offset;
+       u64 last_byte = i_size_read(inode);
+       u64 block_start;
+       u64 cur_end;
+       sector_t sector;
+       struct extent_map *em;
+       struct block_device *bdev;
+       int ret;
+       int nr = 0;
+       size_t page_offset = 0;
+       size_t iosize;
+       size_t disk_io_size;
+       size_t blocksize = inode->i_sb->s_blocksize;
+       unsigned long this_bio_flag = 0;
+
+       set_page_extent_mapped(page);
+
+       end = page_end;
+       lock_extent(tree, start, end, GFP_NOFS);
+
+       if (page->index == last_byte >> PAGE_CACHE_SHIFT) {
+               char *userpage;
+               size_t zero_offset = last_byte & (PAGE_CACHE_SIZE - 1);
+
+               if (zero_offset) {
+                       iosize = PAGE_CACHE_SIZE - zero_offset;
+                       userpage = kmap_atomic(page, KM_USER0);
+                       memset(userpage + zero_offset, 0, iosize);
+                       flush_dcache_page(page);
+                       kunmap_atomic(userpage, KM_USER0);
+               }
+       }
+       while (cur <= end) {
+               if (cur >= last_byte) {
+                       char *userpage;
+                       iosize = PAGE_CACHE_SIZE - page_offset;
+                       userpage = kmap_atomic(page, KM_USER0);
+                       memset(userpage + page_offset, 0, iosize);
+                       flush_dcache_page(page);
+                       kunmap_atomic(userpage, KM_USER0);
+                       set_extent_uptodate(tree, cur, cur + iosize - 1,
+                                           GFP_NOFS);
+                       unlock_extent(tree, cur, cur + iosize - 1, GFP_NOFS);
+                       break;
+               }
+               em = get_extent(inode, page, page_offset, cur,
+                               end - cur + 1, 0);
+               if (IS_ERR(em) || !em) {
+                       SetPageError(page);
+                       unlock_extent(tree, cur, end, GFP_NOFS);
+                       break;
+               }
+               extent_offset = cur - em->start;
+               BUG_ON(extent_map_end(em) <= cur);
+               BUG_ON(end < cur);
+
+               if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
+                       this_bio_flag = EXTENT_BIO_COMPRESSED;
+
+               iosize = min(extent_map_end(em) - cur, end - cur + 1);
+               cur_end = min(extent_map_end(em) - 1, end);
+               iosize = (iosize + blocksize - 1) & ~((u64)blocksize - 1);
+               if (this_bio_flag & EXTENT_BIO_COMPRESSED) {
+                       disk_io_size = em->block_len;
+                       sector = em->block_start >> 9;
+               } else {
+                       sector = (em->block_start + extent_offset) >> 9;
+                       disk_io_size = iosize;
+               }
+               bdev = em->bdev;
+               block_start = em->block_start;
+               if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
+                       block_start = EXTENT_MAP_HOLE;
+               free_extent_map(em);
+               em = NULL;
+
+               /* we've found a hole, just zero and go on */
+               if (block_start == EXTENT_MAP_HOLE) {
+                       char *userpage;
+                       userpage = kmap_atomic(page, KM_USER0);
+                       memset(userpage + page_offset, 0, iosize);
+                       flush_dcache_page(page);
+                       kunmap_atomic(userpage, KM_USER0);
+
+                       set_extent_uptodate(tree, cur, cur + iosize - 1,
+                                           GFP_NOFS);
+                       unlock_extent(tree, cur, cur + iosize - 1, GFP_NOFS);
+                       cur = cur + iosize;
+                       page_offset += iosize;
+                       continue;
+               }
+               /* the get_extent function already copied into the page */
+               if (test_range_bit(tree, cur, cur_end, EXTENT_UPTODATE, 1)) {
+                       check_page_uptodate(tree, page);
+                       unlock_extent(tree, cur, cur + iosize - 1, GFP_NOFS);
+                       cur = cur + iosize;
+                       page_offset += iosize;
+                       continue;
+               }
+               /* we have an inline extent but it didn't get marked up
+                * to date.  Error out
+                */
+               if (block_start == EXTENT_MAP_INLINE) {
+                       SetPageError(page);
+                       unlock_extent(tree, cur, cur + iosize - 1, GFP_NOFS);
+                       cur = cur + iosize;
+                       page_offset += iosize;
+                       continue;
+               }
+
+               ret = 0;
+               if (tree->ops && tree->ops->readpage_io_hook) {
+                       ret = tree->ops->readpage_io_hook(page, cur,
+                                                         cur + iosize - 1);
+               }
+               if (!ret) {
+                       unsigned long pnr = (last_byte >> PAGE_CACHE_SHIFT) + 1;
+                       pnr -= page->index;
+                       ret = submit_extent_page(READ, tree, page,
+                                        sector, disk_io_size, page_offset,
+                                        bdev, bio, pnr,
+                                        end_bio_extent_readpage, mirror_num,
+                                        *bio_flags,
+                                        this_bio_flag);
+                       nr++;
+                       *bio_flags = this_bio_flag;
+               }
+               if (ret)
+                       SetPageError(page);
+               cur = cur + iosize;
+               page_offset += iosize;
+       }
+       if (!nr) {
+               if (!PageError(page))
+                       SetPageUptodate(page);
+               unlock_page(page);
+       }
+       return 0;
+}
+
+int extent_read_full_page(struct extent_io_tree *tree, struct page *page,
+                           get_extent_t *get_extent)
+{
+       struct bio *bio = NULL;
+       unsigned long bio_flags = 0;
+       int ret;
+
+       ret = __extent_read_full_page(tree, page, get_extent, &bio, 0,
+                                     &bio_flags);
+       if (bio)
+               submit_one_bio(READ, bio, 0, bio_flags);
+       return ret;
+}
+
+/*
+ * the writepage semantics are similar to regular writepage.  extent
+ * records are inserted to lock ranges in the tree, and as dirty areas
+ * are found, they are marked writeback.  Then the lock bits are removed
+ * and the end_io handler clears the writeback ranges
+ */
+static int __extent_writepage(struct page *page, struct writeback_control *wbc,
+                             void *data)
+{
+       struct inode *inode = page->mapping->host;
+       struct extent_page_data *epd = data;
+       struct extent_io_tree *tree = epd->tree;
+       u64 start = (u64)page->index << PAGE_CACHE_SHIFT;
+       u64 delalloc_start;
+       u64 page_end = start + PAGE_CACHE_SIZE - 1;
+       u64 end;
+       u64 cur = start;
+       u64 extent_offset;
+       u64 last_byte = i_size_read(inode);
+       u64 block_start;
+       u64 iosize;
+       u64 unlock_start;
+       sector_t sector;
+       struct extent_map *em;
+       struct block_device *bdev;
+       int ret;
+       int nr = 0;
+       size_t pg_offset = 0;
+       size_t blocksize;
+       loff_t i_size = i_size_read(inode);
+       unsigned long end_index = i_size >> PAGE_CACHE_SHIFT;
+       u64 nr_delalloc;
+       u64 delalloc_end;
+       int page_started;
+       int compressed;
+       unsigned long nr_written = 0;
+
+       WARN_ON(!PageLocked(page));
+       pg_offset = i_size & (PAGE_CACHE_SIZE - 1);
+       if (page->index > end_index ||
+          (page->index == end_index && !pg_offset)) {
+               page->mapping->a_ops->invalidatepage(page, 0);
+               unlock_page(page);
+               return 0;
+       }
+
+       if (page->index == end_index) {
+               char *userpage;
+
+               userpage = kmap_atomic(page, KM_USER0);
+               memset(userpage + pg_offset, 0,
+                      PAGE_CACHE_SIZE - pg_offset);
+               kunmap_atomic(userpage, KM_USER0);
+               flush_dcache_page(page);
+       }
+       pg_offset = 0;
+
+       set_page_extent_mapped(page);
+
+       delalloc_start = start;
+       delalloc_end = 0;
+       page_started = 0;
+       if (!epd->extent_locked) {
+               while (delalloc_end < page_end) {
+                       nr_delalloc = find_lock_delalloc_range(inode, tree,
+                                                      page,
+                                                      &delalloc_start,
+                                                      &delalloc_end,
+                                                      128 * 1024 * 1024);
+                       if (nr_delalloc == 0) {
+                               delalloc_start = delalloc_end + 1;
+                               continue;
+                       }
+                       tree->ops->fill_delalloc(inode, page, delalloc_start,
+                                                delalloc_end, &page_started,
+                                                &nr_written);
+                       delalloc_start = delalloc_end + 1;
+               }
+
+               /* did the fill delalloc function already unlock and start
+                * the IO?
+                */
+               if (page_started) {
+                       ret = 0;
+                       goto update_nr_written;
+               }
+       }
+       lock_extent(tree, start, page_end, GFP_NOFS);
+
+       unlock_start = start;
+
+       if (tree->ops && tree->ops->writepage_start_hook) {
+               ret = tree->ops->writepage_start_hook(page, start,
+                                                     page_end);
+               if (ret == -EAGAIN) {
+                       unlock_extent(tree, start, page_end, GFP_NOFS);
+                       redirty_page_for_writepage(wbc, page);
+                       unlock_page(page);
+                       ret = 0;
+                       goto update_nr_written;
+               }
+       }
+
+       nr_written++;
+
+       end = page_end;
+       if (test_range_bit(tree, start, page_end, EXTENT_DELALLOC, 0))
+               printk(KERN_ERR "btrfs delalloc bits after lock_extent\n");
+
+       if (last_byte <= start) {
+               clear_extent_dirty(tree, start, page_end, GFP_NOFS);
+               unlock_extent(tree, start, page_end, GFP_NOFS);
+               if (tree->ops && tree->ops->writepage_end_io_hook)
+                       tree->ops->writepage_end_io_hook(page, start,
+                                                        page_end, NULL, 1);
+               unlock_start = page_end + 1;
+               goto done;
+       }
+
+       set_extent_uptodate(tree, start, page_end, GFP_NOFS);
+       blocksize = inode->i_sb->s_blocksize;
+
+       while (cur <= end) {
+               if (cur >= last_byte) {
+                       clear_extent_dirty(tree, cur, page_end, GFP_NOFS);
+                       unlock_extent(tree, unlock_start, page_end, GFP_NOFS);
+                       if (tree->ops && tree->ops->writepage_end_io_hook)
+                               tree->ops->writepage_end_io_hook(page, cur,
+                                                        page_end, NULL, 1);
+                       unlock_start = page_end + 1;
+                       break;
+               }
+               em = epd->get_extent(inode, page, pg_offset, cur,
+                                    end - cur + 1, 1);
+               if (IS_ERR(em) || !em) {
+                       SetPageError(page);
+                       break;
+               }
+
+               extent_offset = cur - em->start;
+               BUG_ON(extent_map_end(em) <= cur);
+               BUG_ON(end < cur);
+               iosize = min(extent_map_end(em) - cur, end - cur + 1);
+               iosize = (iosize + blocksize - 1) & ~((u64)blocksize - 1);
+               sector = (em->block_start + extent_offset) >> 9;
+               bdev = em->bdev;
+               block_start = em->block_start;
+               compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
+               free_extent_map(em);
+               em = NULL;
+
+               /*
+                * compressed and inline extents are written through other
+                * paths in the FS
+                */
+               if (compressed || block_start == EXTENT_MAP_HOLE ||
+                   block_start == EXTENT_MAP_INLINE) {
+                       clear_extent_dirty(tree, cur,
+                                          cur + iosize - 1, GFP_NOFS);
+
+                       unlock_extent(tree, unlock_start, cur + iosize - 1,
+                                     GFP_NOFS);
+
+                       /*
+                        * end_io notification does not happen here for
+                        * compressed extents
+                        */
+                       if (!compressed && tree->ops &&
+                           tree->ops->writepage_end_io_hook)
+                               tree->ops->writepage_end_io_hook(page, cur,
+                                                        cur + iosize - 1,
+                                                        NULL, 1);
+                       else if (compressed) {
+                               /* we don't want to end_page_writeback on
+                                * a compressed extent.  this happens
+                                * elsewhere
+                                */
+                               nr++;
+                       }
+
+                       cur += iosize;
+                       pg_offset += iosize;
+                       unlock_start = cur;
+                       continue;
+               }
+               /* leave this out until we have a page_mkwrite call */
+               if (0 && !test_range_bit(tree, cur, cur + iosize - 1,
+                                  EXTENT_DIRTY, 0)) {
+                       cur = cur + iosize;
+                       pg_offset += iosize;
+                       continue;
+               }
+
+               clear_extent_dirty(tree, cur, cur + iosize - 1, GFP_NOFS);
+               if (tree->ops && tree->ops->writepage_io_hook) {
+                       ret = tree->ops->writepage_io_hook(page, cur,
+                                               cur + iosize - 1);
+               } else {
+                       ret = 0;
+               }
+               if (ret) {
+                       SetPageError(page);
+               } else {
+                       unsigned long max_nr = end_index + 1;
+
+                       set_range_writeback(tree, cur, cur + iosize - 1);
+                       if (!PageWriteback(page)) {
+                               printk(KERN_ERR "btrfs warning page %lu not "
+                                      "writeback, cur %llu end %llu\n",
+                                      page->index, (unsigned long long)cur,
+                                      (unsigned long long)end);
+                       }
+
+                       ret = submit_extent_page(WRITE, tree, page, sector,
+                                                iosize, pg_offset, bdev,
+                                                &epd->bio, max_nr,
+                                                end_bio_extent_writepage,
+                                                0, 0, 0);
+                       if (ret)
+                               SetPageError(page);
+               }
+               cur = cur + iosize;
+               pg_offset += iosize;
+               nr++;
+       }
+done:
+       if (nr == 0) {
+               /* make sure the mapping tag for page dirty gets cleared */
+               set_page_writeback(page);
+               end_page_writeback(page);
+       }
+       if (unlock_start <= page_end)
+               unlock_extent(tree, unlock_start, page_end, GFP_NOFS);
+       unlock_page(page);
+
+update_nr_written:
+       wbc->nr_to_write -= nr_written;
+       if (wbc->range_cyclic || (wbc->nr_to_write > 0 &&
+           wbc->range_start == 0 && wbc->range_end == LLONG_MAX))
+               page->mapping->writeback_index = page->index + nr_written;
+       return 0;
+}
+
+/**
+ * write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
+ * @mapping: address space structure to write
+ * @wbc: subtract the number of written pages from *@wbc->nr_to_write
+ * @writepage: function called for each page
+ * @data: data passed to writepage function
+ *
+ * If a page is already under I/O, write_cache_pages() skips it, even
+ * if it's dirty.  This is desirable behaviour for memory-cleaning writeback,
+ * but it is INCORRECT for data-integrity system calls such as fsync().  fsync()
+ * and msync() need to guarantee that all the data which was dirty at the time
+ * the call was made get new I/O started against them.  If wbc->sync_mode is
+ * WB_SYNC_ALL then we were called for data integrity and we must wait for
+ * existing IO to complete.
+ */
+static int extent_write_cache_pages(struct extent_io_tree *tree,
+                            struct address_space *mapping,
+                            struct writeback_control *wbc,
+                            writepage_t writepage, void *data,
+                            void (*flush_fn)(void *))
+{
+       struct backing_dev_info *bdi = mapping->backing_dev_info;
+       int ret = 0;
+       int done = 0;
+       struct pagevec pvec;
+       int nr_pages;
+       pgoff_t index;
+       pgoff_t end;            /* Inclusive */
+       int scanned = 0;
+       int range_whole = 0;
+
+       if (wbc->nonblocking && bdi_write_congested(bdi)) {
+               wbc->encountered_congestion = 1;
+               return 0;
+       }
+
+       pagevec_init(&pvec, 0);
+       if (wbc->range_cyclic) {
+               index = mapping->writeback_index; /* Start from prev offset */
+               end = -1;
+       } else {
+               index = wbc->range_start >> PAGE_CACHE_SHIFT;
+               end = wbc->range_end >> PAGE_CACHE_SHIFT;
+               if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
+                       range_whole = 1;
+               scanned = 1;
+       }
+retry:
+       while (!done && (index <= end) &&
+              (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+                             PAGECACHE_TAG_DIRTY, min(end - index,
+                                 (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
+               unsigned i;
+
+               scanned = 1;
+               for (i = 0; i < nr_pages; i++) {
+                       struct page *page = pvec.pages[i];
+
+                       /*
+                        * At this point we hold neither mapping->tree_lock nor
+                        * lock on the page itself: the page may be truncated or
+                        * invalidated (changing page->mapping to NULL), or even
+                        * swizzled back from swapper_space to tmpfs file
+                        * mapping
+                        */
+                       if (tree->ops && tree->ops->write_cache_pages_lock_hook)
+                               tree->ops->write_cache_pages_lock_hook(page);
+                       else
+                               lock_page(page);
+
+                       if (unlikely(page->mapping != mapping)) {
+                               unlock_page(page);
+                               continue;
+                       }
+
+                       if (!wbc->range_cyclic && page->index > end) {
+                               done = 1;
+                               unlock_page(page);
+                               continue;
+                       }
+
+                       if (wbc->sync_mode != WB_SYNC_NONE) {
+                               if (PageWriteback(page))
+                                       flush_fn(data);
+                               wait_on_page_writeback(page);
+                       }
+
+                       if (PageWriteback(page) ||
+                           !clear_page_dirty_for_io(page)) {
+                               unlock_page(page);
+                               continue;
+                       }
+
+                       ret = (*writepage)(page, wbc, data);
+
+                       if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
+                               unlock_page(page);
+                               ret = 0;
+                       }
+                       if (ret || wbc->nr_to_write <= 0)
+                               done = 1;
+                       if (wbc->nonblocking && bdi_write_congested(bdi)) {
+                               wbc->encountered_congestion = 1;
+                               done = 1;
+                       }
+               }
+               pagevec_release(&pvec);
+               cond_resched();
+       }
+       if (!scanned && !done) {
+               /*
+                * We hit the last page and there is more work to be done: wrap
+                * back to the start of the file
+                */
+               scanned = 1;
+               index = 0;
+               goto retry;
+       }
+       return ret;
+}
+
+static noinline void flush_write_bio(void *data)
+{
+       struct extent_page_data *epd = data;
+       if (epd->bio) {
+               submit_one_bio(WRITE, epd->bio, 0, 0);
+               epd->bio = NULL;
+       }
+}
+
+int extent_write_full_page(struct extent_io_tree *tree, struct page *page,
+                         get_extent_t *get_extent,
+                         struct writeback_control *wbc)
+{
+       int ret;
+       struct address_space *mapping = page->mapping;
+       struct extent_page_data epd = {
+               .bio = NULL,
+               .tree = tree,
+               .get_extent = get_extent,
+               .extent_locked = 0,
+       };
+       struct writeback_control wbc_writepages = {
+               .bdi            = wbc->bdi,
+               .sync_mode      = WB_SYNC_NONE,
+               .older_than_this = NULL,
+               .nr_to_write    = 64,
+               .range_start    = page_offset(page) + PAGE_CACHE_SIZE,
+               .range_end      = (loff_t)-1,
+       };
+
+
+       ret = __extent_writepage(page, wbc, &epd);
+
+       extent_write_cache_pages(tree, mapping, &wbc_writepages,
+                                __extent_writepage, &epd, flush_write_bio);
+       if (epd.bio)
+               submit_one_bio(WRITE, epd.bio, 0, 0);
+       return ret;
+}
+
+int extent_write_locked_range(struct extent_io_tree *tree, struct inode *inode,
+                             u64 start, u64 end, get_extent_t *get_extent,
+                             int mode)
+{
+       int ret = 0;
+       struct address_space *mapping = inode->i_mapping;
+       struct page *page;
+       unsigned long nr_pages = (end - start + PAGE_CACHE_SIZE) >>
+               PAGE_CACHE_SHIFT;
+
+       struct extent_page_data epd = {
+               .bio = NULL,
+               .tree = tree,
+               .get_extent = get_extent,
+               .extent_locked = 1,
+       };
+       struct writeback_control wbc_writepages = {
+               .bdi            = inode->i_mapping->backing_dev_info,
+               .sync_mode      = mode,
+               .older_than_this = NULL,
+               .nr_to_write    = nr_pages * 2,
+               .range_start    = start,
+               .range_end      = end + 1,
+       };
+
+       while (start <= end) {
+               page = find_get_page(mapping, start >> PAGE_CACHE_SHIFT);
+               if (clear_page_dirty_for_io(page))
+                       ret = __extent_writepage(page, &wbc_writepages, &epd);
+               else {
+                       if (tree->ops && tree->ops->writepage_end_io_hook)
+                               tree->ops->writepage_end_io_hook(page, start,
+                                                start + PAGE_CACHE_SIZE - 1,
+                                                NULL, 1);
+                       unlock_page(page);
+               }
+               page_cache_release(page);
+               start += PAGE_CACHE_SIZE;
+       }
+
+       if (epd.bio)
+               submit_one_bio(WRITE, epd.bio, 0, 0);
+       return ret;
+}
+
+int extent_writepages(struct extent_io_tree *tree,
+                     struct address_space *mapping,
+                     get_extent_t *get_extent,
+                     struct writeback_control *wbc)
+{
+       int ret = 0;
+       struct extent_page_data epd = {
+               .bio = NULL,
+               .tree = tree,
+               .get_extent = get_extent,
+               .extent_locked = 0,
+       };
+
+       ret = extent_write_cache_pages(tree, mapping, wbc,
+                                      __extent_writepage, &epd,
+                                      flush_write_bio);
+       if (epd.bio)
+               submit_one_bio(WRITE, epd.bio, 0, 0);
+       return ret;
+}
+
+int extent_readpages(struct extent_io_tree *tree,
+                    struct address_space *mapping,
+                    struct list_head *pages, unsigned nr_pages,
+                    get_extent_t get_extent)
+{
+       struct bio *bio = NULL;
+       unsigned page_idx;
+       struct pagevec pvec;
+       unsigned long bio_flags = 0;
+
+       pagevec_init(&pvec, 0);
+       for (page_idx = 0; page_idx < nr_pages; page_idx++) {
+               struct page *page = list_entry(pages->prev, struct page, lru);
+
+               prefetchw(&page->flags);
+               list_del(&page->lru);
+               /*
+                * what we want to do here is call add_to_page_cache_lru,
+                * but that isn't exported, so we reproduce it here
+                */
+               if (!add_to_page_cache(page, mapping,
+                                       page->index, GFP_KERNEL)) {
+
+                       /* open coding of lru_cache_add, also not exported */
+                       page_cache_get(page);
+                       if (!pagevec_add(&pvec, page))
+                               __pagevec_lru_add_file(&pvec);
+                       __extent_read_full_page(tree, page, get_extent,
+                                               &bio, 0, &bio_flags);
+               }
+               page_cache_release(page);
+       }
+       if (pagevec_count(&pvec))
+               __pagevec_lru_add_file(&pvec);
+       BUG_ON(!list_empty(pages));
+       if (bio)
+               submit_one_bio(READ, bio, 0, bio_flags);
+       return 0;
+}
+
+/*
+ * basic invalidatepage code, this waits on any locked or writeback
+ * ranges corresponding to the page, and then deletes any extent state
+ * records from the tree
+ */
+int extent_invalidatepage(struct extent_io_tree *tree,
+                         struct page *page, unsigned long offset)
+{
+       u64 start = ((u64)page->index << PAGE_CACHE_SHIFT);
+       u64 end = start + PAGE_CACHE_SIZE - 1;
+       size_t blocksize = page->mapping->host->i_sb->s_blocksize;
+
+       start += (offset + blocksize - 1) & ~(blocksize - 1);
+       if (start > end)
+               return 0;
+
+       lock_extent(tree, start, end, GFP_NOFS);
+       wait_on_extent_writeback(tree, start, end);
+       clear_extent_bit(tree, start, end,
+                        EXTENT_LOCKED | EXTENT_DIRTY | EXTENT_DELALLOC,
+                        1, 1, GFP_NOFS);
+       return 0;
+}
+
+/*
+ * simple commit_write call, set_range_dirty is used to mark both
+ * the pages and the extent records as dirty
+ */
+int extent_commit_write(struct extent_io_tree *tree,
+                       struct inode *inode, struct page *page,
+                       unsigned from, unsigned to)
+{
+       loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+       set_page_extent_mapped(page);
+       set_page_dirty(page);
+
+       if (pos > inode->i_size) {
+               i_size_write(inode, pos);
+               mark_inode_dirty(inode);
+       }
+       return 0;
+}
+
+int extent_prepare_write(struct extent_io_tree *tree,
+                        struct inode *inode, struct page *page,
+                        unsigned from, unsigned to, get_extent_t *get_extent)
+{
+       u64 page_start = (u64)page->index << PAGE_CACHE_SHIFT;
+       u64 page_end = page_start + PAGE_CACHE_SIZE - 1;
+       u64 block_start;
+       u64 orig_block_start;
+       u64 block_end;
+       u64 cur_end;
+       struct extent_map *em;
+       unsigned blocksize = 1 << inode->i_blkbits;
+       size_t page_offset = 0;
+       size_t block_off_start;
+       size_t block_off_end;
+       int err = 0;
+       int iocount = 0;
+       int ret = 0;
+       int isnew;
+
+       set_page_extent_mapped(page);
+
+       block_start = (page_start + from) & ~((u64)blocksize - 1);
+       block_end = (page_start + to - 1) | (blocksize - 1);
+       orig_block_start = block_start;
+
+       lock_extent(tree, page_start, page_end, GFP_NOFS);
+       while (block_start <= block_end) {
+               em = get_extent(inode, page, page_offset, block_start,
+                               block_end - block_start + 1, 1);
+               if (IS_ERR(em) || !em)
+                       goto err;
+
+               cur_end = min(block_end, extent_map_end(em) - 1);
+               block_off_start = block_start & (PAGE_CACHE_SIZE - 1);
+               block_off_end = block_off_start + blocksize;
+               isnew = clear_extent_new(tree, block_start, cur_end, GFP_NOFS);
+
+               if (!PageUptodate(page) && isnew &&
+                   (block_off_end > to || block_off_start < from)) {
+                       void *kaddr;
+
+                       kaddr = kmap_atomic(page, KM_USER0);
+                       if (block_off_end > to)
+                               memset(kaddr + to, 0, block_off_end - to);
+                       if (block_off_start < from)
+                               memset(kaddr + block_off_start, 0,
+                                      from - block_off_start);
+                       flush_dcache_page(page);
+                       kunmap_atomic(kaddr, KM_USER0);
+               }
+               if ((em->block_start != EXTENT_MAP_HOLE &&
+                    em->block_start != EXTENT_MAP_INLINE) &&
+                   !isnew && !PageUptodate(page) &&
+                   (block_off_end > to || block_off_start < from) &&
+                   !test_range_bit(tree, block_start, cur_end,
+                                   EXTENT_UPTODATE, 1)) {
+                       u64 sector;
+                       u64 extent_offset = block_start - em->start;
+                       size_t iosize;
+                       sector = (em->block_start + extent_offset) >> 9;
+                       iosize = (cur_end - block_start + blocksize) &
+                               ~((u64)blocksize - 1);
+                       /*
+                        * we've already got the extent locked, but we
+                        * need to split the state such that our end_bio
+                        * handler can clear the lock.
+                        */
+                       set_extent_bit(tree, block_start,
+                                      block_start + iosize - 1,
+                                      EXTENT_LOCKED, 0, NULL, GFP_NOFS);
+                       ret = submit_extent_page(READ, tree, page,
+                                        sector, iosize, page_offset, em->bdev,
+                                        NULL, 1,
+                                        end_bio_extent_preparewrite, 0,
+                                        0, 0);
+                       iocount++;
+                       block_start = block_start + iosize;
+               } else {
+                       set_extent_uptodate(tree, block_start, cur_end,
+                                           GFP_NOFS);
+                       unlock_extent(tree, block_start, cur_end, GFP_NOFS);
+                       block_start = cur_end + 1;
+               }
+               page_offset = block_start & (PAGE_CACHE_SIZE - 1);
+               free_extent_map(em);
+       }
+       if (iocount) {
+               wait_extent_bit(tree, orig_block_start,
+                               block_end, EXTENT_LOCKED);
+       }
+       check_page_uptodate(tree, page);
+err:
+       /* FIXME, zero out newly allocated blocks on error */
+       return err;
+}
+
+/*
+ * a helper for releasepage, this tests for areas of the page that
+ * are locked or under IO and drops the related state bits if it is safe
+ * to drop the page.
+ */
+int try_release_extent_state(struct extent_map_tree *map,
+                            struct extent_io_tree *tree, struct page *page,
+                            gfp_t mask)
+{
+       u64 start = (u64)page->index << PAGE_CACHE_SHIFT;
+       u64 end = start + PAGE_CACHE_SIZE - 1;
+       int ret = 1;
+
+       if (test_range_bit(tree, start, end,
+                          EXTENT_IOBITS | EXTENT_ORDERED, 0))
+               ret = 0;
+       else {
+               if ((mask & GFP_NOFS) == GFP_NOFS)
+                       mask = GFP_NOFS;
+               clear_extent_bit(tree, start, end, EXTENT_UPTODATE,
+                                1, 1, mask);
+       }
+       return ret;
+}
+
+/*
+ * a helper for releasepage.  As long as there are no locked extents
+ * in the range corresponding to the page, both state records and extent
+ * map records are removed
+ */
+int try_release_extent_mapping(struct extent_map_tree *map,
+                              struct extent_io_tree *tree, struct page *page,
+                              gfp_t mask)
+{
+       struct extent_map *em;
+       u64 start = (u64)page->index << PAGE_CACHE_SHIFT;
+       u64 end = start + PAGE_CACHE_SIZE - 1;
+
+       if ((mask & __GFP_WAIT) &&
+           page->mapping->host->i_size > 16 * 1024 * 1024) {
+               u64 len;
+               while (start <= end) {
+                       len = end - start + 1;
+                       spin_lock(&map->lock);
+                       em = lookup_extent_mapping(map, start, len);
+                       if (!em || IS_ERR(em)) {
+                               spin_unlock(&map->lock);
+                               break;
+                       }
+                       if (test_bit(EXTENT_FLAG_PINNED, &em->flags) ||
+                           em->start != start) {
+                               spin_unlock(&map->lock);
+                               free_extent_map(em);
+                               break;
+                       }
+                       if (!test_range_bit(tree, em->start,
+                                           extent_map_end(em) - 1,
+                                           EXTENT_LOCKED | EXTENT_WRITEBACK |
+                                           EXTENT_ORDERED,
+                                           0)) {
+                               remove_extent_mapping(map, em);
+                               /* once for the rb tree */
+                               free_extent_map(em);
+                       }
+                       start = extent_map_end(em);
+                       spin_unlock(&map->lock);
+
+                       /* once for us */
+                       free_extent_map(em);
+               }
+       }
+       return try_release_extent_state(map, tree, page, mask);
+}
+
+sector_t extent_bmap(struct address_space *mapping, sector_t iblock,
+               get_extent_t *get_extent)
+{
+       struct inode *inode = mapping->host;
+       u64 start = iblock << inode->i_blkbits;
+       sector_t sector = 0;
+       size_t blksize = (1 << inode->i_blkbits);
+       struct extent_map *em;
+
+       lock_extent(&BTRFS_I(inode)->io_tree, start, start + blksize - 1,
+                   GFP_NOFS);
+       em = get_extent(inode, NULL, 0, start, blksize, 0);
+       unlock_extent(&BTRFS_I(inode)->io_tree, start, start + blksize - 1,
+                     GFP_NOFS);
+       if (!em || IS_ERR(em))
+               return 0;
+
+       if (em->block_start > EXTENT_MAP_LAST_BYTE)
+               goto out;
+
+       sector = (em->block_start + start - em->start) >> inode->i_blkbits;
+out:
+       free_extent_map(em);
+       return sector;
+}
+
+static inline struct page *extent_buffer_page(struct extent_buffer *eb,
+                                             unsigned long i)
+{
+       struct page *p;
+       struct address_space *mapping;
+
+       if (i == 0)
+               return eb->first_page;
+       i += eb->start >> PAGE_CACHE_SHIFT;
+       mapping = eb->first_page->mapping;
+       if (!mapping)
+               return NULL;
+
+       /*
+        * extent_buffer_page is only called after pinning the page
+        * by increasing the reference count.  So we know the page must
+        * be in the radix tree.
+        */
+       rcu_read_lock();
+       p = radix_tree_lookup(&mapping->page_tree, i);
+       rcu_read_unlock();
+
+       return p;
+}
+
+static inline unsigned long num_extent_pages(u64 start, u64 len)
+{
+       return ((start + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT) -
+               (start >> PAGE_CACHE_SHIFT);
+}
+
+static struct extent_buffer *__alloc_extent_buffer(struct extent_io_tree *tree,
+                                                  u64 start,
+                                                  unsigned long len,
+                                                  gfp_t mask)
+{
+       struct extent_buffer *eb = NULL;
+#ifdef LEAK_DEBUG
+       unsigned long flags;
+#endif
+
+       eb = kmem_cache_zalloc(extent_buffer_cache, mask);
+       eb->start = start;
+       eb->len = len;
+       mutex_init(&eb->mutex);
+#ifdef LEAK_DEBUG
+       spin_lock_irqsave(&leak_lock, flags);
+       list_add(&eb->leak_list, &buffers);
+       spin_unlock_irqrestore(&leak_lock, flags);
+#endif
+       atomic_set(&eb->refs, 1);
+
+       return eb;
+}
+
+static void __free_extent_buffer(struct extent_buffer *eb)
+{
+#ifdef LEAK_DEBUG
+       unsigned long flags;
+       spin_lock_irqsave(&leak_lock, flags);
+       list_del(&eb->leak_list);
+       spin_unlock_irqrestore(&leak_lock, flags);
+#endif
+       kmem_cache_free(extent_buffer_cache, eb);
+}
+
+struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree,
+                                         u64 start, unsigned long len,
+                                         struct page *page0,
+                                         gfp_t mask)
+{
+       unsigned long num_pages = num_extent_pages(start, len);
+       unsigned long i;
+       unsigned long index = start >> PAGE_CACHE_SHIFT;
+       struct extent_buffer *eb;
+       struct extent_buffer *exists = NULL;
+       struct page *p;
+       struct address_space *mapping = tree->mapping;
+       int uptodate = 1;
+
+       spin_lock(&tree->buffer_lock);
+       eb = buffer_search(tree, start);
+       if (eb) {
+               atomic_inc(&eb->refs);
+               spin_unlock(&tree->buffer_lock);
+               mark_page_accessed(eb->first_page);
+               return eb;
+       }
+       spin_unlock(&tree->buffer_lock);
+
+       eb = __alloc_extent_buffer(tree, start, len, mask);
+       if (!eb)
+               return NULL;
+
+       if (page0) {
+               eb->first_page = page0;
+               i = 1;
+               index++;
+               page_cache_get(page0);
+               mark_page_accessed(page0);
+               set_page_extent_mapped(page0);
+               set_page_extent_head(page0, len);
+               uptodate = PageUptodate(page0);
+       } else {
+               i = 0;
+       }
+       for (; i < num_pages; i++, index++) {
+               p = find_or_create_page(mapping, index, mask | __GFP_HIGHMEM);
+               if (!p) {
+                       WARN_ON(1);
+                       goto free_eb;
+               }
+               set_page_extent_mapped(p);
+               mark_page_accessed(p);
+               if (i == 0) {
+                       eb->first_page = p;
+                       set_page_extent_head(p, len);
+               } else {
+                       set_page_private(p, EXTENT_PAGE_PRIVATE);
+               }
+               if (!PageUptodate(p))
+                       uptodate = 0;
+               unlock_page(p);
+       }
+       if (uptodate)
+               eb->flags |= EXTENT_UPTODATE;
+       eb->flags |= EXTENT_BUFFER_FILLED;
+
+       spin_lock(&tree->buffer_lock);
+       exists = buffer_tree_insert(tree, start, &eb->rb_node);
+       if (exists) {
+               /* add one reference for the caller */
+               atomic_inc(&exists->refs);
+               spin_unlock(&tree->buffer_lock);
+               goto free_eb;
+       }
+       spin_unlock(&tree->buffer_lock);
+
+       /* add one reference for the tree */
+       atomic_inc(&eb->refs);
+       return eb;
+
+free_eb:
+       if (!atomic_dec_and_test(&eb->refs))
+               return exists;
+       for (index = 1; index < i; index++)
+               page_cache_release(extent_buffer_page(eb, index));
+       page_cache_release(extent_buffer_page(eb, 0));
+       __free_extent_buffer(eb);
+       return exists;
+}
+
+struct extent_buffer *find_extent_buffer(struct extent_io_tree *tree,
+                                        u64 start, unsigned long len,
+                                         gfp_t mask)
+{
+       struct extent_buffer *eb;
+
+       spin_lock(&tree->buffer_lock);
+       eb = buffer_search(tree, start);
+       if (eb)
+               atomic_inc(&eb->refs);
+       spin_unlock(&tree->buffer_lock);
+
+       if (eb)
+               mark_page_accessed(eb->first_page);
+
+       return eb;
+}
+
+void free_extent_buffer(struct extent_buffer *eb)
+{
+       if (!eb)
+               return;
+
+       if (!atomic_dec_and_test(&eb->refs))
+               return;
+
+       WARN_ON(1);
+}
+
+int clear_extent_buffer_dirty(struct extent_io_tree *tree,
+                             struct extent_buffer *eb)
+{
+       int set;
+       unsigned long i;
+       unsigned long num_pages;
+       struct page *page;
+
+       u64 start = eb->start;
+       u64 end = start + eb->len - 1;
+
+       set = clear_extent_dirty(tree, start, end, GFP_NOFS);
+       num_pages = num_extent_pages(eb->start, eb->len);
+
+       for (i = 0; i < num_pages; i++) {
+               page = extent_buffer_page(eb, i);
+               if (!set && !PageDirty(page))
+                       continue;
+
+               lock_page(page);
+               if (i == 0)
+                       set_page_extent_head(page, eb->len);
+               else
+                       set_page_private(page, EXTENT_PAGE_PRIVATE);
+
+               /*
+                * if we're on the last page or the first page and the
+                * block isn't aligned on a page boundary, do extra checks
+                * to make sure we don't clean page that is partially dirty
+                */
+               if ((i == 0 && (eb->start & (PAGE_CACHE_SIZE - 1))) ||
+                   ((i == num_pages - 1) &&
+                    ((eb->start + eb->len) & (PAGE_CACHE_SIZE - 1)))) {
+                       start = (u64)page->index << PAGE_CACHE_SHIFT;
+                       end  = start + PAGE_CACHE_SIZE - 1;
+                       if (test_range_bit(tree, start, end,
+                                          EXTENT_DIRTY, 0)) {
+                               unlock_page(page);
+                               continue;
+                       }
+               }
+               clear_page_dirty_for_io(page);
+               spin_lock_irq(&page->mapping->tree_lock);
+               if (!PageDirty(page)) {
+                       radix_tree_tag_clear(&page->mapping->page_tree,
+                                               page_index(page),
+                                               PAGECACHE_TAG_DIRTY);
+               }
+               spin_unlock_irq(&page->mapping->tree_lock);
+               unlock_page(page);
+       }
+       return 0;
+}
+
+int wait_on_extent_buffer_writeback(struct extent_io_tree *tree,
+                                   struct extent_buffer *eb)
+{
+       return wait_on_extent_writeback(tree, eb->start,
+                                       eb->start + eb->len - 1);
+}
+
+int set_extent_buffer_dirty(struct extent_io_tree *tree,
+                            struct extent_buffer *eb)
+{
+       unsigned long i;
+       unsigned long num_pages;
+
+       num_pages = num_extent_pages(eb->start, eb->len);
+       for (i = 0; i < num_pages; i++) {
+               struct page *page = extent_buffer_page(eb, i);
+               /* writepage may need to do something special for the
+                * first page, we have to make sure page->private is
+                * properly set.  releasepage may drop page->private
+                * on us if the page isn't already dirty.
+                */
+               lock_page(page);
+               if (i == 0) {
+                       set_page_extent_head(page, eb->len);
+               } else if (PagePrivate(page) &&
+                          page->private != EXTENT_PAGE_PRIVATE) {
+                       set_page_extent_mapped(page);
+               }
+               __set_page_dirty_nobuffers(extent_buffer_page(eb, i));
+               set_extent_dirty(tree, page_offset(page),
+                                page_offset(page) + PAGE_CACHE_SIZE - 1,
+                                GFP_NOFS);
+               unlock_page(page);
+       }
+       return 0;
+}
+
+int clear_extent_buffer_uptodate(struct extent_io_tree *tree,
+                               struct extent_buffer *eb)
+{
+       unsigned long i;
+       struct page *page;
+       unsigned long num_pages;
+
+       num_pages = num_extent_pages(eb->start, eb->len);
+       eb->flags &= ~EXTENT_UPTODATE;
+
+       clear_extent_uptodate(tree, eb->start, eb->start + eb->len - 1,
+                             GFP_NOFS);
+       for (i = 0; i < num_pages; i++) {
+               page = extent_buffer_page(eb, i);
+               if (page)
+                       ClearPageUptodate(page);
+       }
+       return 0;
+}
+
+int set_extent_buffer_uptodate(struct extent_io_tree *tree,
+                               struct extent_buffer *eb)
+{
+       unsigned long i;
+       struct page *page;
+       unsigned long num_pages;
+
+       num_pages = num_extent_pages(eb->start, eb->len);
+
+       set_extent_uptodate(tree, eb->start, eb->start + eb->len - 1,
+                           GFP_NOFS);
+       for (i = 0; i < num_pages; i++) {
+               page = extent_buffer_page(eb, i);
+               if ((i == 0 && (eb->start & (PAGE_CACHE_SIZE - 1))) ||
+                   ((i == num_pages - 1) &&
+                    ((eb->start + eb->len) & (PAGE_CACHE_SIZE - 1)))) {
+                       check_page_uptodate(tree, page);
+                       continue;
+               }
+               SetPageUptodate(page);
+       }
+       return 0;
+}
+
+int extent_range_uptodate(struct extent_io_tree *tree,
+                         u64 start, u64 end)
+{
+       struct page *page;
+       int ret;
+       int pg_uptodate = 1;
+       int uptodate;
+       unsigned long index;
+
+       ret = test_range_bit(tree, start, end, EXTENT_UPTODATE, 1);
+       if (ret)
+               return 1;
+       while (start <= end) {
+               index = start >> PAGE_CACHE_SHIFT;
+               page = find_get_page(tree->mapping, index);
+               uptodate = PageUptodate(page);
+               page_cache_release(page);
+               if (!uptodate) {
+                       pg_uptodate = 0;
+                       break;
+               }
+               start += PAGE_CACHE_SIZE;
+       }
+       return pg_uptodate;
+}
+
+int extent_buffer_uptodate(struct extent_io_tree *tree,
+                          struct extent_buffer *eb)
+{
+       int ret = 0;
+       unsigned long num_pages;
+       unsigned long i;
+       struct page *page;
+       int pg_uptodate = 1;
+
+       if (eb->flags & EXTENT_UPTODATE)
+               return 1;
+
+       ret = test_range_bit(tree, eb->start, eb->start + eb->len - 1,
+                          EXTENT_UPTODATE, 1);
+       if (ret)
+               return ret;
+
+       num_pages = num_extent_pages(eb->start, eb->len);
+       for (i = 0; i < num_pages; i++) {
+               page = extent_buffer_page(eb, i);
+               if (!PageUptodate(page)) {
+                       pg_uptodate = 0;
+                       break;
+               }
+       }
+       return pg_uptodate;
+}
+
+int read_extent_buffer_pages(struct extent_io_tree *tree,
+                            struct extent_buffer *eb,
+                            u64 start, int wait,
+                            get_extent_t *get_extent, int mirror_num)
+{
+       unsigned long i;
+       unsigned long start_i;
+       struct page *page;
+       int err;
+       int ret = 0;
+       int locked_pages = 0;
+       int all_uptodate = 1;
+       int inc_all_pages = 0;
+       unsigned long num_pages;
+       struct bio *bio = NULL;
+       unsigned long bio_flags = 0;
+
+       if (eb->flags & EXTENT_UPTODATE)
+               return 0;
+
+       if (test_range_bit(tree, eb->start, eb->start + eb->len - 1,
+                          EXTENT_UPTODATE, 1)) {
+               return 0;
+       }
+
+       if (start) {
+               WARN_ON(start < eb->start);
+               start_i = (start >> PAGE_CACHE_SHIFT) -
+                       (eb->start >> PAGE_CACHE_SHIFT);
+       } else {
+               start_i = 0;
+       }
+
+       num_pages = num_extent_pages(eb->start, eb->len);
+       for (i = start_i; i < num_pages; i++) {
+               page = extent_buffer_page(eb, i);
+               if (!wait) {
+                       if (!trylock_page(page))
+                               goto unlock_exit;
+               } else {
+                       lock_page(page);
+               }
+               locked_pages++;
+               if (!PageUptodate(page))
+                       all_uptodate = 0;
+       }
+       if (all_uptodate) {
+               if (start_i == 0)
+                       eb->flags |= EXTENT_UPTODATE;
+               goto unlock_exit;
+       }
+
+       for (i = start_i; i < num_pages; i++) {
+               page = extent_buffer_page(eb, i);
+               if (inc_all_pages)
+                       page_cache_get(page);
+               if (!PageUptodate(page)) {
+                       if (start_i == 0)
+                               inc_all_pages = 1;
+                       ClearPageError(page);
+                       err = __extent_read_full_page(tree, page,
+                                                     get_extent, &bio,
+                                                     mirror_num, &bio_flags);
+                       if (err)
+                               ret = err;
+               } else {
+                       unlock_page(page);
+               }
+       }
+
+       if (bio)
+               submit_one_bio(READ, bio, mirror_num, bio_flags);
+
+       if (ret || !wait)
+               return ret;
+
+       for (i = start_i; i < num_pages; i++) {
+               page = extent_buffer_page(eb, i);
+               wait_on_page_locked(page);
+               if (!PageUptodate(page))
+                       ret = -EIO;
+       }
+
+       if (!ret)
+               eb->flags |= EXTENT_UPTODATE;
+       return ret;
+
+unlock_exit:
+       i = start_i;
+       while (locked_pages > 0) {
+               page = extent_buffer_page(eb, i);
+               i++;
+               unlock_page(page);
+               locked_pages--;
+       }
+       return ret;
+}
+
+void read_extent_buffer(struct extent_buffer *eb, void *dstv,
+                       unsigned long start,
+                       unsigned long len)
+{
+       size_t cur;
+       size_t offset;
+       struct page *page;
+       char *kaddr;
+       char *dst = (char *)dstv;
+       size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
+       unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
+
+       WARN_ON(start > eb->len);
+       WARN_ON(start + len > eb->start + eb->len);
+
+       offset = (start_offset + start) & ((unsigned long)PAGE_CACHE_SIZE - 1);
+
+       while (len > 0) {
+               page = extent_buffer_page(eb, i);
+
+               cur = min(len, (PAGE_CACHE_SIZE - offset));
+               kaddr = kmap_atomic(page, KM_USER1);
+               memcpy(dst, kaddr + offset, cur);
+               kunmap_atomic(kaddr, KM_USER1);
+
+               dst += cur;
+               len -= cur;
+               offset = 0;
+               i++;
+       }
+}
+
+int map_private_extent_buffer(struct extent_buffer *eb, unsigned long start,
+                              unsigned long min_len, char **token, char **map,
+                              unsigned long *map_start,
+                              unsigned long *map_len, int km)
+{
+       size_t offset = start & (PAGE_CACHE_SIZE - 1);
+       char *kaddr;
+       struct page *p;
+       size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
+       unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
+       unsigned long end_i = (start_offset + start + min_len - 1) >>
+               PAGE_CACHE_SHIFT;
+
+       if (i != end_i)
+               return -EINVAL;
+
+       if (i == 0) {
+               offset = start_offset;
+               *map_start = 0;
+       } else {
+               offset = 0;
+               *map_start = ((u64)i << PAGE_CACHE_SHIFT) - start_offset;
+       }
+
+       if (start + min_len > eb->len) {
+               printk(KERN_ERR "btrfs bad mapping eb start %llu len %lu, "
+                      "wanted %lu %lu\n", (unsigned long long)eb->start,
+                      eb->len, start, min_len);
+               WARN_ON(1);
+       }
+
+       p = extent_buffer_page(eb, i);
+       kaddr = kmap_atomic(p, km);
+       *token = kaddr;
+       *map = kaddr + offset;
+       *map_len = PAGE_CACHE_SIZE - offset;
+       return 0;
+}
+
+int map_extent_buffer(struct extent_buffer *eb, unsigned long start,
+                     unsigned long min_len,
+                     char **token, char **map,
+                     unsigned long *map_start,
+                     unsigned long *map_len, int km)
+{
+       int err;
+       int save = 0;
+       if (eb->map_token) {
+               unmap_extent_buffer(eb, eb->map_token, km);
+               eb->map_token = NULL;
+               save = 1;
+               WARN_ON(!mutex_is_locked(&eb->mutex));
+       }
+       err = map_private_extent_buffer(eb, start, min_len, token, map,
+                                      map_start, map_len, km);
+       if (!err && save) {
+               eb->map_token = *token;
+               eb->kaddr = *map;
+               eb->map_start = *map_start;
+               eb->map_len = *map_len;
+       }
+       return err;
+}
+
+void unmap_extent_buffer(struct extent_buffer *eb, char *token, int km)
+{
+       kunmap_atomic(token, km);
+}
+
+int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv,
+                         unsigned long start,
+                         unsigned long len)
+{
+       size_t cur;
+       size_t offset;
+       struct page *page;
+       char *kaddr;
+       char *ptr = (char *)ptrv;
+       size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
+       unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
+       int ret = 0;
+
+       WARN_ON(start > eb->len);
+       WARN_ON(start + len > eb->start + eb->len);
+
+       offset = (start_offset + start) & ((unsigned long)PAGE_CACHE_SIZE - 1);
+
+       while (len > 0) {
+               page = extent_buffer_page(eb, i);
+
+               cur = min(len, (PAGE_CACHE_SIZE - offset));
+
+               kaddr = kmap_atomic(page, KM_USER0);
+               ret = memcmp(ptr, kaddr + offset, cur);
+               kunmap_atomic(kaddr, KM_USER0);
+               if (ret)
+                       break;
+
+               ptr += cur;
+               len -= cur;
+               offset = 0;
+               i++;
+       }
+       return ret;
+}
+
+void write_extent_buffer(struct extent_buffer *eb, const void *srcv,
+                        unsigned long start, unsigned long len)
+{
+       size_t cur;
+       size_t offset;
+       struct page *page;
+       char *kaddr;
+       char *src = (char *)srcv;
+       size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
+       unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
+
+       WARN_ON(start > eb->len);
+       WARN_ON(start + len > eb->start + eb->len);
+
+       offset = (start_offset + start) & ((unsigned long)PAGE_CACHE_SIZE - 1);
+
+       while (len > 0) {
+               page = extent_buffer_page(eb, i);
+               WARN_ON(!PageUptodate(page));
+
+               cur = min(len, PAGE_CACHE_SIZE - offset);
+               kaddr = kmap_atomic(page, KM_USER1);
+               memcpy(kaddr + offset, src, cur);
+               kunmap_atomic(kaddr, KM_USER1);
+
+               src += cur;
+               len -= cur;
+               offset = 0;
+               i++;
+       }
+}
+
+void memset_extent_buffer(struct extent_buffer *eb, char c,
+                         unsigned long start, unsigned long len)
+{
+       size_t cur;
+       size_t offset;
+       struct page *page;
+       char *kaddr;
+       size_t start_offset = eb->start & ((u64)PAGE_CACHE_SIZE - 1);
+       unsigned long i = (start_offset + start) >> PAGE_CACHE_SHIFT;
+
+       WARN_ON(start > eb->len);
+       WARN_ON(start + len > eb->start + eb->len);
+
+       offset = (start_offset + start) & ((unsigned long)PAGE_CACHE_SIZE - 1);
+
+       while (len > 0) {
+               page = extent_buffer_page(eb, i);
+               WARN_ON(!PageUptodate(page));
+
+               cur = min(len, PAGE_CACHE_SIZE - offset);
+               kaddr = kmap_atomic(page, KM_USER0);
+               memset(kaddr + offset, c, cur);
+               kunmap_atomic(kaddr, KM_USER0);
+
+               len -= cur;
+               offset = 0;
+               i++;
+       }
+}
+
+void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src,
+                       unsigned long dst_offset, unsigned long src_offset,
+                       unsigned long len)
+{
+       u64 dst_len = dst->len;
+       size_t cur;
+       size_t offset;
+       struct page *page;
+       char *kaddr;
+       size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1);
+       unsigned long i = (start_offset + dst_offset) >> PAGE_CACHE_SHIFT;
+
+       WARN_ON(src->len != dst_len);
+
+       offset = (start_offset + dst_offset) &
+               ((unsigned long)PAGE_CACHE_SIZE - 1);
+
+       while (len > 0) {
+               page = extent_buffer_page(dst, i);
+               WARN_ON(!PageUptodate(page));
+
+               cur = min(len, (unsigned long)(PAGE_CACHE_SIZE - offset));
+
+               kaddr = kmap_atomic(page, KM_USER0);
+               read_extent_buffer(src, kaddr + offset, src_offset, cur);
+               kunmap_atomic(kaddr, KM_USER0);
+
+               src_offset += cur;
+               len -= cur;
+               offset = 0;
+               i++;
+       }
+}
+
+static void move_pages(struct page *dst_page, struct page *src_page,
+                      unsigned long dst_off, unsigned long src_off,
+                      unsigned long len)
+{
+       char *dst_kaddr = kmap_atomic(dst_page, KM_USER0);
+       if (dst_page == src_page) {
+               memmove(dst_kaddr + dst_off, dst_kaddr + src_off, len);
+       } else {
+               char *src_kaddr = kmap_atomic(src_page, KM_USER1);
+               char *p = dst_kaddr + dst_off + len;
+               char *s = src_kaddr + src_off + len;
+
+               while (len--)
+                       *--p = *--s;
+
+               kunmap_atomic(src_kaddr, KM_USER1);
+       }
+       kunmap_atomic(dst_kaddr, KM_USER0);
+}
+
+static void copy_pages(struct page *dst_page, struct page *src_page,
+                      unsigned long dst_off, unsigned long src_off,
+                      unsigned long len)
+{
+       char *dst_kaddr = kmap_atomic(dst_page, KM_USER0);
+       char *src_kaddr;
+
+       if (dst_page != src_page)
+               src_kaddr = kmap_atomic(src_page, KM_USER1);
+       else
+               src_kaddr = dst_kaddr;
+
+       memcpy(dst_kaddr + dst_off, src_kaddr + src_off, len);
+       kunmap_atomic(dst_kaddr, KM_USER0);
+       if (dst_page != src_page)
+               kunmap_atomic(src_kaddr, KM_USER1);
+}
+
+void memcpy_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
+                          unsigned long src_offset, unsigned long len)
+{
+       size_t cur;
+       size_t dst_off_in_page;
+       size_t src_off_in_page;
+       size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1);
+       unsigned long dst_i;
+       unsigned long src_i;
+
+       if (src_offset + len > dst->len) {
+               printk(KERN_ERR "btrfs memmove bogus src_offset %lu move "
+                      "len %lu dst len %lu\n", src_offset, len, dst->len);
+               BUG_ON(1);
+       }
+       if (dst_offset + len > dst->len) {
+               printk(KERN_ERR "btrfs memmove bogus dst_offset %lu move "
+                      "len %lu dst len %lu\n", dst_offset, len, dst->len);
+               BUG_ON(1);
+       }
+
+       while (len > 0) {
+               dst_off_in_page = (start_offset + dst_offset) &
+                       ((unsigned long)PAGE_CACHE_SIZE - 1);
+               src_off_in_page = (start_offset + src_offset) &
+                       ((unsigned long)PAGE_CACHE_SIZE - 1);
+
+               dst_i = (start_offset + dst_offset) >> PAGE_CACHE_SHIFT;
+               src_i = (start_offset + src_offset) >> PAGE_CACHE_SHIFT;
+
+               cur = min(len, (unsigned long)(PAGE_CACHE_SIZE -
+                                              src_off_in_page));
+               cur = min_t(unsigned long, cur,
+                       (unsigned long)(PAGE_CACHE_SIZE - dst_off_in_page));
+
+               copy_pages(extent_buffer_page(dst, dst_i),
+                          extent_buffer_page(dst, src_i),
+                          dst_off_in_page, src_off_in_page, cur);
+
+               src_offset += cur;
+               dst_offset += cur;
+               len -= cur;
+       }
+}
+
+void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
+                          unsigned long src_offset, unsigned long len)
+{
+       size_t cur;
+       size_t dst_off_in_page;
+       size_t src_off_in_page;
+       unsigned long dst_end = dst_offset + len - 1;
+       unsigned long src_end = src_offset + len - 1;
+       size_t start_offset = dst->start & ((u64)PAGE_CACHE_SIZE - 1);
+       unsigned long dst_i;
+       unsigned long src_i;
+
+       if (src_offset + len > dst->len) {
+               printk(KERN_ERR "btrfs memmove bogus src_offset %lu move "
+                      "len %lu len %lu\n", src_offset, len, dst->len);
+               BUG_ON(1);
+       }
+       if (dst_offset + len > dst->len) {
+               printk(KERN_ERR "btrfs memmove bogus dst_offset %lu move "
+                      "len %lu len %lu\n", dst_offset, len, dst->len);
+               BUG_ON(1);
+       }
+       if (dst_offset < src_offset) {
+               memcpy_extent_buffer(dst, dst_offset, src_offset, len);
+               return;
+       }
+       while (len > 0) {
+               dst_i = (start_offset + dst_end) >> PAGE_CACHE_SHIFT;
+               src_i = (start_offset + src_end) >> PAGE_CACHE_SHIFT;
+
+               dst_off_in_page = (start_offset + dst_end) &
+                       ((unsigned long)PAGE_CACHE_SIZE - 1);
+               src_off_in_page = (start_offset + src_end) &
+                       ((unsigned long)PAGE_CACHE_SIZE - 1);
+
+               cur = min_t(unsigned long, len, src_off_in_page + 1);
+               cur = min(cur, dst_off_in_page + 1);
+               move_pages(extent_buffer_page(dst, dst_i),
+                          extent_buffer_page(dst, src_i),
+                          dst_off_in_page - cur + 1,
+                          src_off_in_page - cur + 1, cur);
+
+               dst_end -= cur;
+               src_end -= cur;
+               len -= cur;
+       }
+}
+
+int try_release_extent_buffer(struct extent_io_tree *tree, struct page *page)
+{
+       u64 start = page_offset(page);
+       struct extent_buffer *eb;
+       int ret = 1;
+       unsigned long i;
+       unsigned long num_pages;
+
+       spin_lock(&tree->buffer_lock);
+       eb = buffer_search(tree, start);
+       if (!eb)
+               goto out;
+
+       if (atomic_read(&eb->refs) > 1) {
+               ret = 0;
+               goto out;
+       }
+       /* at this point we can safely release the extent buffer */
+       num_pages = num_extent_pages(eb->start, eb->len);
+       for (i = 0; i < num_pages; i++)
+               page_cache_release(extent_buffer_page(eb, i));
+       rb_erase(&eb->rb_node, &tree->buffer);
+       __free_extent_buffer(eb);
+out:
+       spin_unlock(&tree->buffer_lock);
+       return ret;
+}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h

new file mode 100644 (file)

index 0000000..c5b483a
--- /dev/null
+++ b/fs/btrfs/extent_io.h
@@ -0,0 +1,269 @@
+#ifndef __EXTENTIO__
+#define __EXTENTIO__
+
+#include <linux/rbtree.h>
+
+/* bits for the extent state */
+#define EXTENT_DIRTY 1
+#define EXTENT_WRITEBACK (1 << 1)
+#define EXTENT_UPTODATE (1 << 2)
+#define EXTENT_LOCKED (1 << 3)
+#define EXTENT_NEW (1 << 4)
+#define EXTENT_DELALLOC (1 << 5)
+#define EXTENT_DEFRAG (1 << 6)
+#define EXTENT_DEFRAG_DONE (1 << 7)
+#define EXTENT_BUFFER_FILLED (1 << 8)
+#define EXTENT_ORDERED (1 << 9)
+#define EXTENT_ORDERED_METADATA (1 << 10)
+#define EXTENT_BOUNDARY (1 << 11)
+#define EXTENT_NODATASUM (1 << 12)
+#define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK)
+
+/* flags for bio submission */
+#define EXTENT_BIO_COMPRESSED 1
+
+/*
+ * page->private values.  Every page that is controlled by the extent
+ * map has page->private set to one.
+ */
+#define EXTENT_PAGE_PRIVATE 1
+#define EXTENT_PAGE_PRIVATE_FIRST_PAGE 3
+
+struct extent_state;
+
+typedef        int (extent_submit_bio_hook_t)(struct inode *inode, int rw,
+                                      struct bio *bio, int mirror_num,
+                                      unsigned long bio_flags);
+struct extent_io_ops {
+       int (*fill_delalloc)(struct inode *inode, struct page *locked_page,
+                            u64 start, u64 end, int *page_started,
+                            unsigned long *nr_written);
+       int (*writepage_start_hook)(struct page *page, u64 start, u64 end);
+       int (*writepage_io_hook)(struct page *page, u64 start, u64 end);
+       extent_submit_bio_hook_t *submit_bio_hook;
+       int (*merge_bio_hook)(struct page *page, unsigned long offset,
+                             size_t size, struct bio *bio,
+                             unsigned long bio_flags);
+       int (*readpage_io_hook)(struct page *page, u64 start, u64 end);
+       int (*readpage_io_failed_hook)(struct bio *bio, struct page *page,
+                                      u64 start, u64 end,
+                                      struct extent_state *state);
+       int (*writepage_io_failed_hook)(struct bio *bio, struct page *page,
+                                       u64 start, u64 end,
+                                      struct extent_state *state);
+       int (*readpage_end_io_hook)(struct page *page, u64 start, u64 end,
+                                   struct extent_state *state);
+       int (*writepage_end_io_hook)(struct page *page, u64 start, u64 end,
+                                     struct extent_state *state, int uptodate);
+       int (*set_bit_hook)(struct inode *inode, u64 start, u64 end,
+                           unsigned long old, unsigned long bits);
+       int (*clear_bit_hook)(struct inode *inode, u64 start, u64 end,
+                           unsigned long old, unsigned long bits);
+       int (*write_cache_pages_lock_hook)(struct page *page);
+};
+
+struct extent_io_tree {
+       struct rb_root state;
+       struct rb_root buffer;
+       struct address_space *mapping;
+       u64 dirty_bytes;
+       spinlock_t lock;
+       spinlock_t buffer_lock;
+       struct extent_io_ops *ops;
+};
+
+struct extent_state {
+       u64 start;
+       u64 end; /* inclusive */
+       struct rb_node rb_node;
+       struct extent_io_tree *tree;
+       wait_queue_head_t wq;
+       atomic_t refs;
+       unsigned long state;
+
+       /* for use by the FS */
+       u64 private;
+
+       struct list_head leak_list;
+};
+
+struct extent_buffer {
+       u64 start;
+       unsigned long len;
+       char *map_token;
+       char *kaddr;
+       unsigned long map_start;
+       unsigned long map_len;
+       struct page *first_page;
+       atomic_t refs;
+       int flags;
+       struct list_head leak_list;
+       struct rb_node rb_node;
+       struct mutex mutex;
+};
+
+struct extent_map_tree;
+
+static inline struct extent_state *extent_state_next(struct extent_state *state)
+{
+       struct rb_node *node;
+       node = rb_next(&state->rb_node);
+       if (!node)
+               return NULL;
+       return rb_entry(node, struct extent_state, rb_node);
+}
+
+typedef struct extent_map *(get_extent_t)(struct inode *inode,
+                                         struct page *page,
+                                         size_t page_offset,
+                                         u64 start, u64 len,
+                                         int create);
+
+void extent_io_tree_init(struct extent_io_tree *tree,
+                         struct address_space *mapping, gfp_t mask);
+int try_release_extent_mapping(struct extent_map_tree *map,
+                              struct extent_io_tree *tree, struct page *page,
+                              gfp_t mask);
+int try_release_extent_buffer(struct extent_io_tree *tree, struct page *page);
+int try_release_extent_state(struct extent_map_tree *map,
+                            struct extent_io_tree *tree, struct page *page,
+                            gfp_t mask);
+int lock_extent(struct extent_io_tree *tree, u64 start, u64 end, gfp_t mask);
+int unlock_extent(struct extent_io_tree *tree, u64 start, u64 end, gfp_t mask);
+int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end,
+                   gfp_t mask);
+int extent_read_full_page(struct extent_io_tree *tree, struct page *page,
+                         get_extent_t *get_extent);
+int __init extent_io_init(void);
+void extent_io_exit(void);
+
+u64 count_range_bits(struct extent_io_tree *tree,
+                    u64 *start, u64 search_end,
+                    u64 max_bytes, unsigned long bits);
+
+int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
+                  int bits, int filled);
+int clear_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
+                     int bits, gfp_t mask);
+int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
+                    int bits, int wake, int delete, gfp_t mask);
+int set_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
+                   int bits, gfp_t mask);
+int set_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end,
+                       gfp_t mask);
+int set_extent_new(struct extent_io_tree *tree, u64 start, u64 end,
+                  gfp_t mask);
+int set_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end,
+                    gfp_t mask);
+int clear_extent_dirty(struct extent_io_tree *tree, u64 start, u64 end,
+                      gfp_t mask);
+int clear_extent_ordered(struct extent_io_tree *tree, u64 start, u64 end,
+                      gfp_t mask);
+int clear_extent_ordered_metadata(struct extent_io_tree *tree, u64 start,
+                                 u64 end, gfp_t mask);
+int set_extent_delalloc(struct extent_io_tree *tree, u64 start, u64 end,
+                    gfp_t mask);
+int set_extent_ordered(struct extent_io_tree *tree, u64 start, u64 end,
+                    gfp_t mask);
+int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
+                         u64 *start_ret, u64 *end_ret, int bits);
+struct extent_state *find_first_extent_bit_state(struct extent_io_tree *tree,
+                                                u64 start, int bits);
+int extent_invalidatepage(struct extent_io_tree *tree,
+                         struct page *page, unsigned long offset);
+int extent_write_full_page(struct extent_io_tree *tree, struct page *page,
+                         get_extent_t *get_extent,
+                         struct writeback_control *wbc);
+int extent_write_locked_range(struct extent_io_tree *tree, struct inode *inode,
+                             u64 start, u64 end, get_extent_t *get_extent,
+                             int mode);
+int extent_writepages(struct extent_io_tree *tree,
+                     struct address_space *mapping,
+                     get_extent_t *get_extent,
+                     struct writeback_control *wbc);
+int extent_readpages(struct extent_io_tree *tree,
+                    struct address_space *mapping,
+                    struct list_head *pages, unsigned nr_pages,
+                    get_extent_t get_extent);
+int extent_prepare_write(struct extent_io_tree *tree,
+                        struct inode *inode, struct page *page,
+                        unsigned from, unsigned to, get_extent_t *get_extent);
+int extent_commit_write(struct extent_io_tree *tree,
+                       struct inode *inode, struct page *page,
+                       unsigned from, unsigned to);
+sector_t extent_bmap(struct address_space *mapping, sector_t iblock,
+               get_extent_t *get_extent);
+int set_range_dirty(struct extent_io_tree *tree, u64 start, u64 end);
+int set_state_private(struct extent_io_tree *tree, u64 start, u64 private);
+int get_state_private(struct extent_io_tree *tree, u64 start, u64 *private);
+void set_page_extent_mapped(struct page *page);
+
+struct extent_buffer *alloc_extent_buffer(struct extent_io_tree *tree,
+                                         u64 start, unsigned long len,
+                                         struct page *page0,
+                                         gfp_t mask);
+struct extent_buffer *find_extent_buffer(struct extent_io_tree *tree,
+                                        u64 start, unsigned long len,
+                                         gfp_t mask);
+void free_extent_buffer(struct extent_buffer *eb);
+int read_extent_buffer_pages(struct extent_io_tree *tree,
+                            struct extent_buffer *eb, u64 start, int wait,
+                            get_extent_t *get_extent, int mirror_num);
+
+static inline void extent_buffer_get(struct extent_buffer *eb)
+{
+       atomic_inc(&eb->refs);
+}
+
+int memcmp_extent_buffer(struct extent_buffer *eb, const void *ptrv,
+                         unsigned long start,
+                         unsigned long len);
+void read_extent_buffer(struct extent_buffer *eb, void *dst,
+                       unsigned long start,
+                       unsigned long len);
+void write_extent_buffer(struct extent_buffer *eb, const void *src,
+                        unsigned long start, unsigned long len);
+void copy_extent_buffer(struct extent_buffer *dst, struct extent_buffer *src,
+                       unsigned long dst_offset, unsigned long src_offset,
+                       unsigned long len);
+void memcpy_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
+                          unsigned long src_offset, unsigned long len);
+void memmove_extent_buffer(struct extent_buffer *dst, unsigned long dst_offset,
+                          unsigned long src_offset, unsigned long len);
+void memset_extent_buffer(struct extent_buffer *eb, char c,
+                         unsigned long start, unsigned long len);
+int wait_on_extent_buffer_writeback(struct extent_io_tree *tree,
+                                   struct extent_buffer *eb);
+int wait_on_extent_writeback(struct extent_io_tree *tree, u64 start, u64 end);
+int wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, int bits);
+int clear_extent_buffer_dirty(struct extent_io_tree *tree,
+                             struct extent_buffer *eb);
+int set_extent_buffer_dirty(struct extent_io_tree *tree,
+                            struct extent_buffer *eb);
+int set_extent_buffer_uptodate(struct extent_io_tree *tree,
+                              struct extent_buffer *eb);
+int clear_extent_buffer_uptodate(struct extent_io_tree *tree,
+                               struct extent_buffer *eb);
+int extent_buffer_uptodate(struct extent_io_tree *tree,
+                          struct extent_buffer *eb);
+int map_extent_buffer(struct extent_buffer *eb, unsigned long offset,
+                     unsigned long min_len, char **token, char **map,
+                     unsigned long *map_start,
+                     unsigned long *map_len, int km);
+int map_private_extent_buffer(struct extent_buffer *eb, unsigned long offset,
+                     unsigned long min_len, char **token, char **map,
+                     unsigned long *map_start,
+                     unsigned long *map_len, int km);
+void unmap_extent_buffer(struct extent_buffer *eb, char *token, int km);
+int release_extent_buffer_tail_pages(struct extent_buffer *eb);
+int extent_range_uptodate(struct extent_io_tree *tree,
+                         u64 start, u64 end);
+int extent_clear_unlock_delalloc(struct inode *inode,
+                               struct extent_io_tree *tree,
+                               u64 start, u64 end, struct page *locked_page,
+                               int unlock_page,
+                               int clear_unlock,
+                               int clear_delalloc, int clear_dirty,
+                               int set_writeback,
+                               int end_writeback);
+#endif
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c

new file mode 100644 (file)

index 0000000..4a83e33
--- /dev/null
+++ b/fs/btrfs/extent_map.c
@@ -0,0 +1,351 @@
+#include <linux/err.h>
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/version.h>
+#include <linux/hardirq.h>
+#include "extent_map.h"
+
+/* temporary define until extent_map moves out of btrfs */
+struct kmem_cache *btrfs_cache_create(const char *name, size_t size,
+                                      unsigned long extra_flags,
+                                      void (*ctor)(void *, struct kmem_cache *,
+                                                   unsigned long));
+
+static struct kmem_cache *extent_map_cache;
+
+int __init extent_map_init(void)
+{
+       extent_map_cache = btrfs_cache_create("extent_map",
+                                           sizeof(struct extent_map), 0,
+                                           NULL);
+       if (!extent_map_cache)
+               return -ENOMEM;
+       return 0;
+}
+
+void extent_map_exit(void)
+{
+       if (extent_map_cache)
+               kmem_cache_destroy(extent_map_cache);
+}
+
+/**
+ * extent_map_tree_init - initialize extent map tree
+ * @tree:              tree to initialize
+ * @mask:              flags for memory allocations during tree operations
+ *
+ * Initialize the extent tree @tree.  Should be called for each new inode
+ * or other user of the extent_map interface.
+ */
+void extent_map_tree_init(struct extent_map_tree *tree, gfp_t mask)
+{
+       tree->map.rb_node = NULL;
+       spin_lock_init(&tree->lock);
+}
+EXPORT_SYMBOL(extent_map_tree_init);
+
+/**
+ * alloc_extent_map - allocate new extent map structure
+ * @mask:      memory allocation flags
+ *
+ * Allocate a new extent_map structure.  The new structure is
+ * returned with a reference count of one and needs to be
+ * freed using free_extent_map()
+ */
+struct extent_map *alloc_extent_map(gfp_t mask)
+{
+       struct extent_map *em;
+       em = kmem_cache_alloc(extent_map_cache, mask);
+       if (!em || IS_ERR(em))
+               return em;
+       em->in_tree = 0;
+       em->flags = 0;
+       atomic_set(&em->refs, 1);
+       return em;
+}
+EXPORT_SYMBOL(alloc_extent_map);
+
+/**
+ * free_extent_map - drop reference count of an extent_map
+ * @em:                extent map beeing releasead
+ *
+ * Drops the reference out on @em by one and free the structure
+ * if the reference count hits zero.
+ */
+void free_extent_map(struct extent_map *em)
+{
+       if (!em)
+               return;
+       WARN_ON(atomic_read(&em->refs) == 0);
+       if (atomic_dec_and_test(&em->refs)) {
+               WARN_ON(em->in_tree);
+               kmem_cache_free(extent_map_cache, em);
+       }
+}
+EXPORT_SYMBOL(free_extent_map);
+
+static struct rb_node *tree_insert(struct rb_root *root, u64 offset,
+                                  struct rb_node *node)
+{
+       struct rb_node **p = &root->rb_node;
+       struct rb_node *parent = NULL;
+       struct extent_map *entry;
+
+       while (*p) {
+               parent = *p;
+               entry = rb_entry(parent, struct extent_map, rb_node);
+
+               WARN_ON(!entry->in_tree);
+
+               if (offset < entry->start)
+                       p = &(*p)->rb_left;
+               else if (offset >= extent_map_end(entry))
+                       p = &(*p)->rb_right;
+               else
+                       return parent;
+       }
+
+       entry = rb_entry(node, struct extent_map, rb_node);
+       entry->in_tree = 1;
+       rb_link_node(node, parent, p);
+       rb_insert_color(node, root);
+       return NULL;
+}
+
+/*
+ * search through the tree for an extent_map with a given offset.  If
+ * it can't be found, try to find some neighboring extents
+ */
+static struct rb_node *__tree_search(struct rb_root *root, u64 offset,
+                                    struct rb_node **prev_ret,
+                                    struct rb_node **next_ret)
+{
+       struct rb_node *n = root->rb_node;
+       struct rb_node *prev = NULL;
+       struct rb_node *orig_prev = NULL;
+       struct extent_map *entry;
+       struct extent_map *prev_entry = NULL;
+
+       while (n) {
+               entry = rb_entry(n, struct extent_map, rb_node);
+               prev = n;
+               prev_entry = entry;
+
+               WARN_ON(!entry->in_tree);
+
+               if (offset < entry->start)
+                       n = n->rb_left;
+               else if (offset >= extent_map_end(entry))
+                       n = n->rb_right;
+               else
+                       return n;
+       }
+
+       if (prev_ret) {
+               orig_prev = prev;
+               while (prev && offset >= extent_map_end(prev_entry)) {
+                       prev = rb_next(prev);
+                       prev_entry = rb_entry(prev, struct extent_map, rb_node);
+               }
+               *prev_ret = prev;
+               prev = orig_prev;
+       }
+
+       if (next_ret) {
+               prev_entry = rb_entry(prev, struct extent_map, rb_node);
+               while (prev && offset < prev_entry->start) {
+                       prev = rb_prev(prev);
+                       prev_entry = rb_entry(prev, struct extent_map, rb_node);
+               }
+               *next_ret = prev;
+       }
+       return NULL;
+}
+
+/*
+ * look for an offset in the tree, and if it can't be found, return
+ * the first offset we can find smaller than 'offset'.
+ */
+static inline struct rb_node *tree_search(struct rb_root *root, u64 offset)
+{
+       struct rb_node *prev;
+       struct rb_node *ret;
+       ret = __tree_search(root, offset, &prev, NULL);
+       if (!ret)
+               return prev;
+       return ret;
+}
+
+/* check to see if two extent_map structs are adjacent and safe to merge */
+static int mergable_maps(struct extent_map *prev, struct extent_map *next)
+{
+       if (test_bit(EXTENT_FLAG_PINNED, &prev->flags))
+               return 0;
+
+       /*
+        * don't merge compressed extents, we need to know their
+        * actual size
+        */
+       if (test_bit(EXTENT_FLAG_COMPRESSED, &prev->flags))
+               return 0;
+
+       if (extent_map_end(prev) == next->start &&
+           prev->flags == next->flags &&
+           prev->bdev == next->bdev &&
+           ((next->block_start == EXTENT_MAP_HOLE &&
+             prev->block_start == EXTENT_MAP_HOLE) ||
+            (next->block_start == EXTENT_MAP_INLINE &&
+             prev->block_start == EXTENT_MAP_INLINE) ||
+            (next->block_start == EXTENT_MAP_DELALLOC &&
+             prev->block_start == EXTENT_MAP_DELALLOC) ||
+            (next->block_start < EXTENT_MAP_LAST_BYTE - 1 &&
+             next->block_start == extent_map_block_end(prev)))) {
+               return 1;
+       }
+       return 0;
+}
+
+/**
+ * add_extent_mapping - add new extent map to the extent tree
+ * @tree:      tree to insert new map in
+ * @em:                map to insert
+ *
+ * Insert @em into @tree or perform a simple forward/backward merge with
+ * existing mappings.  The extent_map struct passed in will be inserted
+ * into the tree directly, with an additional reference taken, or a
+ * reference dropped if the merge attempt was sucessfull.
+ */
+int add_extent_mapping(struct extent_map_tree *tree,
+                      struct extent_map *em)
+{
+       int ret = 0;
+       struct extent_map *merge = NULL;
+       struct rb_node *rb;
+       struct extent_map *exist;
+
+       exist = lookup_extent_mapping(tree, em->start, em->len);
+       if (exist) {
+               free_extent_map(exist);
+               ret = -EEXIST;
+               goto out;
+       }
+       assert_spin_locked(&tree->lock);
+       rb = tree_insert(&tree->map, em->start, &em->rb_node);
+       if (rb) {
+               ret = -EEXIST;
+               free_extent_map(merge);
+               goto out;
+       }
+       atomic_inc(&em->refs);
+       if (em->start != 0) {
+               rb = rb_prev(&em->rb_node);
+               if (rb)
+                       merge = rb_entry(rb, struct extent_map, rb_node);
+               if (rb && mergable_maps(merge, em)) {
+                       em->start = merge->start;
+                       em->len += merge->len;
+                       em->block_len += merge->block_len;
+                       em->block_start = merge->block_start;
+                       merge->in_tree = 0;
+                       rb_erase(&merge->rb_node, &tree->map);
+                       free_extent_map(merge);
+               }
+        }
+       rb = rb_next(&em->rb_node);
+       if (rb)
+               merge = rb_entry(rb, struct extent_map, rb_node);
+       if (rb && mergable_maps(em, merge)) {
+               em->len += merge->len;
+               em->block_len += merge->len;
+               rb_erase(&merge->rb_node, &tree->map);
+               merge->in_tree = 0;
+               free_extent_map(merge);
+       }
+out:
+       return ret;
+}
+EXPORT_SYMBOL(add_extent_mapping);
+
+/* simple helper to do math around the end of an extent, handling wrap */
+static u64 range_end(u64 start, u64 len)
+{
+       if (start + len < start)
+               return (u64)-1;
+       return start + len;
+}
+
+/**
+ * lookup_extent_mapping - lookup extent_map
+ * @tree:      tree to lookup in
+ * @start:     byte offset to start the search
+ * @len:       length of the lookup range
+ *
+ * Find and return the first extent_map struct in @tree that intersects the
+ * [start, len] range.  There may be additional objects in the tree that
+ * intersect, so check the object returned carefully to make sure that no
+ * additional lookups are needed.
+ */
+struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree,
+                                        u64 start, u64 len)
+{
+       struct extent_map *em;
+       struct rb_node *rb_node;
+       struct rb_node *prev = NULL;
+       struct rb_node *next = NULL;
+       u64 end = range_end(start, len);
+
+       assert_spin_locked(&tree->lock);
+       rb_node = __tree_search(&tree->map, start, &prev, &next);
+       if (!rb_node && prev) {
+               em = rb_entry(prev, struct extent_map, rb_node);
+               if (end > em->start && start < extent_map_end(em))
+                       goto found;
+       }
+       if (!rb_node && next) {
+               em = rb_entry(next, struct extent_map, rb_node);
+               if (end > em->start && start < extent_map_end(em))
+                       goto found;
+       }
+       if (!rb_node) {
+               em = NULL;
+               goto out;
+       }
+       if (IS_ERR(rb_node)) {
+               em = ERR_PTR(PTR_ERR(rb_node));
+               goto out;
+       }
+       em = rb_entry(rb_node, struct extent_map, rb_node);
+       if (end > em->start && start < extent_map_end(em))
+               goto found;
+
+       em = NULL;
+       goto out;
+
+found:
+       atomic_inc(&em->refs);
+out:
+       return em;
+}
+EXPORT_SYMBOL(lookup_extent_mapping);
+
+/**
+ * remove_extent_mapping - removes an extent_map from the extent tree
+ * @tree:      extent tree to remove from
+ * @em:                extent map beeing removed
+ *
+ * Removes @em from @tree.  No reference counts are dropped, and no checks
+ * are done to see if the range is in use
+ */
+int remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em)
+{
+       int ret = 0;
+
+       WARN_ON(test_bit(EXTENT_FLAG_PINNED, &em->flags));
+       assert_spin_locked(&tree->lock);
+       rb_erase(&em->rb_node, &tree->map);
+       em->in_tree = 0;
+       return ret;
+}
+EXPORT_SYMBOL(remove_extent_mapping);
diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h

new file mode 100644 (file)

index 0000000..fb6eeef
--- /dev/null
+++ b/fs/btrfs/extent_map.h
@@ -0,0 +1,62 @@
+#ifndef __EXTENTMAP__
+#define __EXTENTMAP__
+
+#include <linux/rbtree.h>
+
+#define EXTENT_MAP_LAST_BYTE (u64)-4
+#define EXTENT_MAP_HOLE (u64)-3
+#define EXTENT_MAP_INLINE (u64)-2
+#define EXTENT_MAP_DELALLOC (u64)-1
+
+/* bits for the flags field */
+#define EXTENT_FLAG_PINNED 0 /* this entry not yet on disk, don't free it */
+#define EXTENT_FLAG_COMPRESSED 1
+#define EXTENT_FLAG_VACANCY 2 /* no file extent item found */
+#define EXTENT_FLAG_PREALLOC 3 /* pre-allocated extent */
+
+struct extent_map {
+       struct rb_node rb_node;
+
+       /* all of these are in bytes */
+       u64 start;
+       u64 len;
+       u64 orig_start;
+       u64 block_start;
+       u64 block_len;
+       unsigned long flags;
+       struct block_device *bdev;
+       atomic_t refs;
+       int in_tree;
+};
+
+struct extent_map_tree {
+       struct rb_root map;
+       spinlock_t lock;
+};
+
+static inline u64 extent_map_end(struct extent_map *em)
+{
+       if (em->start + em->len < em->start)
+               return (u64)-1;
+       return em->start + em->len;
+}
+
+static inline u64 extent_map_block_end(struct extent_map *em)
+{
+       if (em->block_start + em->block_len < em->block_start)
+               return (u64)-1;
+       return em->block_start + em->block_len;
+}
+
+void extent_map_tree_init(struct extent_map_tree *tree, gfp_t mask);
+struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree,
+                                        u64 start, u64 len);
+int add_extent_mapping(struct extent_map_tree *tree,
+                      struct extent_map *em);
+int remove_extent_mapping(struct extent_map_tree *tree, struct extent_map *em);
+
+struct extent_map *alloc_extent_map(gfp_t mask);
+void free_extent_map(struct extent_map *em);
+int __init extent_map_init(void);
+void extent_map_exit(void);
+#endif
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c

new file mode 100644 (file)

index 0000000..9646524
--- /dev/null
+++ b/fs/btrfs/file-item.c
@@ -0,0 +1,831 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/bio.h>
+#include <linux/pagemap.h>
+#include <linux/highmem.h>
+#include "ctree.h"
+#include "disk-io.h"
+#include "transaction.h"
+#include "print-tree.h"
+
+#define MAX_CSUM_ITEMS(r, size) ((((BTRFS_LEAF_DATA_SIZE(r) - \
+                                  sizeof(struct btrfs_item) * 2) / \
+                                 size) - 1))
+
+#define MAX_ORDERED_SUM_BYTES(r) ((PAGE_SIZE - \
+                                  sizeof(struct btrfs_ordered_sum)) / \
+                                  sizeof(struct btrfs_sector_sum) * \
+                                  (r)->sectorsize - (r)->sectorsize)
+
+int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root,
+                            u64 objectid, u64 pos,
+                            u64 disk_offset, u64 disk_num_bytes,
+                            u64 num_bytes, u64 offset, u64 ram_bytes,
+                            u8 compression, u8 encryption, u16 other_encoding)
+{
+       int ret = 0;
+       struct btrfs_file_extent_item *item;
+       struct btrfs_key file_key;
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+       file_key.objectid = objectid;
+       file_key.offset = pos;
+       btrfs_set_key_type(&file_key, BTRFS_EXTENT_DATA_KEY);
+
+       ret = btrfs_insert_empty_item(trans, root, path, &file_key,
+                                     sizeof(*item));
+       if (ret < 0)
+               goto out;
+       BUG_ON(ret);
+       leaf = path->nodes[0];
+       item = btrfs_item_ptr(leaf, path->slots[0],
+                             struct btrfs_file_extent_item);
+       btrfs_set_file_extent_disk_bytenr(leaf, item, disk_offset);
+       btrfs_set_file_extent_disk_num_bytes(leaf, item, disk_num_bytes);
+       btrfs_set_file_extent_offset(leaf, item, offset);
+       btrfs_set_file_extent_num_bytes(leaf, item, num_bytes);
+       btrfs_set_file_extent_ram_bytes(leaf, item, ram_bytes);
+       btrfs_set_file_extent_generation(leaf, item, trans->transid);
+       btrfs_set_file_extent_type(leaf, item, BTRFS_FILE_EXTENT_REG);
+       btrfs_set_file_extent_compression(leaf, item, compression);
+       btrfs_set_file_extent_encryption(leaf, item, encryption);
+       btrfs_set_file_extent_other_encoding(leaf, item, other_encoding);
+
+       btrfs_mark_buffer_dirty(leaf);
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+struct btrfs_csum_item *btrfs_lookup_csum(struct btrfs_trans_handle *trans,
+                                         struct btrfs_root *root,
+                                         struct btrfs_path *path,
+                                         u64 bytenr, int cow)
+{
+       int ret;
+       struct btrfs_key file_key;
+       struct btrfs_key found_key;
+       struct btrfs_csum_item *item;
+       struct extent_buffer *leaf;
+       u64 csum_offset = 0;
+       u16 csum_size =
+               btrfs_super_csum_size(&root->fs_info->super_copy);
+       int csums_in_item;
+
+       file_key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
+       file_key.offset = bytenr;
+       btrfs_set_key_type(&file_key, BTRFS_EXTENT_CSUM_KEY);
+       ret = btrfs_search_slot(trans, root, &file_key, path, 0, cow);
+       if (ret < 0)
+               goto fail;
+       leaf = path->nodes[0];
+       if (ret > 0) {
+               ret = 1;
+               if (path->slots[0] == 0)
+                       goto fail;
+               path->slots[0]--;
+               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+               if (btrfs_key_type(&found_key) != BTRFS_EXTENT_CSUM_KEY)
+                       goto fail;
+
+               csum_offset = (bytenr - found_key.offset) >>
+                               root->fs_info->sb->s_blocksize_bits;
+               csums_in_item = btrfs_item_size_nr(leaf, path->slots[0]);
+               csums_in_item /= csum_size;
+
+               if (csum_offset >= csums_in_item) {
+                       ret = -EFBIG;
+                       goto fail;
+               }
+       }
+       item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_csum_item);
+       item = (struct btrfs_csum_item *)((unsigned char *)item +
+                                         csum_offset * csum_size);
+       return item;
+fail:
+       if (ret > 0)
+               ret = -ENOENT;
+       return ERR_PTR(ret);
+}
+
+
+int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root,
+                            struct btrfs_path *path, u64 objectid,
+                            u64 offset, int mod)
+{
+       int ret;
+       struct btrfs_key file_key;
+       int ins_len = mod < 0 ? -1 : 0;
+       int cow = mod != 0;
+
+       file_key.objectid = objectid;
+       file_key.offset = offset;
+       btrfs_set_key_type(&file_key, BTRFS_EXTENT_DATA_KEY);
+       ret = btrfs_search_slot(trans, root, &file_key, path, ins_len, cow);
+       return ret;
+}
+
+
+int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
+                         struct bio *bio, u32 *dst)
+{
+       u32 sum;
+       struct bio_vec *bvec = bio->bi_io_vec;
+       int bio_index = 0;
+       u64 offset;
+       u64 item_start_offset = 0;
+       u64 item_last_offset = 0;
+       u64 disk_bytenr;
+       u32 diff;
+       u16 csum_size =
+               btrfs_super_csum_size(&root->fs_info->super_copy);
+       int ret;
+       struct btrfs_path *path;
+       struct btrfs_csum_item *item = NULL;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+
+       path = btrfs_alloc_path();
+       if (bio->bi_size > PAGE_CACHE_SIZE * 8)
+               path->reada = 2;
+
+       WARN_ON(bio->bi_vcnt <= 0);
+
+       disk_bytenr = (u64)bio->bi_sector << 9;
+       while (bio_index < bio->bi_vcnt) {
+               offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+               ret = btrfs_find_ordered_sum(inode, offset, disk_bytenr, &sum);
+               if (ret == 0)
+                       goto found;
+
+               if (!item || disk_bytenr < item_start_offset ||
+                   disk_bytenr >= item_last_offset) {
+                       struct btrfs_key found_key;
+                       u32 item_size;
+
+                       if (item)
+                               btrfs_release_path(root, path);
+                       item = btrfs_lookup_csum(NULL, root->fs_info->csum_root,
+                                                path, disk_bytenr, 0);
+                       if (IS_ERR(item)) {
+                               ret = PTR_ERR(item);
+                               if (ret == -ENOENT || ret == -EFBIG)
+                                       ret = 0;
+                               sum = 0;
+                               if (BTRFS_I(inode)->root->root_key.objectid ==
+                                   BTRFS_DATA_RELOC_TREE_OBJECTID) {
+                                       set_extent_bits(io_tree, offset,
+                                               offset + bvec->bv_len - 1,
+                                               EXTENT_NODATASUM, GFP_NOFS);
+                               } else {
+                                       printk(KERN_INFO "btrfs no csum found "
+                                              "for inode %lu start %llu\n",
+                                              inode->i_ino,
+                                              (unsigned long long)offset);
+                               }
+                               item = NULL;
+                               btrfs_release_path(root, path);
+                               goto found;
+                       }
+                       btrfs_item_key_to_cpu(path->nodes[0], &found_key,
+                                             path->slots[0]);
+
+                       item_start_offset = found_key.offset;
+                       item_size = btrfs_item_size_nr(path->nodes[0],
+                                                      path->slots[0]);
+                       item_last_offset = item_start_offset +
+                               (item_size / csum_size) *
+                               root->sectorsize;
+                       item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+                                             struct btrfs_csum_item);
+               }
+               /*
+                * this byte range must be able to fit inside
+                * a single leaf so it will also fit inside a u32
+                */
+               diff = disk_bytenr - item_start_offset;
+               diff = diff / root->sectorsize;
+               diff = diff * csum_size;
+
+               read_extent_buffer(path->nodes[0], &sum,
+                                  ((unsigned long)item) + diff,
+                                  csum_size);
+found:
+               if (dst)
+                       *dst++ = sum;
+               else
+                       set_state_private(io_tree, offset, sum);
+               disk_bytenr += bvec->bv_len;
+               bio_index++;
+               bvec++;
+       }
+       btrfs_free_path(path);
+       return 0;
+}
+
+int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
+                            struct list_head *list)
+{
+       struct btrfs_key key;
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       struct btrfs_ordered_sum *sums;
+       struct btrfs_sector_sum *sector_sum;
+       struct btrfs_csum_item *item;
+       unsigned long offset;
+       int ret;
+       size_t size;
+       u64 csum_end;
+       u16 csum_size = btrfs_super_csum_size(&root->fs_info->super_copy);
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
+       key.offset = start;
+       key.type = BTRFS_EXTENT_CSUM_KEY;
+
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       if (ret < 0)
+               goto fail;
+       if (ret > 0 && path->slots[0] > 0) {
+               leaf = path->nodes[0];
+               btrfs_item_key_to_cpu(leaf, &key, path->slots[0] - 1);
+               if (key.objectid == BTRFS_EXTENT_CSUM_OBJECTID &&
+                   key.type == BTRFS_EXTENT_CSUM_KEY) {
+                       offset = (start - key.offset) >>
+                                root->fs_info->sb->s_blocksize_bits;
+                       if (offset * csum_size <
+                           btrfs_item_size_nr(leaf, path->slots[0] - 1))
+                               path->slots[0]--;
+               }
+       }
+
+       while (start <= end) {
+               leaf = path->nodes[0];
+               if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret < 0)
+                               goto fail;
+                       if (ret > 0)
+                               break;
+                       leaf = path->nodes[0];
+               }
+
+               btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+               if (key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
+                   key.type != BTRFS_EXTENT_CSUM_KEY)
+                       break;
+
+               btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+               if (key.offset > end)
+                       break;
+
+               if (key.offset > start)
+                       start = key.offset;
+
+               size = btrfs_item_size_nr(leaf, path->slots[0]);
+               csum_end = key.offset + (size / csum_size) * root->sectorsize;
+               if (csum_end <= start) {
+                       path->slots[0]++;
+                       continue;
+               }
+
+               csum_end = min(csum_end, end + 1);
+               item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+                                     struct btrfs_csum_item);
+               while (start < csum_end) {
+                       size = min_t(size_t, csum_end - start,
+                                       MAX_ORDERED_SUM_BYTES(root));
+                       sums = kzalloc(btrfs_ordered_sum_size(root, size),
+                                       GFP_NOFS);
+                       BUG_ON(!sums);
+
+                       sector_sum = sums->sums;
+                       sums->bytenr = start;
+                       sums->len = size;
+
+                       offset = (start - key.offset) >>
+                               root->fs_info->sb->s_blocksize_bits;
+                       offset *= csum_size;
+
+                       while (size > 0) {
+                               read_extent_buffer(path->nodes[0],
+                                               &sector_sum->sum,
+                                               ((unsigned long)item) +
+                                               offset, csum_size);
+                               sector_sum->bytenr = start;
+
+                               size -= root->sectorsize;
+                               start += root->sectorsize;
+                               offset += csum_size;
+                               sector_sum++;
+                       }
+                       list_add_tail(&sums->list, list);
+               }
+               path->slots[0]++;
+       }
+       ret = 0;
+fail:
+       btrfs_free_path(path);
+       return ret;
+}
+
+int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
+                      struct bio *bio, u64 file_start, int contig)
+{
+       struct btrfs_ordered_sum *sums;
+       struct btrfs_sector_sum *sector_sum;
+       struct btrfs_ordered_extent *ordered;
+       char *data;
+       struct bio_vec *bvec = bio->bi_io_vec;
+       int bio_index = 0;
+       unsigned long total_bytes = 0;
+       unsigned long this_sum_bytes = 0;
+       u64 offset;
+       u64 disk_bytenr;
+
+       WARN_ON(bio->bi_vcnt <= 0);
+       sums = kzalloc(btrfs_ordered_sum_size(root, bio->bi_size), GFP_NOFS);
+       if (!sums)
+               return -ENOMEM;
+
+       sector_sum = sums->sums;
+       disk_bytenr = (u64)bio->bi_sector << 9;
+       sums->len = bio->bi_size;
+       INIT_LIST_HEAD(&sums->list);
+
+       if (contig)
+               offset = file_start;
+       else
+               offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+
+       ordered = btrfs_lookup_ordered_extent(inode, offset);
+       BUG_ON(!ordered);
+       sums->bytenr = ordered->start;
+
+       while (bio_index < bio->bi_vcnt) {
+               if (!contig)
+                       offset = page_offset(bvec->bv_page) + bvec->bv_offset;
+
+               if (!contig && (offset >= ordered->file_offset + ordered->len ||
+                   offset < ordered->file_offset)) {
+                       unsigned long bytes_left;
+                       sums->len = this_sum_bytes;
+                       this_sum_bytes = 0;
+                       btrfs_add_ordered_sum(inode, ordered, sums);
+                       btrfs_put_ordered_extent(ordered);
+
+                       bytes_left = bio->bi_size - total_bytes;
+
+                       sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left),
+                                      GFP_NOFS);
+                       BUG_ON(!sums);
+                       sector_sum = sums->sums;
+                       sums->len = bytes_left;
+                       ordered = btrfs_lookup_ordered_extent(inode, offset);
+                       BUG_ON(!ordered);
+                       sums->bytenr = ordered->start;
+               }
+
+               data = kmap_atomic(bvec->bv_page, KM_USER0);
+               sector_sum->sum = ~(u32)0;
+               sector_sum->sum = btrfs_csum_data(root,
+                                                 data + bvec->bv_offset,
+                                                 sector_sum->sum,
+                                                 bvec->bv_len);
+               kunmap_atomic(data, KM_USER0);
+               btrfs_csum_final(sector_sum->sum,
+                                (char *)&sector_sum->sum);
+               sector_sum->bytenr = disk_bytenr;
+
+               sector_sum++;
+               bio_index++;
+               total_bytes += bvec->bv_len;
+               this_sum_bytes += bvec->bv_len;
+               disk_bytenr += bvec->bv_len;
+               offset += bvec->bv_len;
+               bvec++;
+       }
+       this_sum_bytes = 0;
+       btrfs_add_ordered_sum(inode, ordered, sums);
+       btrfs_put_ordered_extent(ordered);
+       return 0;
+}
+
+/*
+ * helper function for csum removal, this expects the
+ * key to describe the csum pointed to by the path, and it expects
+ * the csum to overlap the range [bytenr, len]
+ *
+ * The csum should not be entirely contained in the range and the
+ * range should not be entirely contained in the csum.
+ *
+ * This calls btrfs_truncate_item with the correct args based on the
+ * overlap, and fixes up the key as required.
+ */
+static noinline int truncate_one_csum(struct btrfs_trans_handle *trans,
+                                     struct btrfs_root *root,
+                                     struct btrfs_path *path,
+                                     struct btrfs_key *key,
+                                     u64 bytenr, u64 len)
+{
+       struct extent_buffer *leaf;
+       u16 csum_size =
+               btrfs_super_csum_size(&root->fs_info->super_copy);
+       u64 csum_end;
+       u64 end_byte = bytenr + len;
+       u32 blocksize_bits = root->fs_info->sb->s_blocksize_bits;
+       int ret;
+
+       leaf = path->nodes[0];
+       csum_end = btrfs_item_size_nr(leaf, path->slots[0]) / csum_size;
+       csum_end <<= root->fs_info->sb->s_blocksize_bits;
+       csum_end += key->offset;
+
+       if (key->offset < bytenr && csum_end <= end_byte) {
+               /*
+                *         [ bytenr - len ]
+                *         [   ]
+                *   [csum     ]
+                *   A simple truncate off the end of the item
+                */
+               u32 new_size = (bytenr - key->offset) >> blocksize_bits;
+               new_size *= csum_size;
+               ret = btrfs_truncate_item(trans, root, path, new_size, 1);
+               BUG_ON(ret);
+       } else if (key->offset >= bytenr && csum_end > end_byte &&
+                  end_byte > key->offset) {
+               /*
+                *         [ bytenr - len ]
+                *                 [ ]
+                *                 [csum     ]
+                * we need to truncate from the beginning of the csum
+                */
+               u32 new_size = (csum_end - end_byte) >> blocksize_bits;
+               new_size *= csum_size;
+
+               ret = btrfs_truncate_item(trans, root, path, new_size, 0);
+               BUG_ON(ret);
+
+               key->offset = end_byte;
+               ret = btrfs_set_item_key_safe(trans, root, path, key);
+               BUG_ON(ret);
+       } else {
+               BUG();
+       }
+       return 0;
+}
+
+/*
+ * deletes the csum items from the csum tree for a given
+ * range of bytes.
+ */
+int btrfs_del_csums(struct btrfs_trans_handle *trans,
+                   struct btrfs_root *root, u64 bytenr, u64 len)
+{
+       struct btrfs_path *path;
+       struct btrfs_key key;
+       u64 end_byte = bytenr + len;
+       u64 csum_end;
+       struct extent_buffer *leaf;
+       int ret;
+       u16 csum_size =
+               btrfs_super_csum_size(&root->fs_info->super_copy);
+       int blocksize_bits = root->fs_info->sb->s_blocksize_bits;
+
+       root = root->fs_info->csum_root;
+
+       path = btrfs_alloc_path();
+
+       while (1) {
+               key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
+               key.offset = end_byte - 1;
+               key.type = BTRFS_EXTENT_CSUM_KEY;
+
+               ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+               if (ret > 0) {
+                       if (path->slots[0] == 0)
+                               goto out;
+                       path->slots[0]--;
+               }
+               leaf = path->nodes[0];
+               btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+
+               if (key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
+                   key.type != BTRFS_EXTENT_CSUM_KEY) {
+                       break;
+               }
+
+               if (key.offset >= end_byte)
+                       break;
+
+               csum_end = btrfs_item_size_nr(leaf, path->slots[0]) / csum_size;
+               csum_end <<= blocksize_bits;
+               csum_end += key.offset;
+
+               /* this csum ends before we start, we're done */
+               if (csum_end <= bytenr)
+                       break;
+
+               /* delete the entire item, it is inside our range */
+               if (key.offset >= bytenr && csum_end <= end_byte) {
+                       ret = btrfs_del_item(trans, root, path);
+                       BUG_ON(ret);
+                       if (key.offset == bytenr)
+                               break;
+               } else if (key.offset < bytenr && csum_end > end_byte) {
+                       unsigned long offset;
+                       unsigned long shift_len;
+                       unsigned long item_offset;
+                       /*
+                        *        [ bytenr - len ]
+                        *     [csum                ]
+                        *
+                        * Our bytes are in the middle of the csum,
+                        * we need to split this item and insert a new one.
+                        *
+                        * But we can't drop the path because the
+                        * csum could change, get removed, extended etc.
+                        *
+                        * The trick here is the max size of a csum item leaves
+                        * enough room in the tree block for a single
+                        * item header.  So, we split the item in place,
+                        * adding a new header pointing to the existing
+                        * bytes.  Then we loop around again and we have
+                        * a nicely formed csum item that we can neatly
+                        * truncate.
+                        */
+                       offset = (bytenr - key.offset) >> blocksize_bits;
+                       offset *= csum_size;
+
+                       shift_len = (len >> blocksize_bits) * csum_size;
+
+                       item_offset = btrfs_item_ptr_offset(leaf,
+                                                           path->slots[0]);
+
+                       memset_extent_buffer(leaf, 0, item_offset + offset,
+                                            shift_len);
+                       key.offset = bytenr;
+
+                       /*
+                        * btrfs_split_item returns -EAGAIN when the
+                        * item changed size or key
+                        */
+                       ret = btrfs_split_item(trans, root, path, &key, offset);
+                       BUG_ON(ret && ret != -EAGAIN);
+
+                       key.offset = end_byte - 1;
+               } else {
+                       ret = truncate_one_csum(trans, root, path,
+                                               &key, bytenr, len);
+                       BUG_ON(ret);
+                       if (key.offset < bytenr)
+                               break;
+               }
+               btrfs_release_path(root, path);
+       }
+out:
+       btrfs_free_path(path);
+       return 0;
+}
+
+int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
+                          struct btrfs_root *root,
+                          struct btrfs_ordered_sum *sums)
+{
+       u64 bytenr;
+       int ret;
+       struct btrfs_key file_key;
+       struct btrfs_key found_key;
+       u64 next_offset;
+       u64 total_bytes = 0;
+       int found_next;
+       struct btrfs_path *path;
+       struct btrfs_csum_item *item;
+       struct btrfs_csum_item *item_end;
+       struct extent_buffer *leaf = NULL;
+       u64 csum_offset;
+       struct btrfs_sector_sum *sector_sum;
+       u32 nritems;
+       u32 ins_size;
+       char *eb_map;
+       char *eb_token;
+       unsigned long map_len;
+       unsigned long map_start;
+       u16 csum_size =
+               btrfs_super_csum_size(&root->fs_info->super_copy);
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+       sector_sum = sums->sums;
+again:
+       next_offset = (u64)-1;
+       found_next = 0;
+       file_key.objectid = BTRFS_EXTENT_CSUM_OBJECTID;
+       file_key.offset = sector_sum->bytenr;
+       bytenr = sector_sum->bytenr;
+       btrfs_set_key_type(&file_key, BTRFS_EXTENT_CSUM_KEY);
+
+       item = btrfs_lookup_csum(trans, root, path, sector_sum->bytenr, 1);
+       if (!IS_ERR(item)) {
+               leaf = path->nodes[0];
+               ret = 0;
+               goto found;
+       }
+       ret = PTR_ERR(item);
+       if (ret == -EFBIG) {
+               u32 item_size;
+               /* we found one, but it isn't big enough yet */
+               leaf = path->nodes[0];
+               item_size = btrfs_item_size_nr(leaf, path->slots[0]);
+               if ((item_size / csum_size) >=
+                   MAX_CSUM_ITEMS(root, csum_size)) {
+                       /* already at max size, make a new one */
+                       goto insert;
+               }
+       } else {
+               int slot = path->slots[0] + 1;
+               /* we didn't find a csum item, insert one */
+               nritems = btrfs_header_nritems(path->nodes[0]);
+               if (path->slots[0] >= nritems - 1) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret == 1)
+                               found_next = 1;
+                       if (ret != 0)
+                               goto insert;
+                       slot = 0;
+               }
+               btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot);
+               if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
+                   found_key.type != BTRFS_EXTENT_CSUM_KEY) {
+                       found_next = 1;
+                       goto insert;
+               }
+               next_offset = found_key.offset;
+               found_next = 1;
+               goto insert;
+       }
+
+       /*
+        * at this point, we know the tree has an item, but it isn't big
+        * enough yet to put our csum in.  Grow it
+        */
+       btrfs_release_path(root, path);
+       ret = btrfs_search_slot(trans, root, &file_key, path,
+                               csum_size, 1);
+       if (ret < 0)
+               goto fail_unlock;
+
+       if (ret > 0) {
+               if (path->slots[0] == 0)
+                       goto insert;
+               path->slots[0]--;
+       }
+
+       leaf = path->nodes[0];
+       btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+       csum_offset = (bytenr - found_key.offset) >>
+                       root->fs_info->sb->s_blocksize_bits;
+
+       if (btrfs_key_type(&found_key) != BTRFS_EXTENT_CSUM_KEY ||
+           found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID ||
+           csum_offset >= MAX_CSUM_ITEMS(root, csum_size)) {
+               goto insert;
+       }
+
+       if (csum_offset >= btrfs_item_size_nr(leaf, path->slots[0]) /
+           csum_size) {
+               u32 diff = (csum_offset + 1) * csum_size;
+
+               /*
+                * is the item big enough already?  we dropped our lock
+                * before and need to recheck
+                */
+               if (diff < btrfs_item_size_nr(leaf, path->slots[0]))
+                       goto csum;
+
+               diff = diff - btrfs_item_size_nr(leaf, path->slots[0]);
+               if (diff != csum_size)
+                       goto insert;
+
+               ret = btrfs_extend_item(trans, root, path, diff);
+               BUG_ON(ret);
+               goto csum;
+       }
+
+insert:
+       btrfs_release_path(root, path);
+       csum_offset = 0;
+       if (found_next) {
+               u64 tmp = total_bytes + root->sectorsize;
+               u64 next_sector = sector_sum->bytenr;
+               struct btrfs_sector_sum *next = sector_sum + 1;
+
+               while (tmp < sums->len) {
+                       if (next_sector + root->sectorsize != next->bytenr)
+                               break;
+                       tmp += root->sectorsize;
+                       next_sector = next->bytenr;
+                       next++;
+               }
+               tmp = min(tmp, next_offset - file_key.offset);
+               tmp >>= root->fs_info->sb->s_blocksize_bits;
+               tmp = max((u64)1, tmp);
+               tmp = min(tmp, (u64)MAX_CSUM_ITEMS(root, csum_size));
+               ins_size = csum_size * tmp;
+       } else {
+               ins_size = csum_size;
+       }
+       ret = btrfs_insert_empty_item(trans, root, path, &file_key,
+                                     ins_size);
+       if (ret < 0)
+               goto fail_unlock;
+       if (ret != 0) {
+               WARN_ON(1);
+               goto fail_unlock;
+       }
+csum:
+       leaf = path->nodes[0];
+       item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_csum_item);
+       ret = 0;
+       item = (struct btrfs_csum_item *)((unsigned char *)item +
+                                         csum_offset * csum_size);
+found:
+       item_end = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_csum_item);
+       item_end = (struct btrfs_csum_item *)((unsigned char *)item_end +
+                                     btrfs_item_size_nr(leaf, path->slots[0]));
+       eb_token = NULL;
+       cond_resched();
+next_sector:
+
+       if (!eb_token ||
+          (unsigned long)item + csum_size >= map_start + map_len) {
+               int err;
+
+               if (eb_token)
+                       unmap_extent_buffer(leaf, eb_token, KM_USER1);
+               eb_token = NULL;
+               err = map_private_extent_buffer(leaf, (unsigned long)item,
+                                               csum_size,
+                                               &eb_token, &eb_map,
+                                               &map_start, &map_len, KM_USER1);
+               if (err)
+                       eb_token = NULL;
+       }
+       if (eb_token) {
+               memcpy(eb_token + ((unsigned long)item & (PAGE_CACHE_SIZE - 1)),
+                      &sector_sum->sum, csum_size);
+       } else {
+               write_extent_buffer(leaf, &sector_sum->sum,
+                                   (unsigned long)item, csum_size);
+       }
+
+       total_bytes += root->sectorsize;
+       sector_sum++;
+       if (total_bytes < sums->len) {
+               item = (struct btrfs_csum_item *)((char *)item +
+                                                 csum_size);
+               if (item < item_end && bytenr + PAGE_CACHE_SIZE ==
+                   sector_sum->bytenr) {
+                       bytenr = sector_sum->bytenr;
+                       goto next_sector;
+               }
+       }
+       if (eb_token) {
+               unmap_extent_buffer(leaf, eb_token, KM_USER1);
+               eb_token = NULL;
+       }
+       btrfs_mark_buffer_dirty(path->nodes[0]);
+       cond_resched();
+       if (total_bytes < sums->len) {
+               btrfs_release_path(root, path);
+               goto again;
+       }
+out:
+       btrfs_free_path(path);
+       return ret;
+
+fail_unlock:
+       goto out;
+}
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c

new file mode 100644 (file)

index 0000000..9026833
--- /dev/null
+++ b/fs/btrfs/file.c
@@ -0,0 +1,1288 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/highmem.h>
+#include <linux/time.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/smp_lock.h>
+#include <linux/backing-dev.h>
+#include <linux/mpage.h>
+#include <linux/swap.h>
+#include <linux/writeback.h>
+#include <linux/statfs.h>
+#include <linux/compat.h>
+#include <linux/version.h>
+#include "ctree.h"
+#include "disk-io.h"
+#include "transaction.h"
+#include "btrfs_inode.h"
+#include "ioctl.h"
+#include "print-tree.h"
+#include "tree-log.h"
+#include "locking.h"
+#include "compat.h"
+
+
+/* simple helper to fault in pages and copy.  This should go away
+ * and be replaced with calls into generic code.
+ */
+static noinline int btrfs_copy_from_user(loff_t pos, int num_pages,
+                                        int write_bytes,
+                                        struct page **prepared_pages,
+                                        const char __user *buf)
+{
+       long page_fault = 0;
+       int i;
+       int offset = pos & (PAGE_CACHE_SIZE - 1);
+
+       for (i = 0; i < num_pages && write_bytes > 0; i++, offset = 0) {
+               size_t count = min_t(size_t,
+                                    PAGE_CACHE_SIZE - offset, write_bytes);
+               struct page *page = prepared_pages[i];
+               fault_in_pages_readable(buf, count);
+
+               /* Copy data from userspace to the current page */
+               kmap(page);
+               page_fault = __copy_from_user(page_address(page) + offset,
+                                             buf, count);
+               /* Flush processor's dcache for this page */
+               flush_dcache_page(page);
+               kunmap(page);
+               buf += count;
+               write_bytes -= count;
+
+               if (page_fault)
+                       break;
+       }
+       return page_fault ? -EFAULT : 0;
+}
+
+/*
+ * unlocks pages after btrfs_file_write is done with them
+ */
+static noinline void btrfs_drop_pages(struct page **pages, size_t num_pages)
+{
+       size_t i;
+       for (i = 0; i < num_pages; i++) {
+               if (!pages[i])
+                       break;
+               /* page checked is some magic around finding pages that
+                * have been modified without going through btrfs_set_page_dirty
+                * clear it here
+                */
+               ClearPageChecked(pages[i]);
+               unlock_page(pages[i]);
+               mark_page_accessed(pages[i]);
+               page_cache_release(pages[i]);
+       }
+}
+
+/*
+ * after copy_from_user, pages need to be dirtied and we need to make
+ * sure holes are created between the current EOF and the start of
+ * any next extents (if required).
+ *
+ * this also makes the decision about creating an inline extent vs
+ * doing real data extents, marking pages dirty and delalloc as required.
+ */
+static noinline int dirty_and_release_pages(struct btrfs_trans_handle *trans,
+                                  struct btrfs_root *root,
+                                  struct file *file,
+                                  struct page **pages,
+                                  size_t num_pages,
+                                  loff_t pos,
+                                  size_t write_bytes)
+{
+       int err = 0;
+       int i;
+       struct inode *inode = fdentry(file)->d_inode;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+       u64 hint_byte;
+       u64 num_bytes;
+       u64 start_pos;
+       u64 end_of_last_block;
+       u64 end_pos = pos + write_bytes;
+       loff_t isize = i_size_read(inode);
+
+       start_pos = pos & ~((u64)root->sectorsize - 1);
+       num_bytes = (write_bytes + pos - start_pos +
+                   root->sectorsize - 1) & ~((u64)root->sectorsize - 1);
+
+       end_of_last_block = start_pos + num_bytes - 1;
+
+       lock_extent(io_tree, start_pos, end_of_last_block, GFP_NOFS);
+       trans = btrfs_join_transaction(root, 1);
+       if (!trans) {
+               err = -ENOMEM;
+               goto out_unlock;
+       }
+       btrfs_set_trans_block_group(trans, inode);
+       hint_byte = 0;
+
+       set_extent_uptodate(io_tree, start_pos, end_of_last_block, GFP_NOFS);
+
+       /* check for reserved extents on each page, we don't want
+        * to reset the delalloc bit on things that already have
+        * extents reserved.
+        */
+       btrfs_set_extent_delalloc(inode, start_pos, end_of_last_block);
+       for (i = 0; i < num_pages; i++) {
+               struct page *p = pages[i];
+               SetPageUptodate(p);
+               ClearPageChecked(p);
+               set_page_dirty(p);
+       }
+       if (end_pos > isize) {
+               i_size_write(inode, end_pos);
+               btrfs_update_inode(trans, root, inode);
+       }
+       err = btrfs_end_transaction(trans, root);
+out_unlock:
+       unlock_extent(io_tree, start_pos, end_of_last_block, GFP_NOFS);
+       return err;
+}
+
+/*
+ * this drops all the extents in the cache that intersect the range
+ * [start, end].  Existing extents are split as required.
+ */
+int btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end,
+                           int skip_pinned)
+{
+       struct extent_map *em;
+       struct extent_map *split = NULL;
+       struct extent_map *split2 = NULL;
+       struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+       u64 len = end - start + 1;
+       int ret;
+       int testend = 1;
+       unsigned long flags;
+       int compressed = 0;
+
+       WARN_ON(end < start);
+       if (end == (u64)-1) {
+               len = (u64)-1;
+               testend = 0;
+       }
+       while (1) {
+               if (!split)
+                       split = alloc_extent_map(GFP_NOFS);
+               if (!split2)
+                       split2 = alloc_extent_map(GFP_NOFS);
+
+               spin_lock(&em_tree->lock);
+               em = lookup_extent_mapping(em_tree, start, len);
+               if (!em) {
+                       spin_unlock(&em_tree->lock);
+                       break;
+               }
+               flags = em->flags;
+               if (skip_pinned && test_bit(EXTENT_FLAG_PINNED, &em->flags)) {
+                       spin_unlock(&em_tree->lock);
+                       if (em->start <= start &&
+                           (!testend || em->start + em->len >= start + len)) {
+                               free_extent_map(em);
+                               break;
+                       }
+                       if (start < em->start) {
+                               len = em->start - start;
+                       } else {
+                               len = start + len - (em->start + em->len);
+                               start = em->start + em->len;
+                       }
+                       free_extent_map(em);
+                       continue;
+               }
+               compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
+               clear_bit(EXTENT_FLAG_PINNED, &em->flags);
+               remove_extent_mapping(em_tree, em);
+
+               if (em->block_start < EXTENT_MAP_LAST_BYTE &&
+                   em->start < start) {
+                       split->start = em->start;
+                       split->len = start - em->start;
+                       split->orig_start = em->orig_start;
+                       split->block_start = em->block_start;
+
+                       if (compressed)
+                               split->block_len = em->block_len;
+                       else
+                               split->block_len = split->len;
+
+                       split->bdev = em->bdev;
+                       split->flags = flags;
+                       ret = add_extent_mapping(em_tree, split);
+                       BUG_ON(ret);
+                       free_extent_map(split);
+                       split = split2;
+                       split2 = NULL;
+               }
+               if (em->block_start < EXTENT_MAP_LAST_BYTE &&
+                   testend && em->start + em->len > start + len) {
+                       u64 diff = start + len - em->start;
+
+                       split->start = start + len;
+                       split->len = em->start + em->len - (start + len);
+                       split->bdev = em->bdev;
+                       split->flags = flags;
+
+                       if (compressed) {
+                               split->block_len = em->block_len;
+                               split->block_start = em->block_start;
+                               split->orig_start = em->orig_start;
+                       } else {
+                               split->block_len = split->len;
+                               split->block_start = em->block_start + diff;
+                               split->orig_start = split->start;
+                       }
+
+                       ret = add_extent_mapping(em_tree, split);
+                       BUG_ON(ret);
+                       free_extent_map(split);
+                       split = NULL;
+               }
+               spin_unlock(&em_tree->lock);
+
+               /* once for us */
+               free_extent_map(em);
+               /* once for the tree*/
+               free_extent_map(em);
+       }
+       if (split)
+               free_extent_map(split);
+       if (split2)
+               free_extent_map(split2);
+       return 0;
+}
+
+int btrfs_check_file(struct btrfs_root *root, struct inode *inode)
+{
+       return 0;
+#if 0
+       struct btrfs_path *path;
+       struct btrfs_key found_key;
+       struct extent_buffer *leaf;
+       struct btrfs_file_extent_item *extent;
+       u64 last_offset = 0;
+       int nritems;
+       int slot;
+       int found_type;
+       int ret;
+       int err = 0;
+       u64 extent_end = 0;
+
+       path = btrfs_alloc_path();
+       ret = btrfs_lookup_file_extent(NULL, root, path, inode->i_ino,
+                                      last_offset, 0);
+       while (1) {
+               nritems = btrfs_header_nritems(path->nodes[0]);
+               if (path->slots[0] >= nritems) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret)
+                               goto out;
+                       nritems = btrfs_header_nritems(path->nodes[0]);
+               }
+               slot = path->slots[0];
+               leaf = path->nodes[0];
+               btrfs_item_key_to_cpu(leaf, &found_key, slot);
+               if (found_key.objectid != inode->i_ino)
+                       break;
+               if (found_key.type != BTRFS_EXTENT_DATA_KEY)
+                       goto out;
+
+               if (found_key.offset < last_offset) {
+                       WARN_ON(1);
+                       btrfs_print_leaf(root, leaf);
+                       printk(KERN_ERR "inode %lu found offset %llu "
+                              "expected %llu\n", inode->i_ino,
+                              (unsigned long long)found_key.offset,
+                              (unsigned long long)last_offset);
+                       err = 1;
+                       goto out;
+               }
+               extent = btrfs_item_ptr(leaf, slot,
+                                       struct btrfs_file_extent_item);
+               found_type = btrfs_file_extent_type(leaf, extent);
+               if (found_type == BTRFS_FILE_EXTENT_REG) {
+                       extent_end = found_key.offset +
+                            btrfs_file_extent_num_bytes(leaf, extent);
+               } else if (found_type == BTRFS_FILE_EXTENT_INLINE) {
+                       struct btrfs_item *item;
+                       item = btrfs_item_nr(leaf, slot);
+                       extent_end = found_key.offset +
+                            btrfs_file_extent_inline_len(leaf, extent);
+                       extent_end = (extent_end + root->sectorsize - 1) &
+                               ~((u64)root->sectorsize - 1);
+               }
+               last_offset = extent_end;
+               path->slots[0]++;
+       }
+       if (0 && last_offset < inode->i_size) {
+               WARN_ON(1);
+               btrfs_print_leaf(root, leaf);
+               printk(KERN_ERR "inode %lu found offset %llu size %llu\n",
+                      inode->i_ino, (unsigned long long)last_offset,
+                      (unsigned long long)inode->i_size);
+               err = 1;
+
+       }
+out:
+       btrfs_free_path(path);
+       return err;
+#endif
+}
+
+/*
+ * this is very complex, but the basic idea is to drop all extents
+ * in the range start - end.  hint_block is filled in with a block number
+ * that would be a good hint to the block allocator for this file.
+ *
+ * If an extent intersects the range but is not entirely inside the range
+ * it is either truncated or split.  Anything entirely inside the range
+ * is deleted from the tree.
+ *
+ * inline_limit is used to tell this code which offsets in the file to keep
+ * if they contain inline extents.
+ */
+noinline int btrfs_drop_extents(struct btrfs_trans_handle *trans,
+                      struct btrfs_root *root, struct inode *inode,
+                      u64 start, u64 end, u64 inline_limit, u64 *hint_byte)
+{
+       u64 extent_end = 0;
+       u64 locked_end = end;
+       u64 search_start = start;
+       u64 leaf_start;
+       u64 ram_bytes = 0;
+       u64 orig_parent = 0;
+       u64 disk_bytenr = 0;
+       u8 compression;
+       u8 encryption;
+       u16 other_encoding = 0;
+       u64 root_gen;
+       u64 root_owner;
+       struct extent_buffer *leaf;
+       struct btrfs_file_extent_item *extent;
+       struct btrfs_path *path;
+       struct btrfs_key key;
+       struct btrfs_file_extent_item old;
+       int keep;
+       int slot;
+       int bookend;
+       int found_type = 0;
+       int found_extent;
+       int found_inline;
+       int recow;
+       int ret;
+
+       inline_limit = 0;
+       btrfs_drop_extent_cache(inode, start, end - 1, 0);
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+       while (1) {
+               recow = 0;
+               btrfs_release_path(root, path);
+               ret = btrfs_lookup_file_extent(trans, root, path, inode->i_ino,
+                                              search_start, -1);
+               if (ret < 0)
+                       goto out;
+               if (ret > 0) {
+                       if (path->slots[0] == 0) {
+                               ret = 0;
+                               goto out;
+                       }
+                       path->slots[0]--;
+               }
+next_slot:
+               keep = 0;
+               bookend = 0;
+               found_extent = 0;
+               found_inline = 0;
+               leaf_start = 0;
+               root_gen = 0;
+               root_owner = 0;
+               compression = 0;
+               encryption = 0;
+               extent = NULL;
+               leaf = path->nodes[0];
+               slot = path->slots[0];
+               ret = 0;
+               btrfs_item_key_to_cpu(leaf, &key, slot);
+               if (btrfs_key_type(&key) == BTRFS_EXTENT_DATA_KEY &&
+                   key.offset >= end) {
+                       goto out;
+               }
+               if (btrfs_key_type(&key) > BTRFS_EXTENT_DATA_KEY ||
+                   key.objectid != inode->i_ino) {
+                       goto out;
+               }
+               if (recow) {
+                       search_start = max(key.offset, start);
+                       continue;
+               }
+               if (btrfs_key_type(&key) == BTRFS_EXTENT_DATA_KEY) {
+                       extent = btrfs_item_ptr(leaf, slot,
+                                               struct btrfs_file_extent_item);
+                       found_type = btrfs_file_extent_type(leaf, extent);
+                       compression = btrfs_file_extent_compression(leaf,
+                                                                   extent);
+                       encryption = btrfs_file_extent_encryption(leaf,
+                                                                 extent);
+                       other_encoding = btrfs_file_extent_other_encoding(leaf,
+                                                                 extent);
+                       if (found_type == BTRFS_FILE_EXTENT_REG ||
+                           found_type == BTRFS_FILE_EXTENT_PREALLOC) {
+                               extent_end =
+                                    btrfs_file_extent_disk_bytenr(leaf,
+                                                                  extent);
+                               if (extent_end)
+                                       *hint_byte = extent_end;
+
+                               extent_end = key.offset +
+                                    btrfs_file_extent_num_bytes(leaf, extent);
+                               ram_bytes = btrfs_file_extent_ram_bytes(leaf,
+                                                               extent);
+                               found_extent = 1;
+                       } else if (found_type == BTRFS_FILE_EXTENT_INLINE) {
+                               found_inline = 1;
+                               extent_end = key.offset +
+                                    btrfs_file_extent_inline_len(leaf, extent);
+                       }
+               } else {
+                       extent_end = search_start;
+               }
+
+               /* we found nothing we can drop */
+               if ((!found_extent && !found_inline) ||
+                   search_start >= extent_end) {
+                       int nextret;
+                       u32 nritems;
+                       nritems = btrfs_header_nritems(leaf);
+                       if (slot >= nritems - 1) {
+                               nextret = btrfs_next_leaf(root, path);
+                               if (nextret)
+                                       goto out;
+                               recow = 1;
+                       } else {
+                               path->slots[0]++;
+                       }
+                       goto next_slot;
+               }
+
+               if (end <= extent_end && start >= key.offset && found_inline)
+                       *hint_byte = EXTENT_MAP_INLINE;
+
+               if (found_extent) {
+                       read_extent_buffer(leaf, &old, (unsigned long)extent,
+                                          sizeof(old));
+                       root_gen = btrfs_header_generation(leaf);
+                       root_owner = btrfs_header_owner(leaf);
+                       leaf_start = leaf->start;
+               }
+
+               if (end < extent_end && end >= key.offset) {
+                       bookend = 1;
+                       if (found_inline && start <= key.offset)
+                               keep = 1;
+               }
+
+               if (bookend && found_extent) {
+                       if (locked_end < extent_end) {
+                               ret = try_lock_extent(&BTRFS_I(inode)->io_tree,
+                                               locked_end, extent_end - 1,
+                                               GFP_NOFS);
+                               if (!ret) {
+                                       btrfs_release_path(root, path);
+                                       lock_extent(&BTRFS_I(inode)->io_tree,
+                                               locked_end, extent_end - 1,
+                                               GFP_NOFS);
+                                       locked_end = extent_end;
+                                       continue;
+                               }
+                               locked_end = extent_end;
+                       }
+                       orig_parent = path->nodes[0]->start;
+                       disk_bytenr = le64_to_cpu(old.disk_bytenr);
+                       if (disk_bytenr != 0) {
+                               ret = btrfs_inc_extent_ref(trans, root,
+                                          disk_bytenr,
+                                          le64_to_cpu(old.disk_num_bytes),
+                                          orig_parent, root->root_key.objectid,
+                                          trans->transid, inode->i_ino);
+                               BUG_ON(ret);
+                       }
+               }
+
+               if (found_inline) {
+                       u64 mask = root->sectorsize - 1;
+                       search_start = (extent_end + mask) & ~mask;
+               } else
+                       search_start = extent_end;
+
+               /* truncate existing extent */
+               if (start > key.offset) {
+                       u64 new_num;
+                       u64 old_num;
+                       keep = 1;
+                       WARN_ON(start & (root->sectorsize - 1));
+                       if (found_extent) {
+                               new_num = start - key.offset;
+                               old_num = btrfs_file_extent_num_bytes(leaf,
+                                                                     extent);
+                               *hint_byte =
+                                       btrfs_file_extent_disk_bytenr(leaf,
+                                                                     extent);
+                               if (btrfs_file_extent_disk_bytenr(leaf,
+                                                                 extent)) {
+                                       inode_sub_bytes(inode, old_num -
+                                                       new_num);
+                               }
+                               btrfs_set_file_extent_num_bytes(leaf,
+                                                       extent, new_num);
+                               btrfs_mark_buffer_dirty(leaf);
+                       } else if (key.offset < inline_limit &&
+                                  (end > extent_end) &&
+                                  (inline_limit < extent_end)) {
+                               u32 new_size;
+                               new_size = btrfs_file_extent_calc_inline_size(
+                                                  inline_limit - key.offset);
+                               inode_sub_bytes(inode, extent_end -
+                                               inline_limit);
+                               btrfs_set_file_extent_ram_bytes(leaf, extent,
+                                                       new_size);
+                               if (!compression && !encryption) {
+                                       btrfs_truncate_item(trans, root, path,
+                                                           new_size, 1);
+                               }
+                       }
+               }
+               /* delete the entire extent */
+               if (!keep) {
+                       if (found_inline)
+                               inode_sub_bytes(inode, extent_end -
+                                               key.offset);
+                       ret = btrfs_del_item(trans, root, path);
+                       /* TODO update progress marker and return */
+                       BUG_ON(ret);
+                       extent = NULL;
+                       btrfs_release_path(root, path);
+                       /* the extent will be freed later */
+               }
+               if (bookend && found_inline && start <= key.offset) {
+                       u32 new_size;
+                       new_size = btrfs_file_extent_calc_inline_size(
+                                                  extent_end - end);
+                       inode_sub_bytes(inode, end - key.offset);
+                       btrfs_set_file_extent_ram_bytes(leaf, extent,
+                                                       new_size);
+                       if (!compression && !encryption)
+                               ret = btrfs_truncate_item(trans, root, path,
+                                                         new_size, 0);
+                       BUG_ON(ret);
+               }
+               /* create bookend, splitting the extent in two */
+               if (bookend && found_extent) {
+                       struct btrfs_key ins;
+                       ins.objectid = inode->i_ino;
+                       ins.offset = end;
+                       btrfs_set_key_type(&ins, BTRFS_EXTENT_DATA_KEY);
+
+                       btrfs_release_path(root, path);
+                       ret = btrfs_insert_empty_item(trans, root, path, &ins,
+                                                     sizeof(*extent));
+                       BUG_ON(ret);
+
+                       leaf = path->nodes[0];
+                       extent = btrfs_item_ptr(leaf, path->slots[0],
+                                               struct btrfs_file_extent_item);
+                       write_extent_buffer(leaf, &old,
+                                           (unsigned long)extent, sizeof(old));
+
+                       btrfs_set_file_extent_compression(leaf, extent,
+                                                         compression);
+                       btrfs_set_file_extent_encryption(leaf, extent,
+                                                        encryption);
+                       btrfs_set_file_extent_other_encoding(leaf, extent,
+                                                            other_encoding);
+                       btrfs_set_file_extent_offset(leaf, extent,
+                                   le64_to_cpu(old.offset) + end - key.offset);
+                       WARN_ON(le64_to_cpu(old.num_bytes) <
+                               (extent_end - end));
+                       btrfs_set_file_extent_num_bytes(leaf, extent,
+                                                       extent_end - end);
+
+                       /*
+                        * set the ram bytes to the size of the full extent
+                        * before splitting.  This is a worst case flag,
+                        * but its the best we can do because we don't know
+                        * how splitting affects compression
+                        */
+                       btrfs_set_file_extent_ram_bytes(leaf, extent,
+                                                       ram_bytes);
+                       btrfs_set_file_extent_type(leaf, extent, found_type);
+
+                       btrfs_mark_buffer_dirty(path->nodes[0]);
+
+                       if (disk_bytenr != 0) {
+                               ret = btrfs_update_extent_ref(trans, root,
+                                               disk_bytenr, orig_parent,
+                                               leaf->start,
+                                               root->root_key.objectid,
+                                               trans->transid, ins.objectid);
+
+                               BUG_ON(ret);
+                       }
+                       btrfs_release_path(root, path);
+                       if (disk_bytenr != 0)
+                               inode_add_bytes(inode, extent_end - end);
+               }
+
+               if (found_extent && !keep) {
+                       u64 old_disk_bytenr = le64_to_cpu(old.disk_bytenr);
+
+                       if (old_disk_bytenr != 0) {
+                               inode_sub_bytes(inode,
+                                               le64_to_cpu(old.num_bytes));
+                               ret = btrfs_free_extent(trans, root,
+                                               old_disk_bytenr,
+                                               le64_to_cpu(old.disk_num_bytes),
+                                               leaf_start, root_owner,
+                                               root_gen, key.objectid, 0);
+                               BUG_ON(ret);
+                               *hint_byte = old_disk_bytenr;
+                       }
+               }
+
+               if (search_start >= end) {
+                       ret = 0;
+                       goto out;
+               }
+       }
+out:
+       btrfs_free_path(path);
+       if (locked_end > end) {
+               unlock_extent(&BTRFS_I(inode)->io_tree, end, locked_end - 1,
+                             GFP_NOFS);
+       }
+       btrfs_check_file(root, inode);
+       return ret;
+}
+
+static int extent_mergeable(struct extent_buffer *leaf, int slot,
+                           u64 objectid, u64 bytenr, u64 *start, u64 *end)
+{
+       struct btrfs_file_extent_item *fi;
+       struct btrfs_key key;
+       u64 extent_end;
+
+       if (slot < 0 || slot >= btrfs_header_nritems(leaf))
+               return 0;
+
+       btrfs_item_key_to_cpu(leaf, &key, slot);
+       if (key.objectid != objectid || key.type != BTRFS_EXTENT_DATA_KEY)
+               return 0;
+
+       fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
+       if (btrfs_file_extent_type(leaf, fi) != BTRFS_FILE_EXTENT_REG ||
+           btrfs_file_extent_disk_bytenr(leaf, fi) != bytenr ||
+           btrfs_file_extent_compression(leaf, fi) ||
+           btrfs_file_extent_encryption(leaf, fi) ||
+           btrfs_file_extent_other_encoding(leaf, fi))
+               return 0;
+
+       extent_end = key.offset + btrfs_file_extent_num_bytes(leaf, fi);
+       if ((*start && *start != key.offset) || (*end && *end != extent_end))
+               return 0;
+
+       *start = key.offset;
+       *end = extent_end;
+       return 1;
+}
+
+/*
+ * Mark extent in the range start - end as written.
+ *
+ * This changes extent type from 'pre-allocated' to 'regular'. If only
+ * part of extent is marked as written, the extent will be split into
+ * two or three.
+ */
+int btrfs_mark_extent_written(struct btrfs_trans_handle *trans,
+                             struct btrfs_root *root,
+                             struct inode *inode, u64 start, u64 end)
+{
+       struct extent_buffer *leaf;
+       struct btrfs_path *path;
+       struct btrfs_file_extent_item *fi;
+       struct btrfs_key key;
+       u64 bytenr;
+       u64 num_bytes;
+       u64 extent_end;
+       u64 extent_offset;
+       u64 other_start;
+       u64 other_end;
+       u64 split = start;
+       u64 locked_end = end;
+       u64 orig_parent;
+       int extent_type;
+       int split_end = 1;
+       int ret;
+
+       btrfs_drop_extent_cache(inode, start, end - 1, 0);
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+again:
+       key.objectid = inode->i_ino;
+       key.type = BTRFS_EXTENT_DATA_KEY;
+       if (split == start)
+               key.offset = split;
+       else
+               key.offset = split - 1;
+
+       ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+       if (ret > 0 && path->slots[0] > 0)
+               path->slots[0]--;
+
+       leaf = path->nodes[0];
+       btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+       BUG_ON(key.objectid != inode->i_ino ||
+              key.type != BTRFS_EXTENT_DATA_KEY);
+       fi = btrfs_item_ptr(leaf, path->slots[0],
+                           struct btrfs_file_extent_item);
+       extent_type = btrfs_file_extent_type(leaf, fi);
+       BUG_ON(extent_type != BTRFS_FILE_EXTENT_PREALLOC);
+       extent_end = key.offset + btrfs_file_extent_num_bytes(leaf, fi);
+       BUG_ON(key.offset > start || extent_end < end);
+
+       bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+       num_bytes = btrfs_file_extent_disk_num_bytes(leaf, fi);
+       extent_offset = btrfs_file_extent_offset(leaf, fi);
+
+       if (key.offset == start)
+               split = end;
+
+       if (key.offset == start && extent_end == end) {
+               int del_nr = 0;
+               int del_slot = 0;
+               u64 leaf_owner = btrfs_header_owner(leaf);
+               u64 leaf_gen = btrfs_header_generation(leaf);
+               other_start = end;
+               other_end = 0;
+               if (extent_mergeable(leaf, path->slots[0] + 1, inode->i_ino,
+                                    bytenr, &other_start, &other_end)) {
+                       extent_end = other_end;
+                       del_slot = path->slots[0] + 1;
+                       del_nr++;
+                       ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
+                                               leaf->start, leaf_owner,
+                                               leaf_gen, inode->i_ino, 0);
+                       BUG_ON(ret);
+               }
+               other_start = 0;
+               other_end = start;
+               if (extent_mergeable(leaf, path->slots[0] - 1, inode->i_ino,
+                                    bytenr, &other_start, &other_end)) {
+                       key.offset = other_start;
+                       del_slot = path->slots[0];
+                       del_nr++;
+                       ret = btrfs_free_extent(trans, root, bytenr, num_bytes,
+                                               leaf->start, leaf_owner,
+                                               leaf_gen, inode->i_ino, 0);
+                       BUG_ON(ret);
+               }
+               split_end = 0;
+               if (del_nr == 0) {
+                       btrfs_set_file_extent_type(leaf, fi,
+                                                  BTRFS_FILE_EXTENT_REG);
+                       goto done;
+               }
+
+               fi = btrfs_item_ptr(leaf, del_slot - 1,
+                                   struct btrfs_file_extent_item);
+               btrfs_set_file_extent_type(leaf, fi, BTRFS_FILE_EXTENT_REG);
+               btrfs_set_file_extent_num_bytes(leaf, fi,
+                                               extent_end - key.offset);
+               btrfs_mark_buffer_dirty(leaf);
+
+               ret = btrfs_del_items(trans, root, path, del_slot, del_nr);
+               BUG_ON(ret);
+               goto done;
+       } else if (split == start) {
+               if (locked_end < extent_end) {
+                       ret = try_lock_extent(&BTRFS_I(inode)->io_tree,
+                                       locked_end, extent_end - 1, GFP_NOFS);
+                       if (!ret) {
+                               btrfs_release_path(root, path);
+                               lock_extent(&BTRFS_I(inode)->io_tree,
+                                       locked_end, extent_end - 1, GFP_NOFS);
+                               locked_end = extent_end;
+                               goto again;
+                       }
+                       locked_end = extent_end;
+               }
+               btrfs_set_file_extent_num_bytes(leaf, fi, split - key.offset);
+               extent_offset += split - key.offset;
+       } else  {
+               BUG_ON(key.offset != start);
+               btrfs_set_file_extent_offset(leaf, fi, extent_offset +
+                                            split - key.offset);
+               btrfs_set_file_extent_num_bytes(leaf, fi, extent_end - split);
+               key.offset = split;
+               btrfs_set_item_key_safe(trans, root, path, &key);
+               extent_end = split;
+       }
+
+       if (extent_end == end) {
+               split_end = 0;
+               extent_type = BTRFS_FILE_EXTENT_REG;
+       }
+       if (extent_end == end && split == start) {
+               other_start = end;
+               other_end = 0;
+               if (extent_mergeable(leaf, path->slots[0] + 1, inode->i_ino,
+                                    bytenr, &other_start, &other_end)) {
+                       path->slots[0]++;
+                       fi = btrfs_item_ptr(leaf, path->slots[0],
+                                           struct btrfs_file_extent_item);
+                       key.offset = split;
+                       btrfs_set_item_key_safe(trans, root, path, &key);
+                       btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+                       btrfs_set_file_extent_num_bytes(leaf, fi,
+                                                       other_end - split);
+                       goto done;
+               }
+       }
+       if (extent_end == end && split == end) {
+               other_start = 0;
+               other_end = start;
+               if (extent_mergeable(leaf, path->slots[0] - 1 , inode->i_ino,
+                                    bytenr, &other_start, &other_end)) {
+                       path->slots[0]--;
+                       fi = btrfs_item_ptr(leaf, path->slots[0],
+                                           struct btrfs_file_extent_item);
+                       btrfs_set_file_extent_num_bytes(leaf, fi, extent_end -
+                                                       other_start);
+                       goto done;
+               }
+       }
+
+       btrfs_mark_buffer_dirty(leaf);
+
+       orig_parent = leaf->start;
+       ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes,
+                                  orig_parent, root->root_key.objectid,
+                                  trans->transid, inode->i_ino);
+       BUG_ON(ret);
+       btrfs_release_path(root, path);
+
+       key.offset = start;
+       ret = btrfs_insert_empty_item(trans, root, path, &key, sizeof(*fi));
+       BUG_ON(ret);
+
+       leaf = path->nodes[0];
+       fi = btrfs_item_ptr(leaf, path->slots[0],
+                           struct btrfs_file_extent_item);
+       btrfs_set_file_extent_generation(leaf, fi, trans->transid);
+       btrfs_set_file_extent_type(leaf, fi, extent_type);
+       btrfs_set_file_extent_disk_bytenr(leaf, fi, bytenr);
+       btrfs_set_file_extent_disk_num_bytes(leaf, fi, num_bytes);
+       btrfs_set_file_extent_offset(leaf, fi, extent_offset);
+       btrfs_set_file_extent_num_bytes(leaf, fi, extent_end - key.offset);
+       btrfs_set_file_extent_ram_bytes(leaf, fi, num_bytes);
+       btrfs_set_file_extent_compression(leaf, fi, 0);
+       btrfs_set_file_extent_encryption(leaf, fi, 0);
+       btrfs_set_file_extent_other_encoding(leaf, fi, 0);
+
+       if (orig_parent != leaf->start) {
+               ret = btrfs_update_extent_ref(trans, root, bytenr,
+                                             orig_parent, leaf->start,
+                                             root->root_key.objectid,
+                                             trans->transid, inode->i_ino);
+               BUG_ON(ret);
+       }
+done:
+       btrfs_mark_buffer_dirty(leaf);
+       btrfs_release_path(root, path);
+       if (split_end && split == start) {
+               split = end;
+               goto again;
+       }
+       if (locked_end > end) {
+               unlock_extent(&BTRFS_I(inode)->io_tree, end, locked_end - 1,
+                             GFP_NOFS);
+       }
+       btrfs_free_path(path);
+       return 0;
+}
+
+/*
+ * this gets pages into the page cache and locks them down, it also properly
+ * waits for data=ordered extents to finish before allowing the pages to be
+ * modified.
+ */
+static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
+                        struct page **pages, size_t num_pages,
+                        loff_t pos, unsigned long first_index,
+                        unsigned long last_index, size_t write_bytes)
+{
+       int i;
+       unsigned long index = pos >> PAGE_CACHE_SHIFT;
+       struct inode *inode = fdentry(file)->d_inode;
+       int err = 0;
+       u64 start_pos;
+       u64 last_pos;
+
+       start_pos = pos & ~((u64)root->sectorsize - 1);
+       last_pos = ((u64)index + num_pages) << PAGE_CACHE_SHIFT;
+
+       if (start_pos > inode->i_size) {
+               err = btrfs_cont_expand(inode, start_pos);
+               if (err)
+                       return err;
+       }
+
+       memset(pages, 0, num_pages * sizeof(struct page *));
+again:
+       for (i = 0; i < num_pages; i++) {
+               pages[i] = grab_cache_page(inode->i_mapping, index + i);
+               if (!pages[i]) {
+                       err = -ENOMEM;
+                       BUG_ON(1);
+               }
+               wait_on_page_writeback(pages[i]);
+       }
+       if (start_pos < inode->i_size) {
+               struct btrfs_ordered_extent *ordered;
+               lock_extent(&BTRFS_I(inode)->io_tree,
+                           start_pos, last_pos - 1, GFP_NOFS);
+               ordered = btrfs_lookup_first_ordered_extent(inode,
+                                                           last_pos - 1);
+               if (ordered &&
+                   ordered->file_offset + ordered->len > start_pos &&
+                   ordered->file_offset < last_pos) {
+                       btrfs_put_ordered_extent(ordered);
+                       unlock_extent(&BTRFS_I(inode)->io_tree,
+                                     start_pos, last_pos - 1, GFP_NOFS);
+                       for (i = 0; i < num_pages; i++) {
+                               unlock_page(pages[i]);
+                               page_cache_release(pages[i]);
+                       }
+                       btrfs_wait_ordered_range(inode, start_pos,
+                                                last_pos - start_pos);
+                       goto again;
+               }
+               if (ordered)
+                       btrfs_put_ordered_extent(ordered);
+
+               clear_extent_bits(&BTRFS_I(inode)->io_tree, start_pos,
+                                 last_pos - 1, EXTENT_DIRTY | EXTENT_DELALLOC,
+                                 GFP_NOFS);
+               unlock_extent(&BTRFS_I(inode)->io_tree,
+                             start_pos, last_pos - 1, GFP_NOFS);
+       }
+       for (i = 0; i < num_pages; i++) {
+               clear_page_dirty_for_io(pages[i]);
+               set_page_extent_mapped(pages[i]);
+               WARN_ON(!PageLocked(pages[i]));
+       }
+       return 0;
+}
+
+static ssize_t btrfs_file_write(struct file *file, const char __user *buf,
+                               size_t count, loff_t *ppos)
+{
+       loff_t pos;
+       loff_t start_pos;
+       ssize_t num_written = 0;
+       ssize_t err = 0;
+       int ret = 0;
+       struct inode *inode = fdentry(file)->d_inode;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct page **pages = NULL;
+       int nrptrs;
+       struct page *pinned[2];
+       unsigned long first_index;
+       unsigned long last_index;
+       int will_write;
+
+       will_write = ((file->f_flags & O_SYNC) || IS_SYNC(inode) ||
+                     (file->f_flags & O_DIRECT));
+
+       nrptrs = min((count + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE,
+                    PAGE_CACHE_SIZE / (sizeof(struct page *)));
+       pinned[0] = NULL;
+       pinned[1] = NULL;
+
+       pos = *ppos;
+       start_pos = pos;
+
+       vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+       current->backing_dev_info = inode->i_mapping->backing_dev_info;
+       err = generic_write_checks(file, &pos, &count, S_ISBLK(inode->i_mode));
+       if (err)
+               goto out_nolock;
+       if (count == 0)
+               goto out_nolock;
+
+       err = file_remove_suid(file);
+       if (err)
+               goto out_nolock;
+       file_update_time(file);
+
+       pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL);
+
+       mutex_lock(&inode->i_mutex);
+       BTRFS_I(inode)->sequence++;
+       first_index = pos >> PAGE_CACHE_SHIFT;
+       last_index = (pos + count) >> PAGE_CACHE_SHIFT;
+
+       /*
+        * there are lots of better ways to do this, but this code
+        * makes sure the first and last page in the file range are
+        * up to date and ready for cow
+        */
+       if ((pos & (PAGE_CACHE_SIZE - 1))) {
+               pinned[0] = grab_cache_page(inode->i_mapping, first_index);
+               if (!PageUptodate(pinned[0])) {
+                       ret = btrfs_readpage(NULL, pinned[0]);
+                       BUG_ON(ret);
+                       wait_on_page_locked(pinned[0]);
+               } else {
+                       unlock_page(pinned[0]);
+               }
+       }
+       if ((pos + count) & (PAGE_CACHE_SIZE - 1)) {
+               pinned[1] = grab_cache_page(inode->i_mapping, last_index);
+               if (!PageUptodate(pinned[1])) {
+                       ret = btrfs_readpage(NULL, pinned[1]);
+                       BUG_ON(ret);
+                       wait_on_page_locked(pinned[1]);
+               } else {
+                       unlock_page(pinned[1]);
+               }
+       }
+
+       while (count > 0) {
+               size_t offset = pos & (PAGE_CACHE_SIZE - 1);
+               size_t write_bytes = min(count, nrptrs *
+                                       (size_t)PAGE_CACHE_SIZE -
+                                        offset);
+               size_t num_pages = (write_bytes + PAGE_CACHE_SIZE - 1) >>
+                                       PAGE_CACHE_SHIFT;
+
+               WARN_ON(num_pages > nrptrs);
+               memset(pages, 0, sizeof(struct page *) * nrptrs);
+
+               ret = btrfs_check_free_space(root, write_bytes, 0);
+               if (ret)
+                       goto out;
+
+               ret = prepare_pages(root, file, pages, num_pages,
+                                   pos, first_index, last_index,
+                                   write_bytes);
+               if (ret)
+                       goto out;
+
+               ret = btrfs_copy_from_user(pos, num_pages,
+                                          write_bytes, pages, buf);
+               if (ret) {
+                       btrfs_drop_pages(pages, num_pages);
+                       goto out;
+               }
+
+               ret = dirty_and_release_pages(NULL, root, file, pages,
+                                             num_pages, pos, write_bytes);
+               btrfs_drop_pages(pages, num_pages);
+               if (ret)
+                       goto out;
+
+               if (will_write) {
+                       btrfs_fdatawrite_range(inode->i_mapping, pos,
+                                              pos + write_bytes - 1,
+                                              WB_SYNC_NONE);
+               } else {
+                       balance_dirty_pages_ratelimited_nr(inode->i_mapping,
+                                                          num_pages);
+                       if (num_pages <
+                           (root->leafsize >> PAGE_CACHE_SHIFT) + 1)
+                               btrfs_btree_balance_dirty(root, 1);
+                       btrfs_throttle(root);
+               }
+
+               buf += write_bytes;
+               count -= write_bytes;
+               pos += write_bytes;
+               num_written += write_bytes;
+
+               cond_resched();
+       }
+out:
+       mutex_unlock(&inode->i_mutex);
+
+out_nolock:
+       kfree(pages);
+       if (pinned[0])
+               page_cache_release(pinned[0]);
+       if (pinned[1])
+               page_cache_release(pinned[1]);
+       *ppos = pos;
+
+       if (num_written > 0 && will_write) {
+               struct btrfs_trans_handle *trans;
+
+               err = btrfs_wait_ordered_range(inode, start_pos, num_written);
+               if (err)
+                       num_written = err;
+
+               if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) {
+                       trans = btrfs_start_transaction(root, 1);
+                       ret = btrfs_log_dentry_safe(trans, root,
+                                                   file->f_dentry);
+                       if (ret == 0) {
+                               btrfs_sync_log(trans, root);
+                               btrfs_end_transaction(trans, root);
+                       } else {
+                               btrfs_commit_transaction(trans, root);
+                       }
+               }
+               if (file->f_flags & O_DIRECT) {
+                       invalidate_mapping_pages(inode->i_mapping,
+                             start_pos >> PAGE_CACHE_SHIFT,
+                            (start_pos + num_written - 1) >> PAGE_CACHE_SHIFT);
+               }
+       }
+       current->backing_dev_info = NULL;
+       return num_written ? num_written : err;
+}
+
+int btrfs_release_file(struct inode *inode, struct file *filp)
+{
+       if (filp->private_data)
+               btrfs_ioctl_trans_end(filp);
+       return 0;
+}
+
+/*
+ * fsync call for both files and directories.  This logs the inode into
+ * the tree log instead of forcing full commits whenever possible.
+ *
+ * It needs to call filemap_fdatawait so that all ordered extent updates are
+ * in the metadata btree are up to date for copying to the log.
+ *
+ * It drops the inode mutex before doing the tree log commit.  This is an
+ * important optimization for directories because holding the mutex prevents
+ * new operations on the dir while we write to disk.
+ */
+int btrfs_sync_file(struct file *file, struct dentry *dentry, int datasync)
+{
+       struct inode *inode = dentry->d_inode;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       int ret = 0;
+       struct btrfs_trans_handle *trans;
+
+       /*
+        * check the transaction that last modified this inode
+        * and see if its already been committed
+        */
+       if (!BTRFS_I(inode)->last_trans)
+               goto out;
+
+       mutex_lock(&root->fs_info->trans_mutex);
+       if (BTRFS_I(inode)->last_trans <=
+           root->fs_info->last_trans_committed) {
+               BTRFS_I(inode)->last_trans = 0;
+               mutex_unlock(&root->fs_info->trans_mutex);
+               goto out;
+       }
+       mutex_unlock(&root->fs_info->trans_mutex);
+
+       root->fs_info->tree_log_batch++;
+       filemap_fdatawrite(inode->i_mapping);
+       btrfs_wait_ordered_range(inode, 0, (u64)-1);
+       root->fs_info->tree_log_batch++;
+
+       /*
+        * ok we haven't committed the transaction yet, lets do a commit
+        */
+       if (file->private_data)
+               btrfs_ioctl_trans_end(file);
+
+       trans = btrfs_start_transaction(root, 1);
+       if (!trans) {
+               ret = -ENOMEM;
+               goto out;
+       }
+
+       ret = btrfs_log_dentry_safe(trans, root, file->f_dentry);
+       if (ret < 0)
+               goto out;
+
+       /* we've logged all the items and now have a consistent
+        * version of the file in the log.  It is possible that
+        * someone will come in and modify the file, but that's
+        * fine because the log is consistent on disk, and we
+        * have references to all of the file's extents
+        *
+        * It is possible that someone will come in and log the
+        * file again, but that will end up using the synchronization
+        * inside btrfs_sync_log to keep things safe.
+        */
+       mutex_unlock(&file->f_dentry->d_inode->i_mutex);
+
+       if (ret > 0) {
+               ret = btrfs_commit_transaction(trans, root);
+       } else {
+               btrfs_sync_log(trans, root);
+               ret = btrfs_end_transaction(trans, root);
+       }
+       mutex_lock(&file->f_dentry->d_inode->i_mutex);
+out:
+       return ret > 0 ? EIO : ret;
+}
+
+static struct vm_operations_struct btrfs_file_vm_ops = {
+       .fault          = filemap_fault,
+       .page_mkwrite   = btrfs_page_mkwrite,
+};
+
+static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+       vma->vm_ops = &btrfs_file_vm_ops;
+       file_accessed(filp);
+       return 0;
+}
+
+struct file_operations btrfs_file_operations = {
+       .llseek         = generic_file_llseek,
+       .read           = do_sync_read,
+       .aio_read       = generic_file_aio_read,
+       .splice_read    = generic_file_splice_read,
+       .write          = btrfs_file_write,
+       .mmap           = btrfs_file_mmap,
+       .open           = generic_file_open,
+       .release        = btrfs_release_file,
+       .fsync          = btrfs_sync_file,
+       .unlocked_ioctl = btrfs_ioctl,
+#ifdef CONFIG_COMPAT
+       .compat_ioctl   = btrfs_ioctl,
+#endif
+};
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c

new file mode 100644 (file)

index 0000000..d1e5f0e
--- /dev/null
+++ b/fs/btrfs/free-space-cache.c
@@ -0,0 +1,495 @@
+/*
+ * Copyright (C) 2008 Red Hat.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/sched.h>
+#include "ctree.h"
+
+static int tree_insert_offset(struct rb_root *root, u64 offset,
+                             struct rb_node *node)
+{
+       struct rb_node **p = &root->rb_node;
+       struct rb_node *parent = NULL;
+       struct btrfs_free_space *info;
+
+       while (*p) {
+               parent = *p;
+               info = rb_entry(parent, struct btrfs_free_space, offset_index);
+
+               if (offset < info->offset)
+                       p = &(*p)->rb_left;
+               else if (offset > info->offset)
+                       p = &(*p)->rb_right;
+               else
+                       return -EEXIST;
+       }
+
+       rb_link_node(node, parent, p);
+       rb_insert_color(node, root);
+
+       return 0;
+}
+
+static int tree_insert_bytes(struct rb_root *root, u64 bytes,
+                            struct rb_node *node)
+{
+       struct rb_node **p = &root->rb_node;
+       struct rb_node *parent = NULL;
+       struct btrfs_free_space *info;
+
+       while (*p) {
+               parent = *p;
+               info = rb_entry(parent, struct btrfs_free_space, bytes_index);
+
+               if (bytes < info->bytes)
+                       p = &(*p)->rb_left;
+               else
+                       p = &(*p)->rb_right;
+       }
+
+       rb_link_node(node, parent, p);
+       rb_insert_color(node, root);
+
+       return 0;
+}
+
+/*
+ * searches the tree for the given offset.  If contains is set we will return
+ * the free space that contains the given offset.  If contains is not set we
+ * will return the free space that starts at or after the given offset and is
+ * at least bytes long.
+ */
+static struct btrfs_free_space *tree_search_offset(struct rb_root *root,
+                                                  u64 offset, u64 bytes,
+                                                  int contains)
+{
+       struct rb_node *n = root->rb_node;
+       struct btrfs_free_space *entry, *ret = NULL;
+
+       while (n) {
+               entry = rb_entry(n, struct btrfs_free_space, offset_index);
+
+               if (offset < entry->offset) {
+                       if (!contains &&
+                           (!ret || entry->offset < ret->offset) &&
+                           (bytes <= entry->bytes))
+                               ret = entry;
+                       n = n->rb_left;
+               } else if (offset > entry->offset) {
+                       if ((entry->offset + entry->bytes - 1) >= offset &&
+                           bytes <= entry->bytes) {
+                               ret = entry;
+                               break;
+                       }
+                       n = n->rb_right;
+               } else {
+                       if (bytes > entry->bytes) {
+                               n = n->rb_right;
+                               continue;
+                       }
+                       ret = entry;
+                       break;
+               }
+       }
+
+       return ret;
+}
+
+/*
+ * return a chunk at least bytes size, as close to offset that we can get.
+ */
+static struct btrfs_free_space *tree_search_bytes(struct rb_root *root,
+                                                 u64 offset, u64 bytes)
+{
+       struct rb_node *n = root->rb_node;
+       struct btrfs_free_space *entry, *ret = NULL;
+
+       while (n) {
+               entry = rb_entry(n, struct btrfs_free_space, bytes_index);
+
+               if (bytes < entry->bytes) {
+                       /*
+                        * We prefer to get a hole size as close to the size we
+                        * are asking for so we don't take small slivers out of
+                        * huge holes, but we also want to get as close to the
+                        * offset as possible so we don't have a whole lot of
+                        * fragmentation.
+                        */
+                       if (offset <= entry->offset) {
+                               if (!ret)
+                                       ret = entry;
+                               else if (entry->bytes < ret->bytes)
+                                       ret = entry;
+                               else if (entry->offset < ret->offset)
+                                       ret = entry;
+                       }
+                       n = n->rb_left;
+               } else if (bytes > entry->bytes) {
+                       n = n->rb_right;
+               } else {
+                       /*
+                        * Ok we may have multiple chunks of the wanted size,
+                        * so we don't want to take the first one we find, we
+                        * want to take the one closest to our given offset, so
+                        * keep searching just in case theres a better match.
+                        */
+                       n = n->rb_right;
+                       if (offset > entry->offset)
+                               continue;
+                       else if (!ret || entry->offset < ret->offset)
+                               ret = entry;
+               }
+       }
+
+       return ret;
+}
+
+static void unlink_free_space(struct btrfs_block_group_cache *block_group,
+                             struct btrfs_free_space *info)
+{
+       rb_erase(&info->offset_index, &block_group->free_space_offset);
+       rb_erase(&info->bytes_index, &block_group->free_space_bytes);
+}
+
+static int link_free_space(struct btrfs_block_group_cache *block_group,
+                          struct btrfs_free_space *info)
+{
+       int ret = 0;
+
+
+       ret = tree_insert_offset(&block_group->free_space_offset, info->offset,
+                                &info->offset_index);
+       if (ret)
+               return ret;
+
+       ret = tree_insert_bytes(&block_group->free_space_bytes, info->bytes,
+                               &info->bytes_index);
+       if (ret)
+               return ret;
+
+       return ret;
+}
+
+static int __btrfs_add_free_space(struct btrfs_block_group_cache *block_group,
+                                 u64 offset, u64 bytes)
+{
+       struct btrfs_free_space *right_info;
+       struct btrfs_free_space *left_info;
+       struct btrfs_free_space *info = NULL;
+       struct btrfs_free_space *alloc_info;
+       int ret = 0;
+
+       alloc_info = kzalloc(sizeof(struct btrfs_free_space), GFP_NOFS);
+       if (!alloc_info)
+               return -ENOMEM;
+
+       /*
+        * first we want to see if there is free space adjacent to the range we
+        * are adding, if there is remove that struct and add a new one to
+        * cover the entire range
+        */
+       right_info = tree_search_offset(&block_group->free_space_offset,
+                                       offset+bytes, 0, 1);
+       left_info = tree_search_offset(&block_group->free_space_offset,
+                                      offset-1, 0, 1);
+
+       if (right_info && right_info->offset == offset+bytes) {
+               unlink_free_space(block_group, right_info);
+               info = right_info;
+               info->offset = offset;
+               info->bytes += bytes;
+       } else if (right_info && right_info->offset != offset+bytes) {
+               printk(KERN_ERR "btrfs adding space in the middle of an "
+                      "existing free space area. existing: "
+                      "offset=%llu, bytes=%llu. new: offset=%llu, "
+                      "bytes=%llu\n", (unsigned long long)right_info->offset,
+                      (unsigned long long)right_info->bytes,
+                      (unsigned long long)offset,
+                      (unsigned long long)bytes);
+               BUG();
+       }
+
+       if (left_info) {
+               unlink_free_space(block_group, left_info);
+
+               if (unlikely((left_info->offset + left_info->bytes) !=
+                            offset)) {
+                       printk(KERN_ERR "btrfs free space to the left "
+                              "of new free space isn't "
+                              "quite right. existing: offset=%llu, "
+                              "bytes=%llu. new: offset=%llu, bytes=%llu\n",
+                              (unsigned long long)left_info->offset,
+                              (unsigned long long)left_info->bytes,
+                              (unsigned long long)offset,
+                              (unsigned long long)bytes);
+                       BUG();
+               }
+
+               if (info) {
+                       info->offset = left_info->offset;
+                       info->bytes += left_info->bytes;
+                       kfree(left_info);
+               } else {
+                       info = left_info;
+                       info->bytes += bytes;
+               }
+       }
+
+       if (info) {
+               ret = link_free_space(block_group, info);
+               if (!ret)
+                       info = NULL;
+               goto out;
+       }
+
+       info = alloc_info;
+       alloc_info = NULL;
+       info->offset = offset;
+       info->bytes = bytes;
+
+       ret = link_free_space(block_group, info);
+       if (ret)
+               kfree(info);
+out:
+       if (ret) {
+               printk(KERN_ERR "btrfs: unable to add free space :%d\n", ret);
+               if (ret == -EEXIST)
+                       BUG();
+       }
+
+       kfree(alloc_info);
+
+       return ret;
+}
+
+static int
+__btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
+                         u64 offset, u64 bytes)
+{
+       struct btrfs_free_space *info;
+       int ret = 0;
+
+       info = tree_search_offset(&block_group->free_space_offset, offset, 0,
+                                 1);
+
+       if (info && info->offset == offset) {
+               if (info->bytes < bytes) {
+                       printk(KERN_ERR "Found free space at %llu, size %llu,"
+                              "trying to use %llu\n",
+                              (unsigned long long)info->offset,
+                              (unsigned long long)info->bytes,
+                              (unsigned long long)bytes);
+                       WARN_ON(1);
+                       ret = -EINVAL;
+                       goto out;
+               }
+               unlink_free_space(block_group, info);
+
+               if (info->bytes == bytes) {
+                       kfree(info);
+                       goto out;
+               }
+
+               info->offset += bytes;
+               info->bytes -= bytes;
+
+               ret = link_free_space(block_group, info);
+               BUG_ON(ret);
+       } else if (info && info->offset < offset &&
+                  info->offset + info->bytes >= offset + bytes) {
+               u64 old_start = info->offset;
+               /*
+                * we're freeing space in the middle of the info,
+                * this can happen during tree log replay
+                *
+                * first unlink the old info and then
+                * insert it again after the hole we're creating
+                */
+               unlink_free_space(block_group, info);
+               if (offset + bytes < info->offset + info->bytes) {
+                       u64 old_end = info->offset + info->bytes;
+
+                       info->offset = offset + bytes;
+                       info->bytes = old_end - info->offset;
+                       ret = link_free_space(block_group, info);
+                       BUG_ON(ret);
+               } else {
+                       /* the hole we're creating ends at the end
+                        * of the info struct, just free the info
+                        */
+                       kfree(info);
+               }
+
+               /* step two, insert a new info struct to cover anything
+                * before the hole
+                */
+               ret = __btrfs_add_free_space(block_group, old_start,
+                                            offset - old_start);
+               BUG_ON(ret);
+       } else {
+               WARN_ON(1);
+       }
+out:
+       return ret;
+}
+
+int btrfs_add_free_space(struct btrfs_block_group_cache *block_group,
+                        u64 offset, u64 bytes)
+{
+       int ret;
+       struct btrfs_free_space *sp;
+
+       mutex_lock(&block_group->alloc_mutex);
+       ret = __btrfs_add_free_space(block_group, offset, bytes);
+       sp = tree_search_offset(&block_group->free_space_offset, offset, 0, 1);
+       BUG_ON(!sp);
+       mutex_unlock(&block_group->alloc_mutex);
+
+       return ret;
+}
+
+int btrfs_add_free_space_lock(struct btrfs_block_group_cache *block_group,
+                             u64 offset, u64 bytes)
+{
+       int ret;
+       struct btrfs_free_space *sp;
+
+       ret = __btrfs_add_free_space(block_group, offset, bytes);
+       sp = tree_search_offset(&block_group->free_space_offset, offset, 0, 1);
+       BUG_ON(!sp);
+
+       return ret;
+}
+
+int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group,
+                           u64 offset, u64 bytes)
+{
+       int ret = 0;
+
+       mutex_lock(&block_group->alloc_mutex);
+       ret = __btrfs_remove_free_space(block_group, offset, bytes);
+       mutex_unlock(&block_group->alloc_mutex);
+
+       return ret;
+}
+
+int btrfs_remove_free_space_lock(struct btrfs_block_group_cache *block_group,
+                                u64 offset, u64 bytes)
+{
+       int ret;
+
+       ret = __btrfs_remove_free_space(block_group, offset, bytes);
+
+       return ret;
+}
+
+void btrfs_dump_free_space(struct btrfs_block_group_cache *block_group,
+                          u64 bytes)
+{
+       struct btrfs_free_space *info;
+       struct rb_node *n;
+       int count = 0;
+
+       for (n = rb_first(&block_group->free_space_offset); n; n = rb_next(n)) {
+               info = rb_entry(n, struct btrfs_free_space, offset_index);
+               if (info->bytes >= bytes)
+                       count++;
+       }
+       printk(KERN_INFO "%d blocks of free space at or bigger than bytes is"
+              "\n", count);
+}
+
+u64 btrfs_block_group_free_space(struct btrfs_block_group_cache *block_group)
+{
+       struct btrfs_free_space *info;
+       struct rb_node *n;
+       u64 ret = 0;
+
+       for (n = rb_first(&block_group->free_space_offset); n;
+            n = rb_next(n)) {
+               info = rb_entry(n, struct btrfs_free_space, offset_index);
+               ret += info->bytes;
+       }
+
+       return ret;
+}
+
+void btrfs_remove_free_space_cache(struct btrfs_block_group_cache *block_group)
+{
+       struct btrfs_free_space *info;
+       struct rb_node *node;
+
+       mutex_lock(&block_group->alloc_mutex);
+       while ((node = rb_last(&block_group->free_space_bytes)) != NULL) {
+               info = rb_entry(node, struct btrfs_free_space, bytes_index);
+               unlink_free_space(block_group, info);
+               kfree(info);
+               if (need_resched()) {
+                       mutex_unlock(&block_group->alloc_mutex);
+                       cond_resched();
+                       mutex_lock(&block_group->alloc_mutex);
+               }
+       }
+       mutex_unlock(&block_group->alloc_mutex);
+}
+
+#if 0
+static struct btrfs_free_space *btrfs_find_free_space_offset(struct
+                                                     btrfs_block_group_cache
+                                                     *block_group, u64 offset,
+                                                     u64 bytes)
+{
+       struct btrfs_free_space *ret;
+
+       mutex_lock(&block_group->alloc_mutex);
+       ret = tree_search_offset(&block_group->free_space_offset, offset,
+                                bytes, 0);
+       mutex_unlock(&block_group->alloc_mutex);
+
+       return ret;
+}
+
+static struct btrfs_free_space *btrfs_find_free_space_bytes(struct
+                                                    btrfs_block_group_cache
+                                                    *block_group, u64 offset,
+                                                    u64 bytes)
+{
+       struct btrfs_free_space *ret;
+
+       mutex_lock(&block_group->alloc_mutex);
+
+       ret = tree_search_bytes(&block_group->free_space_bytes, offset, bytes);
+       mutex_unlock(&block_group->alloc_mutex);
+
+       return ret;
+}
+#endif
+
+struct btrfs_free_space *btrfs_find_free_space(struct btrfs_block_group_cache
+                                              *block_group, u64 offset,
+                                              u64 bytes)
+{
+       struct btrfs_free_space *ret = NULL;
+
+       ret = tree_search_offset(&block_group->free_space_offset, offset,
+                                bytes, 0);
+       if (!ret)
+               ret = tree_search_bytes(&block_group->free_space_bytes,
+                                       offset, bytes);
+
+       return ret;
+}
diff --git a/fs/btrfs/hash.h b/fs/btrfs/hash.h

new file mode 100644 (file)

index 0000000..2a020b2
--- /dev/null
+++ b/fs/btrfs/hash.h
@@ -0,0 +1,27 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __HASH__
+#define __HASH__
+
+#include "crc32c.h"
+static inline u64 btrfs_name_hash(const char *name, int len)
+{
+       return btrfs_crc32c((u32)~1, name, len);
+}
+#endif
diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c

new file mode 100644 (file)

index 0000000..3d46fa1
--- /dev/null
+++ b/fs/btrfs/inode-item.c
@@ -0,0 +1,206 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "ctree.h"
+#include "disk-io.h"
+#include "transaction.h"
+
+static int find_name_in_backref(struct btrfs_path *path, const char *name,
+                        int name_len, struct btrfs_inode_ref **ref_ret)
+{
+       struct extent_buffer *leaf;
+       struct btrfs_inode_ref *ref;
+       unsigned long ptr;
+       unsigned long name_ptr;
+       u32 item_size;
+       u32 cur_offset = 0;
+       int len;
+
+       leaf = path->nodes[0];
+       item_size = btrfs_item_size_nr(leaf, path->slots[0]);
+       ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
+       while (cur_offset < item_size) {
+               ref = (struct btrfs_inode_ref *)(ptr + cur_offset);
+               len = btrfs_inode_ref_name_len(leaf, ref);
+               name_ptr = (unsigned long)(ref + 1);
+               cur_offset += len + sizeof(*ref);
+               if (len != name_len)
+                       continue;
+               if (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0) {
+                       *ref_ret = ref;
+                       return 1;
+               }
+       }
+       return 0;
+}
+
+int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
+                          struct btrfs_root *root,
+                          const char *name, int name_len,
+                          u64 inode_objectid, u64 ref_objectid, u64 *index)
+{
+       struct btrfs_path *path;
+       struct btrfs_key key;
+       struct btrfs_inode_ref *ref;
+       struct extent_buffer *leaf;
+       unsigned long ptr;
+       unsigned long item_start;
+       u32 item_size;
+       u32 sub_item_len;
+       int ret;
+       int del_len = name_len + sizeof(*ref);
+
+       key.objectid = inode_objectid;
+       key.offset = ref_objectid;
+       btrfs_set_key_type(&key, BTRFS_INODE_REF_KEY);
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+       if (ret > 0) {
+               ret = -ENOENT;
+               goto out;
+       } else if (ret < 0) {
+               goto out;
+       }
+       if (!find_name_in_backref(path, name, name_len, &ref)) {
+               ret = -ENOENT;
+               goto out;
+       }
+       leaf = path->nodes[0];
+       item_size = btrfs_item_size_nr(leaf, path->slots[0]);
+
+       if (index)
+               *index = btrfs_inode_ref_index(leaf, ref);
+
+       if (del_len == item_size) {
+               ret = btrfs_del_item(trans, root, path);
+               goto out;
+       }
+       ptr = (unsigned long)ref;
+       sub_item_len = name_len + sizeof(*ref);
+       item_start = btrfs_item_ptr_offset(leaf, path->slots[0]);
+       memmove_extent_buffer(leaf, ptr, ptr + sub_item_len,
+                             item_size - (ptr + sub_item_len - item_start));
+       ret = btrfs_truncate_item(trans, root, path,
+                                 item_size - sub_item_len, 1);
+       BUG_ON(ret);
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
+                          struct btrfs_root *root,
+                          const char *name, int name_len,
+                          u64 inode_objectid, u64 ref_objectid, u64 index)
+{
+       struct btrfs_path *path;
+       struct btrfs_key key;
+       struct btrfs_inode_ref *ref;
+       unsigned long ptr;
+       int ret;
+       int ins_len = name_len + sizeof(*ref);
+
+       key.objectid = inode_objectid;
+       key.offset = ref_objectid;
+       btrfs_set_key_type(&key, BTRFS_INODE_REF_KEY);
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       ret = btrfs_insert_empty_item(trans, root, path, &key,
+                                     ins_len);
+       if (ret == -EEXIST) {
+               u32 old_size;
+
+               if (find_name_in_backref(path, name, name_len, &ref))
+                       goto out;
+
+               old_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]);
+               ret = btrfs_extend_item(trans, root, path, ins_len);
+               BUG_ON(ret);
+               ref = btrfs_item_ptr(path->nodes[0], path->slots[0],
+                                    struct btrfs_inode_ref);
+               ref = (struct btrfs_inode_ref *)((unsigned long)ref + old_size);
+               btrfs_set_inode_ref_name_len(path->nodes[0], ref, name_len);
+               btrfs_set_inode_ref_index(path->nodes[0], ref, index);
+               ptr = (unsigned long)(ref + 1);
+               ret = 0;
+       } else if (ret < 0) {
+               goto out;
+       } else {
+               ref = btrfs_item_ptr(path->nodes[0], path->slots[0],
+                                    struct btrfs_inode_ref);
+               btrfs_set_inode_ref_name_len(path->nodes[0], ref, name_len);
+               btrfs_set_inode_ref_index(path->nodes[0], ref, index);
+               ptr = (unsigned long)(ref + 1);
+       }
+       write_extent_buffer(path->nodes[0], name, ptr, name_len);
+       btrfs_mark_buffer_dirty(path->nodes[0]);
+
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+int btrfs_insert_empty_inode(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root,
+                            struct btrfs_path *path, u64 objectid)
+{
+       struct btrfs_key key;
+       int ret;
+       key.objectid = objectid;
+       btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY);
+       key.offset = 0;
+
+       ret = btrfs_insert_empty_item(trans, root, path, &key,
+                                     sizeof(struct btrfs_inode_item));
+       if (ret == 0 && objectid > root->highest_inode)
+               root->highest_inode = objectid;
+       return ret;
+}
+
+int btrfs_lookup_inode(struct btrfs_trans_handle *trans, struct btrfs_root
+                      *root, struct btrfs_path *path,
+                      struct btrfs_key *location, int mod)
+{
+       int ins_len = mod < 0 ? -1 : 0;
+       int cow = mod != 0;
+       int ret;
+       int slot;
+       struct extent_buffer *leaf;
+       struct btrfs_key found_key;
+
+       ret = btrfs_search_slot(trans, root, location, path, ins_len, cow);
+       if (ret > 0 && btrfs_key_type(location) == BTRFS_ROOT_ITEM_KEY &&
+           location->offset == (u64)-1 && path->slots[0] != 0) {
+               slot = path->slots[0] - 1;
+               leaf = path->nodes[0];
+               btrfs_item_key_to_cpu(leaf, &found_key, slot);
+               if (found_key.objectid == location->objectid &&
+                   btrfs_key_type(&found_key) == btrfs_key_type(location)) {
+                       path->slots[0]--;
+                       return 0;
+               }
+       }
+       return ret;
+}
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c

new file mode 100644 (file)

index 0000000..2aa7987
--- /dev/null
+++ b/fs/btrfs/inode-map.c
@@ -0,0 +1,144 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "ctree.h"
+#include "disk-io.h"
+#include "transaction.h"
+
+int btrfs_find_highest_inode(struct btrfs_root *root, u64 *objectid)
+{
+       struct btrfs_path *path;
+       int ret;
+       struct extent_buffer *l;
+       struct btrfs_key search_key;
+       struct btrfs_key found_key;
+       int slot;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       search_key.objectid = BTRFS_LAST_FREE_OBJECTID;
+       search_key.type = -1;
+       search_key.offset = (u64)-1;
+       ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0);
+       if (ret < 0)
+               goto error;
+       BUG_ON(ret == 0);
+       if (path->slots[0] > 0) {
+               slot = path->slots[0] - 1;
+               l = path->nodes[0];
+               btrfs_item_key_to_cpu(l, &found_key, slot);
+               *objectid = found_key.objectid;
+       } else {
+               *objectid = BTRFS_FIRST_FREE_OBJECTID;
+       }
+       ret = 0;
+error:
+       btrfs_free_path(path);
+       return ret;
+}
+
+/*
+ * walks the btree of allocated inodes and find a hole.
+ */
+int btrfs_find_free_objectid(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root,
+                            u64 dirid, u64 *objectid)
+{
+       struct btrfs_path *path;
+       struct btrfs_key key;
+       int ret;
+       int slot = 0;
+       u64 last_ino = 0;
+       int start_found;
+       struct extent_buffer *l;
+       struct btrfs_key search_key;
+       u64 search_start = dirid;
+
+       mutex_lock(&root->objectid_mutex);
+       if (root->last_inode_alloc >= BTRFS_FIRST_FREE_OBJECTID &&
+           root->last_inode_alloc < BTRFS_LAST_FREE_OBJECTID) {
+               *objectid = ++root->last_inode_alloc;
+               mutex_unlock(&root->objectid_mutex);
+               return 0;
+       }
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+       search_start = max(search_start, BTRFS_FIRST_FREE_OBJECTID);
+       search_key.objectid = search_start;
+       search_key.type = 0;
+       search_key.offset = 0;
+
+       btrfs_init_path(path);
+       start_found = 0;
+       ret = btrfs_search_slot(trans, root, &search_key, path, 0, 0);
+       if (ret < 0)
+               goto error;
+
+       while (1) {
+               l = path->nodes[0];
+               slot = path->slots[0];
+               if (slot >= btrfs_header_nritems(l)) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret == 0)
+                               continue;
+                       if (ret < 0)
+                               goto error;
+                       if (!start_found) {
+                               *objectid = search_start;
+                               start_found = 1;
+                               goto found;
+                       }
+                       *objectid = last_ino > search_start ?
+                               last_ino : search_start;
+                       goto found;
+               }
+               btrfs_item_key_to_cpu(l, &key, slot);
+               if (key.objectid >= search_start) {
+                       if (start_found) {
+                               if (last_ino < search_start)
+                                       last_ino = search_start;
+                               if (key.objectid > last_ino) {
+                                       *objectid = last_ino;
+                                       goto found;
+                               }
+                       } else if (key.objectid > search_start) {
+                               *objectid = search_start;
+                               goto found;
+                       }
+               }
+               if (key.objectid >= BTRFS_LAST_FREE_OBJECTID)
+                       break;
+
+               start_found = 1;
+               last_ino = key.objectid + 1;
+               path->slots[0]++;
+       }
+       BUG_ON(1);
+found:
+       btrfs_release_path(root, path);
+       btrfs_free_path(path);
+       BUG_ON(*objectid < search_start);
+       mutex_unlock(&root->objectid_mutex);
+       return 0;
+error:
+       btrfs_release_path(root, path);
+       btrfs_free_path(path);
+       mutex_unlock(&root->objectid_mutex);
+       return ret;
+}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c

new file mode 100644 (file)

index 0000000..8adfe05
--- /dev/null
+++ b/fs/btrfs/inode.c
@@ -0,0 +1,5035 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/kernel.h>
+#include <linux/bio.h>
+#include <linux/buffer_head.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/highmem.h>
+#include <linux/time.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/smp_lock.h>
+#include <linux/backing-dev.h>
+#include <linux/mpage.h>
+#include <linux/swap.h>
+#include <linux/writeback.h>
+#include <linux/statfs.h>
+#include <linux/compat.h>
+#include <linux/bit_spinlock.h>
+#include <linux/version.h>
+#include <linux/xattr.h>
+#include <linux/posix_acl.h>
+#include <linux/falloc.h>
+#include "compat.h"
+#include "ctree.h"
+#include "disk-io.h"
+#include "transaction.h"
+#include "btrfs_inode.h"
+#include "ioctl.h"
+#include "print-tree.h"
+#include "volumes.h"
+#include "ordered-data.h"
+#include "xattr.h"
+#include "tree-log.h"
+#include "ref-cache.h"
+#include "compression.h"
+
+struct btrfs_iget_args {
+       u64 ino;
+       struct btrfs_root *root;
+};
+
+static struct inode_operations btrfs_dir_inode_operations;
+static struct inode_operations btrfs_symlink_inode_operations;
+static struct inode_operations btrfs_dir_ro_inode_operations;
+static struct inode_operations btrfs_special_inode_operations;
+static struct inode_operations btrfs_file_inode_operations;
+static struct address_space_operations btrfs_aops;
+static struct address_space_operations btrfs_symlink_aops;
+static struct file_operations btrfs_dir_file_operations;
+static struct extent_io_ops btrfs_extent_io_ops;
+
+static struct kmem_cache *btrfs_inode_cachep;
+struct kmem_cache *btrfs_trans_handle_cachep;
+struct kmem_cache *btrfs_transaction_cachep;
+struct kmem_cache *btrfs_bit_radix_cachep;
+struct kmem_cache *btrfs_path_cachep;
+
+#define S_SHIFT 12
+static unsigned char btrfs_type_by_mode[S_IFMT >> S_SHIFT] = {
+       [S_IFREG >> S_SHIFT]    = BTRFS_FT_REG_FILE,
+       [S_IFDIR >> S_SHIFT]    = BTRFS_FT_DIR,
+       [S_IFCHR >> S_SHIFT]    = BTRFS_FT_CHRDEV,
+       [S_IFBLK >> S_SHIFT]    = BTRFS_FT_BLKDEV,
+       [S_IFIFO >> S_SHIFT]    = BTRFS_FT_FIFO,
+       [S_IFSOCK >> S_SHIFT]   = BTRFS_FT_SOCK,
+       [S_IFLNK >> S_SHIFT]    = BTRFS_FT_SYMLINK,
+};
+
+static void btrfs_truncate(struct inode *inode);
+static int btrfs_finish_ordered_io(struct inode *inode, u64 start, u64 end);
+static noinline int cow_file_range(struct inode *inode,
+                                  struct page *locked_page,
+                                  u64 start, u64 end, int *page_started,
+                                  unsigned long *nr_written, int unlock);
+
+/*
+ * a very lame attempt at stopping writes when the FS is 85% full.  There
+ * are countless ways this is incorrect, but it is better than nothing.
+ */
+int btrfs_check_free_space(struct btrfs_root *root, u64 num_required,
+                          int for_del)
+{
+       u64 total;
+       u64 used;
+       u64 thresh;
+       int ret = 0;
+
+       spin_lock(&root->fs_info->delalloc_lock);
+       total = btrfs_super_total_bytes(&root->fs_info->super_copy);
+       used = btrfs_super_bytes_used(&root->fs_info->super_copy);
+       if (for_del)
+               thresh = total * 90;
+       else
+               thresh = total * 85;
+
+       do_div(thresh, 100);
+
+       if (used + root->fs_info->delalloc_bytes + num_required > thresh)
+               ret = -ENOSPC;
+       spin_unlock(&root->fs_info->delalloc_lock);
+       return ret;
+}
+
+/*
+ * this does all the hard work for inserting an inline extent into
+ * the btree.  The caller should have done a btrfs_drop_extents so that
+ * no overlapping inline items exist in the btree
+ */
+static noinline int insert_inline_extent(struct btrfs_trans_handle *trans,
+                               struct btrfs_root *root, struct inode *inode,
+                               u64 start, size_t size, size_t compressed_size,
+                               struct page **compressed_pages)
+{
+       struct btrfs_key key;
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       struct page *page = NULL;
+       char *kaddr;
+       unsigned long ptr;
+       struct btrfs_file_extent_item *ei;
+       int err = 0;
+       int ret;
+       size_t cur_size = size;
+       size_t datasize;
+       unsigned long offset;
+       int use_compress = 0;
+
+       if (compressed_size && compressed_pages) {
+               use_compress = 1;
+               cur_size = compressed_size;
+       }
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       btrfs_set_trans_block_group(trans, inode);
+
+       key.objectid = inode->i_ino;
+       key.offset = start;
+       btrfs_set_key_type(&key, BTRFS_EXTENT_DATA_KEY);
+       datasize = btrfs_file_extent_calc_inline_size(cur_size);
+
+       inode_add_bytes(inode, size);
+       ret = btrfs_insert_empty_item(trans, root, path, &key,
+                                     datasize);
+       BUG_ON(ret);
+       if (ret) {
+               err = ret;
+               goto fail;
+       }
+       leaf = path->nodes[0];
+       ei = btrfs_item_ptr(leaf, path->slots[0],
+                           struct btrfs_file_extent_item);
+       btrfs_set_file_extent_generation(leaf, ei, trans->transid);
+       btrfs_set_file_extent_type(leaf, ei, BTRFS_FILE_EXTENT_INLINE);
+       btrfs_set_file_extent_encryption(leaf, ei, 0);
+       btrfs_set_file_extent_other_encoding(leaf, ei, 0);
+       btrfs_set_file_extent_ram_bytes(leaf, ei, size);
+       ptr = btrfs_file_extent_inline_start(ei);
+
+       if (use_compress) {
+               struct page *cpage;
+               int i = 0;
+               while (compressed_size > 0) {
+                       cpage = compressed_pages[i];
+                       cur_size = min_t(unsigned long, compressed_size,
+                                      PAGE_CACHE_SIZE);
+
+                       kaddr = kmap(cpage);
+                       write_extent_buffer(leaf, kaddr, ptr, cur_size);
+                       kunmap(cpage);
+
+                       i++;
+                       ptr += cur_size;
+                       compressed_size -= cur_size;
+               }
+               btrfs_set_file_extent_compression(leaf, ei,
+                                                 BTRFS_COMPRESS_ZLIB);
+       } else {
+               page = find_get_page(inode->i_mapping,
+                                    start >> PAGE_CACHE_SHIFT);
+               btrfs_set_file_extent_compression(leaf, ei, 0);
+               kaddr = kmap_atomic(page, KM_USER0);
+               offset = start & (PAGE_CACHE_SIZE - 1);
+               write_extent_buffer(leaf, kaddr + offset, ptr, size);
+               kunmap_atomic(kaddr, KM_USER0);
+               page_cache_release(page);
+       }
+       btrfs_mark_buffer_dirty(leaf);
+       btrfs_free_path(path);
+
+       BTRFS_I(inode)->disk_i_size = inode->i_size;
+       btrfs_update_inode(trans, root, inode);
+       return 0;
+fail:
+       btrfs_free_path(path);
+       return err;
+}
+
+
+/*
+ * conditionally insert an inline extent into the file.  This
+ * does the checks required to make sure the data is small enough
+ * to fit as an inline extent.
+ */
+static int cow_file_range_inline(struct btrfs_trans_handle *trans,
+                                struct btrfs_root *root,
+                                struct inode *inode, u64 start, u64 end,
+                                size_t compressed_size,
+                                struct page **compressed_pages)
+{
+       u64 isize = i_size_read(inode);
+       u64 actual_end = min(end + 1, isize);
+       u64 inline_len = actual_end - start;
+       u64 aligned_end = (end + root->sectorsize - 1) &
+                       ~((u64)root->sectorsize - 1);
+       u64 hint_byte;
+       u64 data_len = inline_len;
+       int ret;
+
+       if (compressed_size)
+               data_len = compressed_size;
+
+       if (start > 0 ||
+           actual_end >= PAGE_CACHE_SIZE ||
+           data_len >= BTRFS_MAX_INLINE_DATA_SIZE(root) ||
+           (!compressed_size &&
+           (actual_end & (root->sectorsize - 1)) == 0) ||
+           end + 1 < isize ||
+           data_len > root->fs_info->max_inline) {
+               return 1;
+       }
+
+       ret = btrfs_drop_extents(trans, root, inode, start,
+                                aligned_end, start, &hint_byte);
+       BUG_ON(ret);
+
+       if (isize > actual_end)
+               inline_len = min_t(u64, isize, actual_end);
+       ret = insert_inline_extent(trans, root, inode, start,
+                                  inline_len, compressed_size,
+                                  compressed_pages);
+       BUG_ON(ret);
+       btrfs_drop_extent_cache(inode, start, aligned_end, 0);
+       return 0;
+}
+
+struct async_extent {
+       u64 start;
+       u64 ram_size;
+       u64 compressed_size;
+       struct page **pages;
+       unsigned long nr_pages;
+       struct list_head list;
+};
+
+struct async_cow {
+       struct inode *inode;
+       struct btrfs_root *root;
+       struct page *locked_page;
+       u64 start;
+       u64 end;
+       struct list_head extents;
+       struct btrfs_work work;
+};
+
+static noinline int add_async_extent(struct async_cow *cow,
+                                    u64 start, u64 ram_size,
+                                    u64 compressed_size,
+                                    struct page **pages,
+                                    unsigned long nr_pages)
+{
+       struct async_extent *async_extent;
+
+       async_extent = kmalloc(sizeof(*async_extent), GFP_NOFS);
+       async_extent->start = start;
+       async_extent->ram_size = ram_size;
+       async_extent->compressed_size = compressed_size;
+       async_extent->pages = pages;
+       async_extent->nr_pages = nr_pages;
+       list_add_tail(&async_extent->list, &cow->extents);
+       return 0;
+}
+
+/*
+ * we create compressed extents in two phases.  The first
+ * phase compresses a range of pages that have already been
+ * locked (both pages and state bits are locked).
+ *
+ * This is done inside an ordered work queue, and the compression
+ * is spread across many cpus.  The actual IO submission is step
+ * two, and the ordered work queue takes care of making sure that
+ * happens in the same order things were put onto the queue by
+ * writepages and friends.
+ *
+ * If this code finds it can't get good compression, it puts an
+ * entry onto the work queue to write the uncompressed bytes.  This
+ * makes sure that both compressed inodes and uncompressed inodes
+ * are written in the same order that pdflush sent them down.
+ */
+static noinline int compress_file_range(struct inode *inode,
+                                       struct page *locked_page,
+                                       u64 start, u64 end,
+                                       struct async_cow *async_cow,
+                                       int *num_added)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_trans_handle *trans;
+       u64 num_bytes;
+       u64 orig_start;
+       u64 disk_num_bytes;
+       u64 blocksize = root->sectorsize;
+       u64 actual_end;
+       u64 isize = i_size_read(inode);
+       int ret = 0;
+       struct page **pages = NULL;
+       unsigned long nr_pages;
+       unsigned long nr_pages_ret = 0;
+       unsigned long total_compressed = 0;
+       unsigned long total_in = 0;
+       unsigned long max_compressed = 128 * 1024;
+       unsigned long max_uncompressed = 128 * 1024;
+       int i;
+       int will_compress;
+
+       orig_start = start;
+
+       actual_end = min_t(u64, isize, end + 1);
+again:
+       will_compress = 0;
+       nr_pages = (end >> PAGE_CACHE_SHIFT) - (start >> PAGE_CACHE_SHIFT) + 1;
+       nr_pages = min(nr_pages, (128 * 1024UL) / PAGE_CACHE_SIZE);
+
+       total_compressed = actual_end - start;
+
+       /* we want to make sure that amount of ram required to uncompress
+        * an extent is reasonable, so we limit the total size in ram
+        * of a compressed extent to 128k.  This is a crucial number
+        * because it also controls how easily we can spread reads across
+        * cpus for decompression.
+        *
+        * We also want to make sure the amount of IO required to do
+        * a random read is reasonably small, so we limit the size of
+        * a compressed extent to 128k.
+        */
+       total_compressed = min(total_compressed, max_uncompressed);
+       num_bytes = (end - start + blocksize) & ~(blocksize - 1);
+       num_bytes = max(blocksize,  num_bytes);
+       disk_num_bytes = num_bytes;
+       total_in = 0;
+       ret = 0;
+
+       /*
+        * we do compression for mount -o compress and when the
+        * inode has not been flagged as nocompress.  This flag can
+        * change at any time if we discover bad compression ratios.
+        */
+       if (!btrfs_test_flag(inode, NOCOMPRESS) &&
+           btrfs_test_opt(root, COMPRESS)) {
+               WARN_ON(pages);
+               pages = kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS);
+
+               ret = btrfs_zlib_compress_pages(inode->i_mapping, start,
+                                               total_compressed, pages,
+                                               nr_pages, &nr_pages_ret,
+                                               &total_in,
+                                               &total_compressed,
+                                               max_compressed);
+
+               if (!ret) {
+                       unsigned long offset = total_compressed &
+                               (PAGE_CACHE_SIZE - 1);
+                       struct page *page = pages[nr_pages_ret - 1];
+                       char *kaddr;
+
+                       /* zero the tail end of the last page, we might be
+                        * sending it down to disk
+                        */
+                       if (offset) {
+                               kaddr = kmap_atomic(page, KM_USER0);
+                               memset(kaddr + offset, 0,
+                                      PAGE_CACHE_SIZE - offset);
+                               kunmap_atomic(kaddr, KM_USER0);
+                       }
+                       will_compress = 1;
+               }
+       }
+       if (start == 0) {
+               trans = btrfs_join_transaction(root, 1);
+               BUG_ON(!trans);
+               btrfs_set_trans_block_group(trans, inode);
+
+               /* lets try to make an inline extent */
+               if (ret || total_in < (actual_end - start)) {
+                       /* we didn't compress the entire range, try
+                        * to make an uncompressed inline extent.
+                        */
+                       ret = cow_file_range_inline(trans, root, inode,
+                                                   start, end, 0, NULL);
+               } else {
+                       /* try making a compressed inline extent */
+                       ret = cow_file_range_inline(trans, root, inode,
+                                                   start, end,
+                                                   total_compressed, pages);
+               }
+               btrfs_end_transaction(trans, root);
+               if (ret == 0) {
+                       /*
+                        * inline extent creation worked, we don't need
+                        * to create any more async work items.  Unlock
+                        * and free up our temp pages.
+                        */
+                       extent_clear_unlock_delalloc(inode,
+                                                    &BTRFS_I(inode)->io_tree,
+                                                    start, end, NULL, 1, 0,
+                                                    0, 1, 1, 1);
+                       ret = 0;
+                       goto free_pages_out;
+               }
+       }
+
+       if (will_compress) {
+               /*
+                * we aren't doing an inline extent round the compressed size
+                * up to a block size boundary so the allocator does sane
+                * things
+                */
+               total_compressed = (total_compressed + blocksize - 1) &
+                       ~(blocksize - 1);
+
+               /*
+                * one last check to make sure the compression is really a
+                * win, compare the page count read with the blocks on disk
+                */
+               total_in = (total_in + PAGE_CACHE_SIZE - 1) &
+                       ~(PAGE_CACHE_SIZE - 1);
+               if (total_compressed >= total_in) {
+                       will_compress = 0;
+               } else {
+                       disk_num_bytes = total_compressed;
+                       num_bytes = total_in;
+               }
+       }
+       if (!will_compress && pages) {
+               /*
+                * the compression code ran but failed to make things smaller,
+                * free any pages it allocated and our page pointer array
+                */
+               for (i = 0; i < nr_pages_ret; i++) {
+                       WARN_ON(pages[i]->mapping);
+                       page_cache_release(pages[i]);
+               }
+               kfree(pages);
+               pages = NULL;
+               total_compressed = 0;
+               nr_pages_ret = 0;
+
+               /* flag the file so we don't compress in the future */
+               btrfs_set_flag(inode, NOCOMPRESS);
+       }
+       if (will_compress) {
+               *num_added += 1;
+
+               /* the async work queues will take care of doing actual
+                * allocation on disk for these compressed pages,
+                * and will submit them to the elevator.
+                */
+               add_async_extent(async_cow, start, num_bytes,
+                                total_compressed, pages, nr_pages_ret);
+
+               if (start + num_bytes < end && start + num_bytes < actual_end) {
+                       start += num_bytes;
+                       pages = NULL;
+                       cond_resched();
+                       goto again;
+               }
+       } else {
+               /*
+                * No compression, but we still need to write the pages in
+                * the file we've been given so far.  redirty the locked
+                * page if it corresponds to our extent and set things up
+                * for the async work queue to run cow_file_range to do
+                * the normal delalloc dance
+                */
+               if (page_offset(locked_page) >= start &&
+                   page_offset(locked_page) <= end) {
+                       __set_page_dirty_nobuffers(locked_page);
+                       /* unlocked later on in the async handlers */
+               }
+               add_async_extent(async_cow, start, end - start + 1, 0, NULL, 0);
+               *num_added += 1;
+       }
+
+out:
+       return 0;
+
+free_pages_out:
+       for (i = 0; i < nr_pages_ret; i++) {
+               WARN_ON(pages[i]->mapping);
+               page_cache_release(pages[i]);
+       }
+       kfree(pages);
+
+       goto out;
+}
+
+/*
+ * phase two of compressed writeback.  This is the ordered portion
+ * of the code, which only gets called in the order the work was
+ * queued.  We walk all the async extents created by compress_file_range
+ * and send them down to the disk.
+ */
+static noinline int submit_compressed_extents(struct inode *inode,
+                                             struct async_cow *async_cow)
+{
+       struct async_extent *async_extent;
+       u64 alloc_hint = 0;
+       struct btrfs_trans_handle *trans;
+       struct btrfs_key ins;
+       struct extent_map *em;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+       struct extent_io_tree *io_tree;
+       int ret;
+
+       if (list_empty(&async_cow->extents))
+               return 0;
+
+       trans = btrfs_join_transaction(root, 1);
+
+       while (!list_empty(&async_cow->extents)) {
+               async_extent = list_entry(async_cow->extents.next,
+                                         struct async_extent, list);
+               list_del(&async_extent->list);
+
+               io_tree = &BTRFS_I(inode)->io_tree;
+
+               /* did the compression code fall back to uncompressed IO? */
+               if (!async_extent->pages) {
+                       int page_started = 0;
+                       unsigned long nr_written = 0;
+
+                       lock_extent(io_tree, async_extent->start,
+                                   async_extent->start +
+                                   async_extent->ram_size - 1, GFP_NOFS);
+
+                       /* allocate blocks */
+                       cow_file_range(inode, async_cow->locked_page,
+                                      async_extent->start,
+                                      async_extent->start +
+                                      async_extent->ram_size - 1,
+                                      &page_started, &nr_written, 0);
+
+                       /*
+                        * if page_started, cow_file_range inserted an
+                        * inline extent and took care of all the unlocking
+                        * and IO for us.  Otherwise, we need to submit
+                        * all those pages down to the drive.
+                        */
+                       if (!page_started)
+                               extent_write_locked_range(io_tree,
+                                                 inode, async_extent->start,
+                                                 async_extent->start +
+                                                 async_extent->ram_size - 1,
+                                                 btrfs_get_extent,
+                                                 WB_SYNC_ALL);
+                       kfree(async_extent);
+                       cond_resched();
+                       continue;
+               }
+
+               lock_extent(io_tree, async_extent->start,
+                           async_extent->start + async_extent->ram_size - 1,
+                           GFP_NOFS);
+               /*
+                * here we're doing allocation and writeback of the
+                * compressed pages
+                */
+               btrfs_drop_extent_cache(inode, async_extent->start,
+                                       async_extent->start +
+                                       async_extent->ram_size - 1, 0);
+
+               ret = btrfs_reserve_extent(trans, root,
+                                          async_extent->compressed_size,
+                                          async_extent->compressed_size,
+                                          0, alloc_hint,
+                                          (u64)-1, &ins, 1);
+               BUG_ON(ret);
+               em = alloc_extent_map(GFP_NOFS);
+               em->start = async_extent->start;
+               em->len = async_extent->ram_size;
+               em->orig_start = em->start;
+
+               em->block_start = ins.objectid;
+               em->block_len = ins.offset;
+               em->bdev = root->fs_info->fs_devices->latest_bdev;
+               set_bit(EXTENT_FLAG_PINNED, &em->flags);
+               set_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
+
+               while (1) {
+                       spin_lock(&em_tree->lock);
+                       ret = add_extent_mapping(em_tree, em);
+                       spin_unlock(&em_tree->lock);
+                       if (ret != -EEXIST) {
+                               free_extent_map(em);
+                               break;
+                       }
+                       btrfs_drop_extent_cache(inode, async_extent->start,
+                                               async_extent->start +
+                                               async_extent->ram_size - 1, 0);
+               }
+
+               ret = btrfs_add_ordered_extent(inode, async_extent->start,
+                                              ins.objectid,
+                                              async_extent->ram_size,
+                                              ins.offset,
+                                              BTRFS_ORDERED_COMPRESSED);
+               BUG_ON(ret);
+
+               btrfs_end_transaction(trans, root);
+
+               /*
+                * clear dirty, set writeback and unlock the pages.
+                */
+               extent_clear_unlock_delalloc(inode,
+                                            &BTRFS_I(inode)->io_tree,
+                                            async_extent->start,
+                                            async_extent->start +
+                                            async_extent->ram_size - 1,
+                                            NULL, 1, 1, 0, 1, 1, 0);
+
+               ret = btrfs_submit_compressed_write(inode,
+                                   async_extent->start,
+                                   async_extent->ram_size,
+                                   ins.objectid,
+                                   ins.offset, async_extent->pages,
+                                   async_extent->nr_pages);
+
+               BUG_ON(ret);
+               trans = btrfs_join_transaction(root, 1);
+               alloc_hint = ins.objectid + ins.offset;
+               kfree(async_extent);
+               cond_resched();
+       }
+
+       btrfs_end_transaction(trans, root);
+       return 0;
+}
+
+/*
+ * when extent_io.c finds a delayed allocation range in the file,
+ * the call backs end up in this code.  The basic idea is to
+ * allocate extents on disk for the range, and create ordered data structs
+ * in ram to track those extents.
+ *
+ * locked_page is the page that writepage had locked already.  We use
+ * it to make sure we don't do extra locks or unlocks.
+ *
+ * *page_started is set to one if we unlock locked_page and do everything
+ * required to start IO on it.  It may be clean and already done with
+ * IO when we return.
+ */
+static noinline int cow_file_range(struct inode *inode,
+                                  struct page *locked_page,
+                                  u64 start, u64 end, int *page_started,
+                                  unsigned long *nr_written,
+                                  int unlock)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_trans_handle *trans;
+       u64 alloc_hint = 0;
+       u64 num_bytes;
+       unsigned long ram_size;
+       u64 disk_num_bytes;
+       u64 cur_alloc_size;
+       u64 blocksize = root->sectorsize;
+       u64 actual_end;
+       u64 isize = i_size_read(inode);
+       struct btrfs_key ins;
+       struct extent_map *em;
+       struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+       int ret = 0;
+
+       trans = btrfs_join_transaction(root, 1);
+       BUG_ON(!trans);
+       btrfs_set_trans_block_group(trans, inode);
+
+       actual_end = min_t(u64, isize, end + 1);
+
+       num_bytes = (end - start + blocksize) & ~(blocksize - 1);
+       num_bytes = max(blocksize,  num_bytes);
+       disk_num_bytes = num_bytes;
+       ret = 0;
+
+       if (start == 0) {
+               /* lets try to make an inline extent */
+               ret = cow_file_range_inline(trans, root, inode,
+                                           start, end, 0, NULL);
+               if (ret == 0) {
+                       extent_clear_unlock_delalloc(inode,
+                                                    &BTRFS_I(inode)->io_tree,
+                                                    start, end, NULL, 1, 1,
+                                                    1, 1, 1, 1);
+                       *nr_written = *nr_written +
+                            (end - start + PAGE_CACHE_SIZE) / PAGE_CACHE_SIZE;
+                       *page_started = 1;
+                       ret = 0;
+                       goto out;
+               }
+       }
+
+       BUG_ON(disk_num_bytes >
+              btrfs_super_total_bytes(&root->fs_info->super_copy));
+
+       btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0);
+
+       while (disk_num_bytes > 0) {
+               cur_alloc_size = min(disk_num_bytes, root->fs_info->max_extent);
+               ret = btrfs_reserve_extent(trans, root, cur_alloc_size,
+                                          root->sectorsize, 0, alloc_hint,
+                                          (u64)-1, &ins, 1);
+               BUG_ON(ret);
+
+               em = alloc_extent_map(GFP_NOFS);
+               em->start = start;
+               em->orig_start = em->start;
+
+               ram_size = ins.offset;
+               em->len = ins.offset;
+
+               em->block_start = ins.objectid;
+               em->block_len = ins.offset;
+               em->bdev = root->fs_info->fs_devices->latest_bdev;
+               set_bit(EXTENT_FLAG_PINNED, &em->flags);
+
+               while (1) {
+                       spin_lock(&em_tree->lock);
+                       ret = add_extent_mapping(em_tree, em);
+                       spin_unlock(&em_tree->lock);
+                       if (ret != -EEXIST) {
+                               free_extent_map(em);
+                               break;
+                       }
+                       btrfs_drop_extent_cache(inode, start,
+                                               start + ram_size - 1, 0);
+               }
+
+               cur_alloc_size = ins.offset;
+               ret = btrfs_add_ordered_extent(inode, start, ins.objectid,
+                                              ram_size, cur_alloc_size, 0);
+               BUG_ON(ret);
+
+               if (root->root_key.objectid ==
+                   BTRFS_DATA_RELOC_TREE_OBJECTID) {
+                       ret = btrfs_reloc_clone_csums(inode, start,
+                                                     cur_alloc_size);
+                       BUG_ON(ret);
+               }
+
+               if (disk_num_bytes < cur_alloc_size)
+                       break;
+
+               /* we're not doing compressed IO, don't unlock the first
+                * page (which the caller expects to stay locked), don't
+                * clear any dirty bits and don't set any writeback bits
+                */
+               extent_clear_unlock_delalloc(inode, &BTRFS_I(inode)->io_tree,
+                                            start, start + ram_size - 1,
+                                            locked_page, unlock, 1,
+                                            1, 0, 0, 0);
+               disk_num_bytes -= cur_alloc_size;
+               num_bytes -= cur_alloc_size;
+               alloc_hint = ins.objectid + ins.offset;
+               start += cur_alloc_size;
+       }
+out:
+       ret = 0;
+       btrfs_end_transaction(trans, root);
+
+       return ret;
+}
+
+/*
+ * work queue call back to started compression on a file and pages
+ */
+static noinline void async_cow_start(struct btrfs_work *work)
+{
+       struct async_cow *async_cow;
+       int num_added = 0;
+       async_cow = container_of(work, struct async_cow, work);
+
+       compress_file_range(async_cow->inode, async_cow->locked_page,
+                           async_cow->start, async_cow->end, async_cow,
+                           &num_added);
+       if (num_added == 0)
+               async_cow->inode = NULL;
+}
+
+/*
+ * work queue call back to submit previously compressed pages
+ */
+static noinline void async_cow_submit(struct btrfs_work *work)
+{
+       struct async_cow *async_cow;
+       struct btrfs_root *root;
+       unsigned long nr_pages;
+
+       async_cow = container_of(work, struct async_cow, work);
+
+       root = async_cow->root;
+       nr_pages = (async_cow->end - async_cow->start + PAGE_CACHE_SIZE) >>
+               PAGE_CACHE_SHIFT;
+
+       atomic_sub(nr_pages, &root->fs_info->async_delalloc_pages);
+
+       if (atomic_read(&root->fs_info->async_delalloc_pages) <
+           5 * 1042 * 1024 &&
+           waitqueue_active(&root->fs_info->async_submit_wait))
+               wake_up(&root->fs_info->async_submit_wait);
+
+       if (async_cow->inode)
+               submit_compressed_extents(async_cow->inode, async_cow);
+}
+
+static noinline void async_cow_free(struct btrfs_work *work)
+{
+       struct async_cow *async_cow;
+       async_cow = container_of(work, struct async_cow, work);
+       kfree(async_cow);
+}
+
+static int cow_file_range_async(struct inode *inode, struct page *locked_page,
+                               u64 start, u64 end, int *page_started,
+                               unsigned long *nr_written)
+{
+       struct async_cow *async_cow;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       unsigned long nr_pages;
+       u64 cur_end;
+       int limit = 10 * 1024 * 1042;
+
+       if (!btrfs_test_opt(root, COMPRESS)) {
+               return cow_file_range(inode, locked_page, start, end,
+                                     page_started, nr_written, 1);
+       }
+
+       clear_extent_bit(&BTRFS_I(inode)->io_tree, start, end, EXTENT_LOCKED |
+                        EXTENT_DELALLOC, 1, 0, GFP_NOFS);
+       while (start < end) {
+               async_cow = kmalloc(sizeof(*async_cow), GFP_NOFS);
+               async_cow->inode = inode;
+               async_cow->root = root;
+               async_cow->locked_page = locked_page;
+               async_cow->start = start;
+
+               if (btrfs_test_flag(inode, NOCOMPRESS))
+                       cur_end = end;
+               else
+                       cur_end = min(end, start + 512 * 1024 - 1);
+
+               async_cow->end = cur_end;
+               INIT_LIST_HEAD(&async_cow->extents);
+
+               async_cow->work.func = async_cow_start;
+               async_cow->work.ordered_func = async_cow_submit;
+               async_cow->work.ordered_free = async_cow_free;
+               async_cow->work.flags = 0;
+
+               nr_pages = (cur_end - start + PAGE_CACHE_SIZE) >>
+                       PAGE_CACHE_SHIFT;
+               atomic_add(nr_pages, &root->fs_info->async_delalloc_pages);
+
+               btrfs_queue_worker(&root->fs_info->delalloc_workers,
+                                  &async_cow->work);
+
+               if (atomic_read(&root->fs_info->async_delalloc_pages) > limit) {
+                       wait_event(root->fs_info->async_submit_wait,
+                          (atomic_read(&root->fs_info->async_delalloc_pages) <
+                           limit));
+               }
+
+               while (atomic_read(&root->fs_info->async_submit_draining) &&
+                     atomic_read(&root->fs_info->async_delalloc_pages)) {
+                       wait_event(root->fs_info->async_submit_wait,
+                         (atomic_read(&root->fs_info->async_delalloc_pages) ==
+                          0));
+               }
+
+               *nr_written += nr_pages;
+               start = cur_end + 1;
+       }
+       *page_started = 1;
+       return 0;
+}
+
+static noinline int csum_exist_in_range(struct btrfs_root *root,
+                                       u64 bytenr, u64 num_bytes)
+{
+       int ret;
+       struct btrfs_ordered_sum *sums;
+       LIST_HEAD(list);
+
+       ret = btrfs_lookup_csums_range(root->fs_info->csum_root, bytenr,
+                                      bytenr + num_bytes - 1, &list);
+       if (ret == 0 && list_empty(&list))
+               return 0;
+
+       while (!list_empty(&list)) {
+               sums = list_entry(list.next, struct btrfs_ordered_sum, list);
+               list_del(&sums->list);
+               kfree(sums);
+       }
+       return 1;
+}
+
+/*
+ * when nowcow writeback call back.  This checks for snapshots or COW copies
+ * of the extents that exist in the file, and COWs the file as required.
+ *
+ * If no cow copies or snapshots exist, we write directly to the existing
+ * blocks on disk
+ */
+static int run_delalloc_nocow(struct inode *inode, struct page *locked_page,
+                             u64 start, u64 end, int *page_started, int force,
+                             unsigned long *nr_written)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_trans_handle *trans;
+       struct extent_buffer *leaf;
+       struct btrfs_path *path;
+       struct btrfs_file_extent_item *fi;
+       struct btrfs_key found_key;
+       u64 cow_start;
+       u64 cur_offset;
+       u64 extent_end;
+       u64 disk_bytenr;
+       u64 num_bytes;
+       int extent_type;
+       int ret;
+       int type;
+       int nocow;
+       int check_prev = 1;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+       trans = btrfs_join_transaction(root, 1);
+       BUG_ON(!trans);
+
+       cow_start = (u64)-1;
+       cur_offset = start;
+       while (1) {
+               ret = btrfs_lookup_file_extent(trans, root, path, inode->i_ino,
+                                              cur_offset, 0);
+               BUG_ON(ret < 0);
+               if (ret > 0 && path->slots[0] > 0 && check_prev) {
+                       leaf = path->nodes[0];
+                       btrfs_item_key_to_cpu(leaf, &found_key,
+                                             path->slots[0] - 1);
+                       if (found_key.objectid == inode->i_ino &&
+                           found_key.type == BTRFS_EXTENT_DATA_KEY)
+                               path->slots[0]--;
+               }
+               check_prev = 0;
+next_slot:
+               leaf = path->nodes[0];
+               if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret < 0)
+                               BUG_ON(1);
+                       if (ret > 0)
+                               break;
+                       leaf = path->nodes[0];
+               }
+
+               nocow = 0;
+               disk_bytenr = 0;
+               num_bytes = 0;
+               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+               if (found_key.objectid > inode->i_ino ||
+                   found_key.type > BTRFS_EXTENT_DATA_KEY ||
+                   found_key.offset > end)
+                       break;
+
+               if (found_key.offset > cur_offset) {
+                       extent_end = found_key.offset;
+                       goto out_check;
+               }
+
+               fi = btrfs_item_ptr(leaf, path->slots[0],
+                                   struct btrfs_file_extent_item);
+               extent_type = btrfs_file_extent_type(leaf, fi);
+
+               if (extent_type == BTRFS_FILE_EXTENT_REG ||
+                   extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
+                       disk_bytenr = btrfs_file_extent_disk_bytenr(leaf, fi);
+                       extent_end = found_key.offset +
+                               btrfs_file_extent_num_bytes(leaf, fi);
+                       if (extent_end <= start) {
+                               path->slots[0]++;
+                               goto next_slot;
+                       }
+                       if (disk_bytenr == 0)
+                               goto out_check;
+                       if (btrfs_file_extent_compression(leaf, fi) ||
+                           btrfs_file_extent_encryption(leaf, fi) ||
+                           btrfs_file_extent_other_encoding(leaf, fi))
+                               goto out_check;
+                       if (extent_type == BTRFS_FILE_EXTENT_REG && !force)
+                               goto out_check;
+                       if (btrfs_extent_readonly(root, disk_bytenr))
+                               goto out_check;
+                       if (btrfs_cross_ref_exist(trans, root, inode->i_ino,
+                                                 disk_bytenr))
+                               goto out_check;
+                       disk_bytenr += btrfs_file_extent_offset(leaf, fi);
+                       disk_bytenr += cur_offset - found_key.offset;
+                       num_bytes = min(end + 1, extent_end) - cur_offset;
+                       /*
+                        * force cow if csum exists in the range.
+                        * this ensure that csum for a given extent are
+                        * either valid or do not exist.
+                        */
+                       if (csum_exist_in_range(root, disk_bytenr, num_bytes))
+                               goto out_check;
+                       nocow = 1;
+               } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
+                       extent_end = found_key.offset +
+                               btrfs_file_extent_inline_len(leaf, fi);
+                       extent_end = ALIGN(extent_end, root->sectorsize);
+               } else {
+                       BUG_ON(1);
+               }
+out_check:
+               if (extent_end <= start) {
+                       path->slots[0]++;
+                       goto next_slot;
+               }
+               if (!nocow) {
+                       if (cow_start == (u64)-1)
+                               cow_start = cur_offset;
+                       cur_offset = extent_end;
+                       if (cur_offset > end)
+                               break;
+                       path->slots[0]++;
+                       goto next_slot;
+               }
+
+               btrfs_release_path(root, path);
+               if (cow_start != (u64)-1) {
+                       ret = cow_file_range(inode, locked_page, cow_start,
+                                       found_key.offset - 1, page_started,
+                                       nr_written, 1);
+                       BUG_ON(ret);
+                       cow_start = (u64)-1;
+               }
+
+               if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
+                       struct extent_map *em;
+                       struct extent_map_tree *em_tree;
+                       em_tree = &BTRFS_I(inode)->extent_tree;
+                       em = alloc_extent_map(GFP_NOFS);
+                       em->start = cur_offset;
+                       em->orig_start = em->start;
+                       em->len = num_bytes;
+                       em->block_len = num_bytes;
+                       em->block_start = disk_bytenr;
+                       em->bdev = root->fs_info->fs_devices->latest_bdev;
+                       set_bit(EXTENT_FLAG_PINNED, &em->flags);
+                       while (1) {
+                               spin_lock(&em_tree->lock);
+                               ret = add_extent_mapping(em_tree, em);
+                               spin_unlock(&em_tree->lock);
+                               if (ret != -EEXIST) {
+                                       free_extent_map(em);
+                                       break;
+                               }
+                               btrfs_drop_extent_cache(inode, em->start,
+                                               em->start + em->len - 1, 0);
+                       }
+                       type = BTRFS_ORDERED_PREALLOC;
+               } else {
+                       type = BTRFS_ORDERED_NOCOW;
+               }
+
+               ret = btrfs_add_ordered_extent(inode, cur_offset, disk_bytenr,
+                                              num_bytes, num_bytes, type);
+               BUG_ON(ret);
+
+               extent_clear_unlock_delalloc(inode, &BTRFS_I(inode)->io_tree,
+                                       cur_offset, cur_offset + num_bytes - 1,
+                                       locked_page, 1, 1, 1, 0, 0, 0);
+               cur_offset = extent_end;
+               if (cur_offset > end)
+                       break;
+       }
+       btrfs_release_path(root, path);
+
+       if (cur_offset <= end && cow_start == (u64)-1)
+               cow_start = cur_offset;
+       if (cow_start != (u64)-1) {
+               ret = cow_file_range(inode, locked_page, cow_start, end,
+                                    page_started, nr_written, 1);
+               BUG_ON(ret);
+       }
+
+       ret = btrfs_end_transaction(trans, root);
+       BUG_ON(ret);
+       btrfs_free_path(path);
+       return 0;
+}
+
+/*
+ * extent_io.c call back to do delayed allocation processing
+ */
+static int run_delalloc_range(struct inode *inode, struct page *locked_page,
+                             u64 start, u64 end, int *page_started,
+                             unsigned long *nr_written)
+{
+       int ret;
+
+       if (btrfs_test_flag(inode, NODATACOW))
+               ret = run_delalloc_nocow(inode, locked_page, start, end,
+                                        page_started, 1, nr_written);
+       else if (btrfs_test_flag(inode, PREALLOC))
+               ret = run_delalloc_nocow(inode, locked_page, start, end,
+                                        page_started, 0, nr_written);
+       else
+               ret = cow_file_range_async(inode, locked_page, start, end,
+                                          page_started, nr_written);
+
+       return ret;
+}
+
+/*
+ * extent_io.c set_bit_hook, used to track delayed allocation
+ * bytes in this file, and to maintain the list of inodes that
+ * have pending delalloc work to be done.
+ */
+static int btrfs_set_bit_hook(struct inode *inode, u64 start, u64 end,
+                      unsigned long old, unsigned long bits)
+{
+       /*
+        * set_bit and clear bit hooks normally require _irqsave/restore
+        * but in this case, we are only testeing for the DELALLOC
+        * bit, which is only set or cleared with irqs on
+        */
+       if (!(old & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
+               struct btrfs_root *root = BTRFS_I(inode)->root;
+               spin_lock(&root->fs_info->delalloc_lock);
+               BTRFS_I(inode)->delalloc_bytes += end - start + 1;
+               root->fs_info->delalloc_bytes += end - start + 1;
+               if (list_empty(&BTRFS_I(inode)->delalloc_inodes)) {
+                       list_add_tail(&BTRFS_I(inode)->delalloc_inodes,
+                                     &root->fs_info->delalloc_inodes);
+               }
+               spin_unlock(&root->fs_info->delalloc_lock);
+       }
+       return 0;
+}
+
+/*
+ * extent_io.c clear_bit_hook, see set_bit_hook for why
+ */
+static int btrfs_clear_bit_hook(struct inode *inode, u64 start, u64 end,
+                        unsigned long old, unsigned long bits)
+{
+       /*
+        * set_bit and clear bit hooks normally require _irqsave/restore
+        * but in this case, we are only testeing for the DELALLOC
+        * bit, which is only set or cleared with irqs on
+        */
+       if ((old & EXTENT_DELALLOC) && (bits & EXTENT_DELALLOC)) {
+               struct btrfs_root *root = BTRFS_I(inode)->root;
+
+               spin_lock(&root->fs_info->delalloc_lock);
+               if (end - start + 1 > root->fs_info->delalloc_bytes) {
+                       printk(KERN_INFO "btrfs warning: delalloc account "
+                              "%llu %llu\n",
+                              (unsigned long long)end - start + 1,
+                              (unsigned long long)
+                              root->fs_info->delalloc_bytes);
+                       root->fs_info->delalloc_bytes = 0;
+                       BTRFS_I(inode)->delalloc_bytes = 0;
+               } else {
+                       root->fs_info->delalloc_bytes -= end - start + 1;
+                       BTRFS_I(inode)->delalloc_bytes -= end - start + 1;
+               }
+               if (BTRFS_I(inode)->delalloc_bytes == 0 &&
+                   !list_empty(&BTRFS_I(inode)->delalloc_inodes)) {
+                       list_del_init(&BTRFS_I(inode)->delalloc_inodes);
+               }
+               spin_unlock(&root->fs_info->delalloc_lock);
+       }
+       return 0;
+}
+
+/*
+ * extent_io.c merge_bio_hook, this must check the chunk tree to make sure
+ * we don't create bios that span stripes or chunks
+ */
+int btrfs_merge_bio_hook(struct page *page, unsigned long offset,
+                        size_t size, struct bio *bio,
+                        unsigned long bio_flags)
+{
+       struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
+       struct btrfs_mapping_tree *map_tree;
+       u64 logical = (u64)bio->bi_sector << 9;
+       u64 length = 0;
+       u64 map_length;
+       int ret;
+
+       if (bio_flags & EXTENT_BIO_COMPRESSED)
+               return 0;
+
+       length = bio->bi_size;
+       map_tree = &root->fs_info->mapping_tree;
+       map_length = length;
+       ret = btrfs_map_block(map_tree, READ, logical,
+                             &map_length, NULL, 0);
+
+       if (map_length < length + size)
+               return 1;
+       return 0;
+}
+
+/*
+ * in order to insert checksums into the metadata in large chunks,
+ * we wait until bio submission time.   All the pages in the bio are
+ * checksummed and sums are attached onto the ordered extent record.
+ *
+ * At IO completion time the cums attached on the ordered extent record
+ * are inserted into the btree
+ */
+static int __btrfs_submit_bio_start(struct inode *inode, int rw,
+                                   struct bio *bio, int mirror_num,
+                                   unsigned long bio_flags)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       int ret = 0;
+
+       ret = btrfs_csum_one_bio(root, inode, bio, 0, 0);
+       BUG_ON(ret);
+       return 0;
+}
+
+/*
+ * in order to insert checksums into the metadata in large chunks,
+ * we wait until bio submission time.   All the pages in the bio are
+ * checksummed and sums are attached onto the ordered extent record.
+ *
+ * At IO completion time the cums attached on the ordered extent record
+ * are inserted into the btree
+ */
+static int __btrfs_submit_bio_done(struct inode *inode, int rw, struct bio *bio,
+                         int mirror_num, unsigned long bio_flags)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       return btrfs_map_bio(root, rw, bio, mirror_num, 1);
+}
+
+/*
+ * extent_io.c submission hook. This does the right thing for csum calculation
+ * on write, or reading the csums from the tree before a read
+ */
+static int btrfs_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
+                         int mirror_num, unsigned long bio_flags)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       int ret = 0;
+       int skip_sum;
+
+       skip_sum = btrfs_test_flag(inode, NODATASUM);
+
+       ret = btrfs_bio_wq_end_io(root->fs_info, bio, 0);
+       BUG_ON(ret);
+
+       if (!(rw & (1 << BIO_RW))) {
+               if (bio_flags & EXTENT_BIO_COMPRESSED) {
+                       return btrfs_submit_compressed_read(inode, bio,
+                                                   mirror_num, bio_flags);
+               } else if (!skip_sum)
+                       btrfs_lookup_bio_sums(root, inode, bio, NULL);
+               goto mapit;
+       } else if (!skip_sum) {
+               /* csum items have already been cloned */
+               if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)
+                       goto mapit;
+               /* we're doing a write, do the async checksumming */
+               return btrfs_wq_submit_bio(BTRFS_I(inode)->root->fs_info,
+                                  inode, rw, bio, mirror_num,
+                                  bio_flags, __btrfs_submit_bio_start,
+                                  __btrfs_submit_bio_done);
+       }
+
+mapit:
+       return btrfs_map_bio(root, rw, bio, mirror_num, 0);
+}
+
+/*
+ * given a list of ordered sums record them in the inode.  This happens
+ * at IO completion time based on sums calculated at bio submission time.
+ */
+static noinline int add_pending_csums(struct btrfs_trans_handle *trans,
+                            struct inode *inode, u64 file_offset,
+                            struct list_head *list)
+{
+       struct list_head *cur;
+       struct btrfs_ordered_sum *sum;
+
+       btrfs_set_trans_block_group(trans, inode);
+       list_for_each(cur, list) {
+               sum = list_entry(cur, struct btrfs_ordered_sum, list);
+               btrfs_csum_file_blocks(trans,
+                      BTRFS_I(inode)->root->fs_info->csum_root, sum);
+       }
+       return 0;
+}
+
+int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end)
+{
+       if ((end & (PAGE_CACHE_SIZE - 1)) == 0)
+               WARN_ON(1);
+       return set_extent_delalloc(&BTRFS_I(inode)->io_tree, start, end,
+                                  GFP_NOFS);
+}
+
+/* see btrfs_writepage_start_hook for details on why this is required */
+struct btrfs_writepage_fixup {
+       struct page *page;
+       struct btrfs_work work;
+};
+
+static void btrfs_writepage_fixup_worker(struct btrfs_work *work)
+{
+       struct btrfs_writepage_fixup *fixup;
+       struct btrfs_ordered_extent *ordered;
+       struct page *page;
+       struct inode *inode;
+       u64 page_start;
+       u64 page_end;
+
+       fixup = container_of(work, struct btrfs_writepage_fixup, work);
+       page = fixup->page;
+again:
+       lock_page(page);
+       if (!page->mapping || !PageDirty(page) || !PageChecked(page)) {
+               ClearPageChecked(page);
+               goto out_page;
+       }
+
+       inode = page->mapping->host;
+       page_start = page_offset(page);
+       page_end = page_offset(page) + PAGE_CACHE_SIZE - 1;
+
+       lock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end, GFP_NOFS);
+
+       /* already ordered? We're done */
+       if (test_range_bit(&BTRFS_I(inode)->io_tree, page_start, page_end,
+                            EXTENT_ORDERED, 0)) {
+               goto out;
+       }
+
+       ordered = btrfs_lookup_ordered_extent(inode, page_start);
+       if (ordered) {
+               unlock_extent(&BTRFS_I(inode)->io_tree, page_start,
+                             page_end, GFP_NOFS);
+               unlock_page(page);
+               btrfs_start_ordered_extent(inode, ordered, 1);
+               goto again;
+       }
+
+       btrfs_set_extent_delalloc(inode, page_start, page_end);
+       ClearPageChecked(page);
+out:
+       unlock_extent(&BTRFS_I(inode)->io_tree, page_start, page_end, GFP_NOFS);
+out_page:
+       unlock_page(page);
+       page_cache_release(page);
+}
+
+/*
+ * There are a few paths in the higher layers of the kernel that directly
+ * set the page dirty bit without asking the filesystem if it is a
+ * good idea.  This causes problems because we want to make sure COW
+ * properly happens and the data=ordered rules are followed.
+ *
+ * In our case any range that doesn't have the ORDERED bit set
+ * hasn't been properly setup for IO.  We kick off an async process
+ * to fix it up.  The async helper will wait for ordered extents, set
+ * the delalloc bit and make it safe to write the page.
+ */
+static int btrfs_writepage_start_hook(struct page *page, u64 start, u64 end)
+{
+       struct inode *inode = page->mapping->host;
+       struct btrfs_writepage_fixup *fixup;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       int ret;
+
+       ret = test_range_bit(&BTRFS_I(inode)->io_tree, start, end,
+                            EXTENT_ORDERED, 0);
+       if (ret)
+               return 0;
+
+       if (PageChecked(page))
+               return -EAGAIN;
+
+       fixup = kzalloc(sizeof(*fixup), GFP_NOFS);
+       if (!fixup)
+               return -EAGAIN;
+
+       SetPageChecked(page);
+       page_cache_get(page);
+       fixup->work.func = btrfs_writepage_fixup_worker;
+       fixup->page = page;
+       btrfs_queue_worker(&root->fs_info->fixup_workers, &fixup->work);
+       return -EAGAIN;
+}
+
+static int insert_reserved_file_extent(struct btrfs_trans_handle *trans,
+                                      struct inode *inode, u64 file_pos,
+                                      u64 disk_bytenr, u64 disk_num_bytes,
+                                      u64 num_bytes, u64 ram_bytes,
+                                      u8 compression, u8 encryption,
+                                      u16 other_encoding, int extent_type)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_file_extent_item *fi;
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       struct btrfs_key ins;
+       u64 hint;
+       int ret;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       ret = btrfs_drop_extents(trans, root, inode, file_pos,
+                                file_pos + num_bytes, file_pos, &hint);
+       BUG_ON(ret);
+
+       ins.objectid = inode->i_ino;
+       ins.offset = file_pos;
+       ins.type = BTRFS_EXTENT_DATA_KEY;
+       ret = btrfs_insert_empty_item(trans, root, path, &ins, sizeof(*fi));
+       BUG_ON(ret);
+       leaf = path->nodes[0];
+       fi = btrfs_item_ptr(leaf, path->slots[0],
+                           struct btrfs_file_extent_item);
+       btrfs_set_file_extent_generation(leaf, fi, trans->transid);
+       btrfs_set_file_extent_type(leaf, fi, extent_type);
+       btrfs_set_file_extent_disk_bytenr(leaf, fi, disk_bytenr);
+       btrfs_set_file_extent_disk_num_bytes(leaf, fi, disk_num_bytes);
+       btrfs_set_file_extent_offset(leaf, fi, 0);
+       btrfs_set_file_extent_num_bytes(leaf, fi, num_bytes);
+       btrfs_set_file_extent_ram_bytes(leaf, fi, ram_bytes);
+       btrfs_set_file_extent_compression(leaf, fi, compression);
+       btrfs_set_file_extent_encryption(leaf, fi, encryption);
+       btrfs_set_file_extent_other_encoding(leaf, fi, other_encoding);
+       btrfs_mark_buffer_dirty(leaf);
+
+       inode_add_bytes(inode, num_bytes);
+       btrfs_drop_extent_cache(inode, file_pos, file_pos + num_bytes - 1, 0);
+
+       ins.objectid = disk_bytenr;
+       ins.offset = disk_num_bytes;
+       ins.type = BTRFS_EXTENT_ITEM_KEY;
+       ret = btrfs_alloc_reserved_extent(trans, root, leaf->start,
+                                         root->root_key.objectid,
+                                         trans->transid, inode->i_ino, &ins);
+       BUG_ON(ret);
+
+       btrfs_free_path(path);
+       return 0;
+}
+
+/* as ordered data IO finishes, this gets called so we can finish
+ * an ordered extent if the range of bytes in the file it covers are
+ * fully written.
+ */
+static int btrfs_finish_ordered_io(struct inode *inode, u64 start, u64 end)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_trans_handle *trans;
+       struct btrfs_ordered_extent *ordered_extent;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+       int compressed = 0;
+       int ret;
+
+       ret = btrfs_dec_test_ordered_pending(inode, start, end - start + 1);
+       if (!ret)
+               return 0;
+
+       trans = btrfs_join_transaction(root, 1);
+
+       ordered_extent = btrfs_lookup_ordered_extent(inode, start);
+       BUG_ON(!ordered_extent);
+       if (test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags))
+               goto nocow;
+
+       lock_extent(io_tree, ordered_extent->file_offset,
+                   ordered_extent->file_offset + ordered_extent->len - 1,
+                   GFP_NOFS);
+
+       if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered_extent->flags))
+               compressed = 1;
+       if (test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) {
+               BUG_ON(compressed);
+               ret = btrfs_mark_extent_written(trans, root, inode,
+                                               ordered_extent->file_offset,
+                                               ordered_extent->file_offset +
+                                               ordered_extent->len);
+               BUG_ON(ret);
+       } else {
+               ret = insert_reserved_file_extent(trans, inode,
+                                               ordered_extent->file_offset,
+                                               ordered_extent->start,
+                                               ordered_extent->disk_len,
+                                               ordered_extent->len,
+                                               ordered_extent->len,
+                                               compressed, 0, 0,
+                                               BTRFS_FILE_EXTENT_REG);
+               BUG_ON(ret);
+       }
+       unlock_extent(io_tree, ordered_extent->file_offset,
+                   ordered_extent->file_offset + ordered_extent->len - 1,
+                   GFP_NOFS);
+nocow:
+       add_pending_csums(trans, inode, ordered_extent->file_offset,
+                         &ordered_extent->list);
+
+       mutex_lock(&BTRFS_I(inode)->extent_mutex);
+       btrfs_ordered_update_i_size(inode, ordered_extent);
+       btrfs_update_inode(trans, root, inode);
+       btrfs_remove_ordered_extent(inode, ordered_extent);
+       mutex_unlock(&BTRFS_I(inode)->extent_mutex);
+
+       /* once for us */
+       btrfs_put_ordered_extent(ordered_extent);
+       /* once for the tree */
+       btrfs_put_ordered_extent(ordered_extent);
+
+       btrfs_end_transaction(trans, root);
+       return 0;
+}
+
+static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
+                               struct extent_state *state, int uptodate)
+{
+       return btrfs_finish_ordered_io(page->mapping->host, start, end);
+}
+
+/*
+ * When IO fails, either with EIO or csum verification fails, we
+ * try other mirrors that might have a good copy of the data.  This
+ * io_failure_record is used to record state as we go through all the
+ * mirrors.  If another mirror has good data, the page is set up to date
+ * and things continue.  If a good mirror can't be found, the original
+ * bio end_io callback is called to indicate things have failed.
+ */
+struct io_failure_record {
+       struct page *page;
+       u64 start;
+       u64 len;
+       u64 logical;
+       unsigned long bio_flags;
+       int last_mirror;
+};
+
+static int btrfs_io_failed_hook(struct bio *failed_bio,
+                        struct page *page, u64 start, u64 end,
+                        struct extent_state *state)
+{
+       struct io_failure_record *failrec = NULL;
+       u64 private;
+       struct extent_map *em;
+       struct inode *inode = page->mapping->host;
+       struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
+       struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+       struct bio *bio;
+       int num_copies;
+       int ret;
+       int rw;
+       u64 logical;
+
+       ret = get_state_private(failure_tree, start, &private);
+       if (ret) {
+               failrec = kmalloc(sizeof(*failrec), GFP_NOFS);
+               if (!failrec)
+                       return -ENOMEM;
+               failrec->start = start;
+               failrec->len = end - start + 1;
+               failrec->last_mirror = 0;
+               failrec->bio_flags = 0;
+
+               spin_lock(&em_tree->lock);
+               em = lookup_extent_mapping(em_tree, start, failrec->len);
+               if (em->start > start || em->start + em->len < start) {
+                       free_extent_map(em);
+                       em = NULL;
+               }
+               spin_unlock(&em_tree->lock);
+
+               if (!em || IS_ERR(em)) {
+                       kfree(failrec);
+                       return -EIO;
+               }
+               logical = start - em->start;
+               logical = em->block_start + logical;
+               if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
+                       logical = em->block_start;
+                       failrec->bio_flags = EXTENT_BIO_COMPRESSED;
+               }
+               failrec->logical = logical;
+               free_extent_map(em);
+               set_extent_bits(failure_tree, start, end, EXTENT_LOCKED |
+                               EXTENT_DIRTY, GFP_NOFS);
+               set_state_private(failure_tree, start,
+                                (u64)(unsigned long)failrec);
+       } else {
+               failrec = (struct io_failure_record *)(unsigned long)private;
+       }
+       num_copies = btrfs_num_copies(
+                             &BTRFS_I(inode)->root->fs_info->mapping_tree,
+                             failrec->logical, failrec->len);
+       failrec->last_mirror++;
+       if (!state) {
+               spin_lock(&BTRFS_I(inode)->io_tree.lock);
+               state = find_first_extent_bit_state(&BTRFS_I(inode)->io_tree,
+                                                   failrec->start,
+                                                   EXTENT_LOCKED);
+               if (state && state->start != failrec->start)
+                       state = NULL;
+               spin_unlock(&BTRFS_I(inode)->io_tree.lock);
+       }
+       if (!state || failrec->last_mirror > num_copies) {
+               set_state_private(failure_tree, failrec->start, 0);
+               clear_extent_bits(failure_tree, failrec->start,
+                                 failrec->start + failrec->len - 1,
+                                 EXTENT_LOCKED | EXTENT_DIRTY, GFP_NOFS);
+               kfree(failrec);
+               return -EIO;
+       }
+       bio = bio_alloc(GFP_NOFS, 1);
+       bio->bi_private = state;
+       bio->bi_end_io = failed_bio->bi_end_io;
+       bio->bi_sector = failrec->logical >> 9;
+       bio->bi_bdev = failed_bio->bi_bdev;
+       bio->bi_size = 0;
+
+       bio_add_page(bio, page, failrec->len, start - page_offset(page));
+       if (failed_bio->bi_rw & (1 << BIO_RW))
+               rw = WRITE;
+       else
+               rw = READ;
+
+       BTRFS_I(inode)->io_tree.ops->submit_bio_hook(inode, rw, bio,
+                                                     failrec->last_mirror,
+                                                     failrec->bio_flags);
+       return 0;
+}
+
+/*
+ * each time an IO finishes, we do a fast check in the IO failure tree
+ * to see if we need to process or clean up an io_failure_record
+ */
+static int btrfs_clean_io_failures(struct inode *inode, u64 start)
+{
+       u64 private;
+       u64 private_failure;
+       struct io_failure_record *failure;
+       int ret;
+
+       private = 0;
+       if (count_range_bits(&BTRFS_I(inode)->io_failure_tree, &private,
+                            (u64)-1, 1, EXTENT_DIRTY)) {
+               ret = get_state_private(&BTRFS_I(inode)->io_failure_tree,
+                                       start, &private_failure);
+               if (ret == 0) {
+                       failure = (struct io_failure_record *)(unsigned long)
+                                  private_failure;
+                       set_state_private(&BTRFS_I(inode)->io_failure_tree,
+                                         failure->start, 0);
+                       clear_extent_bits(&BTRFS_I(inode)->io_failure_tree,
+                                         failure->start,
+                                         failure->start + failure->len - 1,
+                                         EXTENT_DIRTY | EXTENT_LOCKED,
+                                         GFP_NOFS);
+                       kfree(failure);
+               }
+       }
+       return 0;
+}
+
+/*
+ * when reads are done, we need to check csums to verify the data is correct
+ * if there's a match, we allow the bio to finish.  If not, we go through
+ * the io_failure_record routines to find good copies
+ */
+static int btrfs_readpage_end_io_hook(struct page *page, u64 start, u64 end,
+                              struct extent_state *state)
+{
+       size_t offset = start - ((u64)page->index << PAGE_CACHE_SHIFT);
+       struct inode *inode = page->mapping->host;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+       char *kaddr;
+       u64 private = ~(u32)0;
+       int ret;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       u32 csum = ~(u32)0;
+
+       if (PageChecked(page)) {
+               ClearPageChecked(page);
+               goto good;
+       }
+       if (btrfs_test_flag(inode, NODATASUM))
+               return 0;
+
+       if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID &&
+           test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1)) {
+               clear_extent_bits(io_tree, start, end, EXTENT_NODATASUM,
+                                 GFP_NOFS);
+               return 0;
+       }
+
+       if (state && state->start == start) {
+               private = state->private;
+               ret = 0;
+       } else {
+               ret = get_state_private(io_tree, start, &private);
+       }
+       kaddr = kmap_atomic(page, KM_USER0);
+       if (ret)
+               goto zeroit;
+
+       csum = btrfs_csum_data(root, kaddr + offset, csum,  end - start + 1);
+       btrfs_csum_final(csum, (char *)&csum);
+       if (csum != private)
+               goto zeroit;
+
+       kunmap_atomic(kaddr, KM_USER0);
+good:
+       /* if the io failure tree for this inode is non-empty,
+        * check to see if we've recovered from a failed IO
+        */
+       btrfs_clean_io_failures(inode, start);
+       return 0;
+
+zeroit:
+       printk(KERN_INFO "btrfs csum failed ino %lu off %llu csum %u "
+              "private %llu\n", page->mapping->host->i_ino,
+              (unsigned long long)start, csum,
+              (unsigned long long)private);
+       memset(kaddr + offset, 1, end - start + 1);
+       flush_dcache_page(page);
+       kunmap_atomic(kaddr, KM_USER0);
+       if (private == 0)
+               return 0;
+       return -EIO;
+}
+
+/*
+ * This creates an orphan entry for the given inode in case something goes
+ * wrong in the middle of an unlink/truncate.
+ */
+int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       int ret = 0;
+
+       spin_lock(&root->list_lock);
+
+       /* already on the orphan list, we're good */
+       if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+               spin_unlock(&root->list_lock);
+               return 0;
+       }
+
+       list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+
+       spin_unlock(&root->list_lock);
+
+       /*
+        * insert an orphan item to track this unlinked/truncated file
+        */
+       ret = btrfs_insert_orphan_item(trans, root, inode->i_ino);
+
+       return ret;
+}
+
+/*
+ * We have done the truncate/delete so we can go ahead and remove the orphan
+ * item for this particular inode.
+ */
+int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       int ret = 0;
+
+       spin_lock(&root->list_lock);
+
+       if (list_empty(&BTRFS_I(inode)->i_orphan)) {
+               spin_unlock(&root->list_lock);
+               return 0;
+       }
+
+       list_del_init(&BTRFS_I(inode)->i_orphan);
+       if (!trans) {
+               spin_unlock(&root->list_lock);
+               return 0;
+       }
+
+       spin_unlock(&root->list_lock);
+
+       ret = btrfs_del_orphan_item(trans, root, inode->i_ino);
+
+       return ret;
+}
+
+/*
+ * this cleans up any orphans that may be left on the list from the last use
+ * of this root.
+ */
+void btrfs_orphan_cleanup(struct btrfs_root *root)
+{
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       struct btrfs_item *item;
+       struct btrfs_key key, found_key;
+       struct btrfs_trans_handle *trans;
+       struct inode *inode;
+       int ret = 0, nr_unlink = 0, nr_truncate = 0;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return;
+       path->reada = -1;
+
+       key.objectid = BTRFS_ORPHAN_OBJECTID;
+       btrfs_set_key_type(&key, BTRFS_ORPHAN_ITEM_KEY);
+       key.offset = (u64)-1;
+
+
+       while (1) {
+               ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+               if (ret < 0) {
+                       printk(KERN_ERR "Error searching slot for orphan: %d"
+                              "\n", ret);
+                       break;
+               }
+
+               /*
+                * if ret == 0 means we found what we were searching for, which
+                * is weird, but possible, so only screw with path if we didnt
+                * find the key and see if we have stuff that matches
+                */
+               if (ret > 0) {
+                       if (path->slots[0] == 0)
+                               break;
+                       path->slots[0]--;
+               }
+
+               /* pull out the item */
+               leaf = path->nodes[0];
+               item = btrfs_item_nr(leaf, path->slots[0]);
+               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+               /* make sure the item matches what we want */
+               if (found_key.objectid != BTRFS_ORPHAN_OBJECTID)
+                       break;
+               if (btrfs_key_type(&found_key) != BTRFS_ORPHAN_ITEM_KEY)
+                       break;
+
+               /* release the path since we're done with it */
+               btrfs_release_path(root, path);
+
+               /*
+                * this is where we are basically btrfs_lookup, without the
+                * crossing root thing.  we store the inode number in the
+                * offset of the orphan item.
+                */
+               inode = btrfs_iget_locked(root->fs_info->sb,
+                                         found_key.offset, root);
+               if (!inode)
+                       break;
+
+               if (inode->i_state & I_NEW) {
+                       BTRFS_I(inode)->root = root;
+
+                       /* have to set the location manually */
+                       BTRFS_I(inode)->location.objectid = inode->i_ino;
+                       BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY;
+                       BTRFS_I(inode)->location.offset = 0;
+
+                       btrfs_read_locked_inode(inode);
+                       unlock_new_inode(inode);
+               }
+
+               /*
+                * add this inode to the orphan list so btrfs_orphan_del does
+                * the proper thing when we hit it
+                */
+               spin_lock(&root->list_lock);
+               list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+               spin_unlock(&root->list_lock);
+
+               /*
+                * if this is a bad inode, means we actually succeeded in
+                * removing the inode, but not the orphan record, which means
+                * we need to manually delete the orphan since iput will just
+                * do a destroy_inode
+                */
+               if (is_bad_inode(inode)) {
+                       trans = btrfs_start_transaction(root, 1);
+                       btrfs_orphan_del(trans, inode);
+                       btrfs_end_transaction(trans, root);
+                       iput(inode);
+                       continue;
+               }
+
+               /* if we have links, this was a truncate, lets do that */
+               if (inode->i_nlink) {
+                       nr_truncate++;
+                       btrfs_truncate(inode);
+               } else {
+                       nr_unlink++;
+               }
+
+               /* this will do delete_inode and everything for us */
+               iput(inode);
+       }
+
+       if (nr_unlink)
+               printk(KERN_INFO "btrfs: unlinked %d orphans\n", nr_unlink);
+       if (nr_truncate)
+               printk(KERN_INFO "btrfs: truncated %d orphans\n", nr_truncate);
+
+       btrfs_free_path(path);
+}
+
+/*
+ * read an inode from the btree into the in-memory inode
+ */
+void btrfs_read_locked_inode(struct inode *inode)
+{
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       struct btrfs_inode_item *inode_item;
+       struct btrfs_timespec *tspec;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_key location;
+       u64 alloc_group_block;
+       u32 rdev;
+       int ret;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+       memcpy(&location, &BTRFS_I(inode)->location, sizeof(location));
+
+       ret = btrfs_lookup_inode(NULL, root, path, &location, 0);
+       if (ret)
+               goto make_bad;
+
+       leaf = path->nodes[0];
+       inode_item = btrfs_item_ptr(leaf, path->slots[0],
+                                   struct btrfs_inode_item);
+
+       inode->i_mode = btrfs_inode_mode(leaf, inode_item);
+       inode->i_nlink = btrfs_inode_nlink(leaf, inode_item);
+       inode->i_uid = btrfs_inode_uid(leaf, inode_item);
+       inode->i_gid = btrfs_inode_gid(leaf, inode_item);
+       btrfs_i_size_write(inode, btrfs_inode_size(leaf, inode_item));
+
+       tspec = btrfs_inode_atime(inode_item);
+       inode->i_atime.tv_sec = btrfs_timespec_sec(leaf, tspec);
+       inode->i_atime.tv_nsec = btrfs_timespec_nsec(leaf, tspec);
+
+       tspec = btrfs_inode_mtime(inode_item);
+       inode->i_mtime.tv_sec = btrfs_timespec_sec(leaf, tspec);
+       inode->i_mtime.tv_nsec = btrfs_timespec_nsec(leaf, tspec);
+
+       tspec = btrfs_inode_ctime(inode_item);
+       inode->i_ctime.tv_sec = btrfs_timespec_sec(leaf, tspec);
+       inode->i_ctime.tv_nsec = btrfs_timespec_nsec(leaf, tspec);
+
+       inode_set_bytes(inode, btrfs_inode_nbytes(leaf, inode_item));
+       BTRFS_I(inode)->generation = btrfs_inode_generation(leaf, inode_item);
+       BTRFS_I(inode)->sequence = btrfs_inode_sequence(leaf, inode_item);
+       inode->i_generation = BTRFS_I(inode)->generation;
+       inode->i_rdev = 0;
+       rdev = btrfs_inode_rdev(leaf, inode_item);
+
+       BTRFS_I(inode)->index_cnt = (u64)-1;
+       BTRFS_I(inode)->flags = btrfs_inode_flags(leaf, inode_item);
+
+       alloc_group_block = btrfs_inode_block_group(leaf, inode_item);
+       BTRFS_I(inode)->block_group = btrfs_find_block_group(root, 0,
+                                               alloc_group_block, 0);
+       btrfs_free_path(path);
+       inode_item = NULL;
+
+       switch (inode->i_mode & S_IFMT) {
+       case S_IFREG:
+               inode->i_mapping->a_ops = &btrfs_aops;
+               inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+               BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
+               inode->i_fop = &btrfs_file_operations;
+               inode->i_op = &btrfs_file_inode_operations;
+               break;
+       case S_IFDIR:
+               inode->i_fop = &btrfs_dir_file_operations;
+               if (root == root->fs_info->tree_root)
+                       inode->i_op = &btrfs_dir_ro_inode_operations;
+               else
+                       inode->i_op = &btrfs_dir_inode_operations;
+               break;
+       case S_IFLNK:
+               inode->i_op = &btrfs_symlink_inode_operations;
+               inode->i_mapping->a_ops = &btrfs_symlink_aops;
+               inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+               break;
+       default:
+               init_special_inode(inode, inode->i_mode, rdev);
+               break;
+       }
+       return;
+
+make_bad:
+       btrfs_free_path(path);
+       make_bad_inode(inode);
+}
+
+/*
+ * given a leaf and an inode, copy the inode fields into the leaf
+ */
+static void fill_inode_item(struct btrfs_trans_handle *trans,
+                           struct extent_buffer *leaf,
+                           struct btrfs_inode_item *item,
+                           struct inode *inode)
+{
+       btrfs_set_inode_uid(leaf, item, inode->i_uid);
+       btrfs_set_inode_gid(leaf, item, inode->i_gid);
+       btrfs_set_inode_size(leaf, item, BTRFS_I(inode)->disk_i_size);
+       btrfs_set_inode_mode(leaf, item, inode->i_mode);
+       btrfs_set_inode_nlink(leaf, item, inode->i_nlink);
+
+       btrfs_set_timespec_sec(leaf, btrfs_inode_atime(item),
+                              inode->i_atime.tv_sec);
+       btrfs_set_timespec_nsec(leaf, btrfs_inode_atime(item),
+                               inode->i_atime.tv_nsec);
+
+       btrfs_set_timespec_sec(leaf, btrfs_inode_mtime(item),
+                              inode->i_mtime.tv_sec);
+       btrfs_set_timespec_nsec(leaf, btrfs_inode_mtime(item),
+                               inode->i_mtime.tv_nsec);
+
+       btrfs_set_timespec_sec(leaf, btrfs_inode_ctime(item),
+                              inode->i_ctime.tv_sec);
+       btrfs_set_timespec_nsec(leaf, btrfs_inode_ctime(item),
+                               inode->i_ctime.tv_nsec);
+
+       btrfs_set_inode_nbytes(leaf, item, inode_get_bytes(inode));
+       btrfs_set_inode_generation(leaf, item, BTRFS_I(inode)->generation);
+       btrfs_set_inode_sequence(leaf, item, BTRFS_I(inode)->sequence);
+       btrfs_set_inode_transid(leaf, item, trans->transid);
+       btrfs_set_inode_rdev(leaf, item, inode->i_rdev);
+       btrfs_set_inode_flags(leaf, item, BTRFS_I(inode)->flags);
+       btrfs_set_inode_block_group(leaf, item, BTRFS_I(inode)->block_group);
+}
+
+/*
+ * copy everything in the in-memory inode into the btree.
+ */
+noinline int btrfs_update_inode(struct btrfs_trans_handle *trans,
+                               struct btrfs_root *root, struct inode *inode)
+{
+       struct btrfs_inode_item *inode_item;
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       int ret;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+       ret = btrfs_lookup_inode(trans, root, path,
+                                &BTRFS_I(inode)->location, 1);
+       if (ret) {
+               if (ret > 0)
+                       ret = -ENOENT;
+               goto failed;
+       }
+
+       leaf = path->nodes[0];
+       inode_item = btrfs_item_ptr(leaf, path->slots[0],
+                                 struct btrfs_inode_item);
+
+       fill_inode_item(trans, leaf, inode_item, inode);
+       btrfs_mark_buffer_dirty(leaf);
+       btrfs_set_inode_last_trans(trans, inode);
+       ret = 0;
+failed:
+       btrfs_free_path(path);
+       return ret;
+}
+
+
+/*
+ * unlink helper that gets used here in inode.c and in the tree logging
+ * recovery code.  It remove a link in a directory with a given name, and
+ * also drops the back refs in the inode to the directory
+ */
+int btrfs_unlink_inode(struct btrfs_trans_handle *trans,
+                      struct btrfs_root *root,
+                      struct inode *dir, struct inode *inode,
+                      const char *name, int name_len)
+{
+       struct btrfs_path *path;
+       int ret = 0;
+       struct extent_buffer *leaf;
+       struct btrfs_dir_item *di;
+       struct btrfs_key key;
+       u64 index;
+
+       path = btrfs_alloc_path();
+       if (!path) {
+               ret = -ENOMEM;
+               goto err;
+       }
+
+       di = btrfs_lookup_dir_item(trans, root, path, dir->i_ino,
+                                   name, name_len, -1);
+       if (IS_ERR(di)) {
+               ret = PTR_ERR(di);
+               goto err;
+       }
+       if (!di) {
+               ret = -ENOENT;
+               goto err;
+       }
+       leaf = path->nodes[0];
+       btrfs_dir_item_key_to_cpu(leaf, di, &key);
+       ret = btrfs_delete_one_dir_name(trans, root, path, di);
+       if (ret)
+               goto err;
+       btrfs_release_path(root, path);
+
+       ret = btrfs_del_inode_ref(trans, root, name, name_len,
+                                 inode->i_ino,
+                                 dir->i_ino, &index);
+       if (ret) {
+               printk(KERN_INFO "btrfs failed to delete reference to %.*s, "
+                      "inode %lu parent %lu\n", name_len, name,
+                      inode->i_ino, dir->i_ino);
+               goto err;
+       }
+
+       di = btrfs_lookup_dir_index_item(trans, root, path, dir->i_ino,
+                                        index, name, name_len, -1);
+       if (IS_ERR(di)) {
+               ret = PTR_ERR(di);
+               goto err;
+       }
+       if (!di) {
+               ret = -ENOENT;
+               goto err;
+       }
+       ret = btrfs_delete_one_dir_name(trans, root, path, di);
+       btrfs_release_path(root, path);
+
+       ret = btrfs_del_inode_ref_in_log(trans, root, name, name_len,
+                                        inode, dir->i_ino);
+       BUG_ON(ret != 0 && ret != -ENOENT);
+       if (ret != -ENOENT)
+               BTRFS_I(dir)->log_dirty_trans = trans->transid;
+
+       ret = btrfs_del_dir_entries_in_log(trans, root, name, name_len,
+                                          dir, index);
+       BUG_ON(ret);
+err:
+       btrfs_free_path(path);
+       if (ret)
+               goto out;
+
+       btrfs_i_size_write(dir, dir->i_size - name_len * 2);
+       inode->i_ctime = dir->i_mtime = dir->i_ctime = CURRENT_TIME;
+       btrfs_update_inode(trans, root, dir);
+       btrfs_drop_nlink(inode);
+       ret = btrfs_update_inode(trans, root, inode);
+       dir->i_sb->s_dirt = 1;
+out:
+       return ret;
+}
+
+static int btrfs_unlink(struct inode *dir, struct dentry *dentry)
+{
+       struct btrfs_root *root;
+       struct btrfs_trans_handle *trans;
+       struct inode *inode = dentry->d_inode;
+       int ret;
+       unsigned long nr = 0;
+
+       root = BTRFS_I(dir)->root;
+
+       ret = btrfs_check_free_space(root, 1, 1);
+       if (ret)
+               goto fail;
+
+       trans = btrfs_start_transaction(root, 1);
+
+       btrfs_set_trans_block_group(trans, dir);
+       ret = btrfs_unlink_inode(trans, root, dir, dentry->d_inode,
+                                dentry->d_name.name, dentry->d_name.len);
+
+       if (inode->i_nlink == 0)
+               ret = btrfs_orphan_add(trans, inode);
+
+       nr = trans->blocks_used;
+
+       btrfs_end_transaction_throttle(trans, root);
+fail:
+       btrfs_btree_balance_dirty(root, nr);
+       return ret;
+}
+
+static int btrfs_rmdir(struct inode *dir, struct dentry *dentry)
+{
+       struct inode *inode = dentry->d_inode;
+       int err = 0;
+       int ret;
+       struct btrfs_root *root = BTRFS_I(dir)->root;
+       struct btrfs_trans_handle *trans;
+       unsigned long nr = 0;
+
+       /*
+        * the FIRST_FREE_OBJECTID check makes sure we don't try to rmdir
+        * the root of a subvolume or snapshot
+        */
+       if (inode->i_size > BTRFS_EMPTY_DIR_SIZE ||
+           inode->i_ino == BTRFS_FIRST_FREE_OBJECTID) {
+               return -ENOTEMPTY;
+       }
+
+       ret = btrfs_check_free_space(root, 1, 1);
+       if (ret)
+               goto fail;
+
+       trans = btrfs_start_transaction(root, 1);
+       btrfs_set_trans_block_group(trans, dir);
+
+       err = btrfs_orphan_add(trans, inode);
+       if (err)
+               goto fail_trans;
+
+       /* now the directory is empty */
+       err = btrfs_unlink_inode(trans, root, dir, dentry->d_inode,
+                                dentry->d_name.name, dentry->d_name.len);
+       if (!err)
+               btrfs_i_size_write(inode, 0);
+
+fail_trans:
+       nr = trans->blocks_used;
+       ret = btrfs_end_transaction_throttle(trans, root);
+fail:
+       btrfs_btree_balance_dirty(root, nr);
+
+       if (ret && !err)
+               err = ret;
+       return err;
+}
+
+#if 0
+/*
+ * when truncating bytes in a file, it is possible to avoid reading
+ * the leaves that contain only checksum items.  This can be the
+ * majority of the IO required to delete a large file, but it must
+ * be done carefully.
+ *
+ * The keys in the level just above the leaves are checked to make sure
+ * the lowest key in a given leaf is a csum key, and starts at an offset
+ * after the new  size.
+ *
+ * Then the key for the next leaf is checked to make sure it also has
+ * a checksum item for the same file.  If it does, we know our target leaf
+ * contains only checksum items, and it can be safely freed without reading
+ * it.
+ *
+ * This is just an optimization targeted at large files.  It may do
+ * nothing.  It will return 0 unless things went badly.
+ */
+static noinline int drop_csum_leaves(struct btrfs_trans_handle *trans,
+                                    struct btrfs_root *root,
+                                    struct btrfs_path *path,
+                                    struct inode *inode, u64 new_size)
+{
+       struct btrfs_key key;
+       int ret;
+       int nritems;
+       struct btrfs_key found_key;
+       struct btrfs_key other_key;
+       struct btrfs_leaf_ref *ref;
+       u64 leaf_gen;
+       u64 leaf_start;
+
+       path->lowest_level = 1;
+       key.objectid = inode->i_ino;
+       key.type = BTRFS_CSUM_ITEM_KEY;
+       key.offset = new_size;
+again:
+       ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+       if (ret < 0)
+               goto out;
+
+       if (path->nodes[1] == NULL) {
+               ret = 0;
+               goto out;
+       }
+       ret = 0;
+       btrfs_node_key_to_cpu(path->nodes[1], &found_key, path->slots[1]);
+       nritems = btrfs_header_nritems(path->nodes[1]);
+
+       if (!nritems)
+               goto out;
+
+       if (path->slots[1] >= nritems)
+               goto next_node;
+
+       /* did we find a key greater than anything we want to delete? */
+       if (found_key.objectid > inode->i_ino ||
+          (found_key.objectid == inode->i_ino && found_key.type > key.type))
+               goto out;
+
+       /* we check the next key in the node to make sure the leave contains
+        * only checksum items.  This comparison doesn't work if our
+        * leaf is the last one in the node
+        */
+       if (path->slots[1] + 1 >= nritems) {
+next_node:
+               /* search forward from the last key in the node, this
+                * will bring us into the next node in the tree
+                */
+               btrfs_node_key_to_cpu(path->nodes[1], &found_key, nritems - 1);
+
+               /* unlikely, but we inc below, so check to be safe */
+               if (found_key.offset == (u64)-1)
+                       goto out;
+
+               /* search_forward needs a path with locks held, do the
+                * search again for the original key.  It is possible
+                * this will race with a balance and return a path that
+                * we could modify, but this drop is just an optimization
+                * and is allowed to miss some leaves.
+                */
+               btrfs_release_path(root, path);
+               found_key.offset++;
+
+               /* setup a max key for search_forward */
+               other_key.offset = (u64)-1;
+               other_key.type = key.type;
+               other_key.objectid = key.objectid;
+
+               path->keep_locks = 1;
+               ret = btrfs_search_forward(root, &found_key, &other_key,
+                                          path, 0, 0);
+               path->keep_locks = 0;
+               if (ret || found_key.objectid != key.objectid ||
+                   found_key.type != key.type) {
+                       ret = 0;
+                       goto out;
+               }
+
+               key.offset = found_key.offset;
+               btrfs_release_path(root, path);
+               cond_resched();
+               goto again;
+       }
+
+       /* we know there's one more slot after us in the tree,
+        * read that key so we can verify it is also a checksum item
+        */
+       btrfs_node_key_to_cpu(path->nodes[1], &other_key, path->slots[1] + 1);
+
+       if (found_key.objectid < inode->i_ino)
+               goto next_key;
+
+       if (found_key.type != key.type || found_key.offset < new_size)
+               goto next_key;
+
+       /*
+        * if the key for the next leaf isn't a csum key from this objectid,
+        * we can't be sure there aren't good items inside this leaf.
+        * Bail out
+        */
+       if (other_key.objectid != inode->i_ino || other_key.type != key.type)
+               goto out;
+
+       leaf_start = btrfs_node_blockptr(path->nodes[1], path->slots[1]);
+       leaf_gen = btrfs_node_ptr_generation(path->nodes[1], path->slots[1]);
+       /*
+        * it is safe to delete this leaf, it contains only
+        * csum items from this inode at an offset >= new_size
+        */
+       ret = btrfs_del_leaf(trans, root, path, leaf_start);
+       BUG_ON(ret);
+
+       if (root->ref_cows && leaf_gen < trans->transid) {
+               ref = btrfs_alloc_leaf_ref(root, 0);
+               if (ref) {
+                       ref->root_gen = root->root_key.offset;
+                       ref->bytenr = leaf_start;
+                       ref->owner = 0;
+                       ref->generation = leaf_gen;
+                       ref->nritems = 0;
+
+                       ret = btrfs_add_leaf_ref(root, ref, 0);
+                       WARN_ON(ret);
+                       btrfs_free_leaf_ref(root, ref);
+               } else {
+                       WARN_ON(1);
+               }
+       }
+next_key:
+       btrfs_release_path(root, path);
+
+       if (other_key.objectid == inode->i_ino &&
+           other_key.type == key.type && other_key.offset > key.offset) {
+               key.offset = other_key.offset;
+               cond_resched();
+               goto again;
+       }
+       ret = 0;
+out:
+       /* fixup any changes we've made to the path */
+       path->lowest_level = 0;
+       path->keep_locks = 0;
+       btrfs_release_path(root, path);
+       return ret;
+}
+
+#endif
+
+/*
+ * this can truncate away extent items, csum items and directory items.
+ * It starts at a high offset and removes keys until it can't find
+ * any higher than new_size
+ *
+ * csum items that cross the new i_size are truncated to the new size
+ * as well.
+ *
+ * min_type is the minimum key type to truncate down to.  If set to 0, this
+ * will kill all the items on this inode, including the INODE_ITEM_KEY.
+ */
+noinline int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
+                                       struct btrfs_root *root,
+                                       struct inode *inode,
+                                       u64 new_size, u32 min_type)
+{
+       int ret;
+       struct btrfs_path *path;
+       struct btrfs_key key;
+       struct btrfs_key found_key;
+       u32 found_type;
+       struct extent_buffer *leaf;
+       struct btrfs_file_extent_item *fi;
+       u64 extent_start = 0;
+       u64 extent_num_bytes = 0;
+       u64 item_end = 0;
+       u64 root_gen = 0;
+       u64 root_owner = 0;
+       int found_extent;
+       int del_item;
+       int pending_del_nr = 0;
+       int pending_del_slot = 0;
+       int extent_type = -1;
+       int encoding;
+       u64 mask = root->sectorsize - 1;
+
+       if (root->ref_cows)
+               btrfs_drop_extent_cache(inode, new_size & (~mask), (u64)-1, 0);
+       path = btrfs_alloc_path();
+       path->reada = -1;
+       BUG_ON(!path);
+
+       /* FIXME, add redo link to tree so we don't leak on crash */
+       key.objectid = inode->i_ino;
+       key.offset = (u64)-1;
+       key.type = (u8)-1;
+
+       btrfs_init_path(path);
+
+search_again:
+       ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+       if (ret < 0)
+               goto error;
+
+       if (ret > 0) {
+               /* there are no items in the tree for us to truncate, we're
+                * done
+                */
+               if (path->slots[0] == 0) {
+                       ret = 0;
+                       goto error;
+               }
+               path->slots[0]--;
+       }
+
+       while (1) {
+               fi = NULL;
+               leaf = path->nodes[0];
+               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+               found_type = btrfs_key_type(&found_key);
+               encoding = 0;
+
+               if (found_key.objectid != inode->i_ino)
+                       break;
+
+               if (found_type < min_type)
+                       break;
+
+               item_end = found_key.offset;
+               if (found_type == BTRFS_EXTENT_DATA_KEY) {
+                       fi = btrfs_item_ptr(leaf, path->slots[0],
+                                           struct btrfs_file_extent_item);
+                       extent_type = btrfs_file_extent_type(leaf, fi);
+                       encoding = btrfs_file_extent_compression(leaf, fi);
+                       encoding |= btrfs_file_extent_encryption(leaf, fi);
+                       encoding |= btrfs_file_extent_other_encoding(leaf, fi);
+
+                       if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
+                               item_end +=
+                                   btrfs_file_extent_num_bytes(leaf, fi);
+                       } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
+                               item_end += btrfs_file_extent_inline_len(leaf,
+                                                                        fi);
+                       }
+                       item_end--;
+               }
+               if (item_end < new_size) {
+                       if (found_type == BTRFS_DIR_ITEM_KEY)
+                               found_type = BTRFS_INODE_ITEM_KEY;
+                       else if (found_type == BTRFS_EXTENT_ITEM_KEY)
+                               found_type = BTRFS_EXTENT_DATA_KEY;
+                       else if (found_type == BTRFS_EXTENT_DATA_KEY)
+                               found_type = BTRFS_XATTR_ITEM_KEY;
+                       else if (found_type == BTRFS_XATTR_ITEM_KEY)
+                               found_type = BTRFS_INODE_REF_KEY;
+                       else if (found_type)
+                               found_type--;
+                       else
+                               break;
+                       btrfs_set_key_type(&key, found_type);
+                       goto next;
+               }
+               if (found_key.offset >= new_size)
+                       del_item = 1;
+               else
+                       del_item = 0;
+               found_extent = 0;
+
+               /* FIXME, shrink the extent if the ref count is only 1 */
+               if (found_type != BTRFS_EXTENT_DATA_KEY)
+                       goto delete;
+
+               if (extent_type != BTRFS_FILE_EXTENT_INLINE) {
+                       u64 num_dec;
+                       extent_start = btrfs_file_extent_disk_bytenr(leaf, fi);
+                       if (!del_item && !encoding) {
+                               u64 orig_num_bytes =
+                                       btrfs_file_extent_num_bytes(leaf, fi);
+                               extent_num_bytes = new_size -
+                                       found_key.offset + root->sectorsize - 1;
+                               extent_num_bytes = extent_num_bytes &
+                                       ~((u64)root->sectorsize - 1);
+                               btrfs_set_file_extent_num_bytes(leaf, fi,
+                                                        extent_num_bytes);
+                               num_dec = (orig_num_bytes -
+                                          extent_num_bytes);
+                               if (root->ref_cows && extent_start != 0)
+                                       inode_sub_bytes(inode, num_dec);
+                               btrfs_mark_buffer_dirty(leaf);
+                       } else {
+                               extent_num_bytes =
+                                       btrfs_file_extent_disk_num_bytes(leaf,
+                                                                        fi);
+                               /* FIXME blocksize != 4096 */
+                               num_dec = btrfs_file_extent_num_bytes(leaf, fi);
+                               if (extent_start != 0) {
+                                       found_extent = 1;
+                                       if (root->ref_cows)
+                                               inode_sub_bytes(inode, num_dec);
+                               }
+                               root_gen = btrfs_header_generation(leaf);
+                               root_owner = btrfs_header_owner(leaf);
+                       }
+               } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
+                       /*
+                        * we can't truncate inline items that have had
+                        * special encodings
+                        */
+                       if (!del_item &&
+                           btrfs_file_extent_compression(leaf, fi) == 0 &&
+                           btrfs_file_extent_encryption(leaf, fi) == 0 &&
+                           btrfs_file_extent_other_encoding(leaf, fi) == 0) {
+                               u32 size = new_size - found_key.offset;
+
+                               if (root->ref_cows) {
+                                       inode_sub_bytes(inode, item_end + 1 -
+                                                       new_size);
+                               }
+                               size =
+                                   btrfs_file_extent_calc_inline_size(size);
+                               ret = btrfs_truncate_item(trans, root, path,
+                                                         size, 1);
+                               BUG_ON(ret);
+                       } else if (root->ref_cows) {
+                               inode_sub_bytes(inode, item_end + 1 -
+                                               found_key.offset);
+                       }
+               }
+delete:
+               if (del_item) {
+                       if (!pending_del_nr) {
+                               /* no pending yet, add ourselves */
+                               pending_del_slot = path->slots[0];
+                               pending_del_nr = 1;
+                       } else if (pending_del_nr &&
+                                  path->slots[0] + 1 == pending_del_slot) {
+                               /* hop on the pending chunk */
+                               pending_del_nr++;
+                               pending_del_slot = path->slots[0];
+                       } else {
+                               BUG();
+                       }
+               } else {
+                       break;
+               }
+               if (found_extent) {
+                       ret = btrfs_free_extent(trans, root, extent_start,
+                                               extent_num_bytes,
+                                               leaf->start, root_owner,
+                                               root_gen, inode->i_ino, 0);
+                       BUG_ON(ret);
+               }
+next:
+               if (path->slots[0] == 0) {
+                       if (pending_del_nr)
+                               goto del_pending;
+                       btrfs_release_path(root, path);
+                       goto search_again;
+               }
+
+               path->slots[0]--;
+               if (pending_del_nr &&
+                   path->slots[0] + 1 != pending_del_slot) {
+                       struct btrfs_key debug;
+del_pending:
+                       btrfs_item_key_to_cpu(path->nodes[0], &debug,
+                                             pending_del_slot);
+                       ret = btrfs_del_items(trans, root, path,
+                                             pending_del_slot,
+                                             pending_del_nr);
+                       BUG_ON(ret);
+                       pending_del_nr = 0;
+                       btrfs_release_path(root, path);
+                       goto search_again;
+               }
+       }
+       ret = 0;
+error:
+       if (pending_del_nr) {
+               ret = btrfs_del_items(trans, root, path, pending_del_slot,
+                                     pending_del_nr);
+       }
+       btrfs_free_path(path);
+       inode->i_sb->s_dirt = 1;
+       return ret;
+}
+
+/*
+ * taken from block_truncate_page, but does cow as it zeros out
+ * any bytes left in the last page in the file.
+ */
+static int btrfs_truncate_page(struct address_space *mapping, loff_t from)
+{
+       struct inode *inode = mapping->host;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+       struct btrfs_ordered_extent *ordered;
+       char *kaddr;
+       u32 blocksize = root->sectorsize;
+       pgoff_t index = from >> PAGE_CACHE_SHIFT;
+       unsigned offset = from & (PAGE_CACHE_SIZE-1);
+       struct page *page;
+       int ret = 0;
+       u64 page_start;
+       u64 page_end;
+
+       if ((offset & (blocksize - 1)) == 0)
+               goto out;
+
+       ret = -ENOMEM;
+again:
+       page = grab_cache_page(mapping, index);
+       if (!page)
+               goto out;
+
+       page_start = page_offset(page);
+       page_end = page_start + PAGE_CACHE_SIZE - 1;
+
+       if (!PageUptodate(page)) {
+               ret = btrfs_readpage(NULL, page);
+               lock_page(page);
+               if (page->mapping != mapping) {
+                       unlock_page(page);
+                       page_cache_release(page);
+                       goto again;
+               }
+               if (!PageUptodate(page)) {
+                       ret = -EIO;
+                       goto out_unlock;
+               }
+       }
+       wait_on_page_writeback(page);
+
+       lock_extent(io_tree, page_start, page_end, GFP_NOFS);
+       set_page_extent_mapped(page);
+
+       ordered = btrfs_lookup_ordered_extent(inode, page_start);
+       if (ordered) {
+               unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
+               unlock_page(page);
+               page_cache_release(page);
+               btrfs_start_ordered_extent(inode, ordered, 1);
+               btrfs_put_ordered_extent(ordered);
+               goto again;
+       }
+
+       btrfs_set_extent_delalloc(inode, page_start, page_end);
+       ret = 0;
+       if (offset != PAGE_CACHE_SIZE) {
+               kaddr = kmap(page);
+               memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
+               flush_dcache_page(page);
+               kunmap(page);
+       }
+       ClearPageChecked(page);
+       set_page_dirty(page);
+       unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
+
+out_unlock:
+       unlock_page(page);
+       page_cache_release(page);
+out:
+       return ret;
+}
+
+int btrfs_cont_expand(struct inode *inode, loff_t size)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+       struct extent_map *em;
+       u64 mask = root->sectorsize - 1;
+       u64 hole_start = (inode->i_size + mask) & ~mask;
+       u64 block_end = (size + mask) & ~mask;
+       u64 last_byte;
+       u64 cur_offset;
+       u64 hole_size;
+       int err;
+
+       if (size <= hole_start)
+               return 0;
+
+       err = btrfs_check_free_space(root, 1, 0);
+       if (err)
+               return err;
+
+       btrfs_truncate_page(inode->i_mapping, inode->i_size);
+
+       while (1) {
+               struct btrfs_ordered_extent *ordered;
+               btrfs_wait_ordered_range(inode, hole_start,
+                                        block_end - hole_start);
+               lock_extent(io_tree, hole_start, block_end - 1, GFP_NOFS);
+               ordered = btrfs_lookup_ordered_extent(inode, hole_start);
+               if (!ordered)
+                       break;
+               unlock_extent(io_tree, hole_start, block_end - 1, GFP_NOFS);
+               btrfs_put_ordered_extent(ordered);
+       }
+
+       trans = btrfs_start_transaction(root, 1);
+       btrfs_set_trans_block_group(trans, inode);
+
+       cur_offset = hole_start;
+       while (1) {
+               em = btrfs_get_extent(inode, NULL, 0, cur_offset,
+                               block_end - cur_offset, 0);
+               BUG_ON(IS_ERR(em) || !em);
+               last_byte = min(extent_map_end(em), block_end);
+               last_byte = (last_byte + mask) & ~mask;
+               if (test_bit(EXTENT_FLAG_VACANCY, &em->flags)) {
+                       u64 hint_byte = 0;
+                       hole_size = last_byte - cur_offset;
+                       err = btrfs_drop_extents(trans, root, inode,
+                                                cur_offset,
+                                                cur_offset + hole_size,
+                                                cur_offset, &hint_byte);
+                       if (err)
+                               break;
+                       err = btrfs_insert_file_extent(trans, root,
+                                       inode->i_ino, cur_offset, 0,
+                                       0, hole_size, 0, hole_size,
+                                       0, 0, 0);
+                       btrfs_drop_extent_cache(inode, hole_start,
+                                       last_byte - 1, 0);
+               }
+               free_extent_map(em);
+               cur_offset = last_byte;
+               if (err || cur_offset >= block_end)
+                       break;
+       }
+
+       btrfs_end_transaction(trans, root);
+       unlock_extent(io_tree, hole_start, block_end - 1, GFP_NOFS);
+       return err;
+}
+
+static int btrfs_setattr(struct dentry *dentry, struct iattr *attr)
+{
+       struct inode *inode = dentry->d_inode;
+       int err;
+
+       err = inode_change_ok(inode, attr);
+       if (err)
+               return err;
+
+       if (S_ISREG(inode->i_mode) &&
+           attr->ia_valid & ATTR_SIZE && attr->ia_size > inode->i_size) {
+               err = btrfs_cont_expand(inode, attr->ia_size);
+               if (err)
+                       return err;
+       }
+
+       err = inode_setattr(inode, attr);
+
+       if (!err && ((attr->ia_valid & ATTR_MODE)))
+               err = btrfs_acl_chmod(inode);
+       return err;
+}
+
+void btrfs_delete_inode(struct inode *inode)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       unsigned long nr;
+       int ret;
+
+       truncate_inode_pages(&inode->i_data, 0);
+       if (is_bad_inode(inode)) {
+               btrfs_orphan_del(NULL, inode);
+               goto no_delete;
+       }
+       btrfs_wait_ordered_range(inode, 0, (u64)-1);
+
+       btrfs_i_size_write(inode, 0);
+       trans = btrfs_join_transaction(root, 1);
+
+       btrfs_set_trans_block_group(trans, inode);
+       ret = btrfs_truncate_inode_items(trans, root, inode, inode->i_size, 0);
+       if (ret) {
+               btrfs_orphan_del(NULL, inode);
+               goto no_delete_lock;
+       }
+
+       btrfs_orphan_del(trans, inode);
+
+       nr = trans->blocks_used;
+       clear_inode(inode);
+
+       btrfs_end_transaction(trans, root);
+       btrfs_btree_balance_dirty(root, nr);
+       return;
+
+no_delete_lock:
+       nr = trans->blocks_used;
+       btrfs_end_transaction(trans, root);
+       btrfs_btree_balance_dirty(root, nr);
+no_delete:
+       clear_inode(inode);
+}
+
+/*
+ * this returns the key found in the dir entry in the location pointer.
+ * If no dir entries were found, location->objectid is 0.
+ */
+static int btrfs_inode_by_name(struct inode *dir, struct dentry *dentry,
+                              struct btrfs_key *location)
+{
+       const char *name = dentry->d_name.name;
+       int namelen = dentry->d_name.len;
+       struct btrfs_dir_item *di;
+       struct btrfs_path *path;
+       struct btrfs_root *root = BTRFS_I(dir)->root;
+       int ret = 0;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       di = btrfs_lookup_dir_item(NULL, root, path, dir->i_ino, name,
+                                   namelen, 0);
+       if (IS_ERR(di))
+               ret = PTR_ERR(di);
+
+       if (!di || IS_ERR(di))
+               goto out_err;
+
+       btrfs_dir_item_key_to_cpu(path->nodes[0], di, location);
+out:
+       btrfs_free_path(path);
+       return ret;
+out_err:
+       location->objectid = 0;
+       goto out;
+}
+
+/*
+ * when we hit a tree root in a directory, the btrfs part of the inode
+ * needs to be changed to reflect the root directory of the tree root.  This
+ * is kind of like crossing a mount point.
+ */
+static int fixup_tree_root_location(struct btrfs_root *root,
+                            struct btrfs_key *location,
+                            struct btrfs_root **sub_root,
+                            struct dentry *dentry)
+{
+       struct btrfs_root_item *ri;
+
+       if (btrfs_key_type(location) != BTRFS_ROOT_ITEM_KEY)
+               return 0;
+       if (location->objectid == BTRFS_ROOT_TREE_OBJECTID)
+               return 0;
+
+       *sub_root = btrfs_read_fs_root(root->fs_info, location,
+                                       dentry->d_name.name,
+                                       dentry->d_name.len);
+       if (IS_ERR(*sub_root))
+               return PTR_ERR(*sub_root);
+
+       ri = &(*sub_root)->root_item;
+       location->objectid = btrfs_root_dirid(ri);
+       btrfs_set_key_type(location, BTRFS_INODE_ITEM_KEY);
+       location->offset = 0;
+
+       return 0;
+}
+
+static noinline void init_btrfs_i(struct inode *inode)
+{
+       struct btrfs_inode *bi = BTRFS_I(inode);
+
+       bi->i_acl = NULL;
+       bi->i_default_acl = NULL;
+
+       bi->generation = 0;
+       bi->sequence = 0;
+       bi->last_trans = 0;
+       bi->logged_trans = 0;
+       bi->delalloc_bytes = 0;
+       bi->disk_i_size = 0;
+       bi->flags = 0;
+       bi->index_cnt = (u64)-1;
+       bi->log_dirty_trans = 0;
+       extent_map_tree_init(&BTRFS_I(inode)->extent_tree, GFP_NOFS);
+       extent_io_tree_init(&BTRFS_I(inode)->io_tree,
+                            inode->i_mapping, GFP_NOFS);
+       extent_io_tree_init(&BTRFS_I(inode)->io_failure_tree,
+                            inode->i_mapping, GFP_NOFS);
+       INIT_LIST_HEAD(&BTRFS_I(inode)->delalloc_inodes);
+       btrfs_ordered_inode_tree_init(&BTRFS_I(inode)->ordered_tree);
+       mutex_init(&BTRFS_I(inode)->extent_mutex);
+       mutex_init(&BTRFS_I(inode)->log_mutex);
+}
+
+static int btrfs_init_locked_inode(struct inode *inode, void *p)
+{
+       struct btrfs_iget_args *args = p;
+       inode->i_ino = args->ino;
+       init_btrfs_i(inode);
+       BTRFS_I(inode)->root = args->root;
+       return 0;
+}
+
+static int btrfs_find_actor(struct inode *inode, void *opaque)
+{
+       struct btrfs_iget_args *args = opaque;
+       return args->ino == inode->i_ino &&
+               args->root == BTRFS_I(inode)->root;
+}
+
+struct inode *btrfs_ilookup(struct super_block *s, u64 objectid,
+                           struct btrfs_root *root, int wait)
+{
+       struct inode *inode;
+       struct btrfs_iget_args args;
+       args.ino = objectid;
+       args.root = root;
+
+       if (wait) {
+               inode = ilookup5(s, objectid, btrfs_find_actor,
+                                (void *)&args);
+       } else {
+               inode = ilookup5_nowait(s, objectid, btrfs_find_actor,
+                                       (void *)&args);
+       }
+       return inode;
+}
+
+struct inode *btrfs_iget_locked(struct super_block *s, u64 objectid,
+                               struct btrfs_root *root)
+{
+       struct inode *inode;
+       struct btrfs_iget_args args;
+       args.ino = objectid;
+       args.root = root;
+
+       inode = iget5_locked(s, objectid, btrfs_find_actor,
+                            btrfs_init_locked_inode,
+                            (void *)&args);
+       return inode;
+}
+
+/* Get an inode object given its location and corresponding root.
+ * Returns in *is_new if the inode was read from disk
+ */
+struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
+                        struct btrfs_root *root, int *is_new)
+{
+       struct inode *inode;
+
+       inode = btrfs_iget_locked(s, location->objectid, root);
+       if (!inode)
+               return ERR_PTR(-EACCES);
+
+       if (inode->i_state & I_NEW) {
+               BTRFS_I(inode)->root = root;
+               memcpy(&BTRFS_I(inode)->location, location, sizeof(*location));
+               btrfs_read_locked_inode(inode);
+               unlock_new_inode(inode);
+               if (is_new)
+                       *is_new = 1;
+       } else {
+               if (is_new)
+                       *is_new = 0;
+       }
+
+       return inode;
+}
+
+struct inode *btrfs_lookup_dentry(struct inode *dir, struct dentry *dentry)
+{
+       struct inode *inode;
+       struct btrfs_inode *bi = BTRFS_I(dir);
+       struct btrfs_root *root = bi->root;
+       struct btrfs_root *sub_root = root;
+       struct btrfs_key location;
+       int ret, new;
+
+       if (dentry->d_name.len > BTRFS_NAME_LEN)
+               return ERR_PTR(-ENAMETOOLONG);
+
+       ret = btrfs_inode_by_name(dir, dentry, &location);
+
+       if (ret < 0)
+               return ERR_PTR(ret);
+
+       inode = NULL;
+       if (location.objectid) {
+               ret = fixup_tree_root_location(root, &location, &sub_root,
+                                               dentry);
+               if (ret < 0)
+                       return ERR_PTR(ret);
+               if (ret > 0)
+                       return ERR_PTR(-ENOENT);
+               inode = btrfs_iget(dir->i_sb, &location, sub_root, &new);
+               if (IS_ERR(inode))
+                       return ERR_CAST(inode);
+       }
+       return inode;
+}
+
+static struct dentry *btrfs_lookup(struct inode *dir, struct dentry *dentry,
+                                  struct nameidata *nd)
+{
+       struct inode *inode;
+
+       if (dentry->d_name.len > BTRFS_NAME_LEN)
+               return ERR_PTR(-ENAMETOOLONG);
+
+       inode = btrfs_lookup_dentry(dir, dentry);
+       if (IS_ERR(inode))
+               return ERR_CAST(inode);
+
+       return d_splice_alias(inode, dentry);
+}
+
+static unsigned char btrfs_filetype_table[] = {
+       DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
+};
+
+static int btrfs_real_readdir(struct file *filp, void *dirent,
+                             filldir_t filldir)
+{
+       struct inode *inode = filp->f_dentry->d_inode;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_item *item;
+       struct btrfs_dir_item *di;
+       struct btrfs_key key;
+       struct btrfs_key found_key;
+       struct btrfs_path *path;
+       int ret;
+       u32 nritems;
+       struct extent_buffer *leaf;
+       int slot;
+       int advance;
+       unsigned char d_type;
+       int over = 0;
+       u32 di_cur;
+       u32 di_total;
+       u32 di_len;
+       int key_type = BTRFS_DIR_INDEX_KEY;
+       char tmp_name[32];
+       char *name_ptr;
+       int name_len;
+
+       /* FIXME, use a real flag for deciding about the key type */
+       if (root->fs_info->tree_root == root)
+               key_type = BTRFS_DIR_ITEM_KEY;
+
+       /* special case for "." */
+       if (filp->f_pos == 0) {
+               over = filldir(dirent, ".", 1,
+                              1, inode->i_ino,
+                              DT_DIR);
+               if (over)
+                       return 0;
+               filp->f_pos = 1;
+       }
+       /* special case for .., just use the back ref */
+       if (filp->f_pos == 1) {
+               u64 pino = parent_ino(filp->f_path.dentry);
+               over = filldir(dirent, "..", 2,
+                              2, pino, DT_DIR);
+               if (over)
+                       return 0;
+               filp->f_pos = 2;
+       }
+       path = btrfs_alloc_path();
+       path->reada = 2;
+
+       btrfs_set_key_type(&key, key_type);
+       key.offset = filp->f_pos;
+       key.objectid = inode->i_ino;
+
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       if (ret < 0)
+               goto err;
+       advance = 0;
+
+       while (1) {
+               leaf = path->nodes[0];
+               nritems = btrfs_header_nritems(leaf);
+               slot = path->slots[0];
+               if (advance || slot >= nritems) {
+                       if (slot >= nritems - 1) {
+                               ret = btrfs_next_leaf(root, path);
+                               if (ret)
+                                       break;
+                               leaf = path->nodes[0];
+                               nritems = btrfs_header_nritems(leaf);
+                               slot = path->slots[0];
+                       } else {
+                               slot++;
+                               path->slots[0]++;
+                       }
+               }
+
+               advance = 1;
+               item = btrfs_item_nr(leaf, slot);
+               btrfs_item_key_to_cpu(leaf, &found_key, slot);
+
+               if (found_key.objectid != key.objectid)
+                       break;
+               if (btrfs_key_type(&found_key) != key_type)
+                       break;
+               if (found_key.offset < filp->f_pos)
+                       continue;
+
+               filp->f_pos = found_key.offset;
+
+               di = btrfs_item_ptr(leaf, slot, struct btrfs_dir_item);
+               di_cur = 0;
+               di_total = btrfs_item_size(leaf, item);
+
+               while (di_cur < di_total) {
+                       struct btrfs_key location;
+
+                       name_len = btrfs_dir_name_len(leaf, di);
+                       if (name_len <= sizeof(tmp_name)) {
+                               name_ptr = tmp_name;
+                       } else {
+                               name_ptr = kmalloc(name_len, GFP_NOFS);
+                               if (!name_ptr) {
+                                       ret = -ENOMEM;
+                                       goto err;
+                               }
+                       }
+                       read_extent_buffer(leaf, name_ptr,
+                                          (unsigned long)(di + 1), name_len);
+
+                       d_type = btrfs_filetype_table[btrfs_dir_type(leaf, di)];
+                       btrfs_dir_item_key_to_cpu(leaf, di, &location);
+
+                       /* is this a reference to our own snapshot? If so
+                        * skip it
+                        */
+                       if (location.type == BTRFS_ROOT_ITEM_KEY &&
+                           location.objectid == root->root_key.objectid) {
+                               over = 0;
+                               goto skip;
+                       }
+                       over = filldir(dirent, name_ptr, name_len,
+                                      found_key.offset, location.objectid,
+                                      d_type);
+
+skip:
+                       if (name_ptr != tmp_name)
+                               kfree(name_ptr);
+
+                       if (over)
+                               goto nopos;
+                       di_len = btrfs_dir_name_len(leaf, di) +
+                                btrfs_dir_data_len(leaf, di) + sizeof(*di);
+                       di_cur += di_len;
+                       di = (struct btrfs_dir_item *)((char *)di + di_len);
+               }
+       }
+
+       /* Reached end of directory/root. Bump pos past the last item. */
+       if (key_type == BTRFS_DIR_INDEX_KEY)
+               filp->f_pos = INT_LIMIT(typeof(filp->f_pos));
+       else
+               filp->f_pos++;
+nopos:
+       ret = 0;
+err:
+       btrfs_free_path(path);
+       return ret;
+}
+
+int btrfs_write_inode(struct inode *inode, int wait)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_trans_handle *trans;
+       int ret = 0;
+
+       if (root->fs_info->btree_inode == inode)
+               return 0;
+
+       if (wait) {
+               trans = btrfs_join_transaction(root, 1);
+               btrfs_set_trans_block_group(trans, inode);
+               ret = btrfs_commit_transaction(trans, root);
+       }
+       return ret;
+}
+
+/*
+ * This is somewhat expensive, updating the tree every time the
+ * inode changes.  But, it is most likely to find the inode in cache.
+ * FIXME, needs more benchmarking...there are no reasons other than performance
+ * to keep or drop this code.
+ */
+void btrfs_dirty_inode(struct inode *inode)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_trans_handle *trans;
+
+       trans = btrfs_join_transaction(root, 1);
+       btrfs_set_trans_block_group(trans, inode);
+       btrfs_update_inode(trans, root, inode);
+       btrfs_end_transaction(trans, root);
+}
+
+/*
+ * find the highest existing sequence number in a directory
+ * and then set the in-memory index_cnt variable to reflect
+ * free sequence numbers
+ */
+static int btrfs_set_inode_index_count(struct inode *inode)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_key key, found_key;
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       int ret;
+
+       key.objectid = inode->i_ino;
+       btrfs_set_key_type(&key, BTRFS_DIR_INDEX_KEY);
+       key.offset = (u64)-1;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       if (ret < 0)
+               goto out;
+       /* FIXME: we should be able to handle this */
+       if (ret == 0)
+               goto out;
+       ret = 0;
+
+       /*
+        * MAGIC NUMBER EXPLANATION:
+        * since we search a directory based on f_pos we have to start at 2
+        * since '.' and '..' have f_pos of 0 and 1 respectively, so everybody
+        * else has to start at 2
+        */
+       if (path->slots[0] == 0) {
+               BTRFS_I(inode)->index_cnt = 2;
+               goto out;
+       }
+
+       path->slots[0]--;
+
+       leaf = path->nodes[0];
+       btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+       if (found_key.objectid != inode->i_ino ||
+           btrfs_key_type(&found_key) != BTRFS_DIR_INDEX_KEY) {
+               BTRFS_I(inode)->index_cnt = 2;
+               goto out;
+       }
+
+       BTRFS_I(inode)->index_cnt = found_key.offset + 1;
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+/*
+ * helper to find a free sequence number in a given directory.  This current
+ * code is very simple, later versions will do smarter things in the btree
+ */
+int btrfs_set_inode_index(struct inode *dir, u64 *index)
+{
+       int ret = 0;
+
+       if (BTRFS_I(dir)->index_cnt == (u64)-1) {
+               ret = btrfs_set_inode_index_count(dir);
+               if (ret)
+                       return ret;
+       }
+
+       *index = BTRFS_I(dir)->index_cnt;
+       BTRFS_I(dir)->index_cnt++;
+
+       return ret;
+}
+
+static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans,
+                                    struct btrfs_root *root,
+                                    struct inode *dir,
+                                    const char *name, int name_len,
+                                    u64 ref_objectid, u64 objectid,
+                                    u64 alloc_hint, int mode, u64 *index)
+{
+       struct inode *inode;
+       struct btrfs_inode_item *inode_item;
+       struct btrfs_key *location;
+       struct btrfs_path *path;
+       struct btrfs_inode_ref *ref;
+       struct btrfs_key key[2];
+       u32 sizes[2];
+       unsigned long ptr;
+       int ret;
+       int owner;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       inode = new_inode(root->fs_info->sb);
+       if (!inode)
+               return ERR_PTR(-ENOMEM);
+
+       if (dir) {
+               ret = btrfs_set_inode_index(dir, index);
+               if (ret)
+                       return ERR_PTR(ret);
+       }
+       /*
+        * index_cnt is ignored for everything but a dir,
+        * btrfs_get_inode_index_count has an explanation for the magic
+        * number
+        */
+       init_btrfs_i(inode);
+       BTRFS_I(inode)->index_cnt = 2;
+       BTRFS_I(inode)->root = root;
+       BTRFS_I(inode)->generation = trans->transid;
+
+       if (mode & S_IFDIR)
+               owner = 0;
+       else
+               owner = 1;
+       BTRFS_I(inode)->block_group =
+                       btrfs_find_block_group(root, 0, alloc_hint, owner);
+       if ((mode & S_IFREG)) {
+               if (btrfs_test_opt(root, NODATASUM))
+                       btrfs_set_flag(inode, NODATASUM);
+               if (btrfs_test_opt(root, NODATACOW))
+                       btrfs_set_flag(inode, NODATACOW);
+       }
+
+       key[0].objectid = objectid;
+       btrfs_set_key_type(&key[0], BTRFS_INODE_ITEM_KEY);
+       key[0].offset = 0;
+
+       key[1].objectid = objectid;
+       btrfs_set_key_type(&key[1], BTRFS_INODE_REF_KEY);
+       key[1].offset = ref_objectid;
+
+       sizes[0] = sizeof(struct btrfs_inode_item);
+       sizes[1] = name_len + sizeof(*ref);
+
+       ret = btrfs_insert_empty_items(trans, root, path, key, sizes, 2);
+       if (ret != 0)
+               goto fail;
+
+       if (objectid > root->highest_inode)
+               root->highest_inode = objectid;
+
+       inode->i_uid = current_fsuid();
+       inode->i_gid = current_fsgid();
+       inode->i_mode = mode;
+       inode->i_ino = objectid;
+       inode_set_bytes(inode, 0);
+       inode->i_mtime = inode->i_atime = inode->i_ctime = CURRENT_TIME;
+       inode_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+                                 struct btrfs_inode_item);
+       fill_inode_item(trans, path->nodes[0], inode_item, inode);
+
+       ref = btrfs_item_ptr(path->nodes[0], path->slots[0] + 1,
+                            struct btrfs_inode_ref);
+       btrfs_set_inode_ref_name_len(path->nodes[0], ref, name_len);
+       btrfs_set_inode_ref_index(path->nodes[0], ref, *index);
+       ptr = (unsigned long)(ref + 1);
+       write_extent_buffer(path->nodes[0], name, ptr, name_len);
+
+       btrfs_mark_buffer_dirty(path->nodes[0]);
+       btrfs_free_path(path);
+
+       location = &BTRFS_I(inode)->location;
+       location->objectid = objectid;
+       location->offset = 0;
+       btrfs_set_key_type(location, BTRFS_INODE_ITEM_KEY);
+
+       insert_inode_hash(inode);
+       return inode;
+fail:
+       if (dir)
+               BTRFS_I(dir)->index_cnt--;
+       btrfs_free_path(path);
+       return ERR_PTR(ret);
+}
+
+static inline u8 btrfs_inode_type(struct inode *inode)
+{
+       return btrfs_type_by_mode[(inode->i_mode & S_IFMT) >> S_SHIFT];
+}
+
+/*
+ * utility function to add 'inode' into 'parent_inode' with
+ * a give name and a given sequence number.
+ * if 'add_backref' is true, also insert a backref from the
+ * inode to the parent directory.
+ */
+int btrfs_add_link(struct btrfs_trans_handle *trans,
+                  struct inode *parent_inode, struct inode *inode,
+                  const char *name, int name_len, int add_backref, u64 index)
+{
+       int ret;
+       struct btrfs_key key;
+       struct btrfs_root *root = BTRFS_I(parent_inode)->root;
+
+       key.objectid = inode->i_ino;
+       btrfs_set_key_type(&key, BTRFS_INODE_ITEM_KEY);
+       key.offset = 0;
+
+       ret = btrfs_insert_dir_item(trans, root, name, name_len,
+                                   parent_inode->i_ino,
+                                   &key, btrfs_inode_type(inode),
+                                   index);
+       if (ret == 0) {
+               if (add_backref) {
+                       ret = btrfs_insert_inode_ref(trans, root,
+                                                    name, name_len,
+                                                    inode->i_ino,
+                                                    parent_inode->i_ino,
+                                                    index);
+               }
+               btrfs_i_size_write(parent_inode, parent_inode->i_size +
+                                  name_len * 2);
+               parent_inode->i_mtime = parent_inode->i_ctime = CURRENT_TIME;
+               ret = btrfs_update_inode(trans, root, parent_inode);
+       }
+       return ret;
+}
+
+static int btrfs_add_nondir(struct btrfs_trans_handle *trans,
+                           struct dentry *dentry, struct inode *inode,
+                           int backref, u64 index)
+{
+       int err = btrfs_add_link(trans, dentry->d_parent->d_inode,
+                                inode, dentry->d_name.name,
+                                dentry->d_name.len, backref, index);
+       if (!err) {
+               d_instantiate(dentry, inode);
+               return 0;
+       }
+       if (err > 0)
+               err = -EEXIST;
+       return err;
+}
+
+static int btrfs_mknod(struct inode *dir, struct dentry *dentry,
+                       int mode, dev_t rdev)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *root = BTRFS_I(dir)->root;
+       struct inode *inode = NULL;
+       int err;
+       int drop_inode = 0;
+       u64 objectid;
+       unsigned long nr = 0;
+       u64 index = 0;
+
+       if (!new_valid_dev(rdev))
+               return -EINVAL;
+
+       err = btrfs_check_free_space(root, 1, 0);
+       if (err)
+               goto fail;
+
+       trans = btrfs_start_transaction(root, 1);
+       btrfs_set_trans_block_group(trans, dir);
+
+       err = btrfs_find_free_objectid(trans, root, dir->i_ino, &objectid);
+       if (err) {
+               err = -ENOSPC;
+               goto out_unlock;
+       }
+
+       inode = btrfs_new_inode(trans, root, dir, dentry->d_name.name,
+                               dentry->d_name.len,
+                               dentry->d_parent->d_inode->i_ino, objectid,
+                               BTRFS_I(dir)->block_group, mode, &index);
+       err = PTR_ERR(inode);
+       if (IS_ERR(inode))
+               goto out_unlock;
+
+       err = btrfs_init_acl(inode, dir);
+       if (err) {
+               drop_inode = 1;
+               goto out_unlock;
+       }
+
+       btrfs_set_trans_block_group(trans, inode);
+       err = btrfs_add_nondir(trans, dentry, inode, 0, index);
+       if (err)
+               drop_inode = 1;
+       else {
+               inode->i_op = &btrfs_special_inode_operations;
+               init_special_inode(inode, inode->i_mode, rdev);
+               btrfs_update_inode(trans, root, inode);
+       }
+       dir->i_sb->s_dirt = 1;
+       btrfs_update_inode_block_group(trans, inode);
+       btrfs_update_inode_block_group(trans, dir);
+out_unlock:
+       nr = trans->blocks_used;
+       btrfs_end_transaction_throttle(trans, root);
+fail:
+       if (drop_inode) {
+               inode_dec_link_count(inode);
+               iput(inode);
+       }
+       btrfs_btree_balance_dirty(root, nr);
+       return err;
+}
+
+static int btrfs_create(struct inode *dir, struct dentry *dentry,
+                       int mode, struct nameidata *nd)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *root = BTRFS_I(dir)->root;
+       struct inode *inode = NULL;
+       int err;
+       int drop_inode = 0;
+       unsigned long nr = 0;
+       u64 objectid;
+       u64 index = 0;
+
+       err = btrfs_check_free_space(root, 1, 0);
+       if (err)
+               goto fail;
+       trans = btrfs_start_transaction(root, 1);
+       btrfs_set_trans_block_group(trans, dir);
+
+       err = btrfs_find_free_objectid(trans, root, dir->i_ino, &objectid);
+       if (err) {
+               err = -ENOSPC;
+               goto out_unlock;
+       }
+
+       inode = btrfs_new_inode(trans, root, dir, dentry->d_name.name,
+                               dentry->d_name.len,
+                               dentry->d_parent->d_inode->i_ino,
+                               objectid, BTRFS_I(dir)->block_group, mode,
+                               &index);
+       err = PTR_ERR(inode);
+       if (IS_ERR(inode))
+               goto out_unlock;
+
+       err = btrfs_init_acl(inode, dir);
+       if (err) {
+               drop_inode = 1;
+               goto out_unlock;
+       }
+
+       btrfs_set_trans_block_group(trans, inode);
+       err = btrfs_add_nondir(trans, dentry, inode, 0, index);
+       if (err)
+               drop_inode = 1;
+       else {
+               inode->i_mapping->a_ops = &btrfs_aops;
+               inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+               inode->i_fop = &btrfs_file_operations;
+               inode->i_op = &btrfs_file_inode_operations;
+               BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
+       }
+       dir->i_sb->s_dirt = 1;
+       btrfs_update_inode_block_group(trans, inode);
+       btrfs_update_inode_block_group(trans, dir);
+out_unlock:
+       nr = trans->blocks_used;
+       btrfs_end_transaction_throttle(trans, root);
+fail:
+       if (drop_inode) {
+               inode_dec_link_count(inode);
+               iput(inode);
+       }
+       btrfs_btree_balance_dirty(root, nr);
+       return err;
+}
+
+static int btrfs_link(struct dentry *old_dentry, struct inode *dir,
+                     struct dentry *dentry)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *root = BTRFS_I(dir)->root;
+       struct inode *inode = old_dentry->d_inode;
+       u64 index;
+       unsigned long nr = 0;
+       int err;
+       int drop_inode = 0;
+
+       if (inode->i_nlink == 0)
+               return -ENOENT;
+
+       btrfs_inc_nlink(inode);
+       err = btrfs_check_free_space(root, 1, 0);
+       if (err)
+               goto fail;
+       err = btrfs_set_inode_index(dir, &index);
+       if (err)
+               goto fail;
+
+       trans = btrfs_start_transaction(root, 1);
+
+       btrfs_set_trans_block_group(trans, dir);
+       atomic_inc(&inode->i_count);
+
+       err = btrfs_add_nondir(trans, dentry, inode, 1, index);
+
+       if (err)
+               drop_inode = 1;
+
+       dir->i_sb->s_dirt = 1;
+       btrfs_update_inode_block_group(trans, dir);
+       err = btrfs_update_inode(trans, root, inode);
+
+       if (err)
+               drop_inode = 1;
+
+       nr = trans->blocks_used;
+       btrfs_end_transaction_throttle(trans, root);
+fail:
+       if (drop_inode) {
+               inode_dec_link_count(inode);
+               iput(inode);
+       }
+       btrfs_btree_balance_dirty(root, nr);
+       return err;
+}
+
+static int btrfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
+{
+       struct inode *inode = NULL;
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *root = BTRFS_I(dir)->root;
+       int err = 0;
+       int drop_on_err = 0;
+       u64 objectid = 0;
+       u64 index = 0;
+       unsigned long nr = 1;
+
+       err = btrfs_check_free_space(root, 1, 0);
+       if (err)
+               goto out_unlock;
+
+       trans = btrfs_start_transaction(root, 1);
+       btrfs_set_trans_block_group(trans, dir);
+
+       if (IS_ERR(trans)) {
+               err = PTR_ERR(trans);
+               goto out_unlock;
+       }
+
+       err = btrfs_find_free_objectid(trans, root, dir->i_ino, &objectid);
+       if (err) {
+               err = -ENOSPC;
+               goto out_unlock;
+       }
+
+       inode = btrfs_new_inode(trans, root, dir, dentry->d_name.name,
+                               dentry->d_name.len,
+                               dentry->d_parent->d_inode->i_ino, objectid,
+                               BTRFS_I(dir)->block_group, S_IFDIR | mode,
+                               &index);
+       if (IS_ERR(inode)) {
+               err = PTR_ERR(inode);
+               goto out_fail;
+       }
+
+       drop_on_err = 1;
+
+       err = btrfs_init_acl(inode, dir);
+       if (err)
+               goto out_fail;
+
+       inode->i_op = &btrfs_dir_inode_operations;
+       inode->i_fop = &btrfs_dir_file_operations;
+       btrfs_set_trans_block_group(trans, inode);
+
+       btrfs_i_size_write(inode, 0);
+       err = btrfs_update_inode(trans, root, inode);
+       if (err)
+               goto out_fail;
+
+       err = btrfs_add_link(trans, dentry->d_parent->d_inode,
+                                inode, dentry->d_name.name,
+                                dentry->d_name.len, 0, index);
+       if (err)
+               goto out_fail;
+
+       d_instantiate(dentry, inode);
+       drop_on_err = 0;
+       dir->i_sb->s_dirt = 1;
+       btrfs_update_inode_block_group(trans, inode);
+       btrfs_update_inode_block_group(trans, dir);
+
+out_fail:
+       nr = trans->blocks_used;
+       btrfs_end_transaction_throttle(trans, root);
+
+out_unlock:
+       if (drop_on_err)
+               iput(inode);
+       btrfs_btree_balance_dirty(root, nr);
+       return err;
+}
+
+/* helper for btfs_get_extent.  Given an existing extent in the tree,
+ * and an extent that you want to insert, deal with overlap and insert
+ * the new extent into the tree.
+ */
+static int merge_extent_mapping(struct extent_map_tree *em_tree,
+                               struct extent_map *existing,
+                               struct extent_map *em,
+                               u64 map_start, u64 map_len)
+{
+       u64 start_diff;
+
+       BUG_ON(map_start < em->start || map_start >= extent_map_end(em));
+       start_diff = map_start - em->start;
+       em->start = map_start;
+       em->len = map_len;
+       if (em->block_start < EXTENT_MAP_LAST_BYTE &&
+           !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
+               em->block_start += start_diff;
+               em->block_len -= start_diff;
+       }
+       return add_extent_mapping(em_tree, em);
+}
+
+static noinline int uncompress_inline(struct btrfs_path *path,
+                                     struct inode *inode, struct page *page,
+                                     size_t pg_offset, u64 extent_offset,
+                                     struct btrfs_file_extent_item *item)
+{
+       int ret;
+       struct extent_buffer *leaf = path->nodes[0];
+       char *tmp;
+       size_t max_size;
+       unsigned long inline_size;
+       unsigned long ptr;
+
+       WARN_ON(pg_offset != 0);
+       max_size = btrfs_file_extent_ram_bytes(leaf, item);
+       inline_size = btrfs_file_extent_inline_item_len(leaf,
+                                       btrfs_item_nr(leaf, path->slots[0]));
+       tmp = kmalloc(inline_size, GFP_NOFS);
+       ptr = btrfs_file_extent_inline_start(item);
+
+       read_extent_buffer(leaf, tmp, ptr, inline_size);
+
+       max_size = min_t(unsigned long, PAGE_CACHE_SIZE, max_size);
+       ret = btrfs_zlib_decompress(tmp, page, extent_offset,
+                                   inline_size, max_size);
+       if (ret) {
+               char *kaddr = kmap_atomic(page, KM_USER0);
+               unsigned long copy_size = min_t(u64,
+                                 PAGE_CACHE_SIZE - pg_offset,
+                                 max_size - extent_offset);
+               memset(kaddr + pg_offset, 0, copy_size);
+               kunmap_atomic(kaddr, KM_USER0);
+       }
+       kfree(tmp);
+       return 0;
+}
+
+/*
+ * a bit scary, this does extent mapping from logical file offset to the disk.
+ * the ugly parts come from merging extents from the disk with the in-ram
+ * representation.  This gets more complex because of the data=ordered code,
+ * where the in-ram extents might be locked pending data=ordered completion.
+ *
+ * This also copies inline extents directly into the page.
+ */
+
+struct extent_map *btrfs_get_extent(struct inode *inode, struct page *page,
+                                   size_t pg_offset, u64 start, u64 len,
+                                   int create)
+{
+       int ret;
+       int err = 0;
+       u64 bytenr;
+       u64 extent_start = 0;
+       u64 extent_end = 0;
+       u64 objectid = inode->i_ino;
+       u32 found_type;
+       struct btrfs_path *path = NULL;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_file_extent_item *item;
+       struct extent_buffer *leaf;
+       struct btrfs_key found_key;
+       struct extent_map *em = NULL;
+       struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+       struct btrfs_trans_handle *trans = NULL;
+       int compressed;
+
+again:
+       spin_lock(&em_tree->lock);
+       em = lookup_extent_mapping(em_tree, start, len);
+       if (em)
+               em->bdev = root->fs_info->fs_devices->latest_bdev;
+       spin_unlock(&em_tree->lock);
+
+       if (em) {
+               if (em->start > start || em->start + em->len <= start)
+                       free_extent_map(em);
+               else if (em->block_start == EXTENT_MAP_INLINE && page)
+                       free_extent_map(em);
+               else
+                       goto out;
+       }
+       em = alloc_extent_map(GFP_NOFS);
+       if (!em) {
+               err = -ENOMEM;
+               goto out;
+       }
+       em->bdev = root->fs_info->fs_devices->latest_bdev;
+       em->start = EXTENT_MAP_HOLE;
+       em->orig_start = EXTENT_MAP_HOLE;
+       em->len = (u64)-1;
+       em->block_len = (u64)-1;
+
+       if (!path) {
+               path = btrfs_alloc_path();
+               BUG_ON(!path);
+       }
+
+       ret = btrfs_lookup_file_extent(trans, root, path,
+                                      objectid, start, trans != NULL);
+       if (ret < 0) {
+               err = ret;
+               goto out;
+       }
+
+       if (ret != 0) {
+               if (path->slots[0] == 0)
+                       goto not_found;
+               path->slots[0]--;
+       }
+
+       leaf = path->nodes[0];
+       item = btrfs_item_ptr(leaf, path->slots[0],
+                             struct btrfs_file_extent_item);
+       /* are we inside the extent that was found? */
+       btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+       found_type = btrfs_key_type(&found_key);
+       if (found_key.objectid != objectid ||
+           found_type != BTRFS_EXTENT_DATA_KEY) {
+               goto not_found;
+       }
+
+       found_type = btrfs_file_extent_type(leaf, item);
+       extent_start = found_key.offset;
+       compressed = btrfs_file_extent_compression(leaf, item);
+       if (found_type == BTRFS_FILE_EXTENT_REG ||
+           found_type == BTRFS_FILE_EXTENT_PREALLOC) {
+               extent_end = extent_start +
+                      btrfs_file_extent_num_bytes(leaf, item);
+       } else if (found_type == BTRFS_FILE_EXTENT_INLINE) {
+               size_t size;
+               size = btrfs_file_extent_inline_len(leaf, item);
+               extent_end = (extent_start + size + root->sectorsize - 1) &
+                       ~((u64)root->sectorsize - 1);
+       }
+
+       if (start >= extent_end) {
+               path->slots[0]++;
+               if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret < 0) {
+                               err = ret;
+                               goto out;
+                       }
+                       if (ret > 0)
+                               goto not_found;
+                       leaf = path->nodes[0];
+               }
+               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+               if (found_key.objectid != objectid ||
+                   found_key.type != BTRFS_EXTENT_DATA_KEY)
+                       goto not_found;
+               if (start + len <= found_key.offset)
+                       goto not_found;
+               em->start = start;
+               em->len = found_key.offset - start;
+               goto not_found_em;
+       }
+
+       if (found_type == BTRFS_FILE_EXTENT_REG ||
+           found_type == BTRFS_FILE_EXTENT_PREALLOC) {
+               em->start = extent_start;
+               em->len = extent_end - extent_start;
+               em->orig_start = extent_start -
+                                btrfs_file_extent_offset(leaf, item);
+               bytenr = btrfs_file_extent_disk_bytenr(leaf, item);
+               if (bytenr == 0) {
+                       em->block_start = EXTENT_MAP_HOLE;
+                       goto insert;
+               }
+               if (compressed) {
+                       set_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
+                       em->block_start = bytenr;
+                       em->block_len = btrfs_file_extent_disk_num_bytes(leaf,
+                                                                        item);
+               } else {
+                       bytenr += btrfs_file_extent_offset(leaf, item);
+                       em->block_start = bytenr;
+                       em->block_len = em->len;
+                       if (found_type == BTRFS_FILE_EXTENT_PREALLOC)
+                               set_bit(EXTENT_FLAG_PREALLOC, &em->flags);
+               }
+               goto insert;
+       } else if (found_type == BTRFS_FILE_EXTENT_INLINE) {
+               unsigned long ptr;
+               char *map;
+               size_t size;
+               size_t extent_offset;
+               size_t copy_size;
+
+               em->block_start = EXTENT_MAP_INLINE;
+               if (!page || create) {
+                       em->start = extent_start;
+                       em->len = extent_end - extent_start;
+                       goto out;
+               }
+
+               size = btrfs_file_extent_inline_len(leaf, item);
+               extent_offset = page_offset(page) + pg_offset - extent_start;
+               copy_size = min_t(u64, PAGE_CACHE_SIZE - pg_offset,
+                               size - extent_offset);
+               em->start = extent_start + extent_offset;
+               em->len = (copy_size + root->sectorsize - 1) &
+                       ~((u64)root->sectorsize - 1);
+               em->orig_start = EXTENT_MAP_INLINE;
+               if (compressed)
+                       set_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
+               ptr = btrfs_file_extent_inline_start(item) + extent_offset;
+               if (create == 0 && !PageUptodate(page)) {
+                       if (btrfs_file_extent_compression(leaf, item) ==
+                           BTRFS_COMPRESS_ZLIB) {
+                               ret = uncompress_inline(path, inode, page,
+                                                       pg_offset,
+                                                       extent_offset, item);
+                               BUG_ON(ret);
+                       } else {
+                               map = kmap(page);
+                               read_extent_buffer(leaf, map + pg_offset, ptr,
+                                                  copy_size);
+                               kunmap(page);
+                       }
+                       flush_dcache_page(page);
+               } else if (create && PageUptodate(page)) {
+                       if (!trans) {
+                               kunmap(page);
+                               free_extent_map(em);
+                               em = NULL;
+                               btrfs_release_path(root, path);
+                               trans = btrfs_join_transaction(root, 1);
+                               goto again;
+                       }
+                       map = kmap(page);
+                       write_extent_buffer(leaf, map + pg_offset, ptr,
+                                           copy_size);
+                       kunmap(page);
+                       btrfs_mark_buffer_dirty(leaf);
+               }
+               set_extent_uptodate(io_tree, em->start,
+                                   extent_map_end(em) - 1, GFP_NOFS);
+               goto insert;
+       } else {
+               printk(KERN_ERR "btrfs unknown found_type %d\n", found_type);
+               WARN_ON(1);
+       }
+not_found:
+       em->start = start;
+       em->len = len;
+not_found_em:
+       em->block_start = EXTENT_MAP_HOLE;
+       set_bit(EXTENT_FLAG_VACANCY, &em->flags);
+insert:
+       btrfs_release_path(root, path);
+       if (em->start > start || extent_map_end(em) <= start) {
+               printk(KERN_ERR "Btrfs: bad extent! em: [%llu %llu] passed "
+                      "[%llu %llu]\n", (unsigned long long)em->start,
+                      (unsigned long long)em->len,
+                      (unsigned long long)start,
+                      (unsigned long long)len);
+               err = -EIO;
+               goto out;
+       }
+
+       err = 0;
+       spin_lock(&em_tree->lock);
+       ret = add_extent_mapping(em_tree, em);
+       /* it is possible that someone inserted the extent into the tree
+        * while we had the lock dropped.  It is also possible that
+        * an overlapping map exists in the tree
+        */
+       if (ret == -EEXIST) {
+               struct extent_map *existing;
+
+               ret = 0;
+
+               existing = lookup_extent_mapping(em_tree, start, len);
+               if (existing && (existing->start > start ||
+                   existing->start + existing->len <= start)) {
+                       free_extent_map(existing);
+                       existing = NULL;
+               }
+               if (!existing) {
+                       existing = lookup_extent_mapping(em_tree, em->start,
+                                                        em->len);
+                       if (existing) {
+                               err = merge_extent_mapping(em_tree, existing,
+                                                          em, start,
+                                                          root->sectorsize);
+                               free_extent_map(existing);
+                               if (err) {
+                                       free_extent_map(em);
+                                       em = NULL;
+                               }
+                       } else {
+                               err = -EIO;
+                               free_extent_map(em);
+                               em = NULL;
+                       }
+               } else {
+                       free_extent_map(em);
+                       em = existing;
+                       err = 0;
+               }
+       }
+       spin_unlock(&em_tree->lock);
+out:
+       if (path)
+               btrfs_free_path(path);
+       if (trans) {
+               ret = btrfs_end_transaction(trans, root);
+               if (!err)
+                       err = ret;
+       }
+       if (err) {
+               free_extent_map(em);
+               WARN_ON(1);
+               return ERR_PTR(err);
+       }
+       return em;
+}
+
+static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb,
+                       const struct iovec *iov, loff_t offset,
+                       unsigned long nr_segs)
+{
+       return -EINVAL;
+}
+
+static sector_t btrfs_bmap(struct address_space *mapping, sector_t iblock)
+{
+       return extent_bmap(mapping, iblock, btrfs_get_extent);
+}
+
+int btrfs_readpage(struct file *file, struct page *page)
+{
+       struct extent_io_tree *tree;
+       tree = &BTRFS_I(page->mapping->host)->io_tree;
+       return extent_read_full_page(tree, page, btrfs_get_extent);
+}
+
+static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
+{
+       struct extent_io_tree *tree;
+
+
+       if (current->flags & PF_MEMALLOC) {
+               redirty_page_for_writepage(wbc, page);
+               unlock_page(page);
+               return 0;
+       }
+       tree = &BTRFS_I(page->mapping->host)->io_tree;
+       return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
+}
+
+int btrfs_writepages(struct address_space *mapping,
+                    struct writeback_control *wbc)
+{
+       struct extent_io_tree *tree;
+
+       tree = &BTRFS_I(mapping->host)->io_tree;
+       return extent_writepages(tree, mapping, btrfs_get_extent, wbc);
+}
+
+static int
+btrfs_readpages(struct file *file, struct address_space *mapping,
+               struct list_head *pages, unsigned nr_pages)
+{
+       struct extent_io_tree *tree;
+       tree = &BTRFS_I(mapping->host)->io_tree;
+       return extent_readpages(tree, mapping, pages, nr_pages,
+                               btrfs_get_extent);
+}
+static int __btrfs_releasepage(struct page *page, gfp_t gfp_flags)
+{
+       struct extent_io_tree *tree;
+       struct extent_map_tree *map;
+       int ret;
+
+       tree = &BTRFS_I(page->mapping->host)->io_tree;
+       map = &BTRFS_I(page->mapping->host)->extent_tree;
+       ret = try_release_extent_mapping(map, tree, page, gfp_flags);
+       if (ret == 1) {
+               ClearPagePrivate(page);
+               set_page_private(page, 0);
+               page_cache_release(page);
+       }
+       return ret;
+}
+
+static int btrfs_releasepage(struct page *page, gfp_t gfp_flags)
+{
+       if (PageWriteback(page) || PageDirty(page))
+               return 0;
+       return __btrfs_releasepage(page, gfp_flags);
+}
+
+static void btrfs_invalidatepage(struct page *page, unsigned long offset)
+{
+       struct extent_io_tree *tree;
+       struct btrfs_ordered_extent *ordered;
+       u64 page_start = page_offset(page);
+       u64 page_end = page_start + PAGE_CACHE_SIZE - 1;
+
+       wait_on_page_writeback(page);
+       tree = &BTRFS_I(page->mapping->host)->io_tree;
+       if (offset) {
+               btrfs_releasepage(page, GFP_NOFS);
+               return;
+       }
+
+       lock_extent(tree, page_start, page_end, GFP_NOFS);
+       ordered = btrfs_lookup_ordered_extent(page->mapping->host,
+                                          page_offset(page));
+       if (ordered) {
+               /*
+                * IO on this page will never be started, so we need
+                * to account for any ordered extents now
+                */
+               clear_extent_bit(tree, page_start, page_end,
+                                EXTENT_DIRTY | EXTENT_DELALLOC |
+                                EXTENT_LOCKED, 1, 0, GFP_NOFS);
+               btrfs_finish_ordered_io(page->mapping->host,
+                                       page_start, page_end);
+               btrfs_put_ordered_extent(ordered);
+               lock_extent(tree, page_start, page_end, GFP_NOFS);
+       }
+       clear_extent_bit(tree, page_start, page_end,
+                EXTENT_LOCKED | EXTENT_DIRTY | EXTENT_DELALLOC |
+                EXTENT_ORDERED,
+                1, 1, GFP_NOFS);
+       __btrfs_releasepage(page, GFP_NOFS);
+
+       ClearPageChecked(page);
+       if (PagePrivate(page)) {
+               ClearPagePrivate(page);
+               set_page_private(page, 0);
+               page_cache_release(page);
+       }
+}
+
+/*
+ * btrfs_page_mkwrite() is not allowed to change the file size as it gets
+ * called from a page fault handler when a page is first dirtied. Hence we must
+ * be careful to check for EOF conditions here. We set the page up correctly
+ * for a written page which means we get ENOSPC checking when writing into
+ * holes and correct delalloc and unwritten extent mapping on filesystems that
+ * support these features.
+ *
+ * We are not allowed to take the i_mutex here so we have to play games to
+ * protect against truncate races as the page could now be beyond EOF.  Because
+ * vmtruncate() writes the inode size before removing pages, once we have the
+ * page lock we can determine safely if the page is beyond EOF. If it is not
+ * beyond EOF, then the page is guaranteed safe against truncation until we
+ * unlock the page.
+ */
+int btrfs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+       struct inode *inode = fdentry(vma->vm_file)->d_inode;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+       struct btrfs_ordered_extent *ordered;
+       char *kaddr;
+       unsigned long zero_start;
+       loff_t size;
+       int ret;
+       u64 page_start;
+       u64 page_end;
+
+       ret = btrfs_check_free_space(root, PAGE_CACHE_SIZE, 0);
+       if (ret)
+               goto out;
+
+       ret = -EINVAL;
+again:
+       lock_page(page);
+       size = i_size_read(inode);
+       page_start = page_offset(page);
+       page_end = page_start + PAGE_CACHE_SIZE - 1;
+
+       if ((page->mapping != inode->i_mapping) ||
+           (page_start >= size)) {
+               /* page got truncated out from underneath us */
+               goto out_unlock;
+       }
+       wait_on_page_writeback(page);
+
+       lock_extent(io_tree, page_start, page_end, GFP_NOFS);
+       set_page_extent_mapped(page);
+
+       /*
+        * we can't set the delalloc bits if there are pending ordered
+        * extents.  Drop our locks and wait for them to finish
+        */
+       ordered = btrfs_lookup_ordered_extent(inode, page_start);
+       if (ordered) {
+               unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
+               unlock_page(page);
+               btrfs_start_ordered_extent(inode, ordered, 1);
+               btrfs_put_ordered_extent(ordered);
+               goto again;
+       }
+
+       btrfs_set_extent_delalloc(inode, page_start, page_end);
+       ret = 0;
+
+       /* page is wholly or partially inside EOF */
+       if (page_start + PAGE_CACHE_SIZE > size)
+               zero_start = size & ~PAGE_CACHE_MASK;
+       else
+               zero_start = PAGE_CACHE_SIZE;
+
+       if (zero_start != PAGE_CACHE_SIZE) {
+               kaddr = kmap(page);
+               memset(kaddr + zero_start, 0, PAGE_CACHE_SIZE - zero_start);
+               flush_dcache_page(page);
+               kunmap(page);
+       }
+       ClearPageChecked(page);
+       set_page_dirty(page);
+       unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
+
+out_unlock:
+       unlock_page(page);
+out:
+       return ret;
+}
+
+static void btrfs_truncate(struct inode *inode)
+{
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       int ret;
+       struct btrfs_trans_handle *trans;
+       unsigned long nr;
+       u64 mask = root->sectorsize - 1;
+
+       if (!S_ISREG(inode->i_mode))
+               return;
+       if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
+               return;
+
+       btrfs_truncate_page(inode->i_mapping, inode->i_size);
+       btrfs_wait_ordered_range(inode, inode->i_size & (~mask), (u64)-1);
+
+       trans = btrfs_start_transaction(root, 1);
+       btrfs_set_trans_block_group(trans, inode);
+       btrfs_i_size_write(inode, inode->i_size);
+
+       ret = btrfs_orphan_add(trans, inode);
+       if (ret)
+               goto out;
+       /* FIXME, add redo link to tree so we don't leak on crash */
+       ret = btrfs_truncate_inode_items(trans, root, inode, inode->i_size,
+                                     BTRFS_EXTENT_DATA_KEY);
+       btrfs_update_inode(trans, root, inode);
+
+       ret = btrfs_orphan_del(trans, inode);
+       BUG_ON(ret);
+
+out:
+       nr = trans->blocks_used;
+       ret = btrfs_end_transaction_throttle(trans, root);
+       BUG_ON(ret);
+       btrfs_btree_balance_dirty(root, nr);
+}
+
+/*
+ * create a new subvolume directory/inode (helper for the ioctl).
+ */
+int btrfs_create_subvol_root(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *new_root, struct dentry *dentry,
+                            u64 new_dirid, u64 alloc_hint)
+{
+       struct inode *inode;
+       int error;
+       u64 index = 0;
+
+       inode = btrfs_new_inode(trans, new_root, NULL, "..", 2, new_dirid,
+                               new_dirid, alloc_hint, S_IFDIR | 0700, &index);
+       if (IS_ERR(inode))
+               return PTR_ERR(inode);
+       inode->i_op = &btrfs_dir_inode_operations;
+       inode->i_fop = &btrfs_dir_file_operations;
+
+       inode->i_nlink = 1;
+       btrfs_i_size_write(inode, 0);
+
+       error = btrfs_update_inode(trans, new_root, inode);
+       if (error)
+               return error;
+
+       d_instantiate(dentry, inode);
+       return 0;
+}
+
+/* helper function for file defrag and space balancing.  This
+ * forces readahead on a given range of bytes in an inode
+ */
+unsigned long btrfs_force_ra(struct address_space *mapping,
+                             struct file_ra_state *ra, struct file *file,
+                             pgoff_t offset, pgoff_t last_index)
+{
+       pgoff_t req_size = last_index - offset + 1;
+
+       page_cache_sync_readahead(mapping, ra, file, offset, req_size);
+       return offset + req_size;
+}
+
+struct inode *btrfs_alloc_inode(struct super_block *sb)
+{
+       struct btrfs_inode *ei;
+
+       ei = kmem_cache_alloc(btrfs_inode_cachep, GFP_NOFS);
+       if (!ei)
+               return NULL;
+       ei->last_trans = 0;
+       ei->logged_trans = 0;
+       btrfs_ordered_inode_tree_init(&ei->ordered_tree);
+       ei->i_acl = BTRFS_ACL_NOT_CACHED;
+       ei->i_default_acl = BTRFS_ACL_NOT_CACHED;
+       INIT_LIST_HEAD(&ei->i_orphan);
+       return &ei->vfs_inode;
+}
+
+void btrfs_destroy_inode(struct inode *inode)
+{
+       struct btrfs_ordered_extent *ordered;
+       WARN_ON(!list_empty(&inode->i_dentry));
+       WARN_ON(inode->i_data.nrpages);
+
+       if (BTRFS_I(inode)->i_acl &&
+           BTRFS_I(inode)->i_acl != BTRFS_ACL_NOT_CACHED)
+               posix_acl_release(BTRFS_I(inode)->i_acl);
+       if (BTRFS_I(inode)->i_default_acl &&
+           BTRFS_I(inode)->i_default_acl != BTRFS_ACL_NOT_CACHED)
+               posix_acl_release(BTRFS_I(inode)->i_default_acl);
+
+       spin_lock(&BTRFS_I(inode)->root->list_lock);
+       if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
+               printk(KERN_ERR "BTRFS: inode %lu: inode still on the orphan"
+                      " list\n", inode->i_ino);
+               dump_stack();
+       }
+       spin_unlock(&BTRFS_I(inode)->root->list_lock);
+
+       while (1) {
+               ordered = btrfs_lookup_first_ordered_extent(inode, (u64)-1);
+               if (!ordered)
+                       break;
+               else {
+                       printk(KERN_ERR "btrfs found ordered "
+                              "extent %llu %llu on inode cleanup\n",
+                              (unsigned long long)ordered->file_offset,
+                              (unsigned long long)ordered->len);
+                       btrfs_remove_ordered_extent(inode, ordered);
+                       btrfs_put_ordered_extent(ordered);
+                       btrfs_put_ordered_extent(ordered);
+               }
+       }
+       btrfs_drop_extent_cache(inode, 0, (u64)-1, 0);
+       kmem_cache_free(btrfs_inode_cachep, BTRFS_I(inode));
+}
+
+static void init_once(void *foo)
+{
+       struct btrfs_inode *ei = (struct btrfs_inode *) foo;
+
+       inode_init_once(&ei->vfs_inode);
+}
+
+void btrfs_destroy_cachep(void)
+{
+       if (btrfs_inode_cachep)
+               kmem_cache_destroy(btrfs_inode_cachep);
+       if (btrfs_trans_handle_cachep)
+               kmem_cache_destroy(btrfs_trans_handle_cachep);
+       if (btrfs_transaction_cachep)
+               kmem_cache_destroy(btrfs_transaction_cachep);
+       if (btrfs_bit_radix_cachep)
+               kmem_cache_destroy(btrfs_bit_radix_cachep);
+       if (btrfs_path_cachep)
+               kmem_cache_destroy(btrfs_path_cachep);
+}
+
+struct kmem_cache *btrfs_cache_create(const char *name, size_t size,
+                                      unsigned long extra_flags,
+                                      void (*ctor)(void *))
+{
+       return kmem_cache_create(name, size, 0, (SLAB_RECLAIM_ACCOUNT |
+                                SLAB_MEM_SPREAD | extra_flags), ctor);
+}
+
+int btrfs_init_cachep(void)
+{
+       btrfs_inode_cachep = btrfs_cache_create("btrfs_inode_cache",
+                                         sizeof(struct btrfs_inode),
+                                         0, init_once);
+       if (!btrfs_inode_cachep)
+               goto fail;
+       btrfs_trans_handle_cachep =
+                       btrfs_cache_create("btrfs_trans_handle_cache",
+                                          sizeof(struct btrfs_trans_handle),
+                                          0, NULL);
+       if (!btrfs_trans_handle_cachep)
+               goto fail;
+       btrfs_transaction_cachep = btrfs_cache_create("btrfs_transaction_cache",
+                                            sizeof(struct btrfs_transaction),
+                                            0, NULL);
+       if (!btrfs_transaction_cachep)
+               goto fail;
+       btrfs_path_cachep = btrfs_cache_create("btrfs_path_cache",
+                                        sizeof(struct btrfs_path),
+                                        0, NULL);
+       if (!btrfs_path_cachep)
+               goto fail;
+       btrfs_bit_radix_cachep = btrfs_cache_create("btrfs_radix", 256,
+                                             SLAB_DESTROY_BY_RCU, NULL);
+       if (!btrfs_bit_radix_cachep)
+               goto fail;
+       return 0;
+fail:
+       btrfs_destroy_cachep();
+       return -ENOMEM;
+}
+
+static int btrfs_getattr(struct vfsmount *mnt,
+                        struct dentry *dentry, struct kstat *stat)
+{
+       struct inode *inode = dentry->d_inode;
+       generic_fillattr(inode, stat);
+       stat->dev = BTRFS_I(inode)->root->anon_super.s_dev;
+       stat->blksize = PAGE_CACHE_SIZE;
+       stat->blocks = (inode_get_bytes(inode) +
+                       BTRFS_I(inode)->delalloc_bytes) >> 9;
+       return 0;
+}
+
+static int btrfs_rename(struct inode *old_dir, struct dentry *old_dentry,
+                          struct inode *new_dir, struct dentry *new_dentry)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *root = BTRFS_I(old_dir)->root;
+       struct inode *new_inode = new_dentry->d_inode;
+       struct inode *old_inode = old_dentry->d_inode;
+       struct timespec ctime = CURRENT_TIME;
+       u64 index = 0;
+       int ret;
+
+       /* we're not allowed to rename between subvolumes */
+       if (BTRFS_I(old_inode)->root->root_key.objectid !=
+           BTRFS_I(new_dir)->root->root_key.objectid)
+               return -EXDEV;
+
+       if (S_ISDIR(old_inode->i_mode) && new_inode &&
+           new_inode->i_size > BTRFS_EMPTY_DIR_SIZE) {
+               return -ENOTEMPTY;
+       }
+
+       /* to rename a snapshot or subvolume, we need to juggle the
+        * backrefs.  This isn't coded yet
+        */
+       if (old_inode->i_ino == BTRFS_FIRST_FREE_OBJECTID)
+               return -EXDEV;
+
+       ret = btrfs_check_free_space(root, 1, 0);
+       if (ret)
+               goto out_unlock;
+
+       trans = btrfs_start_transaction(root, 1);
+
+       btrfs_set_trans_block_group(trans, new_dir);
+
+       btrfs_inc_nlink(old_dentry->d_inode);
+       old_dir->i_ctime = old_dir->i_mtime = ctime;
+       new_dir->i_ctime = new_dir->i_mtime = ctime;
+       old_inode->i_ctime = ctime;
+
+       ret = btrfs_unlink_inode(trans, root, old_dir, old_dentry->d_inode,
+                                old_dentry->d_name.name,
+                                old_dentry->d_name.len);
+       if (ret)
+               goto out_fail;
+
+       if (new_inode) {
+               new_inode->i_ctime = CURRENT_TIME;
+               ret = btrfs_unlink_inode(trans, root, new_dir,
+                                        new_dentry->d_inode,
+                                        new_dentry->d_name.name,
+                                        new_dentry->d_name.len);
+               if (ret)
+                       goto out_fail;
+               if (new_inode->i_nlink == 0) {
+                       ret = btrfs_orphan_add(trans, new_dentry->d_inode);
+                       if (ret)
+                               goto out_fail;
+               }
+
+       }
+       ret = btrfs_set_inode_index(new_dir, &index);
+       if (ret)
+               goto out_fail;
+
+       ret = btrfs_add_link(trans, new_dentry->d_parent->d_inode,
+                            old_inode, new_dentry->d_name.name,
+                            new_dentry->d_name.len, 1, index);
+       if (ret)
+               goto out_fail;
+
+out_fail:
+       btrfs_end_transaction_throttle(trans, root);
+out_unlock:
+       return ret;
+}
+
+/*
+ * some fairly slow code that needs optimization. This walks the list
+ * of all the inodes with pending delalloc and forces them to disk.
+ */
+int btrfs_start_delalloc_inodes(struct btrfs_root *root)
+{
+       struct list_head *head = &root->fs_info->delalloc_inodes;
+       struct btrfs_inode *binode;
+       struct inode *inode;
+
+       if (root->fs_info->sb->s_flags & MS_RDONLY)
+               return -EROFS;
+
+       spin_lock(&root->fs_info->delalloc_lock);
+       while (!list_empty(head)) {
+               binode = list_entry(head->next, struct btrfs_inode,
+                                   delalloc_inodes);
+               inode = igrab(&binode->vfs_inode);
+               if (!inode)
+                       list_del_init(&binode->delalloc_inodes);
+               spin_unlock(&root->fs_info->delalloc_lock);
+               if (inode) {
+                       filemap_flush(inode->i_mapping);
+                       iput(inode);
+               }
+               cond_resched();
+               spin_lock(&root->fs_info->delalloc_lock);
+       }
+       spin_unlock(&root->fs_info->delalloc_lock);
+
+       /* the filemap_flush will queue IO into the worker threads, but
+        * we have to make sure the IO is actually started and that
+        * ordered extents get created before we return
+        */
+       atomic_inc(&root->fs_info->async_submit_draining);
+       while (atomic_read(&root->fs_info->nr_async_submits) ||
+             atomic_read(&root->fs_info->async_delalloc_pages)) {
+               wait_event(root->fs_info->async_submit_wait,
+                  (atomic_read(&root->fs_info->nr_async_submits) == 0 &&
+                   atomic_read(&root->fs_info->async_delalloc_pages) == 0));
+       }
+       atomic_dec(&root->fs_info->async_submit_draining);
+       return 0;
+}
+
+static int btrfs_symlink(struct inode *dir, struct dentry *dentry,
+                        const char *symname)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *root = BTRFS_I(dir)->root;
+       struct btrfs_path *path;
+       struct btrfs_key key;
+       struct inode *inode = NULL;
+       int err;
+       int drop_inode = 0;
+       u64 objectid;
+       u64 index = 0 ;
+       int name_len;
+       int datasize;
+       unsigned long ptr;
+       struct btrfs_file_extent_item *ei;
+       struct extent_buffer *leaf;
+       unsigned long nr = 0;
+
+       name_len = strlen(symname) + 1;
+       if (name_len > BTRFS_MAX_INLINE_DATA_SIZE(root))
+               return -ENAMETOOLONG;
+
+       err = btrfs_check_free_space(root, 1, 0);
+       if (err)
+               goto out_fail;
+
+       trans = btrfs_start_transaction(root, 1);
+       btrfs_set_trans_block_group(trans, dir);
+
+       err = btrfs_find_free_objectid(trans, root, dir->i_ino, &objectid);
+       if (err) {
+               err = -ENOSPC;
+               goto out_unlock;
+       }
+
+       inode = btrfs_new_inode(trans, root, dir, dentry->d_name.name,
+                               dentry->d_name.len,
+                               dentry->d_parent->d_inode->i_ino, objectid,
+                               BTRFS_I(dir)->block_group, S_IFLNK|S_IRWXUGO,
+                               &index);
+       err = PTR_ERR(inode);
+       if (IS_ERR(inode))
+               goto out_unlock;
+
+       err = btrfs_init_acl(inode, dir);
+       if (err) {
+               drop_inode = 1;
+               goto out_unlock;
+       }
+
+       btrfs_set_trans_block_group(trans, inode);
+       err = btrfs_add_nondir(trans, dentry, inode, 0, index);
+       if (err)
+               drop_inode = 1;
+       else {
+               inode->i_mapping->a_ops = &btrfs_aops;
+               inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+               inode->i_fop = &btrfs_file_operations;
+               inode->i_op = &btrfs_file_inode_operations;
+               BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops;
+       }
+       dir->i_sb->s_dirt = 1;
+       btrfs_update_inode_block_group(trans, inode);
+       btrfs_update_inode_block_group(trans, dir);
+       if (drop_inode)
+               goto out_unlock;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+       key.objectid = inode->i_ino;
+       key.offset = 0;
+       btrfs_set_key_type(&key, BTRFS_EXTENT_DATA_KEY);
+       datasize = btrfs_file_extent_calc_inline_size(name_len);
+       err = btrfs_insert_empty_item(trans, root, path, &key,
+                                     datasize);
+       if (err) {
+               drop_inode = 1;
+               goto out_unlock;
+       }
+       leaf = path->nodes[0];
+       ei = btrfs_item_ptr(leaf, path->slots[0],
+                           struct btrfs_file_extent_item);
+       btrfs_set_file_extent_generation(leaf, ei, trans->transid);
+       btrfs_set_file_extent_type(leaf, ei,
+                                  BTRFS_FILE_EXTENT_INLINE);
+       btrfs_set_file_extent_encryption(leaf, ei, 0);
+       btrfs_set_file_extent_compression(leaf, ei, 0);
+       btrfs_set_file_extent_other_encoding(leaf, ei, 0);
+       btrfs_set_file_extent_ram_bytes(leaf, ei, name_len);
+
+       ptr = btrfs_file_extent_inline_start(ei);
+       write_extent_buffer(leaf, symname, ptr, name_len);
+       btrfs_mark_buffer_dirty(leaf);
+       btrfs_free_path(path);
+
+       inode->i_op = &btrfs_symlink_inode_operations;
+       inode->i_mapping->a_ops = &btrfs_symlink_aops;
+       inode->i_mapping->backing_dev_info = &root->fs_info->bdi;
+       inode_set_bytes(inode, name_len);
+       btrfs_i_size_write(inode, name_len - 1);
+       err = btrfs_update_inode(trans, root, inode);
+       if (err)
+               drop_inode = 1;
+
+out_unlock:
+       nr = trans->blocks_used;
+       btrfs_end_transaction_throttle(trans, root);
+out_fail:
+       if (drop_inode) {
+               inode_dec_link_count(inode);
+               iput(inode);
+       }
+       btrfs_btree_balance_dirty(root, nr);
+       return err;
+}
+
+static int prealloc_file_range(struct inode *inode, u64 start, u64 end,
+                              u64 alloc_hint, int mode)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_key ins;
+       u64 alloc_size;
+       u64 cur_offset = start;
+       u64 num_bytes = end - start;
+       int ret = 0;
+
+       trans = btrfs_join_transaction(root, 1);
+       BUG_ON(!trans);
+       btrfs_set_trans_block_group(trans, inode);
+
+       while (num_bytes > 0) {
+               alloc_size = min(num_bytes, root->fs_info->max_extent);
+               ret = btrfs_reserve_extent(trans, root, alloc_size,
+                                          root->sectorsize, 0, alloc_hint,
+                                          (u64)-1, &ins, 1);
+               if (ret) {
+                       WARN_ON(1);
+                       goto out;
+               }
+               ret = insert_reserved_file_extent(trans, inode,
+                                                 cur_offset, ins.objectid,
+                                                 ins.offset, ins.offset,
+                                                 ins.offset, 0, 0, 0,
+                                                 BTRFS_FILE_EXTENT_PREALLOC);
+               BUG_ON(ret);
+               num_bytes -= ins.offset;
+               cur_offset += ins.offset;
+               alloc_hint = ins.objectid + ins.offset;
+       }
+out:
+       if (cur_offset > start) {
+               inode->i_ctime = CURRENT_TIME;
+               btrfs_set_flag(inode, PREALLOC);
+               if (!(mode & FALLOC_FL_KEEP_SIZE) &&
+                   cur_offset > i_size_read(inode))
+                       btrfs_i_size_write(inode, cur_offset);
+               ret = btrfs_update_inode(trans, root, inode);
+               BUG_ON(ret);
+       }
+
+       btrfs_end_transaction(trans, root);
+       return ret;
+}
+
+static long btrfs_fallocate(struct inode *inode, int mode,
+                           loff_t offset, loff_t len)
+{
+       u64 cur_offset;
+       u64 last_byte;
+       u64 alloc_start;
+       u64 alloc_end;
+       u64 alloc_hint = 0;
+       u64 mask = BTRFS_I(inode)->root->sectorsize - 1;
+       struct extent_map *em;
+       int ret;
+
+       alloc_start = offset & ~mask;
+       alloc_end =  (offset + len + mask) & ~mask;
+
+       mutex_lock(&inode->i_mutex);
+       if (alloc_start > inode->i_size) {
+               ret = btrfs_cont_expand(inode, alloc_start);
+               if (ret)
+                       goto out;
+       }
+
+       while (1) {
+               struct btrfs_ordered_extent *ordered;
+               lock_extent(&BTRFS_I(inode)->io_tree, alloc_start,
+                           alloc_end - 1, GFP_NOFS);
+               ordered = btrfs_lookup_first_ordered_extent(inode,
+                                                           alloc_end - 1);
+               if (ordered &&
+                   ordered->file_offset + ordered->len > alloc_start &&
+                   ordered->file_offset < alloc_end) {
+                       btrfs_put_ordered_extent(ordered);
+                       unlock_extent(&BTRFS_I(inode)->io_tree,
+                                     alloc_start, alloc_end - 1, GFP_NOFS);
+                       btrfs_wait_ordered_range(inode, alloc_start,
+                                                alloc_end - alloc_start);
+               } else {
+                       if (ordered)
+                               btrfs_put_ordered_extent(ordered);
+                       break;
+               }
+       }
+
+       cur_offset = alloc_start;
+       while (1) {
+               em = btrfs_get_extent(inode, NULL, 0, cur_offset,
+                                     alloc_end - cur_offset, 0);
+               BUG_ON(IS_ERR(em) || !em);
+               last_byte = min(extent_map_end(em), alloc_end);
+               last_byte = (last_byte + mask) & ~mask;
+               if (em->block_start == EXTENT_MAP_HOLE) {
+                       ret = prealloc_file_range(inode, cur_offset,
+                                       last_byte, alloc_hint, mode);
+                       if (ret < 0) {
+                               free_extent_map(em);
+                               break;
+                       }
+               }
+               if (em->block_start <= EXTENT_MAP_LAST_BYTE)
+                       alloc_hint = em->block_start;
+               free_extent_map(em);
+
+               cur_offset = last_byte;
+               if (cur_offset >= alloc_end) {
+                       ret = 0;
+                       break;
+               }
+       }
+       unlock_extent(&BTRFS_I(inode)->io_tree, alloc_start, alloc_end - 1,
+                     GFP_NOFS);
+out:
+       mutex_unlock(&inode->i_mutex);
+       return ret;
+}
+
+static int btrfs_set_page_dirty(struct page *page)
+{
+       return __set_page_dirty_nobuffers(page);
+}
+
+static int btrfs_permission(struct inode *inode, int mask)
+{
+       if (btrfs_test_flag(inode, READONLY) && (mask & MAY_WRITE))
+               return -EACCES;
+       return generic_permission(inode, mask, btrfs_check_acl);
+}
+
+static struct inode_operations btrfs_dir_inode_operations = {
+       .getattr        = btrfs_getattr,
+       .lookup         = btrfs_lookup,
+       .create         = btrfs_create,
+       .unlink         = btrfs_unlink,
+       .link           = btrfs_link,
+       .mkdir          = btrfs_mkdir,
+       .rmdir          = btrfs_rmdir,
+       .rename         = btrfs_rename,
+       .symlink        = btrfs_symlink,
+       .setattr        = btrfs_setattr,
+       .mknod          = btrfs_mknod,
+       .setxattr       = btrfs_setxattr,
+       .getxattr       = btrfs_getxattr,
+       .listxattr      = btrfs_listxattr,
+       .removexattr    = btrfs_removexattr,
+       .permission     = btrfs_permission,
+};
+static struct inode_operations btrfs_dir_ro_inode_operations = {
+       .lookup         = btrfs_lookup,
+       .permission     = btrfs_permission,
+};
+static struct file_operations btrfs_dir_file_operations = {
+       .llseek         = generic_file_llseek,
+       .read           = generic_read_dir,
+       .readdir        = btrfs_real_readdir,
+       .unlocked_ioctl = btrfs_ioctl,
+#ifdef CONFIG_COMPAT
+       .compat_ioctl   = btrfs_ioctl,
+#endif
+       .release        = btrfs_release_file,
+       .fsync          = btrfs_sync_file,
+};
+
+static struct extent_io_ops btrfs_extent_io_ops = {
+       .fill_delalloc = run_delalloc_range,
+       .submit_bio_hook = btrfs_submit_bio_hook,
+       .merge_bio_hook = btrfs_merge_bio_hook,
+       .readpage_end_io_hook = btrfs_readpage_end_io_hook,
+       .writepage_end_io_hook = btrfs_writepage_end_io_hook,
+       .writepage_start_hook = btrfs_writepage_start_hook,
+       .readpage_io_failed_hook = btrfs_io_failed_hook,
+       .set_bit_hook = btrfs_set_bit_hook,
+       .clear_bit_hook = btrfs_clear_bit_hook,
+};
+
+static struct address_space_operations btrfs_aops = {
+       .readpage       = btrfs_readpage,
+       .writepage      = btrfs_writepage,
+       .writepages     = btrfs_writepages,
+       .readpages      = btrfs_readpages,
+       .sync_page      = block_sync_page,
+       .bmap           = btrfs_bmap,
+       .direct_IO      = btrfs_direct_IO,
+       .invalidatepage = btrfs_invalidatepage,
+       .releasepage    = btrfs_releasepage,
+       .set_page_dirty = btrfs_set_page_dirty,
+};
+
+static struct address_space_operations btrfs_symlink_aops = {
+       .readpage       = btrfs_readpage,
+       .writepage      = btrfs_writepage,
+       .invalidatepage = btrfs_invalidatepage,
+       .releasepage    = btrfs_releasepage,
+};
+
+static struct inode_operations btrfs_file_inode_operations = {
+       .truncate       = btrfs_truncate,
+       .getattr        = btrfs_getattr,
+       .setattr        = btrfs_setattr,
+       .setxattr       = btrfs_setxattr,
+       .getxattr       = btrfs_getxattr,
+       .listxattr      = btrfs_listxattr,
+       .removexattr    = btrfs_removexattr,
+       .permission     = btrfs_permission,
+       .fallocate      = btrfs_fallocate,
+};
+static struct inode_operations btrfs_special_inode_operations = {
+       .getattr        = btrfs_getattr,
+       .setattr        = btrfs_setattr,
+       .permission     = btrfs_permission,
+       .setxattr       = btrfs_setxattr,
+       .getxattr       = btrfs_getxattr,
+       .listxattr      = btrfs_listxattr,
+       .removexattr    = btrfs_removexattr,
+};
+static struct inode_operations btrfs_symlink_inode_operations = {
+       .readlink       = generic_readlink,
+       .follow_link    = page_follow_link_light,
+       .put_link       = page_put_link,
+       .permission     = btrfs_permission,
+};
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c

new file mode 100644 (file)

index 0000000..c2aa33e
--- /dev/null
+++ b/fs/btrfs/ioctl.c
@@ -0,0 +1,1132 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/kernel.h>
+#include <linux/bio.h>
+#include <linux/buffer_head.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/fsnotify.h>
+#include <linux/pagemap.h>
+#include <linux/highmem.h>
+#include <linux/time.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/smp_lock.h>
+#include <linux/backing-dev.h>
+#include <linux/mount.h>
+#include <linux/mpage.h>
+#include <linux/namei.h>
+#include <linux/swap.h>
+#include <linux/writeback.h>
+#include <linux/statfs.h>
+#include <linux/compat.h>
+#include <linux/bit_spinlock.h>
+#include <linux/security.h>
+#include <linux/version.h>
+#include <linux/xattr.h>
+#include <linux/vmalloc.h>
+#include "compat.h"
+#include "ctree.h"
+#include "disk-io.h"
+#include "transaction.h"
+#include "btrfs_inode.h"
+#include "ioctl.h"
+#include "print-tree.h"
+#include "volumes.h"
+#include "locking.h"
+
+
+
+static noinline int create_subvol(struct btrfs_root *root,
+                                 struct dentry *dentry,
+                                 char *name, int namelen)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_key key;
+       struct btrfs_root_item root_item;
+       struct btrfs_inode_item *inode_item;
+       struct extent_buffer *leaf;
+       struct btrfs_root *new_root = root;
+       struct inode *dir;
+       int ret;
+       int err;
+       u64 objectid;
+       u64 new_dirid = BTRFS_FIRST_FREE_OBJECTID;
+       u64 index = 0;
+       unsigned long nr = 1;
+
+       ret = btrfs_check_free_space(root, 1, 0);
+       if (ret)
+               goto fail_commit;
+
+       trans = btrfs_start_transaction(root, 1);
+       BUG_ON(!trans);
+
+       ret = btrfs_find_free_objectid(trans, root->fs_info->tree_root,
+                                      0, &objectid);
+       if (ret)
+               goto fail;
+
+       leaf = btrfs_alloc_free_block(trans, root, root->leafsize, 0,
+                                     objectid, trans->transid, 0, 0, 0);
+       if (IS_ERR(leaf)) {
+               ret = PTR_ERR(leaf);
+               goto fail;
+       }
+
+       btrfs_set_header_nritems(leaf, 0);
+       btrfs_set_header_level(leaf, 0);
+       btrfs_set_header_bytenr(leaf, leaf->start);
+       btrfs_set_header_generation(leaf, trans->transid);
+       btrfs_set_header_owner(leaf, objectid);
+
+       write_extent_buffer(leaf, root->fs_info->fsid,
+                           (unsigned long)btrfs_header_fsid(leaf),
+                           BTRFS_FSID_SIZE);
+       btrfs_mark_buffer_dirty(leaf);
+
+       inode_item = &root_item.inode;
+       memset(inode_item, 0, sizeof(*inode_item));
+       inode_item->generation = cpu_to_le64(1);
+       inode_item->size = cpu_to_le64(3);
+       inode_item->nlink = cpu_to_le32(1);
+       inode_item->nbytes = cpu_to_le64(root->leafsize);
+       inode_item->mode = cpu_to_le32(S_IFDIR | 0755);
+
+       btrfs_set_root_bytenr(&root_item, leaf->start);
+       btrfs_set_root_generation(&root_item, trans->transid);
+       btrfs_set_root_level(&root_item, 0);
+       btrfs_set_root_refs(&root_item, 1);
+       btrfs_set_root_used(&root_item, 0);
+       btrfs_set_root_last_snapshot(&root_item, 0);
+
+       memset(&root_item.drop_progress, 0, sizeof(root_item.drop_progress));
+       root_item.drop_level = 0;
+
+       btrfs_tree_unlock(leaf);
+       free_extent_buffer(leaf);
+       leaf = NULL;
+
+       btrfs_set_root_dirid(&root_item, new_dirid);
+
+       key.objectid = objectid;
+       key.offset = 1;
+       btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY);
+       ret = btrfs_insert_root(trans, root->fs_info->tree_root, &key,
+                               &root_item);
+       if (ret)
+               goto fail;
+
+       /*
+        * insert the directory item
+        */
+       key.offset = (u64)-1;
+       dir = dentry->d_parent->d_inode;
+       ret = btrfs_set_inode_index(dir, &index);
+       BUG_ON(ret);
+
+       ret = btrfs_insert_dir_item(trans, root,
+                                   name, namelen, dir->i_ino, &key,
+                                   BTRFS_FT_DIR, index);
+       if (ret)
+               goto fail;
+
+       btrfs_i_size_write(dir, dir->i_size + namelen * 2);
+       ret = btrfs_update_inode(trans, root, dir);
+       BUG_ON(ret);
+
+       /* add the backref first */
+       ret = btrfs_add_root_ref(trans, root->fs_info->tree_root,
+                                objectid, BTRFS_ROOT_BACKREF_KEY,
+                                root->root_key.objectid,
+                                dir->i_ino, index, name, namelen);
+
+       BUG_ON(ret);
+
+       /* now add the forward ref */
+       ret = btrfs_add_root_ref(trans, root->fs_info->tree_root,
+                                root->root_key.objectid, BTRFS_ROOT_REF_KEY,
+                                objectid,
+                                dir->i_ino, index, name, namelen);
+
+       BUG_ON(ret);
+
+       ret = btrfs_commit_transaction(trans, root);
+       if (ret)
+               goto fail_commit;
+
+       new_root = btrfs_read_fs_root_no_name(root->fs_info, &key);
+       BUG_ON(!new_root);
+
+       trans = btrfs_start_transaction(new_root, 1);
+       BUG_ON(!trans);
+
+       ret = btrfs_create_subvol_root(trans, new_root, dentry, new_dirid,
+                                      BTRFS_I(dir)->block_group);
+       if (ret)
+               goto fail;
+
+fail:
+       nr = trans->blocks_used;
+       err = btrfs_commit_transaction(trans, new_root);
+       if (err && !ret)
+               ret = err;
+fail_commit:
+       btrfs_btree_balance_dirty(root, nr);
+       return ret;
+}
+
+static int create_snapshot(struct btrfs_root *root, struct dentry *dentry,
+                          char *name, int namelen)
+{
+       struct btrfs_pending_snapshot *pending_snapshot;
+       struct btrfs_trans_handle *trans;
+       int ret = 0;
+       int err;
+       unsigned long nr = 0;
+
+       if (!root->ref_cows)
+               return -EINVAL;
+
+       ret = btrfs_check_free_space(root, 1, 0);
+       if (ret)
+               goto fail_unlock;
+
+       pending_snapshot = kzalloc(sizeof(*pending_snapshot), GFP_NOFS);
+       if (!pending_snapshot) {
+               ret = -ENOMEM;
+               goto fail_unlock;
+       }
+       pending_snapshot->name = kmalloc(namelen + 1, GFP_NOFS);
+       if (!pending_snapshot->name) {
+               ret = -ENOMEM;
+               kfree(pending_snapshot);
+               goto fail_unlock;
+       }
+       memcpy(pending_snapshot->name, name, namelen);
+       pending_snapshot->name[namelen] = '\0';
+       pending_snapshot->dentry = dentry;
+       trans = btrfs_start_transaction(root, 1);
+       BUG_ON(!trans);
+       pending_snapshot->root = root;
+       list_add(&pending_snapshot->list,
+                &trans->transaction->pending_snapshots);
+       err = btrfs_commit_transaction(trans, root);
+
+fail_unlock:
+       btrfs_btree_balance_dirty(root, nr);
+       return ret;
+}
+
+/* copy of may_create in fs/namei.c() */
+static inline int btrfs_may_create(struct inode *dir, struct dentry *child)
+{
+       if (child->d_inode)
+               return -EEXIST;
+       if (IS_DEADDIR(dir))
+               return -ENOENT;
+       return inode_permission(dir, MAY_WRITE | MAY_EXEC);
+}
+
+/*
+ * Create a new subvolume below @parent.  This is largely modeled after
+ * sys_mkdirat and vfs_mkdir, but we only do a single component lookup
+ * inside this filesystem so it's quite a bit simpler.
+ */
+static noinline int btrfs_mksubvol(struct path *parent, char *name,
+                                  int mode, int namelen,
+                                  struct btrfs_root *snap_src)
+{
+       struct dentry *dentry;
+       int error;
+
+       mutex_lock_nested(&parent->dentry->d_inode->i_mutex, I_MUTEX_PARENT);
+
+       dentry = lookup_one_len(name, parent->dentry, namelen);
+       error = PTR_ERR(dentry);
+       if (IS_ERR(dentry))
+               goto out_unlock;
+
+       error = -EEXIST;
+       if (dentry->d_inode)
+               goto out_dput;
+
+       if (!IS_POSIXACL(parent->dentry->d_inode))
+               mode &= ~current->fs->umask;
+
+       error = mnt_want_write(parent->mnt);
+       if (error)
+               goto out_dput;
+
+       error = btrfs_may_create(parent->dentry->d_inode, dentry);
+       if (error)
+               goto out_drop_write;
+
+       /*
+        * Actually perform the low-level subvolume creation after all
+        * this VFS fuzz.
+        *
+        * Eventually we want to pass in an inode under which we create this
+        * subvolume, but for now all are under the filesystem root.
+        *
+        * Also we should pass on the mode eventually to allow creating new
+        * subvolume with specific mode bits.
+        */
+       if (snap_src) {
+               struct dentry *dir = dentry->d_parent;
+               struct dentry *test = dir->d_parent;
+               struct btrfs_path *path = btrfs_alloc_path();
+               int ret;
+               u64 test_oid;
+               u64 parent_oid = BTRFS_I(dir->d_inode)->root->root_key.objectid;
+
+               test_oid = snap_src->root_key.objectid;
+
+               ret = btrfs_find_root_ref(snap_src->fs_info->tree_root,
+                                         path, parent_oid, test_oid);
+               if (ret == 0)
+                       goto create;
+               btrfs_release_path(snap_src->fs_info->tree_root, path);
+
+               /* we need to make sure we aren't creating a directory loop
+                * by taking a snapshot of something that has our current
+                * subvol in its directory tree.  So, this loops through
+                * the dentries and checks the forward refs for each subvolume
+                * to see if is references the subvolume where we are
+                * placing this new snapshot.
+                */
+               while (1) {
+                       if (!test ||
+                           dir == snap_src->fs_info->sb->s_root ||
+                           test == snap_src->fs_info->sb->s_root ||
+                           test->d_inode->i_sb != snap_src->fs_info->sb) {
+                               break;
+                       }
+                       if (S_ISLNK(test->d_inode->i_mode)) {
+                               printk(KERN_INFO "Btrfs symlink in snapshot "
+                                      "path, failed\n");
+                               error = -EMLINK;
+                               btrfs_free_path(path);
+                               goto out_drop_write;
+                       }
+                       test_oid =
+                               BTRFS_I(test->d_inode)->root->root_key.objectid;
+                       ret = btrfs_find_root_ref(snap_src->fs_info->tree_root,
+                                 path, test_oid, parent_oid);
+                       if (ret == 0) {
+                               printk(KERN_INFO "Btrfs snapshot creation "
+                                      "failed, looping\n");
+                               error = -EMLINK;
+                               btrfs_free_path(path);
+                               goto out_drop_write;
+                       }
+                       btrfs_release_path(snap_src->fs_info->tree_root, path);
+                       test = test->d_parent;
+               }
+create:
+               btrfs_free_path(path);
+               error = create_snapshot(snap_src, dentry, name, namelen);
+       } else {
+               error = create_subvol(BTRFS_I(parent->dentry->d_inode)->root,
+                                     dentry, name, namelen);
+       }
+       if (error)
+               goto out_drop_write;
+
+       fsnotify_mkdir(parent->dentry->d_inode, dentry);
+out_drop_write:
+       mnt_drop_write(parent->mnt);
+out_dput:
+       dput(dentry);
+out_unlock:
+       mutex_unlock(&parent->dentry->d_inode->i_mutex);
+       return error;
+}
+
+
+static int btrfs_defrag_file(struct file *file)
+{
+       struct inode *inode = fdentry(file)->d_inode;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+       struct btrfs_ordered_extent *ordered;
+       struct page *page;
+       unsigned long last_index;
+       unsigned long ra_pages = root->fs_info->bdi.ra_pages;
+       unsigned long total_read = 0;
+       u64 page_start;
+       u64 page_end;
+       unsigned long i;
+       int ret;
+
+       ret = btrfs_check_free_space(root, inode->i_size, 0);
+       if (ret)
+               return -ENOSPC;
+
+       mutex_lock(&inode->i_mutex);
+       last_index = inode->i_size >> PAGE_CACHE_SHIFT;
+       for (i = 0; i <= last_index; i++) {
+               if (total_read % ra_pages == 0) {
+                       btrfs_force_ra(inode->i_mapping, &file->f_ra, file, i,
+                                      min(last_index, i + ra_pages - 1));
+               }
+               total_read++;
+again:
+               page = grab_cache_page(inode->i_mapping, i);
+               if (!page)
+                       goto out_unlock;
+               if (!PageUptodate(page)) {
+                       btrfs_readpage(NULL, page);
+                       lock_page(page);
+                       if (!PageUptodate(page)) {
+                               unlock_page(page);
+                               page_cache_release(page);
+                               goto out_unlock;
+                       }
+               }
+
+               wait_on_page_writeback(page);
+
+               page_start = (u64)page->index << PAGE_CACHE_SHIFT;
+               page_end = page_start + PAGE_CACHE_SIZE - 1;
+               lock_extent(io_tree, page_start, page_end, GFP_NOFS);
+
+               ordered = btrfs_lookup_ordered_extent(inode, page_start);
+               if (ordered) {
+                       unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
+                       unlock_page(page);
+                       page_cache_release(page);
+                       btrfs_start_ordered_extent(inode, ordered, 1);
+                       btrfs_put_ordered_extent(ordered);
+                       goto again;
+               }
+               set_page_extent_mapped(page);
+
+               /*
+                * this makes sure page_mkwrite is called on the
+                * page if it is dirtied again later
+                */
+               clear_page_dirty_for_io(page);
+
+               btrfs_set_extent_delalloc(inode, page_start, page_end);
+
+               unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
+               set_page_dirty(page);
+               unlock_page(page);
+               page_cache_release(page);
+               balance_dirty_pages_ratelimited_nr(inode->i_mapping, 1);
+       }
+
+out_unlock:
+       mutex_unlock(&inode->i_mutex);
+       return 0;
+}
+
+/*
+ * Called inside transaction, so use GFP_NOFS
+ */
+
+static int btrfs_ioctl_resize(struct btrfs_root *root, void __user *arg)
+{
+       u64 new_size;
+       u64 old_size;
+       u64 devid = 1;
+       struct btrfs_ioctl_vol_args *vol_args;
+       struct btrfs_trans_handle *trans;
+       struct btrfs_device *device = NULL;
+       char *sizestr;
+       char *devstr = NULL;
+       int ret = 0;
+       int namelen;
+       int mod = 0;
+
+       if (root->fs_info->sb->s_flags & MS_RDONLY)
+               return -EROFS;
+
+       if (!capable(CAP_SYS_ADMIN))
+               return -EPERM;
+
+       vol_args = kmalloc(sizeof(*vol_args), GFP_NOFS);
+
+       if (!vol_args)
+               return -ENOMEM;
+
+       if (copy_from_user(vol_args, arg, sizeof(*vol_args))) {
+               ret = -EFAULT;
+               goto out;
+       }
+
+       vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
+       namelen = strlen(vol_args->name);
+
+       mutex_lock(&root->fs_info->volume_mutex);
+       sizestr = vol_args->name;
+       devstr = strchr(sizestr, ':');
+       if (devstr) {
+               char *end;
+               sizestr = devstr + 1;
+               *devstr = '\0';
+               devstr = vol_args->name;
+               devid = simple_strtoull(devstr, &end, 10);
+               printk(KERN_INFO "resizing devid %llu\n", devid);
+       }
+       device = btrfs_find_device(root, devid, NULL, NULL);
+       if (!device) {
+               printk(KERN_INFO "resizer unable to find device %llu\n", devid);
+               ret = -EINVAL;
+               goto out_unlock;
+       }
+       if (!strcmp(sizestr, "max"))
+               new_size = device->bdev->bd_inode->i_size;
+       else {
+               if (sizestr[0] == '-') {
+                       mod = -1;
+                       sizestr++;
+               } else if (sizestr[0] == '+') {
+                       mod = 1;
+                       sizestr++;
+               }
+               new_size = btrfs_parse_size(sizestr);
+               if (new_size == 0) {
+                       ret = -EINVAL;
+                       goto out_unlock;
+               }
+       }
+
+       old_size = device->total_bytes;
+
+       if (mod < 0) {
+               if (new_size > old_size) {
+                       ret = -EINVAL;
+                       goto out_unlock;
+               }
+               new_size = old_size - new_size;
+       } else if (mod > 0) {
+               new_size = old_size + new_size;
+       }
+
+       if (new_size < 256 * 1024 * 1024) {
+               ret = -EINVAL;
+               goto out_unlock;
+       }
+       if (new_size > device->bdev->bd_inode->i_size) {
+               ret = -EFBIG;
+               goto out_unlock;
+       }
+
+       do_div(new_size, root->sectorsize);
+       new_size *= root->sectorsize;
+
+       printk(KERN_INFO "new size for %s is %llu\n",
+               device->name, (unsigned long long)new_size);
+
+       if (new_size > old_size) {
+               trans = btrfs_start_transaction(root, 1);
+               ret = btrfs_grow_device(trans, device, new_size);
+               btrfs_commit_transaction(trans, root);
+       } else {
+               ret = btrfs_shrink_device(device, new_size);
+       }
+
+out_unlock:
+       mutex_unlock(&root->fs_info->volume_mutex);
+out:
+       kfree(vol_args);
+       return ret;
+}
+
+static noinline int btrfs_ioctl_snap_create(struct file *file,
+                                           void __user *arg, int subvol)
+{
+       struct btrfs_root *root = BTRFS_I(fdentry(file)->d_inode)->root;
+       struct btrfs_ioctl_vol_args *vol_args;
+       struct btrfs_dir_item *di;
+       struct btrfs_path *path;
+       struct file *src_file;
+       u64 root_dirid;
+       int namelen;
+       int ret = 0;
+
+       if (root->fs_info->sb->s_flags & MS_RDONLY)
+               return -EROFS;
+
+       vol_args = kmalloc(sizeof(*vol_args), GFP_NOFS);
+
+       if (!vol_args)
+               return -ENOMEM;
+
+       if (copy_from_user(vol_args, arg, sizeof(*vol_args))) {
+               ret = -EFAULT;
+               goto out;
+       }
+
+       vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
+       namelen = strlen(vol_args->name);
+       if (strchr(vol_args->name, '/')) {
+               ret = -EINVAL;
+               goto out;
+       }
+
+       path = btrfs_alloc_path();
+       if (!path) {
+               ret = -ENOMEM;
+               goto out;
+       }
+
+       root_dirid = root->fs_info->sb->s_root->d_inode->i_ino,
+       di = btrfs_lookup_dir_item(NULL, root->fs_info->tree_root,
+                           path, root_dirid,
+                           vol_args->name, namelen, 0);
+       btrfs_free_path(path);
+
+       if (di && !IS_ERR(di)) {
+               ret = -EEXIST;
+               goto out;
+       }
+
+       if (IS_ERR(di)) {
+               ret = PTR_ERR(di);
+               goto out;
+       }
+
+       if (subvol) {
+               ret = btrfs_mksubvol(&file->f_path, vol_args->name,
+                                    file->f_path.dentry->d_inode->i_mode,
+                                    namelen, NULL);
+       } else {
+               struct inode *src_inode;
+               src_file = fget(vol_args->fd);
+               if (!src_file) {
+                       ret = -EINVAL;
+                       goto out;
+               }
+
+               src_inode = src_file->f_path.dentry->d_inode;
+               if (src_inode->i_sb != file->f_path.dentry->d_inode->i_sb) {
+                       printk(KERN_INFO "btrfs: Snapshot src from "
+                              "another FS\n");
+                       ret = -EINVAL;
+                       fput(src_file);
+                       goto out;
+               }
+               ret = btrfs_mksubvol(&file->f_path, vol_args->name,
+                            file->f_path.dentry->d_inode->i_mode,
+                            namelen, BTRFS_I(src_inode)->root);
+               fput(src_file);
+       }
+
+out:
+       kfree(vol_args);
+       return ret;
+}
+
+static int btrfs_ioctl_defrag(struct file *file)
+{
+       struct inode *inode = fdentry(file)->d_inode;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       int ret;
+
+       ret = mnt_want_write(file->f_path.mnt);
+       if (ret)
+               return ret;
+
+       switch (inode->i_mode & S_IFMT) {
+       case S_IFDIR:
+               if (!capable(CAP_SYS_ADMIN)) {
+                       ret = -EPERM;
+                       goto out;
+               }
+               btrfs_defrag_root(root, 0);
+               btrfs_defrag_root(root->fs_info->extent_root, 0);
+               break;
+       case S_IFREG:
+               if (!(file->f_mode & FMODE_WRITE)) {
+                       ret = -EINVAL;
+                       goto out;
+               }
+               btrfs_defrag_file(file);
+               break;
+       }
+out:
+       mnt_drop_write(file->f_path.mnt);
+       return ret;
+}
+
+static long btrfs_ioctl_add_dev(struct btrfs_root *root, void __user *arg)
+{
+       struct btrfs_ioctl_vol_args *vol_args;
+       int ret;
+
+       if (!capable(CAP_SYS_ADMIN))
+               return -EPERM;
+
+       vol_args = kmalloc(sizeof(*vol_args), GFP_NOFS);
+
+       if (!vol_args)
+               return -ENOMEM;
+
+       if (copy_from_user(vol_args, arg, sizeof(*vol_args))) {
+               ret = -EFAULT;
+               goto out;
+       }
+       vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
+       ret = btrfs_init_new_device(root, vol_args->name);
+
+out:
+       kfree(vol_args);
+       return ret;
+}
+
+static long btrfs_ioctl_rm_dev(struct btrfs_root *root, void __user *arg)
+{
+       struct btrfs_ioctl_vol_args *vol_args;
+       int ret;
+
+       if (!capable(CAP_SYS_ADMIN))
+               return -EPERM;
+
+       if (root->fs_info->sb->s_flags & MS_RDONLY)
+               return -EROFS;
+
+       vol_args = kmalloc(sizeof(*vol_args), GFP_NOFS);
+
+       if (!vol_args)
+               return -ENOMEM;
+
+       if (copy_from_user(vol_args, arg, sizeof(*vol_args))) {
+               ret = -EFAULT;
+               goto out;
+       }
+       vol_args->name[BTRFS_PATH_NAME_MAX] = '\0';
+       ret = btrfs_rm_device(root, vol_args->name);
+
+out:
+       kfree(vol_args);
+       return ret;
+}
+
+static long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+               u64 off, u64 olen, u64 destoff)
+{
+       struct inode *inode = fdentry(file)->d_inode;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct file *src_file;
+       struct inode *src;
+       struct btrfs_trans_handle *trans;
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       char *buf;
+       struct btrfs_key key;
+       u32 nritems;
+       int slot;
+       int ret;
+       u64 len = olen;
+       u64 bs = root->fs_info->sb->s_blocksize;
+       u64 hint_byte;
+
+       /*
+        * TODO:
+        * - split compressed inline extents.  annoying: we need to
+        *   decompress into destination's address_space (the file offset
+        *   may change, so source mapping won't do), then recompress (or
+        *   otherwise reinsert) a subrange.
+        * - allow ranges within the same file to be cloned (provided
+        *   they don't overlap)?
+        */
+
+       /* the destination must be opened for writing */
+       if (!(file->f_mode & FMODE_WRITE))
+               return -EINVAL;
+
+       ret = mnt_want_write(file->f_path.mnt);
+       if (ret)
+               return ret;
+
+       src_file = fget(srcfd);
+       if (!src_file) {
+               ret = -EBADF;
+               goto out_drop_write;
+       }
+       src = src_file->f_dentry->d_inode;
+
+       ret = -EINVAL;
+       if (src == inode)
+               goto out_fput;
+
+       ret = -EISDIR;
+       if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
+               goto out_fput;
+
+       ret = -EXDEV;
+       if (src->i_sb != inode->i_sb || BTRFS_I(src)->root != root)
+               goto out_fput;
+
+       ret = -ENOMEM;
+       buf = vmalloc(btrfs_level_size(root, 0));
+       if (!buf)
+               goto out_fput;
+
+       path = btrfs_alloc_path();
+       if (!path) {
+               vfree(buf);
+               goto out_fput;
+       }
+       path->reada = 2;
+
+       if (inode < src) {
+               mutex_lock(&inode->i_mutex);
+               mutex_lock(&src->i_mutex);
+       } else {
+               mutex_lock(&src->i_mutex);
+               mutex_lock(&inode->i_mutex);
+       }
+
+       /* determine range to clone */
+       ret = -EINVAL;
+       if (off >= src->i_size || off + len > src->i_size)
+               goto out_unlock;
+       if (len == 0)
+               olen = len = src->i_size - off;
+       /* if we extend to eof, continue to block boundary */
+       if (off + len == src->i_size)
+               len = ((src->i_size + bs-1) & ~(bs-1))
+                       - off;
+
+       /* verify the end result is block aligned */
+       if ((off & (bs-1)) ||
+           ((off + len) & (bs-1)))
+               goto out_unlock;
+
+       /* do any pending delalloc/csum calc on src, one way or
+          another, and lock file content */
+       while (1) {
+               struct btrfs_ordered_extent *ordered;
+               lock_extent(&BTRFS_I(src)->io_tree, off, off+len, GFP_NOFS);
+               ordered = btrfs_lookup_first_ordered_extent(inode, off+len);
+               if (BTRFS_I(src)->delalloc_bytes == 0 && !ordered)
+                       break;
+               unlock_extent(&BTRFS_I(src)->io_tree, off, off+len, GFP_NOFS);
+               if (ordered)
+                       btrfs_put_ordered_extent(ordered);
+               btrfs_wait_ordered_range(src, off, off+len);
+       }
+
+       trans = btrfs_start_transaction(root, 1);
+       BUG_ON(!trans);
+
+       /* punch hole in destination first */
+       btrfs_drop_extents(trans, root, inode, off, off+len, 0, &hint_byte);
+
+       /* clone data */
+       key.objectid = src->i_ino;
+       key.type = BTRFS_EXTENT_DATA_KEY;
+       key.offset = 0;
+
+       while (1) {
+               /*
+                * note the key will change type as we walk through the
+                * tree.
+                */
+               ret = btrfs_search_slot(trans, root, &key, path, 0, 0);
+               if (ret < 0)
+                       goto out;
+
+               nritems = btrfs_header_nritems(path->nodes[0]);
+               if (path->slots[0] >= nritems) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret < 0)
+                               goto out;
+                       if (ret > 0)
+                               break;
+                       nritems = btrfs_header_nritems(path->nodes[0]);
+               }
+               leaf = path->nodes[0];
+               slot = path->slots[0];
+
+               btrfs_item_key_to_cpu(leaf, &key, slot);
+               if (btrfs_key_type(&key) > BTRFS_EXTENT_DATA_KEY ||
+                   key.objectid != src->i_ino)
+                       break;
+
+               if (btrfs_key_type(&key) == BTRFS_EXTENT_DATA_KEY) {
+                       struct btrfs_file_extent_item *extent;
+                       int type;
+                       u32 size;
+                       struct btrfs_key new_key;
+                       u64 disko = 0, diskl = 0;
+                       u64 datao = 0, datal = 0;
+                       u8 comp;
+
+                       size = btrfs_item_size_nr(leaf, slot);
+                       read_extent_buffer(leaf, buf,
+                                          btrfs_item_ptr_offset(leaf, slot),
+                                          size);
+
+                       extent = btrfs_item_ptr(leaf, slot,
+                                               struct btrfs_file_extent_item);
+                       comp = btrfs_file_extent_compression(leaf, extent);
+                       type = btrfs_file_extent_type(leaf, extent);
+                       if (type == BTRFS_FILE_EXTENT_REG) {
+                               disko = btrfs_file_extent_disk_bytenr(leaf,
+                                                                     extent);
+                               diskl = btrfs_file_extent_disk_num_bytes(leaf,
+                                                                extent);
+                               datao = btrfs_file_extent_offset(leaf, extent);
+                               datal = btrfs_file_extent_num_bytes(leaf,
+                                                                   extent);
+                       } else if (type == BTRFS_FILE_EXTENT_INLINE) {
+                               /* take upper bound, may be compressed */
+                               datal = btrfs_file_extent_ram_bytes(leaf,
+                                                                   extent);
+                       }
+                       btrfs_release_path(root, path);
+
+                       if (key.offset + datal < off ||
+                           key.offset >= off+len)
+                               goto next;
+
+                       memcpy(&new_key, &key, sizeof(new_key));
+                       new_key.objectid = inode->i_ino;
+                       new_key.offset = key.offset + destoff - off;
+
+                       if (type == BTRFS_FILE_EXTENT_REG) {
+                               ret = btrfs_insert_empty_item(trans, root, path,
+                                                             &new_key, size);
+                               if (ret)
+                                       goto out;
+
+                               leaf = path->nodes[0];
+                               slot = path->slots[0];
+                               write_extent_buffer(leaf, buf,
+                                           btrfs_item_ptr_offset(leaf, slot),
+                                           size);
+
+                               extent = btrfs_item_ptr(leaf, slot,
+                                               struct btrfs_file_extent_item);
+
+                               if (off > key.offset) {
+                                       datao += off - key.offset;
+                                       datal -= off - key.offset;
+                               }
+                               if (key.offset + datao + datal + key.offset >
+                                   off + len)
+                                       datal = off + len - key.offset - datao;
+                               /* disko == 0 means it's a hole */
+                               if (!disko)
+                                       datao = 0;
+
+                               btrfs_set_file_extent_offset(leaf, extent,
+                                                            datao);
+                               btrfs_set_file_extent_num_bytes(leaf, extent,
+                                                               datal);
+                               if (disko) {
+                                       inode_add_bytes(inode, datal);
+                                       ret = btrfs_inc_extent_ref(trans, root,
+                                                  disko, diskl, leaf->start,
+                                                  root->root_key.objectid,
+                                                  trans->transid,
+                                                  inode->i_ino);
+                                       BUG_ON(ret);
+                               }
+                       } else if (type == BTRFS_FILE_EXTENT_INLINE) {
+                               u64 skip = 0;
+                               u64 trim = 0;
+                               if (off > key.offset) {
+                                       skip = off - key.offset;
+                                       new_key.offset += skip;
+                               }
+
+                               if (key.offset + datal > off+len)
+                                       trim = key.offset + datal - (off+len);
+
+                               if (comp && (skip || trim)) {
+                                       ret = -EINVAL;
+                                       goto out;
+                               }
+                               size -= skip + trim;
+                               datal -= skip + trim;
+                               ret = btrfs_insert_empty_item(trans, root, path,
+                                                             &new_key, size);
+                               if (ret)
+                                       goto out;
+
+                               if (skip) {
+                                       u32 start =
+                                         btrfs_file_extent_calc_inline_size(0);
+                                       memmove(buf+start, buf+start+skip,
+                                               datal);
+                               }
+
+                               leaf = path->nodes[0];
+                               slot = path->slots[0];
+                               write_extent_buffer(leaf, buf,
+                                           btrfs_item_ptr_offset(leaf, slot),
+                                           size);
+                               inode_add_bytes(inode, datal);
+                       }
+
+                       btrfs_mark_buffer_dirty(leaf);
+               }
+
+next:
+               btrfs_release_path(root, path);
+               key.offset++;
+       }
+       ret = 0;
+out:
+       btrfs_release_path(root, path);
+       if (ret == 0) {
+               inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+               if (destoff + olen > inode->i_size)
+                       btrfs_i_size_write(inode, destoff + olen);
+               BTRFS_I(inode)->flags = BTRFS_I(src)->flags;
+               ret = btrfs_update_inode(trans, root, inode);
+       }
+       btrfs_end_transaction(trans, root);
+       unlock_extent(&BTRFS_I(src)->io_tree, off, off+len, GFP_NOFS);
+       if (ret)
+               vmtruncate(inode, 0);
+out_unlock:
+       mutex_unlock(&src->i_mutex);
+       mutex_unlock(&inode->i_mutex);
+       vfree(buf);
+       btrfs_free_path(path);
+out_fput:
+       fput(src_file);
+out_drop_write:
+       mnt_drop_write(file->f_path.mnt);
+       return ret;
+}
+
+static long btrfs_ioctl_clone_range(struct file *file, void __user *argp)
+{
+       struct btrfs_ioctl_clone_range_args args;
+
+       if (copy_from_user(&args, argp, sizeof(args)))
+               return -EFAULT;
+       return btrfs_ioctl_clone(file, args.src_fd, args.src_offset,
+                                args.src_length, args.dest_offset);
+}
+
+/*
+ * there are many ways the trans_start and trans_end ioctls can lead
+ * to deadlocks.  They should only be used by applications that
+ * basically own the machine, and have a very in depth understanding
+ * of all the possible deadlocks and enospc problems.
+ */
+static long btrfs_ioctl_trans_start(struct file *file)
+{
+       struct inode *inode = fdentry(file)->d_inode;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_trans_handle *trans;
+       int ret = 0;
+
+       if (!capable(CAP_SYS_ADMIN))
+               return -EPERM;
+
+       if (file->private_data) {
+               ret = -EINPROGRESS;
+               goto out;
+       }
+
+       ret = mnt_want_write(file->f_path.mnt);
+       if (ret)
+               goto out;
+
+       mutex_lock(&root->fs_info->trans_mutex);
+       root->fs_info->open_ioctl_trans++;
+       mutex_unlock(&root->fs_info->trans_mutex);
+
+       trans = btrfs_start_ioctl_transaction(root, 0);
+       if (trans)
+               file->private_data = trans;
+       else
+               ret = -ENOMEM;
+       /*printk(KERN_INFO "btrfs_ioctl_trans_start on %p\n", file);*/
+out:
+       return ret;
+}
+
+/*
+ * there are many ways the trans_start and trans_end ioctls can lead
+ * to deadlocks.  They should only be used by applications that
+ * basically own the machine, and have a very in depth understanding
+ * of all the possible deadlocks and enospc problems.
+ */
+long btrfs_ioctl_trans_end(struct file *file)
+{
+       struct inode *inode = fdentry(file)->d_inode;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_trans_handle *trans;
+       int ret = 0;
+
+       trans = file->private_data;
+       if (!trans) {
+               ret = -EINVAL;
+               goto out;
+       }
+       btrfs_end_transaction(trans, root);
+       file->private_data = NULL;
+
+       mutex_lock(&root->fs_info->trans_mutex);
+       root->fs_info->open_ioctl_trans--;
+       mutex_unlock(&root->fs_info->trans_mutex);
+
+       mnt_drop_write(file->f_path.mnt);
+
+out:
+       return ret;
+}
+
+long btrfs_ioctl(struct file *file, unsigned int
+               cmd, unsigned long arg)
+{
+       struct btrfs_root *root = BTRFS_I(fdentry(file)->d_inode)->root;
+       void __user *argp = (void __user *)arg;
+
+       switch (cmd) {
+       case BTRFS_IOC_SNAP_CREATE:
+               return btrfs_ioctl_snap_create(file, argp, 0);
+       case BTRFS_IOC_SUBVOL_CREATE:
+               return btrfs_ioctl_snap_create(file, argp, 1);
+       case BTRFS_IOC_DEFRAG:
+               return btrfs_ioctl_defrag(file);
+       case BTRFS_IOC_RESIZE:
+               return btrfs_ioctl_resize(root, argp);
+       case BTRFS_IOC_ADD_DEV:
+               return btrfs_ioctl_add_dev(root, argp);
+       case BTRFS_IOC_RM_DEV:
+               return btrfs_ioctl_rm_dev(root, argp);
+       case BTRFS_IOC_BALANCE:
+               return btrfs_balance(root->fs_info->dev_root);
+       case BTRFS_IOC_CLONE:
+               return btrfs_ioctl_clone(file, arg, 0, 0, 0);
+       case BTRFS_IOC_CLONE_RANGE:
+               return btrfs_ioctl_clone_range(file, argp);
+       case BTRFS_IOC_TRANS_START:
+               return btrfs_ioctl_trans_start(file);
+       case BTRFS_IOC_TRANS_END:
+               return btrfs_ioctl_trans_end(file);
+       case BTRFS_IOC_SYNC:
+               btrfs_sync_fs(file->f_dentry->d_sb, 1);
+               return 0;
+       }
+
+       return -ENOTTY;
+}
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h

new file mode 100644 (file)

index 0000000..78049ea
--- /dev/null
+++ b/fs/btrfs/ioctl.h
@@ -0,0 +1,67 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __IOCTL_
+#define __IOCTL_
+#include <linux/ioctl.h>
+
+#define BTRFS_IOCTL_MAGIC 0x94
+#define BTRFS_VOL_NAME_MAX 255
+#define BTRFS_PATH_NAME_MAX 3072
+
+struct btrfs_ioctl_vol_args {
+       __s64 fd;
+       char name[BTRFS_PATH_NAME_MAX + 1];
+};
+
+#define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
+                                  struct btrfs_ioctl_vol_args)
+#define BTRFS_IOC_DEFRAG _IOW(BTRFS_IOCTL_MAGIC, 2, \
+                                  struct btrfs_ioctl_vol_args)
+#define BTRFS_IOC_RESIZE _IOW(BTRFS_IOCTL_MAGIC, 3, \
+                                  struct btrfs_ioctl_vol_args)
+#define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \
+                                  struct btrfs_ioctl_vol_args)
+/* trans start and trans end are dangerous, and only for
+ * use by applications that know how to avoid the
+ * resulting deadlocks
+ */
+#define BTRFS_IOC_TRANS_START  _IO(BTRFS_IOCTL_MAGIC, 6)
+#define BTRFS_IOC_TRANS_END    _IO(BTRFS_IOCTL_MAGIC, 7)
+#define BTRFS_IOC_SYNC         _IO(BTRFS_IOCTL_MAGIC, 8)
+
+#define BTRFS_IOC_CLONE        _IOW(BTRFS_IOCTL_MAGIC, 9, int)
+#define BTRFS_IOC_ADD_DEV _IOW(BTRFS_IOCTL_MAGIC, 10, \
+                                  struct btrfs_ioctl_vol_args)
+#define BTRFS_IOC_RM_DEV _IOW(BTRFS_IOCTL_MAGIC, 11, \
+                                  struct btrfs_ioctl_vol_args)
+#define BTRFS_IOC_BALANCE _IOW(BTRFS_IOCTL_MAGIC, 12, \
+                                  struct btrfs_ioctl_vol_args)
+struct btrfs_ioctl_clone_range_args {
+  __s64 src_fd;
+  __u64 src_offset, src_length;
+  __u64 dest_offset;
+};
+
+#define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \
+                                 struct btrfs_ioctl_clone_range_args)
+
+#define BTRFS_IOC_SUBVOL_CREATE _IOW(BTRFS_IOCTL_MAGIC, 14, \
+                                  struct btrfs_ioctl_vol_args)
+
+#endif
diff --git a/fs/btrfs/locking.c b/fs/btrfs/locking.c

new file mode 100644 (file)

index 0000000..39bae77
--- /dev/null
+++ b/fs/btrfs/locking.c
@@ -0,0 +1,88 @@
+/*
+ * Copyright (C) 2008 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include <linux/sched.h>
+#include <linux/gfp.h>
+#include <linux/pagemap.h>
+#include <linux/spinlock.h>
+#include <linux/page-flags.h>
+#include <asm/bug.h>
+#include "ctree.h"
+#include "extent_io.h"
+#include "locking.h"
+
+/*
+ * locks the per buffer mutex in an extent buffer.  This uses adaptive locks
+ * and the spin is not tuned very extensively.  The spinning does make a big
+ * difference in almost every workload, but spinning for the right amount of
+ * time needs some help.
+ *
+ * In general, we want to spin as long as the lock holder is doing btree
+ * searches, and we should give up if they are in more expensive code.
+ */
+
+int btrfs_tree_lock(struct extent_buffer *eb)
+{
+       int i;
+
+       if (mutex_trylock(&eb->mutex))
+               return 0;
+       for (i = 0; i < 512; i++) {
+               cpu_relax();
+               if (mutex_trylock(&eb->mutex))
+                       return 0;
+       }
+       cpu_relax();
+       mutex_lock_nested(&eb->mutex, BTRFS_MAX_LEVEL - btrfs_header_level(eb));
+       return 0;
+}
+
+int btrfs_try_tree_lock(struct extent_buffer *eb)
+{
+       return mutex_trylock(&eb->mutex);
+}
+
+int btrfs_tree_unlock(struct extent_buffer *eb)
+{
+       mutex_unlock(&eb->mutex);
+       return 0;
+}
+
+int btrfs_tree_locked(struct extent_buffer *eb)
+{
+       return mutex_is_locked(&eb->mutex);
+}
+
+/*
+ * btrfs_search_slot uses this to decide if it should drop its locks
+ * before doing something expensive like allocating free blocks for cow.
+ */
+int btrfs_path_lock_waiting(struct btrfs_path *path, int level)
+{
+       int i;
+       struct extent_buffer *eb;
+       for (i = level; i <= level + 1 && i < BTRFS_MAX_LEVEL; i++) {
+               eb = path->nodes[i];
+               if (!eb)
+                       break;
+               smp_mb();
+               if (!list_empty(&eb->mutex.wait_list))
+                       return 1;
+       }
+       return 0;
+}
+
diff --git a/fs/btrfs/locking.h b/fs/btrfs/locking.h

new file mode 100644 (file)

index 0000000..bc1faef
--- /dev/null
+++ b/fs/btrfs/locking.h
@@ -0,0 +1,27 @@
+/*
+ * Copyright (C) 2008 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_LOCKING_
+#define __BTRFS_LOCKING_
+
+int btrfs_tree_lock(struct extent_buffer *eb);
+int btrfs_tree_unlock(struct extent_buffer *eb);
+int btrfs_tree_locked(struct extent_buffer *eb);
+int btrfs_try_tree_lock(struct extent_buffer *eb);
+int btrfs_path_lock_waiting(struct btrfs_path *path, int level);
+#endif
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c

new file mode 100644 (file)

index 0000000..a209401
--- /dev/null
+++ b/fs/btrfs/ordered-data.c
@@ -0,0 +1,730 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include <linux/writeback.h>
+#include <linux/pagevec.h>
+#include "ctree.h"
+#include "transaction.h"
+#include "btrfs_inode.h"
+#include "extent_io.h"
+
+static u64 entry_end(struct btrfs_ordered_extent *entry)
+{
+       if (entry->file_offset + entry->len < entry->file_offset)
+               return (u64)-1;
+       return entry->file_offset + entry->len;
+}
+
+/* returns NULL if the insertion worked, or it returns the node it did find
+ * in the tree
+ */
+static struct rb_node *tree_insert(struct rb_root *root, u64 file_offset,
+                                  struct rb_node *node)
+{
+       struct rb_node **p = &root->rb_node;
+       struct rb_node *parent = NULL;
+       struct btrfs_ordered_extent *entry;
+
+       while (*p) {
+               parent = *p;
+               entry = rb_entry(parent, struct btrfs_ordered_extent, rb_node);
+
+               if (file_offset < entry->file_offset)
+                       p = &(*p)->rb_left;
+               else if (file_offset >= entry_end(entry))
+                       p = &(*p)->rb_right;
+               else
+                       return parent;
+       }
+
+       rb_link_node(node, parent, p);
+       rb_insert_color(node, root);
+       return NULL;
+}
+
+/*
+ * look for a given offset in the tree, and if it can't be found return the
+ * first lesser offset
+ */
+static struct rb_node *__tree_search(struct rb_root *root, u64 file_offset,
+                                    struct rb_node **prev_ret)
+{
+       struct rb_node *n = root->rb_node;
+       struct rb_node *prev = NULL;
+       struct rb_node *test;
+       struct btrfs_ordered_extent *entry;
+       struct btrfs_ordered_extent *prev_entry = NULL;
+
+       while (n) {
+               entry = rb_entry(n, struct btrfs_ordered_extent, rb_node);
+               prev = n;
+               prev_entry = entry;
+
+               if (file_offset < entry->file_offset)
+                       n = n->rb_left;
+               else if (file_offset >= entry_end(entry))
+                       n = n->rb_right;
+               else
+                       return n;
+       }
+       if (!prev_ret)
+               return NULL;
+
+       while (prev && file_offset >= entry_end(prev_entry)) {
+               test = rb_next(prev);
+               if (!test)
+                       break;
+               prev_entry = rb_entry(test, struct btrfs_ordered_extent,
+                                     rb_node);
+               if (file_offset < entry_end(prev_entry))
+                       break;
+
+               prev = test;
+       }
+       if (prev)
+               prev_entry = rb_entry(prev, struct btrfs_ordered_extent,
+                                     rb_node);
+       while (prev && file_offset < entry_end(prev_entry)) {
+               test = rb_prev(prev);
+               if (!test)
+                       break;
+               prev_entry = rb_entry(test, struct btrfs_ordered_extent,
+                                     rb_node);
+               prev = test;
+       }
+       *prev_ret = prev;
+       return NULL;
+}
+
+/*
+ * helper to check if a given offset is inside a given entry
+ */
+static int offset_in_entry(struct btrfs_ordered_extent *entry, u64 file_offset)
+{
+       if (file_offset < entry->file_offset ||
+           entry->file_offset + entry->len <= file_offset)
+               return 0;
+       return 1;
+}
+
+/*
+ * look find the first ordered struct that has this offset, otherwise
+ * the first one less than this offset
+ */
+static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree,
+                                         u64 file_offset)
+{
+       struct rb_root *root = &tree->tree;
+       struct rb_node *prev;
+       struct rb_node *ret;
+       struct btrfs_ordered_extent *entry;
+
+       if (tree->last) {
+               entry = rb_entry(tree->last, struct btrfs_ordered_extent,
+                                rb_node);
+               if (offset_in_entry(entry, file_offset))
+                       return tree->last;
+       }
+       ret = __tree_search(root, file_offset, &prev);
+       if (!ret)
+               ret = prev;
+       if (ret)
+               tree->last = ret;
+       return ret;
+}
+
+/* allocate and add a new ordered_extent into the per-inode tree.
+ * file_offset is the logical offset in the file
+ *
+ * start is the disk block number of an extent already reserved in the
+ * extent allocation tree
+ *
+ * len is the length of the extent
+ *
+ * This also sets the EXTENT_ORDERED bit on the range in the inode.
+ *
+ * The tree is given a single reference on the ordered extent that was
+ * inserted.
+ */
+int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
+                            u64 start, u64 len, u64 disk_len, int type)
+{
+       struct btrfs_ordered_inode_tree *tree;
+       struct rb_node *node;
+       struct btrfs_ordered_extent *entry;
+
+       tree = &BTRFS_I(inode)->ordered_tree;
+       entry = kzalloc(sizeof(*entry), GFP_NOFS);
+       if (!entry)
+               return -ENOMEM;
+
+       mutex_lock(&tree->mutex);
+       entry->file_offset = file_offset;
+       entry->start = start;
+       entry->len = len;
+       entry->disk_len = disk_len;
+       entry->inode = inode;
+       if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
+               set_bit(type, &entry->flags);
+
+       /* one ref for the tree */
+       atomic_set(&entry->refs, 1);
+       init_waitqueue_head(&entry->wait);
+       INIT_LIST_HEAD(&entry->list);
+       INIT_LIST_HEAD(&entry->root_extent_list);
+
+       node = tree_insert(&tree->tree, file_offset,
+                          &entry->rb_node);
+       BUG_ON(node);
+
+       set_extent_ordered(&BTRFS_I(inode)->io_tree, file_offset,
+                          entry_end(entry) - 1, GFP_NOFS);
+
+       spin_lock(&BTRFS_I(inode)->root->fs_info->ordered_extent_lock);
+       list_add_tail(&entry->root_extent_list,
+                     &BTRFS_I(inode)->root->fs_info->ordered_extents);
+       spin_unlock(&BTRFS_I(inode)->root->fs_info->ordered_extent_lock);
+
+       mutex_unlock(&tree->mutex);
+       BUG_ON(node);
+       return 0;
+}
+
+/*
+ * Add a struct btrfs_ordered_sum into the list of checksums to be inserted
+ * when an ordered extent is finished.  If the list covers more than one
+ * ordered extent, it is split across multiples.
+ */
+int btrfs_add_ordered_sum(struct inode *inode,
+                         struct btrfs_ordered_extent *entry,
+                         struct btrfs_ordered_sum *sum)
+{
+       struct btrfs_ordered_inode_tree *tree;
+
+       tree = &BTRFS_I(inode)->ordered_tree;
+       mutex_lock(&tree->mutex);
+       list_add_tail(&sum->list, &entry->list);
+       mutex_unlock(&tree->mutex);
+       return 0;
+}
+
+/*
+ * this is used to account for finished IO across a given range
+ * of the file.  The IO should not span ordered extents.  If
+ * a given ordered_extent is completely done, 1 is returned, otherwise
+ * 0.
+ *
+ * test_and_set_bit on a flag in the struct btrfs_ordered_extent is used
+ * to make sure this function only returns 1 once for a given ordered extent.
+ */
+int btrfs_dec_test_ordered_pending(struct inode *inode,
+                                  u64 file_offset, u64 io_size)
+{
+       struct btrfs_ordered_inode_tree *tree;
+       struct rb_node *node;
+       struct btrfs_ordered_extent *entry;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+       int ret;
+
+       tree = &BTRFS_I(inode)->ordered_tree;
+       mutex_lock(&tree->mutex);
+       clear_extent_ordered(io_tree, file_offset, file_offset + io_size - 1,
+                            GFP_NOFS);
+       node = tree_search(tree, file_offset);
+       if (!node) {
+               ret = 1;
+               goto out;
+       }
+
+       entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
+       if (!offset_in_entry(entry, file_offset)) {
+               ret = 1;
+               goto out;
+       }
+
+       ret = test_range_bit(io_tree, entry->file_offset,
+                            entry->file_offset + entry->len - 1,
+                            EXTENT_ORDERED, 0);
+       if (ret == 0)
+               ret = test_and_set_bit(BTRFS_ORDERED_IO_DONE, &entry->flags);
+out:
+       mutex_unlock(&tree->mutex);
+       return ret == 0;
+}
+
+/*
+ * used to drop a reference on an ordered extent.  This will free
+ * the extent if the last reference is dropped
+ */
+int btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry)
+{
+       struct list_head *cur;
+       struct btrfs_ordered_sum *sum;
+
+       if (atomic_dec_and_test(&entry->refs)) {
+               while (!list_empty(&entry->list)) {
+                       cur = entry->list.next;
+                       sum = list_entry(cur, struct btrfs_ordered_sum, list);
+                       list_del(&sum->list);
+                       kfree(sum);
+               }
+               kfree(entry);
+       }
+       return 0;
+}
+
+/*
+ * remove an ordered extent from the tree.  No references are dropped
+ * but, anyone waiting on this extent is woken up.
+ */
+int btrfs_remove_ordered_extent(struct inode *inode,
+                               struct btrfs_ordered_extent *entry)
+{
+       struct btrfs_ordered_inode_tree *tree;
+       struct rb_node *node;
+
+       tree = &BTRFS_I(inode)->ordered_tree;
+       mutex_lock(&tree->mutex);
+       node = &entry->rb_node;
+       rb_erase(node, &tree->tree);
+       tree->last = NULL;
+       set_bit(BTRFS_ORDERED_COMPLETE, &entry->flags);
+
+       spin_lock(&BTRFS_I(inode)->root->fs_info->ordered_extent_lock);
+       list_del_init(&entry->root_extent_list);
+       spin_unlock(&BTRFS_I(inode)->root->fs_info->ordered_extent_lock);
+
+       mutex_unlock(&tree->mutex);
+       wake_up(&entry->wait);
+       return 0;
+}
+
+/*
+ * wait for all the ordered extents in a root.  This is done when balancing
+ * space between drives.
+ */
+int btrfs_wait_ordered_extents(struct btrfs_root *root, int nocow_only)
+{
+       struct list_head splice;
+       struct list_head *cur;
+       struct btrfs_ordered_extent *ordered;
+       struct inode *inode;
+
+       INIT_LIST_HEAD(&splice);
+
+       spin_lock(&root->fs_info->ordered_extent_lock);
+       list_splice_init(&root->fs_info->ordered_extents, &splice);
+       while (!list_empty(&splice)) {
+               cur = splice.next;
+               ordered = list_entry(cur, struct btrfs_ordered_extent,
+                                    root_extent_list);
+               if (nocow_only &&
+                   !test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags) &&
+                   !test_bit(BTRFS_ORDERED_PREALLOC, &ordered->flags)) {
+                       list_move(&ordered->root_extent_list,
+                                 &root->fs_info->ordered_extents);
+                       cond_resched_lock(&root->fs_info->ordered_extent_lock);
+                       continue;
+               }
+
+               list_del_init(&ordered->root_extent_list);
+               atomic_inc(&ordered->refs);
+
+               /*
+                * the inode may be getting freed (in sys_unlink path).
+                */
+               inode = igrab(ordered->inode);
+
+               spin_unlock(&root->fs_info->ordered_extent_lock);
+
+               if (inode) {
+                       btrfs_start_ordered_extent(inode, ordered, 1);
+                       btrfs_put_ordered_extent(ordered);
+                       iput(inode);
+               } else {
+                       btrfs_put_ordered_extent(ordered);
+               }
+
+               spin_lock(&root->fs_info->ordered_extent_lock);
+       }
+       spin_unlock(&root->fs_info->ordered_extent_lock);
+       return 0;
+}
+
+/*
+ * Used to start IO or wait for a given ordered extent to finish.
+ *
+ * If wait is one, this effectively waits on page writeback for all the pages
+ * in the extent, and it waits on the io completion code to insert
+ * metadata into the btree corresponding to the extent
+ */
+void btrfs_start_ordered_extent(struct inode *inode,
+                                      struct btrfs_ordered_extent *entry,
+                                      int wait)
+{
+       u64 start = entry->file_offset;
+       u64 end = start + entry->len - 1;
+
+       /*
+        * pages in the range can be dirty, clean or writeback.  We
+        * start IO on any dirty ones so the wait doesn't stall waiting
+        * for pdflush to find them
+        */
+       btrfs_fdatawrite_range(inode->i_mapping, start, end, WB_SYNC_ALL);
+       if (wait) {
+               wait_event(entry->wait, test_bit(BTRFS_ORDERED_COMPLETE,
+                                                &entry->flags));
+       }
+}
+
+/*
+ * Used to wait on ordered extents across a large range of bytes.
+ */
+int btrfs_wait_ordered_range(struct inode *inode, u64 start, u64 len)
+{
+       u64 end;
+       u64 orig_end;
+       u64 wait_end;
+       struct btrfs_ordered_extent *ordered;
+
+       if (start + len < start) {
+               orig_end = INT_LIMIT(loff_t);
+       } else {
+               orig_end = start + len - 1;
+               if (orig_end > INT_LIMIT(loff_t))
+                       orig_end = INT_LIMIT(loff_t);
+       }
+       wait_end = orig_end;
+again:
+       /* start IO across the range first to instantiate any delalloc
+        * extents
+        */
+       btrfs_fdatawrite_range(inode->i_mapping, start, orig_end, WB_SYNC_NONE);
+
+       /* The compression code will leave pages locked but return from
+        * writepage without setting the page writeback.  Starting again
+        * with WB_SYNC_ALL will end up waiting for the IO to actually start.
+        */
+       btrfs_fdatawrite_range(inode->i_mapping, start, orig_end, WB_SYNC_ALL);
+
+       btrfs_wait_on_page_writeback_range(inode->i_mapping,
+                                          start >> PAGE_CACHE_SHIFT,
+                                          orig_end >> PAGE_CACHE_SHIFT);
+
+       end = orig_end;
+       while (1) {
+               ordered = btrfs_lookup_first_ordered_extent(inode, end);
+               if (!ordered)
+                       break;
+               if (ordered->file_offset > orig_end) {
+                       btrfs_put_ordered_extent(ordered);
+                       break;
+               }
+               if (ordered->file_offset + ordered->len < start) {
+                       btrfs_put_ordered_extent(ordered);
+                       break;
+               }
+               btrfs_start_ordered_extent(inode, ordered, 1);
+               end = ordered->file_offset;
+               btrfs_put_ordered_extent(ordered);
+               if (end == 0 || end == start)
+                       break;
+               end--;
+       }
+       if (test_range_bit(&BTRFS_I(inode)->io_tree, start, orig_end,
+                          EXTENT_ORDERED | EXTENT_DELALLOC, 0)) {
+               schedule_timeout(1);
+               goto again;
+       }
+       return 0;
+}
+
+/*
+ * find an ordered extent corresponding to file_offset.  return NULL if
+ * nothing is found, otherwise take a reference on the extent and return it
+ */
+struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct inode *inode,
+                                                        u64 file_offset)
+{
+       struct btrfs_ordered_inode_tree *tree;
+       struct rb_node *node;
+       struct btrfs_ordered_extent *entry = NULL;
+
+       tree = &BTRFS_I(inode)->ordered_tree;
+       mutex_lock(&tree->mutex);
+       node = tree_search(tree, file_offset);
+       if (!node)
+               goto out;
+
+       entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
+       if (!offset_in_entry(entry, file_offset))
+               entry = NULL;
+       if (entry)
+               atomic_inc(&entry->refs);
+out:
+       mutex_unlock(&tree->mutex);
+       return entry;
+}
+
+/*
+ * lookup and return any extent before 'file_offset'.  NULL is returned
+ * if none is found
+ */
+struct btrfs_ordered_extent *
+btrfs_lookup_first_ordered_extent(struct inode *inode, u64 file_offset)
+{
+       struct btrfs_ordered_inode_tree *tree;
+       struct rb_node *node;
+       struct btrfs_ordered_extent *entry = NULL;
+
+       tree = &BTRFS_I(inode)->ordered_tree;
+       mutex_lock(&tree->mutex);
+       node = tree_search(tree, file_offset);
+       if (!node)
+               goto out;
+
+       entry = rb_entry(node, struct btrfs_ordered_extent, rb_node);
+       atomic_inc(&entry->refs);
+out:
+       mutex_unlock(&tree->mutex);
+       return entry;
+}
+
+/*
+ * After an extent is done, call this to conditionally update the on disk
+ * i_size.  i_size is updated to cover any fully written part of the file.
+ */
+int btrfs_ordered_update_i_size(struct inode *inode,
+                               struct btrfs_ordered_extent *ordered)
+{
+       struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
+       struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
+       u64 disk_i_size;
+       u64 new_i_size;
+       u64 i_size_test;
+       struct rb_node *node;
+       struct btrfs_ordered_extent *test;
+
+       mutex_lock(&tree->mutex);
+       disk_i_size = BTRFS_I(inode)->disk_i_size;
+
+       /*
+        * if the disk i_size is already at the inode->i_size, or
+        * this ordered extent is inside the disk i_size, we're done
+        */
+       if (disk_i_size >= inode->i_size ||
+           ordered->file_offset + ordered->len <= disk_i_size) {
+               goto out;
+       }
+
+       /*
+        * we can't update the disk_isize if there are delalloc bytes
+        * between disk_i_size and  this ordered extent
+        */
+       if (test_range_bit(io_tree, disk_i_size,
+                          ordered->file_offset + ordered->len - 1,
+                          EXTENT_DELALLOC, 0)) {
+               goto out;
+       }
+       /*
+        * walk backward from this ordered extent to disk_i_size.
+        * if we find an ordered extent then we can't update disk i_size
+        * yet
+        */
+       node = &ordered->rb_node;
+       while (1) {
+               node = rb_prev(node);
+               if (!node)
+                       break;
+               test = rb_entry(node, struct btrfs_ordered_extent, rb_node);
+               if (test->file_offset + test->len <= disk_i_size)
+                       break;
+               if (test->file_offset >= inode->i_size)
+                       break;
+               if (test->file_offset >= disk_i_size)
+                       goto out;
+       }
+       new_i_size = min_t(u64, entry_end(ordered), i_size_read(inode));
+
+       /*
+        * at this point, we know we can safely update i_size to at least
+        * the offset from this ordered extent.  But, we need to
+        * walk forward and see if ios from higher up in the file have
+        * finished.
+        */
+       node = rb_next(&ordered->rb_node);
+       i_size_test = 0;
+       if (node) {
+               /*
+                * do we have an area where IO might have finished
+                * between our ordered extent and the next one.
+                */
+               test = rb_entry(node, struct btrfs_ordered_extent, rb_node);
+               if (test->file_offset > entry_end(ordered))
+                       i_size_test = test->file_offset;
+       } else {
+               i_size_test = i_size_read(inode);
+       }
+
+       /*
+        * i_size_test is the end of a region after this ordered
+        * extent where there are no ordered extents.  As long as there
+        * are no delalloc bytes in this area, it is safe to update
+        * disk_i_size to the end of the region.
+        */
+       if (i_size_test > entry_end(ordered) &&
+           !test_range_bit(io_tree, entry_end(ordered), i_size_test - 1,
+                          EXTENT_DELALLOC, 0)) {
+               new_i_size = min_t(u64, i_size_test, i_size_read(inode));
+       }
+       BTRFS_I(inode)->disk_i_size = new_i_size;
+out:
+       mutex_unlock(&tree->mutex);
+       return 0;
+}
+
+/*
+ * search the ordered extents for one corresponding to 'offset' and
+ * try to find a checksum.  This is used because we allow pages to
+ * be reclaimed before their checksum is actually put into the btree
+ */
+int btrfs_find_ordered_sum(struct inode *inode, u64 offset, u64 disk_bytenr,
+                          u32 *sum)
+{
+       struct btrfs_ordered_sum *ordered_sum;
+       struct btrfs_sector_sum *sector_sums;
+       struct btrfs_ordered_extent *ordered;
+       struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
+       struct list_head *cur;
+       unsigned long num_sectors;
+       unsigned long i;
+       u32 sectorsize = BTRFS_I(inode)->root->sectorsize;
+       int ret = 1;
+
+       ordered = btrfs_lookup_ordered_extent(inode, offset);
+       if (!ordered)
+               return 1;
+
+       mutex_lock(&tree->mutex);
+       list_for_each_prev(cur, &ordered->list) {
+               ordered_sum = list_entry(cur, struct btrfs_ordered_sum, list);
+               if (disk_bytenr >= ordered_sum->bytenr) {
+                       num_sectors = ordered_sum->len / sectorsize;
+                       sector_sums = ordered_sum->sums;
+                       for (i = 0; i < num_sectors; i++) {
+                               if (sector_sums[i].bytenr == disk_bytenr) {
+                                       *sum = sector_sums[i].sum;
+                                       ret = 0;
+                                       goto out;
+                               }
+                       }
+               }
+       }
+out:
+       mutex_unlock(&tree->mutex);
+       btrfs_put_ordered_extent(ordered);
+       return ret;
+}
+
+
+/**
+ * taken from mm/filemap.c because it isn't exported
+ *
+ * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
+ * @mapping:   address space structure to write
+ * @start:     offset in bytes where the range starts
+ * @end:       offset in bytes where the range ends (inclusive)
+ * @sync_mode: enable synchronous operation
+ *
+ * Start writeback against all of a mapping's dirty pages that lie
+ * within the byte offsets <start, end> inclusive.
+ *
+ * If sync_mode is WB_SYNC_ALL then this is a "data integrity" operation, as
+ * opposed to a regular memory cleansing writeback.  The difference between
+ * these two operations is that if a dirty page/buffer is encountered, it must
+ * be waited upon, and not just skipped over.
+ */
+int btrfs_fdatawrite_range(struct address_space *mapping, loff_t start,
+                          loff_t end, int sync_mode)
+{
+       struct writeback_control wbc = {
+               .sync_mode = sync_mode,
+               .nr_to_write = mapping->nrpages * 2,
+               .range_start = start,
+               .range_end = end,
+               .for_writepages = 1,
+       };
+       return btrfs_writepages(mapping, &wbc);
+}
+
+/**
+ * taken from mm/filemap.c because it isn't exported
+ *
+ * wait_on_page_writeback_range - wait for writeback to complete
+ * @mapping:   target address_space
+ * @start:     beginning page index
+ * @end:       ending page index
+ *
+ * Wait for writeback to complete against pages indexed by start->end
+ * inclusive
+ */
+int btrfs_wait_on_page_writeback_range(struct address_space *mapping,
+                                      pgoff_t start, pgoff_t end)
+{
+       struct pagevec pvec;
+       int nr_pages;
+       int ret = 0;
+       pgoff_t index;
+
+       if (end < start)
+               return 0;
+
+       pagevec_init(&pvec, 0);
+       index = start;
+       while ((index <= end) &&
+                       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
+                       PAGECACHE_TAG_WRITEBACK,
+                       min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)) != 0) {
+               unsigned i;
+
+               for (i = 0; i < nr_pages; i++) {
+                       struct page *page = pvec.pages[i];
+
+                       /* until radix tree lookup accepts end_index */
+                       if (page->index > end)
+                               continue;
+
+                       wait_on_page_writeback(page);
+                       if (PageError(page))
+                               ret = -EIO;
+               }
+               pagevec_release(&pvec);
+               cond_resched();
+       }
+
+       /* Check for outstanding write errors */
+       if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
+               ret = -ENOSPC;
+       if (test_and_clear_bit(AS_EIO, &mapping->flags))
+               ret = -EIO;
+
+       return ret;
+}
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h

new file mode 100644 (file)

index 0000000..ab66d5e
--- /dev/null
+++ b/fs/btrfs/ordered-data.h
@@ -0,0 +1,158 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_ORDERED_DATA__
+#define __BTRFS_ORDERED_DATA__
+
+/* one of these per inode */
+struct btrfs_ordered_inode_tree {
+       struct mutex mutex;
+       struct rb_root tree;
+       struct rb_node *last;
+};
+
+/*
+ * these are used to collect checksums done just before bios submission.
+ * They are attached via a list into the ordered extent, and
+ * checksum items are inserted into the tree after all the blocks in
+ * the ordered extent are on disk
+ */
+struct btrfs_sector_sum {
+       /* bytenr on disk */
+       u64 bytenr;
+       u32 sum;
+};
+
+struct btrfs_ordered_sum {
+       /* bytenr is the start of this extent on disk */
+       u64 bytenr;
+
+       /*
+        * this is the length in bytes covered by the sums array below.
+        */
+       unsigned long len;
+       struct list_head list;
+       /* last field is a variable length array of btrfs_sector_sums */
+       struct btrfs_sector_sum sums[];
+};
+
+/*
+ * bits for the flags field:
+ *
+ * BTRFS_ORDERED_IO_DONE is set when all of the blocks are written.
+ * It is used to make sure metadata is inserted into the tree only once
+ * per extent.
+ *
+ * BTRFS_ORDERED_COMPLETE is set when the extent is removed from the
+ * rbtree, just before waking any waiters.  It is used to indicate the
+ * IO is done and any metadata is inserted into the tree.
+ */
+#define BTRFS_ORDERED_IO_DONE 0 /* set when all the pages are written */
+
+#define BTRFS_ORDERED_COMPLETE 1 /* set when removed from the tree */
+
+#define BTRFS_ORDERED_NOCOW 2 /* set when we want to write in place */
+
+#define BTRFS_ORDERED_COMPRESSED 3 /* writing a compressed extent */
+
+#define BTRFS_ORDERED_PREALLOC 4 /* set when writing to prealloced extent */
+
+struct btrfs_ordered_extent {
+       /* logical offset in the file */
+       u64 file_offset;
+
+       /* disk byte number */
+       u64 start;
+
+       /* ram length of the extent in bytes */
+       u64 len;
+
+       /* extent length on disk */
+       u64 disk_len;
+
+       /* flags (described above) */
+       unsigned long flags;
+
+       /* reference count */
+       atomic_t refs;
+
+       /* the inode we belong to */
+       struct inode *inode;
+
+       /* list of checksums for insertion when the extent io is done */
+       struct list_head list;
+
+       /* used to wait for the BTRFS_ORDERED_COMPLETE bit */
+       wait_queue_head_t wait;
+
+       /* our friendly rbtree entry */
+       struct rb_node rb_node;
+
+       /* a per root list of all the pending ordered extents */
+       struct list_head root_extent_list;
+};
+
+
+/*
+ * calculates the total size you need to allocate for an ordered sum
+ * structure spanning 'bytes' in the file
+ */
+static inline int btrfs_ordered_sum_size(struct btrfs_root *root,
+                                        unsigned long bytes)
+{
+       unsigned long num_sectors = (bytes + root->sectorsize - 1) /
+               root->sectorsize;
+       num_sectors++;
+       return sizeof(struct btrfs_ordered_sum) +
+               num_sectors * sizeof(struct btrfs_sector_sum);
+}
+
+static inline void
+btrfs_ordered_inode_tree_init(struct btrfs_ordered_inode_tree *t)
+{
+       mutex_init(&t->mutex);
+       t->tree.rb_node = NULL;
+       t->last = NULL;
+}
+
+int btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry);
+int btrfs_remove_ordered_extent(struct inode *inode,
+                               struct btrfs_ordered_extent *entry);
+int btrfs_dec_test_ordered_pending(struct inode *inode,
+                                      u64 file_offset, u64 io_size);
+int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset,
+                            u64 start, u64 len, u64 disk_len, int tyep);
+int btrfs_add_ordered_sum(struct inode *inode,
+                         struct btrfs_ordered_extent *entry,
+                         struct btrfs_ordered_sum *sum);
+struct btrfs_ordered_extent *btrfs_lookup_ordered_extent(struct inode *inode,
+                                                        u64 file_offset);
+void btrfs_start_ordered_extent(struct inode *inode,
+                               struct btrfs_ordered_extent *entry, int wait);
+int btrfs_wait_ordered_range(struct inode *inode, u64 start, u64 len);
+struct btrfs_ordered_extent *
+btrfs_lookup_first_ordered_extent(struct inode * inode, u64 file_offset);
+int btrfs_ordered_update_i_size(struct inode *inode,
+                               struct btrfs_ordered_extent *ordered);
+int btrfs_find_ordered_sum(struct inode *inode, u64 offset, u64 disk_bytenr, u32 *sum);
+int btrfs_wait_on_page_writeback_range(struct address_space *mapping,
+                                      pgoff_t start, pgoff_t end);
+int btrfs_fdatawrite_range(struct address_space *mapping, loff_t start,
+                          loff_t end, int sync_mode);
+int btrfs_wait_ordered_extents(struct btrfs_root *root, int nocow_only);
+#endif
diff --git a/fs/btrfs/orphan.c b/fs/btrfs/orphan.c

new file mode 100644 (file)

index 0000000..3c0d52a
--- /dev/null
+++ b/fs/btrfs/orphan.c
@@ -0,0 +1,67 @@
+/*
+ * Copyright (C) 2008 Red Hat.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "ctree.h"
+#include "disk-io.h"
+
+int btrfs_insert_orphan_item(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root, u64 offset)
+{
+       struct btrfs_path *path;
+       struct btrfs_key key;
+       int ret = 0;
+
+       key.objectid = BTRFS_ORPHAN_OBJECTID;
+       btrfs_set_key_type(&key, BTRFS_ORPHAN_ITEM_KEY);
+       key.offset = offset;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
+
+       btrfs_free_path(path);
+       return ret;
+}
+
+int btrfs_del_orphan_item(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root, u64 offset)
+{
+       struct btrfs_path *path;
+       struct btrfs_key key;
+       int ret = 0;
+
+       key.objectid = BTRFS_ORPHAN_OBJECTID;
+       btrfs_set_key_type(&key, BTRFS_ORPHAN_ITEM_KEY);
+       key.offset = offset;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+       if (ret)
+               goto out;
+
+       ret = btrfs_del_item(trans, root, path);
+
+out:
+       btrfs_free_path(path);
+       return ret;
+}
diff --git a/fs/btrfs/print-tree.c b/fs/btrfs/print-tree.c

new file mode 100644 (file)

index 0000000..5f8f218
--- /dev/null
+++ b/fs/btrfs/print-tree.c
@@ -0,0 +1,216 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "ctree.h"
+#include "disk-io.h"
+#include "print-tree.h"
+
+static void print_chunk(struct extent_buffer *eb, struct btrfs_chunk *chunk)
+{
+       int num_stripes = btrfs_chunk_num_stripes(eb, chunk);
+       int i;
+       printk(KERN_INFO "\t\tchunk length %llu owner %llu type %llu "
+              "num_stripes %d\n",
+              (unsigned long long)btrfs_chunk_length(eb, chunk),
+              (unsigned long long)btrfs_chunk_owner(eb, chunk),
+              (unsigned long long)btrfs_chunk_type(eb, chunk),
+              num_stripes);
+       for (i = 0 ; i < num_stripes ; i++) {
+               printk(KERN_INFO "\t\t\tstripe %d devid %llu offset %llu\n", i,
+                     (unsigned long long)btrfs_stripe_devid_nr(eb, chunk, i),
+                     (unsigned long long)btrfs_stripe_offset_nr(eb, chunk, i));
+       }
+}
+static void print_dev_item(struct extent_buffer *eb,
+                          struct btrfs_dev_item *dev_item)
+{
+       printk(KERN_INFO "\t\tdev item devid %llu "
+              "total_bytes %llu bytes used %llu\n",
+              (unsigned long long)btrfs_device_id(eb, dev_item),
+              (unsigned long long)btrfs_device_total_bytes(eb, dev_item),
+              (unsigned long long)btrfs_device_bytes_used(eb, dev_item));
+}
+void btrfs_print_leaf(struct btrfs_root *root, struct extent_buffer *l)
+{
+       int i;
+       u32 nr = btrfs_header_nritems(l);
+       struct btrfs_item *item;
+       struct btrfs_extent_item *ei;
+       struct btrfs_root_item *ri;
+       struct btrfs_dir_item *di;
+       struct btrfs_inode_item *ii;
+       struct btrfs_block_group_item *bi;
+       struct btrfs_file_extent_item *fi;
+       struct btrfs_key key;
+       struct btrfs_key found_key;
+       struct btrfs_extent_ref *ref;
+       struct btrfs_dev_extent *dev_extent;
+       u32 type;
+
+       printk(KERN_INFO "leaf %llu total ptrs %d free space %d\n",
+               (unsigned long long)btrfs_header_bytenr(l), nr,
+               btrfs_leaf_free_space(root, l));
+       for (i = 0 ; i < nr ; i++) {
+               item = btrfs_item_nr(l, i);
+               btrfs_item_key_to_cpu(l, &key, i);
+               type = btrfs_key_type(&key);
+               printk(KERN_INFO "\titem %d key (%llu %x %llu) itemoff %d "
+                      "itemsize %d\n",
+                       i,
+                       (unsigned long long)key.objectid, type,
+                       (unsigned long long)key.offset,
+                       btrfs_item_offset(l, item), btrfs_item_size(l, item));
+               switch (type) {
+               case BTRFS_INODE_ITEM_KEY:
+                       ii = btrfs_item_ptr(l, i, struct btrfs_inode_item);
+                       printk(KERN_INFO "\t\tinode generation %llu size %llu "
+                              "mode %o\n",
+                              (unsigned long long)
+                              btrfs_inode_generation(l, ii),
+                             (unsigned long long)btrfs_inode_size(l, ii),
+                              btrfs_inode_mode(l, ii));
+                       break;
+               case BTRFS_DIR_ITEM_KEY:
+                       di = btrfs_item_ptr(l, i, struct btrfs_dir_item);
+                       btrfs_dir_item_key_to_cpu(l, di, &found_key);
+                       printk(KERN_INFO "\t\tdir oid %llu type %u\n",
+                               (unsigned long long)found_key.objectid,
+                               btrfs_dir_type(l, di));
+                       break;
+               case BTRFS_ROOT_ITEM_KEY:
+                       ri = btrfs_item_ptr(l, i, struct btrfs_root_item);
+                       printk(KERN_INFO "\t\troot data bytenr %llu refs %u\n",
+                               (unsigned long long)
+                               btrfs_disk_root_bytenr(l, ri),
+                               btrfs_disk_root_refs(l, ri));
+                       break;
+               case BTRFS_EXTENT_ITEM_KEY:
+                       ei = btrfs_item_ptr(l, i, struct btrfs_extent_item);
+                       printk(KERN_INFO "\t\textent data refs %u\n",
+                               btrfs_extent_refs(l, ei));
+                       break;
+               case BTRFS_EXTENT_REF_KEY:
+                       ref = btrfs_item_ptr(l, i, struct btrfs_extent_ref);
+                       printk(KERN_INFO "\t\textent back ref root %llu "
+                              "gen %llu owner %llu num_refs %lu\n",
+                              (unsigned long long)btrfs_ref_root(l, ref),
+                              (unsigned long long)btrfs_ref_generation(l, ref),
+                              (unsigned long long)btrfs_ref_objectid(l, ref),
+                              (unsigned long)btrfs_ref_num_refs(l, ref));
+                       break;
+
+               case BTRFS_EXTENT_DATA_KEY:
+                       fi = btrfs_item_ptr(l, i,
+                                           struct btrfs_file_extent_item);
+                       if (btrfs_file_extent_type(l, fi) ==
+                           BTRFS_FILE_EXTENT_INLINE) {
+                               printk(KERN_INFO "\t\tinline extent data "
+                                      "size %u\n",
+                                      btrfs_file_extent_inline_len(l, fi));
+                               break;
+                       }
+                       printk(KERN_INFO "\t\textent data disk bytenr %llu "
+                              "nr %llu\n",
+                              (unsigned long long)
+                              btrfs_file_extent_disk_bytenr(l, fi),
+                              (unsigned long long)
+                              btrfs_file_extent_disk_num_bytes(l, fi));
+                       printk(KERN_INFO "\t\textent data offset %llu "
+                              "nr %llu ram %llu\n",
+                              (unsigned long long)
+                              btrfs_file_extent_offset(l, fi),
+                              (unsigned long long)
+                              btrfs_file_extent_num_bytes(l, fi),
+                              (unsigned long long)
+                              btrfs_file_extent_ram_bytes(l, fi));
+                       break;
+               case BTRFS_BLOCK_GROUP_ITEM_KEY:
+                       bi = btrfs_item_ptr(l, i,
+                                           struct btrfs_block_group_item);
+                       printk(KERN_INFO "\t\tblock group used %llu\n",
+                              (unsigned long long)
+                              btrfs_disk_block_group_used(l, bi));
+                       break;
+               case BTRFS_CHUNK_ITEM_KEY:
+                       print_chunk(l, btrfs_item_ptr(l, i,
+                                                     struct btrfs_chunk));
+                       break;
+               case BTRFS_DEV_ITEM_KEY:
+                       print_dev_item(l, btrfs_item_ptr(l, i,
+                                       struct btrfs_dev_item));
+                       break;
+               case BTRFS_DEV_EXTENT_KEY:
+                       dev_extent = btrfs_item_ptr(l, i,
+                                                   struct btrfs_dev_extent);
+                       printk(KERN_INFO "\t\tdev extent chunk_tree %llu\n"
+                              "\t\tchunk objectid %llu chunk offset %llu "
+                              "length %llu\n",
+                              (unsigned long long)
+                              btrfs_dev_extent_chunk_tree(l, dev_extent),
+                              (unsigned long long)
+                              btrfs_dev_extent_chunk_objectid(l, dev_extent),
+                              (unsigned long long)
+                              btrfs_dev_extent_chunk_offset(l, dev_extent),
+                              (unsigned long long)
+                              btrfs_dev_extent_length(l, dev_extent));
+               };
+       }
+}
+
+void btrfs_print_tree(struct btrfs_root *root, struct extent_buffer *c)
+{
+       int i; u32 nr;
+       struct btrfs_key key;
+       int level;
+
+       if (!c)
+               return;
+       nr = btrfs_header_nritems(c);
+       level = btrfs_header_level(c);
+       if (level == 0) {
+               btrfs_print_leaf(root, c);
+               return;
+       }
+       printk(KERN_INFO "node %llu level %d total ptrs %d free spc %u\n",
+              (unsigned long long)btrfs_header_bytenr(c),
+              btrfs_header_level(c), nr,
+              (u32)BTRFS_NODEPTRS_PER_BLOCK(root) - nr);
+       for (i = 0; i < nr; i++) {
+               btrfs_node_key_to_cpu(c, &key, i);
+               printk(KERN_INFO "\tkey %d (%llu %u %llu) block %llu\n",
+                      i,
+                      (unsigned long long)key.objectid,
+                      key.type,
+                      (unsigned long long)key.offset,
+                      (unsigned long long)btrfs_node_blockptr(c, i));
+       }
+       for (i = 0; i < nr; i++) {
+               struct extent_buffer *next = read_tree_block(root,
+                                       btrfs_node_blockptr(c, i),
+                                       btrfs_level_size(root, level - 1),
+                                       btrfs_node_ptr_generation(c, i));
+               if (btrfs_is_leaf(next) &&
+                   btrfs_header_level(c) != 1)
+                       BUG();
+               if (btrfs_header_level(next) !=
+                       btrfs_header_level(c) - 1)
+                       BUG();
+               btrfs_print_tree(root, next);
+               free_extent_buffer(next);
+       }
+}
diff --git a/fs/btrfs/print-tree.h b/fs/btrfs/print-tree.h

new file mode 100644 (file)

index 0000000..da75efe
--- /dev/null
+++ b/fs/btrfs/print-tree.h
@@ -0,0 +1,23 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __PRINT_TREE_
+#define __PRINT_TREE_
+void btrfs_print_leaf(struct btrfs_root *root, struct extent_buffer *l);
+void btrfs_print_tree(struct btrfs_root *root, struct extent_buffer *t);
+#endif
diff --git a/fs/btrfs/ref-cache.c b/fs/btrfs/ref-cache.c

new file mode 100644 (file)

index 0000000..6f0acc4
--- /dev/null
+++ b/fs/btrfs/ref-cache.c
@@ -0,0 +1,230 @@
+/*
+ * Copyright (C) 2008 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/sched.h>
+#include "ctree.h"
+#include "ref-cache.h"
+#include "transaction.h"
+
+/*
+ * leaf refs are used to cache the information about which extents
+ * a given leaf has references on.  This allows us to process that leaf
+ * in btrfs_drop_snapshot without needing to read it back from disk.
+ */
+
+/*
+ * kmalloc a leaf reference struct and update the counters for the
+ * total ref cache size
+ */
+struct btrfs_leaf_ref *btrfs_alloc_leaf_ref(struct btrfs_root *root,
+                                           int nr_extents)
+{
+       struct btrfs_leaf_ref *ref;
+       size_t size = btrfs_leaf_ref_size(nr_extents);
+
+       ref = kmalloc(size, GFP_NOFS);
+       if (ref) {
+               spin_lock(&root->fs_info->ref_cache_lock);
+               root->fs_info->total_ref_cache_size += size;
+               spin_unlock(&root->fs_info->ref_cache_lock);
+
+               memset(ref, 0, sizeof(*ref));
+               atomic_set(&ref->usage, 1);
+               INIT_LIST_HEAD(&ref->list);
+       }
+       return ref;
+}
+
+/*
+ * free a leaf reference struct and update the counters for the
+ * total ref cache size
+ */
+void btrfs_free_leaf_ref(struct btrfs_root *root, struct btrfs_leaf_ref *ref)
+{
+       if (!ref)
+               return;
+       WARN_ON(atomic_read(&ref->usage) == 0);
+       if (atomic_dec_and_test(&ref->usage)) {
+               size_t size = btrfs_leaf_ref_size(ref->nritems);
+
+               BUG_ON(ref->in_tree);
+               kfree(ref);
+
+               spin_lock(&root->fs_info->ref_cache_lock);
+               root->fs_info->total_ref_cache_size -= size;
+               spin_unlock(&root->fs_info->ref_cache_lock);
+       }
+}
+
+static struct rb_node *tree_insert(struct rb_root *root, u64 bytenr,
+                                  struct rb_node *node)
+{
+       struct rb_node **p = &root->rb_node;
+       struct rb_node *parent = NULL;
+       struct btrfs_leaf_ref *entry;
+
+       while (*p) {
+               parent = *p;
+               entry = rb_entry(parent, struct btrfs_leaf_ref, rb_node);
+
+               if (bytenr < entry->bytenr)
+                       p = &(*p)->rb_left;
+               else if (bytenr > entry->bytenr)
+                       p = &(*p)->rb_right;
+               else
+                       return parent;
+       }
+
+       entry = rb_entry(node, struct btrfs_leaf_ref, rb_node);
+       rb_link_node(node, parent, p);
+       rb_insert_color(node, root);
+       return NULL;
+}
+
+static struct rb_node *tree_search(struct rb_root *root, u64 bytenr)
+{
+       struct rb_node *n = root->rb_node;
+       struct btrfs_leaf_ref *entry;
+
+       while (n) {
+               entry = rb_entry(n, struct btrfs_leaf_ref, rb_node);
+               WARN_ON(!entry->in_tree);
+
+               if (bytenr < entry->bytenr)
+                       n = n->rb_left;
+               else if (bytenr > entry->bytenr)
+                       n = n->rb_right;
+               else
+                       return n;
+       }
+       return NULL;
+}
+
+int btrfs_remove_leaf_refs(struct btrfs_root *root, u64 max_root_gen,
+                          int shared)
+{
+       struct btrfs_leaf_ref *ref = NULL;
+       struct btrfs_leaf_ref_tree *tree = root->ref_tree;
+
+       if (shared)
+               tree = &root->fs_info->shared_ref_tree;
+       if (!tree)
+               return 0;
+
+       spin_lock(&tree->lock);
+       while (!list_empty(&tree->list)) {
+               ref = list_entry(tree->list.next, struct btrfs_leaf_ref, list);
+               BUG_ON(ref->tree != tree);
+               if (ref->root_gen > max_root_gen)
+                       break;
+               if (!xchg(&ref->in_tree, 0)) {
+                       cond_resched_lock(&tree->lock);
+                       continue;
+               }
+
+               rb_erase(&ref->rb_node, &tree->root);
+               list_del_init(&ref->list);
+
+               spin_unlock(&tree->lock);
+               btrfs_free_leaf_ref(root, ref);
+               cond_resched();
+               spin_lock(&tree->lock);
+       }
+       spin_unlock(&tree->lock);
+       return 0;
+}
+
+/*
+ * find the leaf ref for a given extent.  This returns the ref struct with
+ * a usage reference incremented
+ */
+struct btrfs_leaf_ref *btrfs_lookup_leaf_ref(struct btrfs_root *root,
+                                            u64 bytenr)
+{
+       struct rb_node *rb;
+       struct btrfs_leaf_ref *ref = NULL;
+       struct btrfs_leaf_ref_tree *tree = root->ref_tree;
+again:
+       if (tree) {
+               spin_lock(&tree->lock);
+               rb = tree_search(&tree->root, bytenr);
+               if (rb)
+                       ref = rb_entry(rb, struct btrfs_leaf_ref, rb_node);
+               if (ref)
+                       atomic_inc(&ref->usage);
+               spin_unlock(&tree->lock);
+               if (ref)
+                       return ref;
+       }
+       if (tree != &root->fs_info->shared_ref_tree) {
+               tree = &root->fs_info->shared_ref_tree;
+               goto again;
+       }
+       return NULL;
+}
+
+/*
+ * add a fully filled in leaf ref struct
+ * remove all the refs older than a given root generation
+ */
+int btrfs_add_leaf_ref(struct btrfs_root *root, struct btrfs_leaf_ref *ref,
+                      int shared)
+{
+       int ret = 0;
+       struct rb_node *rb;
+       struct btrfs_leaf_ref_tree *tree = root->ref_tree;
+
+       if (shared)
+               tree = &root->fs_info->shared_ref_tree;
+
+       spin_lock(&tree->lock);
+       rb = tree_insert(&tree->root, ref->bytenr, &ref->rb_node);
+       if (rb) {
+               ret = -EEXIST;
+       } else {
+               atomic_inc(&ref->usage);
+               ref->tree = tree;
+               ref->in_tree = 1;
+               list_add_tail(&ref->list, &tree->list);
+       }
+       spin_unlock(&tree->lock);
+       return ret;
+}
+
+/*
+ * remove a single leaf ref from the tree.  This drops the ref held by the tree
+ * only
+ */
+int btrfs_remove_leaf_ref(struct btrfs_root *root, struct btrfs_leaf_ref *ref)
+{
+       struct btrfs_leaf_ref_tree *tree;
+
+       if (!xchg(&ref->in_tree, 0))
+               return 0;
+
+       tree = ref->tree;
+       spin_lock(&tree->lock);
+
+       rb_erase(&ref->rb_node, &tree->root);
+       list_del_init(&ref->list);
+
+       spin_unlock(&tree->lock);
+
+       btrfs_free_leaf_ref(root, ref);
+       return 0;
+}
diff --git a/fs/btrfs/ref-cache.h b/fs/btrfs/ref-cache.h

new file mode 100644 (file)

index 0000000..16f3183
--- /dev/null
+++ b/fs/btrfs/ref-cache.h
@@ -0,0 +1,77 @@
+/*
+ * Copyright (C) 2008 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#ifndef __REFCACHE__
+#define __REFCACHE__
+
+struct btrfs_extent_info {
+       /* bytenr and num_bytes find the extent in the extent allocation tree */
+       u64 bytenr;
+       u64 num_bytes;
+
+       /* objectid and offset find the back reference for the file */
+       u64 objectid;
+       u64 offset;
+};
+
+struct btrfs_leaf_ref {
+       struct rb_node rb_node;
+       struct btrfs_leaf_ref_tree *tree;
+       int in_tree;
+       atomic_t usage;
+
+       u64 root_gen;
+       u64 bytenr;
+       u64 owner;
+       u64 generation;
+       int nritems;
+
+       struct list_head list;
+       struct btrfs_extent_info extents[];
+};
+
+static inline size_t btrfs_leaf_ref_size(int nr_extents)
+{
+       return sizeof(struct btrfs_leaf_ref) +
+              sizeof(struct btrfs_extent_info) * nr_extents;
+}
+
+static inline void btrfs_leaf_ref_tree_init(struct btrfs_leaf_ref_tree *tree)
+{
+       tree->root.rb_node = NULL;
+       INIT_LIST_HEAD(&tree->list);
+       spin_lock_init(&tree->lock);
+}
+
+static inline int btrfs_leaf_ref_tree_empty(struct btrfs_leaf_ref_tree *tree)
+{
+       return RB_EMPTY_ROOT(&tree->root);
+}
+
+void btrfs_leaf_ref_tree_init(struct btrfs_leaf_ref_tree *tree);
+struct btrfs_leaf_ref *btrfs_alloc_leaf_ref(struct btrfs_root *root,
+                                           int nr_extents);
+void btrfs_free_leaf_ref(struct btrfs_root *root, struct btrfs_leaf_ref *ref);
+struct btrfs_leaf_ref *btrfs_lookup_leaf_ref(struct btrfs_root *root,
+                                            u64 bytenr);
+int btrfs_add_leaf_ref(struct btrfs_root *root, struct btrfs_leaf_ref *ref,
+                      int shared);
+int btrfs_remove_leaf_refs(struct btrfs_root *root, u64 max_root_gen,
+                          int shared);
+int btrfs_remove_leaf_ref(struct btrfs_root *root, struct btrfs_leaf_ref *ref);
+
+#endif
diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c

new file mode 100644 (file)

index 0000000..b48650d
--- /dev/null
+++ b/fs/btrfs/root-tree.c
@@ -0,0 +1,366 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include "ctree.h"
+#include "transaction.h"
+#include "disk-io.h"
+#include "print-tree.h"
+
+/*
+ *  search forward for a root, starting with objectid 'search_start'
+ *  if a root key is found, the objectid we find is filled into 'found_objectid'
+ *  and 0 is returned.  < 0 is returned on error, 1 if there is nothing
+ *  left in the tree.
+ */
+int btrfs_search_root(struct btrfs_root *root, u64 search_start,
+                     u64 *found_objectid)
+{
+       struct btrfs_path *path;
+       struct btrfs_key search_key;
+       int ret;
+
+       root = root->fs_info->tree_root;
+       search_key.objectid = search_start;
+       search_key.type = (u8)-1;
+       search_key.offset = (u64)-1;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+again:
+       ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0);
+       if (ret < 0)
+               goto out;
+       if (ret == 0) {
+               ret = 1;
+               goto out;
+       }
+       if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) {
+               ret = btrfs_next_leaf(root, path);
+               if (ret)
+                       goto out;
+       }
+       btrfs_item_key_to_cpu(path->nodes[0], &search_key, path->slots[0]);
+       if (search_key.type != BTRFS_ROOT_ITEM_KEY) {
+               search_key.offset++;
+               btrfs_release_path(root, path);
+               goto again;
+       }
+       ret = 0;
+       *found_objectid = search_key.objectid;
+
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+/*
+ * lookup the root with the highest offset for a given objectid.  The key we do
+ * find is copied into 'key'.  If we find something return 0, otherwise 1, < 0
+ * on error.
+ */
+int btrfs_find_last_root(struct btrfs_root *root, u64 objectid,
+                       struct btrfs_root_item *item, struct btrfs_key *key)
+{
+       struct btrfs_path *path;
+       struct btrfs_key search_key;
+       struct btrfs_key found_key;
+       struct extent_buffer *l;
+       int ret;
+       int slot;
+
+       search_key.objectid = objectid;
+       search_key.type = BTRFS_ROOT_ITEM_KEY;
+       search_key.offset = (u64)-1;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+       ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0);
+       if (ret < 0)
+               goto out;
+
+       BUG_ON(ret == 0);
+       l = path->nodes[0];
+       BUG_ON(path->slots[0] == 0);
+       slot = path->slots[0] - 1;
+       btrfs_item_key_to_cpu(l, &found_key, slot);
+       if (found_key.objectid != objectid) {
+               ret = 1;
+               goto out;
+       }
+       read_extent_buffer(l, item, btrfs_item_ptr_offset(l, slot),
+                          sizeof(*item));
+       memcpy(key, &found_key, sizeof(found_key));
+       ret = 0;
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+/*
+ * copy the data in 'item' into the btree
+ */
+int btrfs_update_root(struct btrfs_trans_handle *trans, struct btrfs_root
+                     *root, struct btrfs_key *key, struct btrfs_root_item
+                     *item)
+{
+       struct btrfs_path *path;
+       struct extent_buffer *l;
+       int ret;
+       int slot;
+       unsigned long ptr;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+       ret = btrfs_search_slot(trans, root, key, path, 0, 1);
+       if (ret < 0)
+               goto out;
+
+       if (ret != 0) {
+               btrfs_print_leaf(root, path->nodes[0]);
+               printk(KERN_CRIT "unable to update root key %llu %u %llu\n",
+                      (unsigned long long)key->objectid, key->type,
+                      (unsigned long long)key->offset);
+               BUG_ON(1);
+       }
+
+       l = path->nodes[0];
+       slot = path->slots[0];
+       ptr = btrfs_item_ptr_offset(l, slot);
+       write_extent_buffer(l, item, ptr, sizeof(*item));
+       btrfs_mark_buffer_dirty(path->nodes[0]);
+out:
+       btrfs_release_path(root, path);
+       btrfs_free_path(path);
+       return ret;
+}
+
+int btrfs_insert_root(struct btrfs_trans_handle *trans, struct btrfs_root
+                     *root, struct btrfs_key *key, struct btrfs_root_item
+                     *item)
+{
+       int ret;
+       ret = btrfs_insert_item(trans, root, key, item, sizeof(*item));
+       return ret;
+}
+
+/*
+ * at mount time we want to find all the old transaction snapshots that were in
+ * the process of being deleted if we crashed.  This is any root item with an
+ * offset lower than the latest root.  They need to be queued for deletion to
+ * finish what was happening when we crashed.
+ */
+int btrfs_find_dead_roots(struct btrfs_root *root, u64 objectid,
+                         struct btrfs_root *latest)
+{
+       struct btrfs_root *dead_root;
+       struct btrfs_item *item;
+       struct btrfs_root_item *ri;
+       struct btrfs_key key;
+       struct btrfs_key found_key;
+       struct btrfs_path *path;
+       int ret;
+       u32 nritems;
+       struct extent_buffer *leaf;
+       int slot;
+
+       key.objectid = objectid;
+       btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY);
+       key.offset = 0;
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+again:
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       if (ret < 0)
+               goto err;
+       while (1) {
+               leaf = path->nodes[0];
+               nritems = btrfs_header_nritems(leaf);
+               slot = path->slots[0];
+               if (slot >= nritems) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret)
+                               break;
+                       leaf = path->nodes[0];
+                       nritems = btrfs_header_nritems(leaf);
+                       slot = path->slots[0];
+               }
+               item = btrfs_item_nr(leaf, slot);
+               btrfs_item_key_to_cpu(leaf, &key, slot);
+               if (btrfs_key_type(&key) != BTRFS_ROOT_ITEM_KEY)
+                       goto next;
+
+               if (key.objectid < objectid)
+                       goto next;
+
+               if (key.objectid > objectid)
+                       break;
+
+               ri = btrfs_item_ptr(leaf, slot, struct btrfs_root_item);
+               if (btrfs_disk_root_refs(leaf, ri) != 0)
+                       goto next;
+
+               memcpy(&found_key, &key, sizeof(key));
+               key.offset++;
+               btrfs_release_path(root, path);
+               dead_root =
+                       btrfs_read_fs_root_no_radix(root->fs_info->tree_root,
+                                                   &found_key);
+               if (IS_ERR(dead_root)) {
+                       ret = PTR_ERR(dead_root);
+                       goto err;
+               }
+
+               if (objectid == BTRFS_TREE_RELOC_OBJECTID)
+                       ret = btrfs_add_dead_reloc_root(dead_root);
+               else
+                       ret = btrfs_add_dead_root(dead_root, latest);
+               if (ret)
+                       goto err;
+               goto again;
+next:
+               slot++;
+               path->slots[0]++;
+       }
+       ret = 0;
+err:
+       btrfs_free_path(path);
+       return ret;
+}
+
+/* drop the root item for 'key' from 'root' */
+int btrfs_del_root(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+                  struct btrfs_key *key)
+{
+       struct btrfs_path *path;
+       int ret;
+       u32 refs;
+       struct btrfs_root_item *ri;
+       struct extent_buffer *leaf;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+       ret = btrfs_search_slot(trans, root, key, path, -1, 1);
+       if (ret < 0)
+               goto out;
+
+       BUG_ON(ret != 0);
+       leaf = path->nodes[0];
+       ri = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_root_item);
+
+       refs = btrfs_disk_root_refs(leaf, ri);
+       BUG_ON(refs != 0);
+       ret = btrfs_del_item(trans, root, path);
+out:
+       btrfs_release_path(root, path);
+       btrfs_free_path(path);
+       return ret;
+}
+
+#if 0 /* this will get used when snapshot deletion is implemented */
+int btrfs_del_root_ref(struct btrfs_trans_handle *trans,
+                      struct btrfs_root *tree_root,
+                      u64 root_id, u8 type, u64 ref_id)
+{
+       struct btrfs_key key;
+       int ret;
+       struct btrfs_path *path;
+
+       path = btrfs_alloc_path();
+
+       key.objectid = root_id;
+       key.type = type;
+       key.offset = ref_id;
+
+       ret = btrfs_search_slot(trans, tree_root, &key, path, -1, 1);
+       BUG_ON(ret);
+
+       ret = btrfs_del_item(trans, tree_root, path);
+       BUG_ON(ret);
+
+       btrfs_free_path(path);
+       return ret;
+}
+#endif
+
+int btrfs_find_root_ref(struct btrfs_root *tree_root,
+                  struct btrfs_path *path,
+                  u64 root_id, u64 ref_id)
+{
+       struct btrfs_key key;
+       int ret;
+
+       key.objectid = root_id;
+       key.type = BTRFS_ROOT_REF_KEY;
+       key.offset = ref_id;
+
+       ret = btrfs_search_slot(NULL, tree_root, &key, path, 0, 0);
+       return ret;
+}
+
+
+/*
+ * add a btrfs_root_ref item.  type is either BTRFS_ROOT_REF_KEY
+ * or BTRFS_ROOT_BACKREF_KEY.
+ *
+ * The dirid, sequence, name and name_len refer to the directory entry
+ * that is referencing the root.
+ *
+ * For a forward ref, the root_id is the id of the tree referencing
+ * the root and ref_id is the id of the subvol  or snapshot.
+ *
+ * For a back ref the root_id is the id of the subvol or snapshot and
+ * ref_id is the id of the tree referencing it.
+ */
+int btrfs_add_root_ref(struct btrfs_trans_handle *trans,
+                      struct btrfs_root *tree_root,
+                      u64 root_id, u8 type, u64 ref_id,
+                      u64 dirid, u64 sequence,
+                      const char *name, int name_len)
+{
+       struct btrfs_key key;
+       int ret;
+       struct btrfs_path *path;
+       struct btrfs_root_ref *ref;
+       struct extent_buffer *leaf;
+       unsigned long ptr;
+
+
+       path = btrfs_alloc_path();
+
+       key.objectid = root_id;
+       key.type = type;
+       key.offset = ref_id;
+
+       ret = btrfs_insert_empty_item(trans, tree_root, path, &key,
+                                     sizeof(*ref) + name_len);
+       BUG_ON(ret);
+
+       leaf = path->nodes[0];
+       ref = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_root_ref);
+       btrfs_set_root_ref_dirid(leaf, ref, dirid);
+       btrfs_set_root_ref_sequence(leaf, ref, sequence);
+       btrfs_set_root_ref_name_len(leaf, ref, name_len);
+       ptr = (unsigned long)(ref + 1);
+       write_extent_buffer(leaf, name, ptr, name_len);
+       btrfs_mark_buffer_dirty(leaf);
+
+       btrfs_free_path(path);
+       return ret;
+}
diff --git a/fs/btrfs/struct-funcs.c b/fs/btrfs/struct-funcs.c

new file mode 100644 (file)

index 0000000..c0f7eca
--- /dev/null
+++ b/fs/btrfs/struct-funcs.c
@@ -0,0 +1,139 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/highmem.h>
+
+/* this is some deeply nasty code.  ctree.h has a different
+ * definition for this BTRFS_SETGET_FUNCS macro, behind a #ifndef
+ *
+ * The end result is that anyone who #includes ctree.h gets a
+ * declaration for the btrfs_set_foo functions and btrfs_foo functions
+ *
+ * This file declares the macros and then #includes ctree.h, which results
+ * in cpp creating the function here based on the template below.
+ *
+ * These setget functions do all the extent_buffer related mapping
+ * required to efficiently read and write specific fields in the extent
+ * buffers.  Every pointer to metadata items in btrfs is really just
+ * an unsigned long offset into the extent buffer which has been
+ * cast to a specific type.  This gives us all the gcc type checking.
+ *
+ * The extent buffer api is used to do all the kmapping and page
+ * spanning work required to get extent buffers in highmem and have
+ * a metadata blocksize different from the page size.
+ *
+ * The macro starts with a simple function prototype declaration so that
+ * sparse won't complain about it being static.
+ */
+
+#define BTRFS_SETGET_FUNCS(name, type, member, bits)                   \
+u##bits btrfs_##name(struct extent_buffer *eb, type *s);               \
+void btrfs_set_##name(struct extent_buffer *eb, type *s, u##bits val); \
+u##bits btrfs_##name(struct extent_buffer *eb,                         \
+                                  type *s)                             \
+{                                                                      \
+       unsigned long part_offset = (unsigned long)s;                   \
+       unsigned long offset = part_offset + offsetof(type, member);    \
+       type *p;                                                        \
+       /* ugly, but we want the fast path here */                      \
+       if (eb->map_token && offset >= eb->map_start &&                 \
+           offset + sizeof(((type *)0)->member) <= eb->map_start +     \
+           eb->map_len) {                                              \
+               p = (type *)(eb->kaddr + part_offset - eb->map_start);  \
+               return le##bits##_to_cpu(p->member);                    \
+       }                                                               \
+       {                                                               \
+               int err;                                                \
+               char *map_token;                                        \
+               char *kaddr;                                            \
+               int unmap_on_exit = (eb->map_token == NULL);            \
+               unsigned long map_start;                                \
+               unsigned long map_len;                                  \
+               u##bits res;                                            \
+               err = map_extent_buffer(eb, offset,                     \
+                               sizeof(((type *)0)->member),            \
+                               &map_token, &kaddr,                     \
+                               &map_start, &map_len, KM_USER1);        \
+               if (err) {                                              \
+                       __le##bits leres;                               \
+                       read_eb_member(eb, s, type, member, &leres);    \
+                       return le##bits##_to_cpu(leres);                \
+               }                                                       \
+               p = (type *)(kaddr + part_offset - map_start);          \
+               res = le##bits##_to_cpu(p->member);                     \
+               if (unmap_on_exit)                                      \
+                       unmap_extent_buffer(eb, map_token, KM_USER1);   \
+               return res;                                             \
+       }                                                               \
+}                                                                      \
+void btrfs_set_##name(struct extent_buffer *eb,                                \
+                                   type *s, u##bits val)               \
+{                                                                      \
+       unsigned long part_offset = (unsigned long)s;                   \
+       unsigned long offset = part_offset + offsetof(type, member);    \
+       type *p;                                                        \
+       /* ugly, but we want the fast path here */                      \
+       if (eb->map_token && offset >= eb->map_start &&                 \
+           offset + sizeof(((type *)0)->member) <= eb->map_start +     \
+           eb->map_len) {                                              \
+               p = (type *)(eb->kaddr + part_offset - eb->map_start);  \
+               p->member = cpu_to_le##bits(val);                       \
+               return;                                                 \
+       }                                                               \
+       {                                                               \
+               int err;                                                \
+               char *map_token;                                        \
+               char *kaddr;                                            \
+               int unmap_on_exit = (eb->map_token == NULL);            \
+               unsigned long map_start;                                \
+               unsigned long map_len;                                  \
+               err = map_extent_buffer(eb, offset,                     \
+                               sizeof(((type *)0)->member),            \
+                               &map_token, &kaddr,                     \
+                               &map_start, &map_len, KM_USER1);        \
+               if (err) {                                              \
+                       __le##bits val2;                                \
+                       val2 = cpu_to_le##bits(val);                    \
+                       write_eb_member(eb, s, type, member, &val2);    \
+                       return;                                         \
+               }                                                       \
+               p = (type *)(kaddr + part_offset - map_start);          \
+               p->member = cpu_to_le##bits(val);                       \
+               if (unmap_on_exit)                                      \
+                       unmap_extent_buffer(eb, map_token, KM_USER1);   \
+       }                                                               \
+}
+
+#include "ctree.h"
+
+void btrfs_node_key(struct extent_buffer *eb,
+                   struct btrfs_disk_key *disk_key, int nr)
+{
+       unsigned long ptr = btrfs_node_key_ptr_offset(nr);
+       if (eb->map_token && ptr >= eb->map_start &&
+           ptr + sizeof(*disk_key) <= eb->map_start + eb->map_len) {
+               memcpy(disk_key, eb->kaddr + ptr - eb->map_start,
+                       sizeof(*disk_key));
+               return;
+       } else if (eb->map_token) {
+               unmap_extent_buffer(eb, eb->map_token, KM_USER1);
+               eb->map_token = NULL;
+       }
+       read_eb_member(eb, (struct btrfs_key_ptr *)ptr,
+                      struct btrfs_key_ptr, key, disk_key);
+}
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c

new file mode 100644 (file)

index 0000000..b4c101d
--- /dev/null
+++ b/fs/btrfs/super.c
@@ -0,0 +1,720 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/blkdev.h>
+#include <linux/module.h>
+#include <linux/buffer_head.h>
+#include <linux/fs.h>
+#include <linux/pagemap.h>
+#include <linux/highmem.h>
+#include <linux/time.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/smp_lock.h>
+#include <linux/backing-dev.h>
+#include <linux/mount.h>
+#include <linux/mpage.h>
+#include <linux/swap.h>
+#include <linux/writeback.h>
+#include <linux/statfs.h>
+#include <linux/compat.h>
+#include <linux/parser.h>
+#include <linux/ctype.h>
+#include <linux/namei.h>
+#include <linux/miscdevice.h>
+#include <linux/version.h>
+#include "compat.h"
+#include "ctree.h"
+#include "disk-io.h"
+#include "transaction.h"
+#include "btrfs_inode.h"
+#include "ioctl.h"
+#include "print-tree.h"
+#include "xattr.h"
+#include "volumes.h"
+#include "version.h"
+#include "export.h"
+#include "compression.h"
+
+#define BTRFS_SUPER_MAGIC 0x9123683E
+
+static struct super_operations btrfs_super_ops;
+
+static void btrfs_put_super(struct super_block *sb)
+{
+       struct btrfs_root *root = btrfs_sb(sb);
+       int ret;
+
+       ret = close_ctree(root);
+       sb->s_fs_info = NULL;
+}
+
+enum {
+       Opt_degraded, Opt_subvol, Opt_device, Opt_nodatasum, Opt_nodatacow,
+       Opt_max_extent, Opt_max_inline, Opt_alloc_start, Opt_nobarrier,
+       Opt_ssd, Opt_thread_pool, Opt_noacl,  Opt_compress, Opt_err,
+};
+
+static match_table_t tokens = {
+       {Opt_degraded, "degraded"},
+       {Opt_subvol, "subvol=%s"},
+       {Opt_device, "device=%s"},
+       {Opt_nodatasum, "nodatasum"},
+       {Opt_nodatacow, "nodatacow"},
+       {Opt_nobarrier, "nobarrier"},
+       {Opt_max_extent, "max_extent=%s"},
+       {Opt_max_inline, "max_inline=%s"},
+       {Opt_alloc_start, "alloc_start=%s"},
+       {Opt_thread_pool, "thread_pool=%d"},
+       {Opt_compress, "compress"},
+       {Opt_ssd, "ssd"},
+       {Opt_noacl, "noacl"},
+       {Opt_err, NULL},
+};
+
+u64 btrfs_parse_size(char *str)
+{
+       u64 res;
+       int mult = 1;
+       char *end;
+       char last;
+
+       res = simple_strtoul(str, &end, 10);
+
+       last = end[0];
+       if (isalpha(last)) {
+               last = tolower(last);
+               switch (last) {
+               case 'g':
+                       mult *= 1024;
+               case 'm':
+                       mult *= 1024;
+               case 'k':
+                       mult *= 1024;
+               }
+               res = res * mult;
+       }
+       return res;
+}
+
+/*
+ * Regular mount options parser.  Everything that is needed only when
+ * reading in a new superblock is parsed here.
+ */
+int btrfs_parse_options(struct btrfs_root *root, char *options)
+{
+       struct btrfs_fs_info *info = root->fs_info;
+       substring_t args[MAX_OPT_ARGS];
+       char *p, *num;
+       int intarg;
+
+       if (!options)
+               return 0;
+
+       /*
+        * strsep changes the string, duplicate it because parse_options
+        * gets called twice
+        */
+       options = kstrdup(options, GFP_NOFS);
+       if (!options)
+               return -ENOMEM;
+
+
+       while ((p = strsep(&options, ",")) != NULL) {
+               int token;
+               if (!*p)
+                       continue;
+
+               token = match_token(p, tokens, args);
+               switch (token) {
+               case Opt_degraded:
+                       printk(KERN_INFO "btrfs: allowing degraded mounts\n");
+                       btrfs_set_opt(info->mount_opt, DEGRADED);
+                       break;
+               case Opt_subvol:
+               case Opt_device:
+                       /*
+                        * These are parsed by btrfs_parse_early_options
+                        * and can be happily ignored here.
+                        */
+                       break;
+               case Opt_nodatasum:
+                       printk(KERN_INFO "btrfs: setting nodatacsum\n");
+                       btrfs_set_opt(info->mount_opt, NODATASUM);
+                       break;
+               case Opt_nodatacow:
+                       printk(KERN_INFO "btrfs: setting nodatacow\n");
+                       btrfs_set_opt(info->mount_opt, NODATACOW);
+                       btrfs_set_opt(info->mount_opt, NODATASUM);
+                       break;
+               case Opt_compress:
+                       printk(KERN_INFO "btrfs: use compression\n");
+                       btrfs_set_opt(info->mount_opt, COMPRESS);
+                       break;
+               case Opt_ssd:
+                       printk(KERN_INFO "btrfs: use ssd allocation scheme\n");
+                       btrfs_set_opt(info->mount_opt, SSD);
+                       break;
+               case Opt_nobarrier:
+                       printk(KERN_INFO "btrfs: turning off barriers\n");
+                       btrfs_set_opt(info->mount_opt, NOBARRIER);
+                       break;
+               case Opt_thread_pool:
+                       intarg = 0;
+                       match_int(&args[0], &intarg);
+                       if (intarg) {
+                               info->thread_pool_size = intarg;
+                               printk(KERN_INFO "btrfs: thread pool %d\n",
+                                      info->thread_pool_size);
+                       }
+                       break;
+               case Opt_max_extent:
+                       num = match_strdup(&args[0]);
+                       if (num) {
+                               info->max_extent = btrfs_parse_size(num);
+                               kfree(num);
+
+                               info->max_extent = max_t(u64,
+                                       info->max_extent, root->sectorsize);
+                               printk(KERN_INFO "btrfs: max_extent at %llu\n",
+                                      info->max_extent);
+                       }
+                       break;
+               case Opt_max_inline:
+                       num = match_strdup(&args[0]);
+                       if (num) {
+                               info->max_inline = btrfs_parse_size(num);
+                               kfree(num);
+
+                               if (info->max_inline) {
+                                       info->max_inline = max_t(u64,
+                                               info->max_inline,
+                                               root->sectorsize);
+                               }
+                               printk(KERN_INFO "btrfs: max_inline at %llu\n",
+                                       info->max_inline);
+                       }
+                       break;
+               case Opt_alloc_start:
+                       num = match_strdup(&args[0]);
+                       if (num) {
+                               info->alloc_start = btrfs_parse_size(num);
+                               kfree(num);
+                               printk(KERN_INFO
+                                       "btrfs: allocations start at %llu\n",
+                                       info->alloc_start);
+                       }
+                       break;
+               case Opt_noacl:
+                       root->fs_info->sb->s_flags &= ~MS_POSIXACL;
+                       break;
+               default:
+                       break;
+               }
+       }
+       kfree(options);
+       return 0;
+}
+
+/*
+ * Parse mount options that are required early in the mount process.
+ *
+ * All other options will be parsed on much later in the mount process and
+ * only when we need to allocate a new super block.
+ */
+static int btrfs_parse_early_options(const char *options, fmode_t flags,
+               void *holder, char **subvol_name,
+               struct btrfs_fs_devices **fs_devices)
+{
+       substring_t args[MAX_OPT_ARGS];
+       char *opts, *p;
+       int error = 0;
+
+       if (!options)
+               goto out;
+
+       /*
+        * strsep changes the string, duplicate it because parse_options
+        * gets called twice
+        */
+       opts = kstrdup(options, GFP_KERNEL);
+       if (!opts)
+               return -ENOMEM;
+
+       while ((p = strsep(&opts, ",")) != NULL) {
+               int token;
+               if (!*p)
+                       continue;
+
+               token = match_token(p, tokens, args);
+               switch (token) {
+               case Opt_subvol:
+                       *subvol_name = match_strdup(&args[0]);
+                       break;
+               case Opt_device:
+                       error = btrfs_scan_one_device(match_strdup(&args[0]),
+                                       flags, holder, fs_devices);
+                       if (error)
+                               goto out_free_opts;
+                       break;
+               default:
+                       break;
+               }
+       }
+
+ out_free_opts:
+       kfree(opts);
+ out:
+       /*
+        * If no subvolume name is specified we use the default one.  Allocate
+        * a copy of the string "." here so that code later in the
+        * mount path doesn't care if it's the default volume or another one.
+        */
+       if (!*subvol_name) {
+               *subvol_name = kstrdup(".", GFP_KERNEL);
+               if (!*subvol_name)
+                       return -ENOMEM;
+       }
+       return error;
+}
+
+static int btrfs_fill_super(struct super_block *sb,
+                           struct btrfs_fs_devices *fs_devices,
+                           void *data, int silent)
+{
+       struct inode *inode;
+       struct dentry *root_dentry;
+       struct btrfs_super_block *disk_super;
+       struct btrfs_root *tree_root;
+       struct btrfs_inode *bi;
+       int err;
+
+       sb->s_maxbytes = MAX_LFS_FILESIZE;
+       sb->s_magic = BTRFS_SUPER_MAGIC;
+       sb->s_op = &btrfs_super_ops;
+       sb->s_export_op = &btrfs_export_ops;
+       sb->s_xattr = btrfs_xattr_handlers;
+       sb->s_time_gran = 1;
+       sb->s_flags |= MS_POSIXACL;
+
+       tree_root = open_ctree(sb, fs_devices, (char *)data);
+
+       if (IS_ERR(tree_root)) {
+               printk("btrfs: open_ctree failed\n");
+               return PTR_ERR(tree_root);
+       }
+       sb->s_fs_info = tree_root;
+       disk_super = &tree_root->fs_info->super_copy;
+       inode = btrfs_iget_locked(sb, BTRFS_FIRST_FREE_OBJECTID,
+                                 tree_root->fs_info->fs_root);
+       bi = BTRFS_I(inode);
+       bi->location.objectid = inode->i_ino;
+       bi->location.offset = 0;
+       bi->root = tree_root->fs_info->fs_root;
+
+       btrfs_set_key_type(&bi->location, BTRFS_INODE_ITEM_KEY);
+
+       if (!inode) {
+               err = -ENOMEM;
+               goto fail_close;
+       }
+       if (inode->i_state & I_NEW) {
+               btrfs_read_locked_inode(inode);
+               unlock_new_inode(inode);
+       }
+
+       root_dentry = d_alloc_root(inode);
+       if (!root_dentry) {
+               iput(inode);
+               err = -ENOMEM;
+               goto fail_close;
+       }
+#if 0
+       /* this does the super kobj at the same time */
+       err = btrfs_sysfs_add_super(tree_root->fs_info);
+       if (err)
+               goto fail_close;
+#endif
+
+       sb->s_root = root_dentry;
+
+       save_mount_options(sb, data);
+       return 0;
+
+fail_close:
+       close_ctree(tree_root);
+       return err;
+}
+
+int btrfs_sync_fs(struct super_block *sb, int wait)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *root;
+       int ret;
+       root = btrfs_sb(sb);
+
+       if (sb->s_flags & MS_RDONLY)
+               return 0;
+
+       sb->s_dirt = 0;
+       if (!wait) {
+               filemap_flush(root->fs_info->btree_inode->i_mapping);
+               return 0;
+       }
+
+       btrfs_start_delalloc_inodes(root);
+       btrfs_wait_ordered_extents(root, 0);
+
+       btrfs_clean_old_snapshots(root);
+       trans = btrfs_start_transaction(root, 1);
+       ret = btrfs_commit_transaction(trans, root);
+       sb->s_dirt = 0;
+       return ret;
+}
+
+static void btrfs_write_super(struct super_block *sb)
+{
+       sb->s_dirt = 0;
+}
+
+static int btrfs_test_super(struct super_block *s, void *data)
+{
+       struct btrfs_fs_devices *test_fs_devices = data;
+       struct btrfs_root *root = btrfs_sb(s);
+
+       return root->fs_info->fs_devices == test_fs_devices;
+}
+
+/*
+ * Find a superblock for the given device / mount point.
+ *
+ * Note:  This is based on get_sb_bdev from fs/super.c with a few additions
+ *       for multiple device setup.  Make sure to keep it in sync.
+ */
+static int btrfs_get_sb(struct file_system_type *fs_type, int flags,
+               const char *dev_name, void *data, struct vfsmount *mnt)
+{
+       char *subvol_name = NULL;
+       struct block_device *bdev = NULL;
+       struct super_block *s;
+       struct dentry *root;
+       struct btrfs_fs_devices *fs_devices = NULL;
+       fmode_t mode = FMODE_READ;
+       int error = 0;
+
+       if (!(flags & MS_RDONLY))
+               mode |= FMODE_WRITE;
+
+       error = btrfs_parse_early_options(data, mode, fs_type,
+                                         &subvol_name, &fs_devices);
+       if (error)
+               return error;
+
+       error = btrfs_scan_one_device(dev_name, mode, fs_type, &fs_devices);
+       if (error)
+               goto error_free_subvol_name;
+
+       error = btrfs_open_devices(fs_devices, mode, fs_type);
+       if (error)
+               goto error_free_subvol_name;
+
+       if (!(flags & MS_RDONLY) && fs_devices->rw_devices == 0) {
+               error = -EACCES;
+               goto error_close_devices;
+       }
+
+       bdev = fs_devices->latest_bdev;
+       s = sget(fs_type, btrfs_test_super, set_anon_super, fs_devices);
+       if (IS_ERR(s))
+               goto error_s;
+
+       if (s->s_root) {
+               if ((flags ^ s->s_flags) & MS_RDONLY) {
+                       up_write(&s->s_umount);
+                       deactivate_super(s);
+                       error = -EBUSY;
+                       goto error_close_devices;
+               }
+
+               btrfs_close_devices(fs_devices);
+       } else {
+               char b[BDEVNAME_SIZE];
+
+               s->s_flags = flags;
+               strlcpy(s->s_id, bdevname(bdev, b), sizeof(s->s_id));
+               error = btrfs_fill_super(s, fs_devices, data,
+                                        flags & MS_SILENT ? 1 : 0);
+               if (error) {
+                       up_write(&s->s_umount);
+                       deactivate_super(s);
+                       goto error_free_subvol_name;
+               }
+
+               btrfs_sb(s)->fs_info->bdev_holder = fs_type;
+               s->s_flags |= MS_ACTIVE;
+       }
+
+       if (!strcmp(subvol_name, "."))
+               root = dget(s->s_root);
+       else {
+               mutex_lock(&s->s_root->d_inode->i_mutex);
+               root = lookup_one_len(subvol_name, s->s_root,
+                                     strlen(subvol_name));
+               mutex_unlock(&s->s_root->d_inode->i_mutex);
+
+               if (IS_ERR(root)) {
+                       up_write(&s->s_umount);
+                       deactivate_super(s);
+                       error = PTR_ERR(root);
+                       goto error_free_subvol_name;
+               }
+               if (!root->d_inode) {
+                       dput(root);
+                       up_write(&s->s_umount);
+                       deactivate_super(s);
+                       error = -ENXIO;
+                       goto error_free_subvol_name;
+               }
+       }
+
+       mnt->mnt_sb = s;
+       mnt->mnt_root = root;
+
+       kfree(subvol_name);
+       return 0;
+
+error_s:
+       error = PTR_ERR(s);
+error_close_devices:
+       btrfs_close_devices(fs_devices);
+error_free_subvol_name:
+       kfree(subvol_name);
+       return error;
+}
+
+static int btrfs_remount(struct super_block *sb, int *flags, char *data)
+{
+       struct btrfs_root *root = btrfs_sb(sb);
+       int ret;
+
+       if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY))
+               return 0;
+
+       if (*flags & MS_RDONLY) {
+               sb->s_flags |= MS_RDONLY;
+
+               ret =  btrfs_commit_super(root);
+               WARN_ON(ret);
+       } else {
+               if (root->fs_info->fs_devices->rw_devices == 0)
+                       return -EACCES;
+
+               if (btrfs_super_log_root(&root->fs_info->super_copy) != 0)
+                       return -EINVAL;
+
+               ret = btrfs_cleanup_reloc_trees(root);
+               WARN_ON(ret);
+
+               ret = btrfs_cleanup_fs_roots(root->fs_info);
+               WARN_ON(ret);
+
+               sb->s_flags &= ~MS_RDONLY;
+       }
+
+       return 0;
+}
+
+static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+       struct btrfs_root *root = btrfs_sb(dentry->d_sb);
+       struct btrfs_super_block *disk_super = &root->fs_info->super_copy;
+       int bits = dentry->d_sb->s_blocksize_bits;
+       __be32 *fsid = (__be32 *)root->fs_info->fsid;
+
+       buf->f_namelen = BTRFS_NAME_LEN;
+       buf->f_blocks = btrfs_super_total_bytes(disk_super) >> bits;
+       buf->f_bfree = buf->f_blocks -
+               (btrfs_super_bytes_used(disk_super) >> bits);
+       buf->f_bavail = buf->f_bfree;
+       buf->f_bsize = dentry->d_sb->s_blocksize;
+       buf->f_type = BTRFS_SUPER_MAGIC;
+
+       /* We treat it as constant endianness (it doesn't matter _which_)
+          because we want the fsid to come out the same whether mounted
+          on a big-endian or little-endian host */
+       buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ be32_to_cpu(fsid[2]);
+       buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ be32_to_cpu(fsid[3]);
+       /* Mask in the root object ID too, to disambiguate subvols */
+       buf->f_fsid.val[0] ^= BTRFS_I(dentry->d_inode)->root->objectid >> 32;
+       buf->f_fsid.val[1] ^= BTRFS_I(dentry->d_inode)->root->objectid;
+
+       return 0;
+}
+
+static struct file_system_type btrfs_fs_type = {
+       .owner          = THIS_MODULE,
+       .name           = "btrfs",
+       .get_sb         = btrfs_get_sb,
+       .kill_sb        = kill_anon_super,
+       .fs_flags       = FS_REQUIRES_DEV,
+};
+
+/*
+ * used by btrfsctl to scan devices when no FS is mounted
+ */
+static long btrfs_control_ioctl(struct file *file, unsigned int cmd,
+                               unsigned long arg)
+{
+       struct btrfs_ioctl_vol_args *vol;
+       struct btrfs_fs_devices *fs_devices;
+       int ret = 0;
+       int len;
+
+       if (!capable(CAP_SYS_ADMIN))
+               return -EPERM;
+
+       vol = kmalloc(sizeof(*vol), GFP_KERNEL);
+       if (copy_from_user(vol, (void __user *)arg, sizeof(*vol))) {
+               ret = -EFAULT;
+               goto out;
+       }
+       len = strnlen(vol->name, BTRFS_PATH_NAME_MAX);
+       switch (cmd) {
+       case BTRFS_IOC_SCAN_DEV:
+               ret = btrfs_scan_one_device(vol->name, FMODE_READ,
+                                           &btrfs_fs_type, &fs_devices);
+               break;
+       }
+out:
+       kfree(vol);
+       return ret;
+}
+
+static void btrfs_write_super_lockfs(struct super_block *sb)
+{
+       struct btrfs_root *root = btrfs_sb(sb);
+       mutex_lock(&root->fs_info->transaction_kthread_mutex);
+       mutex_lock(&root->fs_info->cleaner_mutex);
+}
+
+static void btrfs_unlockfs(struct super_block *sb)
+{
+       struct btrfs_root *root = btrfs_sb(sb);
+       mutex_unlock(&root->fs_info->cleaner_mutex);
+       mutex_unlock(&root->fs_info->transaction_kthread_mutex);
+}
+
+static struct super_operations btrfs_super_ops = {
+       .delete_inode   = btrfs_delete_inode,
+       .put_super      = btrfs_put_super,
+       .write_super    = btrfs_write_super,
+       .sync_fs        = btrfs_sync_fs,
+       .show_options   = generic_show_options,
+       .write_inode    = btrfs_write_inode,
+       .dirty_inode    = btrfs_dirty_inode,
+       .alloc_inode    = btrfs_alloc_inode,
+       .destroy_inode  = btrfs_destroy_inode,
+       .statfs         = btrfs_statfs,
+       .remount_fs     = btrfs_remount,
+       .write_super_lockfs = btrfs_write_super_lockfs,
+       .unlockfs       = btrfs_unlockfs,
+};
+
+static const struct file_operations btrfs_ctl_fops = {
+       .unlocked_ioctl  = btrfs_control_ioctl,
+       .compat_ioctl = btrfs_control_ioctl,
+       .owner   = THIS_MODULE,
+};
+
+static struct miscdevice btrfs_misc = {
+       .minor          = MISC_DYNAMIC_MINOR,
+       .name           = "btrfs-control",
+       .fops           = &btrfs_ctl_fops
+};
+
+static int btrfs_interface_init(void)
+{
+       return misc_register(&btrfs_misc);
+}
+
+static void btrfs_interface_exit(void)
+{
+       if (misc_deregister(&btrfs_misc) < 0)
+               printk(KERN_INFO "misc_deregister failed for control device");
+}
+
+static int __init init_btrfs_fs(void)
+{
+       int err;
+
+       err = btrfs_init_sysfs();
+       if (err)
+               return err;
+
+       err = btrfs_init_cachep();
+       if (err)
+               goto free_sysfs;
+
+       err = extent_io_init();
+       if (err)
+               goto free_cachep;
+
+       err = extent_map_init();
+       if (err)
+               goto free_extent_io;
+
+       err = btrfs_interface_init();
+       if (err)
+               goto free_extent_map;
+
+       err = register_filesystem(&btrfs_fs_type);
+       if (err)
+               goto unregister_ioctl;
+
+       printk(KERN_INFO "%s loaded\n", BTRFS_BUILD_VERSION);
+       return 0;
+
+unregister_ioctl:
+       btrfs_interface_exit();
+free_extent_map:
+       extent_map_exit();
+free_extent_io:
+       extent_io_exit();
+free_cachep:
+       btrfs_destroy_cachep();
+free_sysfs:
+       btrfs_exit_sysfs();
+       return err;
+}
+
+static void __exit exit_btrfs_fs(void)
+{
+       btrfs_destroy_cachep();
+       extent_map_exit();
+       extent_io_exit();
+       btrfs_interface_exit();
+       unregister_filesystem(&btrfs_fs_type);
+       btrfs_exit_sysfs();
+       btrfs_cleanup_fs_uuids();
+       btrfs_zlib_exit();
+}
+
+module_init(init_btrfs_fs)
+module_exit(exit_btrfs_fs)
+
+MODULE_LICENSE("GPL");
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c

new file mode 100644 (file)

index 0000000..a240b6f
--- /dev/null
+++ b/fs/btrfs/sysfs.c
@@ -0,0 +1,269 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
+#include <linux/buffer_head.h>
+#include <linux/module.h>
+#include <linux/kobject.h>
+
+#include "ctree.h"
+#include "disk-io.h"
+#include "transaction.h"
+
+static ssize_t root_blocks_used_show(struct btrfs_root *root, char *buf)
+{
+       return snprintf(buf, PAGE_SIZE, "%llu\n",
+               (unsigned long long)btrfs_root_used(&root->root_item));
+}
+
+static ssize_t root_block_limit_show(struct btrfs_root *root, char *buf)
+{
+       return snprintf(buf, PAGE_SIZE, "%llu\n",
+               (unsigned long long)btrfs_root_limit(&root->root_item));
+}
+
+static ssize_t super_blocks_used_show(struct btrfs_fs_info *fs, char *buf)
+{
+
+       return snprintf(buf, PAGE_SIZE, "%llu\n",
+               (unsigned long long)btrfs_super_bytes_used(&fs->super_copy));
+}
+
+static ssize_t super_total_blocks_show(struct btrfs_fs_info *fs, char *buf)
+{
+       return snprintf(buf, PAGE_SIZE, "%llu\n",
+               (unsigned long long)btrfs_super_total_bytes(&fs->super_copy));
+}
+
+static ssize_t super_blocksize_show(struct btrfs_fs_info *fs, char *buf)
+{
+       return snprintf(buf, PAGE_SIZE, "%llu\n",
+               (unsigned long long)btrfs_super_sectorsize(&fs->super_copy));
+}
+
+/* this is for root attrs (subvols/snapshots) */
+struct btrfs_root_attr {
+       struct attribute attr;
+       ssize_t (*show)(struct btrfs_root *, char *);
+       ssize_t (*store)(struct btrfs_root *, const char *, size_t);
+};
+
+#define ROOT_ATTR(name, mode, show, store) \
+static struct btrfs_root_attr btrfs_root_attr_##name = __ATTR(name, mode, \
+                                                             show, store)
+
+ROOT_ATTR(blocks_used, 0444,   root_blocks_used_show,  NULL);
+ROOT_ATTR(block_limit, 0644,   root_block_limit_show,  NULL);
+
+static struct attribute *btrfs_root_attrs[] = {
+       &btrfs_root_attr_blocks_used.attr,
+       &btrfs_root_attr_block_limit.attr,
+       NULL,
+};
+
+/* this is for super attrs (actual full fs) */
+struct btrfs_super_attr {
+       struct attribute attr;
+       ssize_t (*show)(struct btrfs_fs_info *, char *);
+       ssize_t (*store)(struct btrfs_fs_info *, const char *, size_t);
+};
+
+#define SUPER_ATTR(name, mode, show, store) \
+static struct btrfs_super_attr btrfs_super_attr_##name = __ATTR(name, mode, \
+                                                               show, store)
+
+SUPER_ATTR(blocks_used,                0444,   super_blocks_used_show,         NULL);
+SUPER_ATTR(total_blocks,       0444,   super_total_blocks_show,        NULL);
+SUPER_ATTR(blocksize,          0444,   super_blocksize_show,           NULL);
+
+static struct attribute *btrfs_super_attrs[] = {
+       &btrfs_super_attr_blocks_used.attr,
+       &btrfs_super_attr_total_blocks.attr,
+       &btrfs_super_attr_blocksize.attr,
+       NULL,
+};
+
+static ssize_t btrfs_super_attr_show(struct kobject *kobj,
+                                   struct attribute *attr, char *buf)
+{
+       struct btrfs_fs_info *fs = container_of(kobj, struct btrfs_fs_info,
+                                               super_kobj);
+       struct btrfs_super_attr *a = container_of(attr,
+                                                 struct btrfs_super_attr,
+                                                 attr);
+
+       return a->show ? a->show(fs, buf) : 0;
+}
+
+static ssize_t btrfs_super_attr_store(struct kobject *kobj,
+                                    struct attribute *attr,
+                                    const char *buf, size_t len)
+{
+       struct btrfs_fs_info *fs = container_of(kobj, struct btrfs_fs_info,
+                                               super_kobj);
+       struct btrfs_super_attr *a = container_of(attr,
+                                                 struct btrfs_super_attr,
+                                                 attr);
+
+       return a->store ? a->store(fs, buf, len) : 0;
+}
+
+static ssize_t btrfs_root_attr_show(struct kobject *kobj,
+                                   struct attribute *attr, char *buf)
+{
+       struct btrfs_root *root = container_of(kobj, struct btrfs_root,
+                                               root_kobj);
+       struct btrfs_root_attr *a = container_of(attr,
+                                                struct btrfs_root_attr,
+                                                attr);
+
+       return a->show ? a->show(root, buf) : 0;
+}
+
+static ssize_t btrfs_root_attr_store(struct kobject *kobj,
+                                    struct attribute *attr,
+                                    const char *buf, size_t len)
+{
+       struct btrfs_root *root = container_of(kobj, struct btrfs_root,
+                                               root_kobj);
+       struct btrfs_root_attr *a = container_of(attr,
+                                                struct btrfs_root_attr,
+                                                attr);
+       return a->store ? a->store(root, buf, len) : 0;
+}
+
+static void btrfs_super_release(struct kobject *kobj)
+{
+       struct btrfs_fs_info *fs = container_of(kobj, struct btrfs_fs_info,
+                                               super_kobj);
+       complete(&fs->kobj_unregister);
+}
+
+static void btrfs_root_release(struct kobject *kobj)
+{
+       struct btrfs_root *root = container_of(kobj, struct btrfs_root,
+                                               root_kobj);
+       complete(&root->kobj_unregister);
+}
+
+static struct sysfs_ops btrfs_super_attr_ops = {
+       .show   = btrfs_super_attr_show,
+       .store  = btrfs_super_attr_store,
+};
+
+static struct sysfs_ops btrfs_root_attr_ops = {
+       .show   = btrfs_root_attr_show,
+       .store  = btrfs_root_attr_store,
+};
+
+static struct kobj_type btrfs_root_ktype = {
+       .default_attrs  = btrfs_root_attrs,
+       .sysfs_ops      = &btrfs_root_attr_ops,
+       .release        = btrfs_root_release,
+};
+
+static struct kobj_type btrfs_super_ktype = {
+       .default_attrs  = btrfs_super_attrs,
+       .sysfs_ops      = &btrfs_super_attr_ops,
+       .release        = btrfs_super_release,
+};
+
+/* /sys/fs/btrfs/ entry */
+static struct kset *btrfs_kset;
+
+int btrfs_sysfs_add_super(struct btrfs_fs_info *fs)
+{
+       int error;
+       char *name;
+       char c;
+       int len = strlen(fs->sb->s_id) + 1;
+       int i;
+
+       name = kmalloc(len, GFP_NOFS);
+       if (!name) {
+               error = -ENOMEM;
+               goto fail;
+       }
+
+       for (i = 0; i < len; i++) {
+               c = fs->sb->s_id[i];
+               if (c == '/' || c == '\\')
+                       c = '!';
+               name[i] = c;
+       }
+       name[len] = '\0';
+
+       fs->super_kobj.kset = btrfs_kset;
+       error = kobject_init_and_add(&fs->super_kobj, &btrfs_super_ktype,
+                                    NULL, "%s", name);
+       kfree(name);
+       if (error)
+               goto fail;
+
+       return 0;
+
+fail:
+       printk(KERN_ERR "btrfs: sysfs creation for super failed\n");
+       return error;
+}
+
+int btrfs_sysfs_add_root(struct btrfs_root *root)
+{
+       int error;
+
+       error = kobject_init_and_add(&root->root_kobj, &btrfs_root_ktype,
+                                    &root->fs_info->super_kobj,
+                                    "%s", root->name);
+       if (error)
+               goto fail;
+
+       return 0;
+
+fail:
+       printk(KERN_ERR "btrfs: sysfs creation for root failed\n");
+       return error;
+}
+
+void btrfs_sysfs_del_root(struct btrfs_root *root)
+{
+       kobject_put(&root->root_kobj);
+       wait_for_completion(&root->kobj_unregister);
+}
+
+void btrfs_sysfs_del_super(struct btrfs_fs_info *fs)
+{
+       kobject_put(&fs->super_kobj);
+       wait_for_completion(&fs->kobj_unregister);
+}
+
+int btrfs_init_sysfs(void)
+{
+       btrfs_kset = kset_create_and_add("btrfs", NULL, fs_kobj);
+       if (!btrfs_kset)
+               return -ENOMEM;
+       return 0;
+}
+
+void btrfs_exit_sysfs(void)
+{
+       kset_unregister(btrfs_kset);
+}
+
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c

new file mode 100644 (file)

index 0000000..8a08f94
--- /dev/null
+++ b/fs/btrfs/transaction.c
@@ -0,0 +1,1097 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/writeback.h>
+#include <linux/pagemap.h>
+#include <linux/blkdev.h>
+#include "ctree.h"
+#include "disk-io.h"
+#include "transaction.h"
+#include "locking.h"
+#include "ref-cache.h"
+#include "tree-log.h"
+
+#define BTRFS_ROOT_TRANS_TAG 0
+
+static noinline void put_transaction(struct btrfs_transaction *transaction)
+{
+       WARN_ON(transaction->use_count == 0);
+       transaction->use_count--;
+       if (transaction->use_count == 0) {
+               list_del_init(&transaction->list);
+               memset(transaction, 0, sizeof(*transaction));
+               kmem_cache_free(btrfs_transaction_cachep, transaction);
+       }
+}
+
+/*
+ * either allocate a new transaction or hop into the existing one
+ */
+static noinline int join_transaction(struct btrfs_root *root)
+{
+       struct btrfs_transaction *cur_trans;
+       cur_trans = root->fs_info->running_transaction;
+       if (!cur_trans) {
+               cur_trans = kmem_cache_alloc(btrfs_transaction_cachep,
+                                            GFP_NOFS);
+               BUG_ON(!cur_trans);
+               root->fs_info->generation++;
+               root->fs_info->last_alloc = 0;
+               root->fs_info->last_data_alloc = 0;
+               cur_trans->num_writers = 1;
+               cur_trans->num_joined = 0;
+               cur_trans->transid = root->fs_info->generation;
+               init_waitqueue_head(&cur_trans->writer_wait);
+               init_waitqueue_head(&cur_trans->commit_wait);
+               cur_trans->in_commit = 0;
+               cur_trans->blocked = 0;
+               cur_trans->use_count = 1;
+               cur_trans->commit_done = 0;
+               cur_trans->start_time = get_seconds();
+               INIT_LIST_HEAD(&cur_trans->pending_snapshots);
+               list_add_tail(&cur_trans->list, &root->fs_info->trans_list);
+               extent_io_tree_init(&cur_trans->dirty_pages,
+                                    root->fs_info->btree_inode->i_mapping,
+                                    GFP_NOFS);
+               spin_lock(&root->fs_info->new_trans_lock);
+               root->fs_info->running_transaction = cur_trans;
+               spin_unlock(&root->fs_info->new_trans_lock);
+       } else {
+               cur_trans->num_writers++;
+               cur_trans->num_joined++;
+       }
+
+       return 0;
+}
+
+/*
+ * this does all the record keeping required to make sure that a reference
+ * counted root is properly recorded in a given transaction.  This is required
+ * to make sure the old root from before we joined the transaction is deleted
+ * when the transaction commits
+ */
+noinline int btrfs_record_root_in_trans(struct btrfs_root *root)
+{
+       struct btrfs_dirty_root *dirty;
+       u64 running_trans_id = root->fs_info->running_transaction->transid;
+       if (root->ref_cows && root->last_trans < running_trans_id) {
+               WARN_ON(root == root->fs_info->extent_root);
+               if (root->root_item.refs != 0) {
+                       radix_tree_tag_set(&root->fs_info->fs_roots_radix,
+                                  (unsigned long)root->root_key.objectid,
+                                  BTRFS_ROOT_TRANS_TAG);
+
+                       dirty = kmalloc(sizeof(*dirty), GFP_NOFS);
+                       BUG_ON(!dirty);
+                       dirty->root = kmalloc(sizeof(*dirty->root), GFP_NOFS);
+                       BUG_ON(!dirty->root);
+                       dirty->latest_root = root;
+                       INIT_LIST_HEAD(&dirty->list);
+
+                       root->commit_root = btrfs_root_node(root);
+
+                       memcpy(dirty->root, root, sizeof(*root));
+                       spin_lock_init(&dirty->root->node_lock);
+                       spin_lock_init(&dirty->root->list_lock);
+                       mutex_init(&dirty->root->objectid_mutex);
+                       mutex_init(&dirty->root->log_mutex);
+                       INIT_LIST_HEAD(&dirty->root->dead_list);
+                       dirty->root->node = root->commit_root;
+                       dirty->root->commit_root = NULL;
+
+                       spin_lock(&root->list_lock);
+                       list_add(&dirty->root->dead_list, &root->dead_list);
+                       spin_unlock(&root->list_lock);
+
+                       root->dirty_root = dirty;
+               } else {
+                       WARN_ON(1);
+               }
+               root->last_trans = running_trans_id;
+       }
+       return 0;
+}
+
+/* wait for commit against the current transaction to become unblocked
+ * when this is done, it is safe to start a new transaction, but the current
+ * transaction might not be fully on disk.
+ */
+static void wait_current_trans(struct btrfs_root *root)
+{
+       struct btrfs_transaction *cur_trans;
+
+       cur_trans = root->fs_info->running_transaction;
+       if (cur_trans && cur_trans->blocked) {
+               DEFINE_WAIT(wait);
+               cur_trans->use_count++;
+               while (1) {
+                       prepare_to_wait(&root->fs_info->transaction_wait, &wait,
+                                       TASK_UNINTERRUPTIBLE);
+                       if (cur_trans->blocked) {
+                               mutex_unlock(&root->fs_info->trans_mutex);
+                               schedule();
+                               mutex_lock(&root->fs_info->trans_mutex);
+                               finish_wait(&root->fs_info->transaction_wait,
+                                           &wait);
+                       } else {
+                               finish_wait(&root->fs_info->transaction_wait,
+                                           &wait);
+                               break;
+                       }
+               }
+               put_transaction(cur_trans);
+       }
+}
+
+static struct btrfs_trans_handle *start_transaction(struct btrfs_root *root,
+                                            int num_blocks, int wait)
+{
+       struct btrfs_trans_handle *h =
+               kmem_cache_alloc(btrfs_trans_handle_cachep, GFP_NOFS);
+       int ret;
+
+       mutex_lock(&root->fs_info->trans_mutex);
+       if (!root->fs_info->log_root_recovering &&
+           ((wait == 1 && !root->fs_info->open_ioctl_trans) || wait == 2))
+               wait_current_trans(root);
+       ret = join_transaction(root);
+       BUG_ON(ret);
+
+       btrfs_record_root_in_trans(root);
+       h->transid = root->fs_info->running_transaction->transid;
+       h->transaction = root->fs_info->running_transaction;
+       h->blocks_reserved = num_blocks;
+       h->blocks_used = 0;
+       h->block_group = 0;
+       h->alloc_exclude_nr = 0;
+       h->alloc_exclude_start = 0;
+       root->fs_info->running_transaction->use_count++;
+       mutex_unlock(&root->fs_info->trans_mutex);
+       return h;
+}
+
+struct btrfs_trans_handle *btrfs_start_transaction(struct btrfs_root *root,
+                                                  int num_blocks)
+{
+       return start_transaction(root, num_blocks, 1);
+}
+struct btrfs_trans_handle *btrfs_join_transaction(struct btrfs_root *root,
+                                                  int num_blocks)
+{
+       return start_transaction(root, num_blocks, 0);
+}
+
+struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root *r,
+                                                        int num_blocks)
+{
+       return start_transaction(r, num_blocks, 2);
+}
+
+/* wait for a transaction commit to be fully complete */
+static noinline int wait_for_commit(struct btrfs_root *root,
+                                   struct btrfs_transaction *commit)
+{
+       DEFINE_WAIT(wait);
+       mutex_lock(&root->fs_info->trans_mutex);
+       while (!commit->commit_done) {
+               prepare_to_wait(&commit->commit_wait, &wait,
+                               TASK_UNINTERRUPTIBLE);
+               if (commit->commit_done)
+                       break;
+               mutex_unlock(&root->fs_info->trans_mutex);
+               schedule();
+               mutex_lock(&root->fs_info->trans_mutex);
+       }
+       mutex_unlock(&root->fs_info->trans_mutex);
+       finish_wait(&commit->commit_wait, &wait);
+       return 0;
+}
+
+/*
+ * rate limit against the drop_snapshot code.  This helps to slow down new
+ * operations if the drop_snapshot code isn't able to keep up.
+ */
+static void throttle_on_drops(struct btrfs_root *root)
+{
+       struct btrfs_fs_info *info = root->fs_info;
+       int harder_count = 0;
+
+harder:
+       if (atomic_read(&info->throttles)) {
+               DEFINE_WAIT(wait);
+               int thr;
+               thr = atomic_read(&info->throttle_gen);
+
+               do {
+                       prepare_to_wait(&info->transaction_throttle,
+                                       &wait, TASK_UNINTERRUPTIBLE);
+                       if (!atomic_read(&info->throttles)) {
+                               finish_wait(&info->transaction_throttle, &wait);
+                               break;
+                       }
+                       schedule();
+                       finish_wait(&info->transaction_throttle, &wait);
+               } while (thr == atomic_read(&info->throttle_gen));
+               harder_count++;
+
+               if (root->fs_info->total_ref_cache_size > 1 * 1024 * 1024 &&
+                   harder_count < 2)
+                       goto harder;
+
+               if (root->fs_info->total_ref_cache_size > 5 * 1024 * 1024 &&
+                   harder_count < 10)
+                       goto harder;
+
+               if (root->fs_info->total_ref_cache_size > 10 * 1024 * 1024 &&
+                   harder_count < 20)
+                       goto harder;
+       }
+}
+
+void btrfs_throttle(struct btrfs_root *root)
+{
+       mutex_lock(&root->fs_info->trans_mutex);
+       if (!root->fs_info->open_ioctl_trans)
+               wait_current_trans(root);
+       mutex_unlock(&root->fs_info->trans_mutex);
+
+       throttle_on_drops(root);
+}
+
+static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root, int throttle)
+{
+       struct btrfs_transaction *cur_trans;
+       struct btrfs_fs_info *info = root->fs_info;
+
+       mutex_lock(&info->trans_mutex);
+       cur_trans = info->running_transaction;
+       WARN_ON(cur_trans != trans->transaction);
+       WARN_ON(cur_trans->num_writers < 1);
+       cur_trans->num_writers--;
+
+       if (waitqueue_active(&cur_trans->writer_wait))
+               wake_up(&cur_trans->writer_wait);
+       put_transaction(cur_trans);
+       mutex_unlock(&info->trans_mutex);
+       memset(trans, 0, sizeof(*trans));
+       kmem_cache_free(btrfs_trans_handle_cachep, trans);
+
+       if (throttle)
+               throttle_on_drops(root);
+
+       return 0;
+}
+
+int btrfs_end_transaction(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root)
+{
+       return __btrfs_end_transaction(trans, root, 0);
+}
+
+int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans,
+                                  struct btrfs_root *root)
+{
+       return __btrfs_end_transaction(trans, root, 1);
+}
+
+/*
+ * when btree blocks are allocated, they have some corresponding bits set for
+ * them in one of two extent_io trees.  This is used to make sure all of
+ * those extents are on disk for transaction or log commit
+ */
+int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
+                                       struct extent_io_tree *dirty_pages)
+{
+       int ret;
+       int err = 0;
+       int werr = 0;
+       struct page *page;
+       struct inode *btree_inode = root->fs_info->btree_inode;
+       u64 start = 0;
+       u64 end;
+       unsigned long index;
+
+       while (1) {
+               ret = find_first_extent_bit(dirty_pages, start, &start, &end,
+                                           EXTENT_DIRTY);
+               if (ret)
+                       break;
+               while (start <= end) {
+                       cond_resched();
+
+                       index = start >> PAGE_CACHE_SHIFT;
+                       start = (u64)(index + 1) << PAGE_CACHE_SHIFT;
+                       page = find_get_page(btree_inode->i_mapping, index);
+                       if (!page)
+                               continue;
+
+                       btree_lock_page_hook(page);
+                       if (!page->mapping) {
+                               unlock_page(page);
+                               page_cache_release(page);
+                               continue;
+                       }
+
+                       if (PageWriteback(page)) {
+                               if (PageDirty(page))
+                                       wait_on_page_writeback(page);
+                               else {
+                                       unlock_page(page);
+                                       page_cache_release(page);
+                                       continue;
+                               }
+                       }
+                       err = write_one_page(page, 0);
+                       if (err)
+                               werr = err;
+                       page_cache_release(page);
+               }
+       }
+       while (1) {
+               ret = find_first_extent_bit(dirty_pages, 0, &start, &end,
+                                           EXTENT_DIRTY);
+               if (ret)
+                       break;
+
+               clear_extent_dirty(dirty_pages, start, end, GFP_NOFS);
+               while (start <= end) {
+                       index = start >> PAGE_CACHE_SHIFT;
+                       start = (u64)(index + 1) << PAGE_CACHE_SHIFT;
+                       page = find_get_page(btree_inode->i_mapping, index);
+                       if (!page)
+                               continue;
+                       if (PageDirty(page)) {
+                               btree_lock_page_hook(page);
+                               wait_on_page_writeback(page);
+                               err = write_one_page(page, 0);
+                               if (err)
+                                       werr = err;
+                       }
+                       wait_on_page_writeback(page);
+                       page_cache_release(page);
+                       cond_resched();
+               }
+       }
+       if (err)
+               werr = err;
+       return werr;
+}
+
+int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
+                                    struct btrfs_root *root)
+{
+       if (!trans || !trans->transaction) {
+               struct inode *btree_inode;
+               btree_inode = root->fs_info->btree_inode;
+               return filemap_write_and_wait(btree_inode->i_mapping);
+       }
+       return btrfs_write_and_wait_marked_extents(root,
+                                          &trans->transaction->dirty_pages);
+}
+
+/*
+ * this is used to update the root pointer in the tree of tree roots.
+ *
+ * But, in the case of the extent allocation tree, updating the root
+ * pointer may allocate blocks which may change the root of the extent
+ * allocation tree.
+ *
+ * So, this loops and repeats and makes sure the cowonly root didn't
+ * change while the root pointer was being updated in the metadata.
+ */
+static int update_cowonly_root(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *root)
+{
+       int ret;
+       u64 old_root_bytenr;
+       struct btrfs_root *tree_root = root->fs_info->tree_root;
+
+       btrfs_extent_post_op(trans, root);
+       btrfs_write_dirty_block_groups(trans, root);
+       btrfs_extent_post_op(trans, root);
+
+       while (1) {
+               old_root_bytenr = btrfs_root_bytenr(&root->root_item);
+               if (old_root_bytenr == root->node->start)
+                       break;
+               btrfs_set_root_bytenr(&root->root_item,
+                                      root->node->start);
+               btrfs_set_root_level(&root->root_item,
+                                    btrfs_header_level(root->node));
+               btrfs_set_root_generation(&root->root_item, trans->transid);
+
+               btrfs_extent_post_op(trans, root);
+
+               ret = btrfs_update_root(trans, tree_root,
+                                       &root->root_key,
+                                       &root->root_item);
+               BUG_ON(ret);
+               btrfs_write_dirty_block_groups(trans, root);
+               btrfs_extent_post_op(trans, root);
+       }
+       return 0;
+}
+
+/*
+ * update all the cowonly tree roots on disk
+ */
+int btrfs_commit_tree_roots(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root)
+{
+       struct btrfs_fs_info *fs_info = root->fs_info;
+       struct list_head *next;
+       struct extent_buffer *eb;
+
+       btrfs_extent_post_op(trans, fs_info->tree_root);
+
+       eb = btrfs_lock_root_node(fs_info->tree_root);
+       btrfs_cow_block(trans, fs_info->tree_root, eb, NULL, 0, &eb, 0);
+       btrfs_tree_unlock(eb);
+       free_extent_buffer(eb);
+
+       btrfs_extent_post_op(trans, fs_info->tree_root);
+
+       while (!list_empty(&fs_info->dirty_cowonly_roots)) {
+               next = fs_info->dirty_cowonly_roots.next;
+               list_del_init(next);
+               root = list_entry(next, struct btrfs_root, dirty_list);
+
+               update_cowonly_root(trans, root);
+       }
+       return 0;
+}
+
+/*
+ * dead roots are old snapshots that need to be deleted.  This allocates
+ * a dirty root struct and adds it into the list of dead roots that need to
+ * be deleted
+ */
+int btrfs_add_dead_root(struct btrfs_root *root, struct btrfs_root *latest)
+{
+       struct btrfs_dirty_root *dirty;
+
+       dirty = kmalloc(sizeof(*dirty), GFP_NOFS);
+       if (!dirty)
+               return -ENOMEM;
+       dirty->root = root;
+       dirty->latest_root = latest;
+
+       mutex_lock(&root->fs_info->trans_mutex);
+       list_add(&dirty->list, &latest->fs_info->dead_roots);
+       mutex_unlock(&root->fs_info->trans_mutex);
+       return 0;
+}
+
+/*
+ * at transaction commit time we need to schedule the old roots for
+ * deletion via btrfs_drop_snapshot.  This runs through all the
+ * reference counted roots that were modified in the current
+ * transaction and puts them into the drop list
+ */
+static noinline int add_dirty_roots(struct btrfs_trans_handle *trans,
+                                   struct radix_tree_root *radix,
+                                   struct list_head *list)
+{
+       struct btrfs_dirty_root *dirty;
+       struct btrfs_root *gang[8];
+       struct btrfs_root *root;
+       int i;
+       int ret;
+       int err = 0;
+       u32 refs;
+
+       while (1) {
+               ret = radix_tree_gang_lookup_tag(radix, (void **)gang, 0,
+                                                ARRAY_SIZE(gang),
+                                                BTRFS_ROOT_TRANS_TAG);
+               if (ret == 0)
+                       break;
+               for (i = 0; i < ret; i++) {
+                       root = gang[i];
+                       radix_tree_tag_clear(radix,
+                                    (unsigned long)root->root_key.objectid,
+                                    BTRFS_ROOT_TRANS_TAG);
+
+                       BUG_ON(!root->ref_tree);
+                       dirty = root->dirty_root;
+
+                       btrfs_free_log(trans, root);
+                       btrfs_free_reloc_root(trans, root);
+
+                       if (root->commit_root == root->node) {
+                               WARN_ON(root->node->start !=
+                                       btrfs_root_bytenr(&root->root_item));
+
+                               free_extent_buffer(root->commit_root);
+                               root->commit_root = NULL;
+                               root->dirty_root = NULL;
+
+                               spin_lock(&root->list_lock);
+                               list_del_init(&dirty->root->dead_list);
+                               spin_unlock(&root->list_lock);
+
+                               kfree(dirty->root);
+                               kfree(dirty);
+
+                               /* make sure to update the root on disk
+                                * so we get any updates to the block used
+                                * counts
+                                */
+                               err = btrfs_update_root(trans,
+                                               root->fs_info->tree_root,
+                                               &root->root_key,
+                                               &root->root_item);
+                               continue;
+                       }
+
+                       memset(&root->root_item.drop_progress, 0,
+                              sizeof(struct btrfs_disk_key));
+                       root->root_item.drop_level = 0;
+                       root->commit_root = NULL;
+                       root->dirty_root = NULL;
+                       root->root_key.offset = root->fs_info->generation;
+                       btrfs_set_root_bytenr(&root->root_item,
+                                             root->node->start);
+                       btrfs_set_root_level(&root->root_item,
+                                            btrfs_header_level(root->node));
+                       btrfs_set_root_generation(&root->root_item,
+                                                 root->root_key.offset);
+
+                       err = btrfs_insert_root(trans, root->fs_info->tree_root,
+                                               &root->root_key,
+                                               &root->root_item);
+                       if (err)
+                               break;
+
+                       refs = btrfs_root_refs(&dirty->root->root_item);
+                       btrfs_set_root_refs(&dirty->root->root_item, refs - 1);
+                       err = btrfs_update_root(trans, root->fs_info->tree_root,
+                                               &dirty->root->root_key,
+                                               &dirty->root->root_item);
+
+                       BUG_ON(err);
+                       if (refs == 1) {
+                               list_add(&dirty->list, list);
+                       } else {
+                               WARN_ON(1);
+                               free_extent_buffer(dirty->root->node);
+                               kfree(dirty->root);
+                               kfree(dirty);
+                       }
+               }
+       }
+       return err;
+}
+
+/*
+ * defrag a given btree.  If cacheonly == 1, this won't read from the disk,
+ * otherwise every leaf in the btree is read and defragged.
+ */
+int btrfs_defrag_root(struct btrfs_root *root, int cacheonly)
+{
+       struct btrfs_fs_info *info = root->fs_info;
+       int ret;
+       struct btrfs_trans_handle *trans;
+       unsigned long nr;
+
+       smp_mb();
+       if (root->defrag_running)
+               return 0;
+       trans = btrfs_start_transaction(root, 1);
+       while (1) {
+               root->defrag_running = 1;
+               ret = btrfs_defrag_leaves(trans, root, cacheonly);
+               nr = trans->blocks_used;
+               btrfs_end_transaction(trans, root);
+               btrfs_btree_balance_dirty(info->tree_root, nr);
+               cond_resched();
+
+               trans = btrfs_start_transaction(root, 1);
+               if (root->fs_info->closing || ret != -EAGAIN)
+                       break;
+       }
+       root->defrag_running = 0;
+       smp_mb();
+       btrfs_end_transaction(trans, root);
+       return 0;
+}
+
+/*
+ * Given a list of roots that need to be deleted, call btrfs_drop_snapshot on
+ * all of them
+ */
+static noinline int drop_dirty_roots(struct btrfs_root *tree_root,
+                                    struct list_head *list)
+{
+       struct btrfs_dirty_root *dirty;
+       struct btrfs_trans_handle *trans;
+       unsigned long nr;
+       u64 num_bytes;
+       u64 bytes_used;
+       u64 max_useless;
+       int ret = 0;
+       int err;
+
+       while (!list_empty(list)) {
+               struct btrfs_root *root;
+
+               dirty = list_entry(list->prev, struct btrfs_dirty_root, list);
+               list_del_init(&dirty->list);
+
+               num_bytes = btrfs_root_used(&dirty->root->root_item);
+               root = dirty->latest_root;
+               atomic_inc(&root->fs_info->throttles);
+
+               while (1) {
+                       trans = btrfs_start_transaction(tree_root, 1);
+                       mutex_lock(&root->fs_info->drop_mutex);
+                       ret = btrfs_drop_snapshot(trans, dirty->root);
+                       if (ret != -EAGAIN)
+                               break;
+                       mutex_unlock(&root->fs_info->drop_mutex);
+
+                       err = btrfs_update_root(trans,
+                                       tree_root,
+                                       &dirty->root->root_key,
+                                       &dirty->root->root_item);
+                       if (err)
+                               ret = err;
+                       nr = trans->blocks_used;
+                       ret = btrfs_end_transaction(trans, tree_root);
+                       BUG_ON(ret);
+
+                       btrfs_btree_balance_dirty(tree_root, nr);
+                       cond_resched();
+               }
+               BUG_ON(ret);
+               atomic_dec(&root->fs_info->throttles);
+               wake_up(&root->fs_info->transaction_throttle);
+
+               num_bytes -= btrfs_root_used(&dirty->root->root_item);
+               bytes_used = btrfs_root_used(&root->root_item);
+               if (num_bytes) {
+                       btrfs_record_root_in_trans(root);
+                       btrfs_set_root_used(&root->root_item,
+                                           bytes_used - num_bytes);
+               }
+
+               ret = btrfs_del_root(trans, tree_root, &dirty->root->root_key);
+               if (ret) {
+                       BUG();
+                       break;
+               }
+               mutex_unlock(&root->fs_info->drop_mutex);
+
+               spin_lock(&root->list_lock);
+               list_del_init(&dirty->root->dead_list);
+               if (!list_empty(&root->dead_list)) {
+                       struct btrfs_root *oldest;
+                       oldest = list_entry(root->dead_list.prev,
+                                           struct btrfs_root, dead_list);
+                       max_useless = oldest->root_key.offset - 1;
+               } else {
+                       max_useless = root->root_key.offset - 1;
+               }
+               spin_unlock(&root->list_lock);
+
+               nr = trans->blocks_used;
+               ret = btrfs_end_transaction(trans, tree_root);
+               BUG_ON(ret);
+
+               ret = btrfs_remove_leaf_refs(root, max_useless, 0);
+               BUG_ON(ret);
+
+               free_extent_buffer(dirty->root->node);
+               kfree(dirty->root);
+               kfree(dirty);
+
+               btrfs_btree_balance_dirty(tree_root, nr);
+               cond_resched();
+       }
+       return ret;
+}
+
+/*
+ * new snapshots need to be created at a very specific time in the
+ * transaction commit.  This does the actual creation
+ */
+static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans,
+                                  struct btrfs_fs_info *fs_info,
+                                  struct btrfs_pending_snapshot *pending)
+{
+       struct btrfs_key key;
+       struct btrfs_root_item *new_root_item;
+       struct btrfs_root *tree_root = fs_info->tree_root;
+       struct btrfs_root *root = pending->root;
+       struct extent_buffer *tmp;
+       struct extent_buffer *old;
+       int ret;
+       u64 objectid;
+
+       new_root_item = kmalloc(sizeof(*new_root_item), GFP_NOFS);
+       if (!new_root_item) {
+               ret = -ENOMEM;
+               goto fail;
+       }
+       ret = btrfs_find_free_objectid(trans, tree_root, 0, &objectid);
+       if (ret)
+               goto fail;
+
+       btrfs_record_root_in_trans(root);
+       btrfs_set_root_last_snapshot(&root->root_item, trans->transid);
+       memcpy(new_root_item, &root->root_item, sizeof(*new_root_item));
+
+       key.objectid = objectid;
+       key.offset = trans->transid;
+       btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY);
+
+       old = btrfs_lock_root_node(root);
+       btrfs_cow_block(trans, root, old, NULL, 0, &old, 0);
+
+       btrfs_copy_root(trans, root, old, &tmp, objectid);
+       btrfs_tree_unlock(old);
+       free_extent_buffer(old);
+
+       btrfs_set_root_bytenr(new_root_item, tmp->start);
+       btrfs_set_root_level(new_root_item, btrfs_header_level(tmp));
+       btrfs_set_root_generation(new_root_item, trans->transid);
+       ret = btrfs_insert_root(trans, root->fs_info->tree_root, &key,
+                               new_root_item);
+       btrfs_tree_unlock(tmp);
+       free_extent_buffer(tmp);
+       if (ret)
+               goto fail;
+
+       key.offset = (u64)-1;
+       memcpy(&pending->root_key, &key, sizeof(key));
+fail:
+       kfree(new_root_item);
+       return ret;
+}
+
+static noinline int finish_pending_snapshot(struct btrfs_fs_info *fs_info,
+                                  struct btrfs_pending_snapshot *pending)
+{
+       int ret;
+       int namelen;
+       u64 index = 0;
+       struct btrfs_trans_handle *trans;
+       struct inode *parent_inode;
+       struct inode *inode;
+       struct btrfs_root *parent_root;
+
+       parent_inode = pending->dentry->d_parent->d_inode;
+       parent_root = BTRFS_I(parent_inode)->root;
+       trans = btrfs_join_transaction(parent_root, 1);
+
+       /*
+        * insert the directory item
+        */
+       namelen = strlen(pending->name);
+       ret = btrfs_set_inode_index(parent_inode, &index);
+       ret = btrfs_insert_dir_item(trans, parent_root,
+                           pending->name, namelen,
+                           parent_inode->i_ino,
+                           &pending->root_key, BTRFS_FT_DIR, index);
+
+       if (ret)
+               goto fail;
+
+       btrfs_i_size_write(parent_inode, parent_inode->i_size + namelen * 2);
+       ret = btrfs_update_inode(trans, parent_root, parent_inode);
+       BUG_ON(ret);
+
+       /* add the backref first */
+       ret = btrfs_add_root_ref(trans, parent_root->fs_info->tree_root,
+                                pending->root_key.objectid,
+                                BTRFS_ROOT_BACKREF_KEY,
+                                parent_root->root_key.objectid,
+                                parent_inode->i_ino, index, pending->name,
+                                namelen);
+
+       BUG_ON(ret);
+
+       /* now add the forward ref */
+       ret = btrfs_add_root_ref(trans, parent_root->fs_info->tree_root,
+                                parent_root->root_key.objectid,
+                                BTRFS_ROOT_REF_KEY,
+                                pending->root_key.objectid,
+                                parent_inode->i_ino, index, pending->name,
+                                namelen);
+
+       inode = btrfs_lookup_dentry(parent_inode, pending->dentry);
+       d_instantiate(pending->dentry, inode);
+fail:
+       btrfs_end_transaction(trans, fs_info->fs_root);
+       return ret;
+}
+
+/*
+ * create all the snapshots we've scheduled for creation
+ */
+static noinline int create_pending_snapshots(struct btrfs_trans_handle *trans,
+                                            struct btrfs_fs_info *fs_info)
+{
+       struct btrfs_pending_snapshot *pending;
+       struct list_head *head = &trans->transaction->pending_snapshots;
+       struct list_head *cur;
+       int ret;
+
+       list_for_each(cur, head) {
+               pending = list_entry(cur, struct btrfs_pending_snapshot, list);
+               ret = create_pending_snapshot(trans, fs_info, pending);
+               BUG_ON(ret);
+       }
+       return 0;
+}
+
+static noinline int finish_pending_snapshots(struct btrfs_trans_handle *trans,
+                                            struct btrfs_fs_info *fs_info)
+{
+       struct btrfs_pending_snapshot *pending;
+       struct list_head *head = &trans->transaction->pending_snapshots;
+       int ret;
+
+       while (!list_empty(head)) {
+               pending = list_entry(head->next,
+                                    struct btrfs_pending_snapshot, list);
+               ret = finish_pending_snapshot(fs_info, pending);
+               BUG_ON(ret);
+               list_del(&pending->list);
+               kfree(pending->name);
+               kfree(pending);
+       }
+       return 0;
+}
+
+int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root)
+{
+       unsigned long joined = 0;
+       unsigned long timeout = 1;
+       struct btrfs_transaction *cur_trans;
+       struct btrfs_transaction *prev_trans = NULL;
+       struct btrfs_root *chunk_root = root->fs_info->chunk_root;
+       struct list_head dirty_fs_roots;
+       struct extent_io_tree *pinned_copy;
+       DEFINE_WAIT(wait);
+       int ret;
+
+       INIT_LIST_HEAD(&dirty_fs_roots);
+       mutex_lock(&root->fs_info->trans_mutex);
+       if (trans->transaction->in_commit) {
+               cur_trans = trans->transaction;
+               trans->transaction->use_count++;
+               mutex_unlock(&root->fs_info->trans_mutex);
+               btrfs_end_transaction(trans, root);
+
+               ret = wait_for_commit(root, cur_trans);
+               BUG_ON(ret);
+
+               mutex_lock(&root->fs_info->trans_mutex);
+               put_transaction(cur_trans);
+               mutex_unlock(&root->fs_info->trans_mutex);
+
+               return 0;
+       }
+
+       pinned_copy = kmalloc(sizeof(*pinned_copy), GFP_NOFS);
+       if (!pinned_copy)
+               return -ENOMEM;
+
+       extent_io_tree_init(pinned_copy,
+                            root->fs_info->btree_inode->i_mapping, GFP_NOFS);
+
+       trans->transaction->in_commit = 1;
+       trans->transaction->blocked = 1;
+       cur_trans = trans->transaction;
+       if (cur_trans->list.prev != &root->fs_info->trans_list) {
+               prev_trans = list_entry(cur_trans->list.prev,
+                                       struct btrfs_transaction, list);
+               if (!prev_trans->commit_done) {
+                       prev_trans->use_count++;
+                       mutex_unlock(&root->fs_info->trans_mutex);
+
+                       wait_for_commit(root, prev_trans);
+
+                       mutex_lock(&root->fs_info->trans_mutex);
+                       put_transaction(prev_trans);
+               }
+       }
+
+       do {
+               int snap_pending = 0;
+               joined = cur_trans->num_joined;
+               if (!list_empty(&trans->transaction->pending_snapshots))
+                       snap_pending = 1;
+
+               WARN_ON(cur_trans != trans->transaction);
+               prepare_to_wait(&cur_trans->writer_wait, &wait,
+                               TASK_UNINTERRUPTIBLE);
+
+               if (cur_trans->num_writers > 1)
+                       timeout = MAX_SCHEDULE_TIMEOUT;
+               else
+                       timeout = 1;
+
+               mutex_unlock(&root->fs_info->trans_mutex);
+
+               if (snap_pending) {
+                       ret = btrfs_wait_ordered_extents(root, 1);
+                       BUG_ON(ret);
+               }
+
+               schedule_timeout(timeout);
+
+               mutex_lock(&root->fs_info->trans_mutex);
+               finish_wait(&cur_trans->writer_wait, &wait);
+       } while (cur_trans->num_writers > 1 ||
+                (cur_trans->num_joined != joined));
+
+       ret = create_pending_snapshots(trans, root->fs_info);
+       BUG_ON(ret);
+
+       WARN_ON(cur_trans != trans->transaction);
+
+       /* btrfs_commit_tree_roots is responsible for getting the
+        * various roots consistent with each other.  Every pointer
+        * in the tree of tree roots has to point to the most up to date
+        * root for every subvolume and other tree.  So, we have to keep
+        * the tree logging code from jumping in and changing any
+        * of the trees.
+        *
+        * At this point in the commit, there can't be any tree-log
+        * writers, but a little lower down we drop the trans mutex
+        * and let new people in.  By holding the tree_log_mutex
+        * from now until after the super is written, we avoid races
+        * with the tree-log code.
+        */
+       mutex_lock(&root->fs_info->tree_log_mutex);
+       /*
+        * keep tree reloc code from adding new reloc trees
+        */
+       mutex_lock(&root->fs_info->tree_reloc_mutex);
+
+
+       ret = add_dirty_roots(trans, &root->fs_info->fs_roots_radix,
+                             &dirty_fs_roots);
+       BUG_ON(ret);
+
+       /* add_dirty_roots gets rid of all the tree log roots, it is now
+        * safe to free the root of tree log roots
+        */
+       btrfs_free_log_root_tree(trans, root->fs_info);
+
+       ret = btrfs_commit_tree_roots(trans, root);
+       BUG_ON(ret);
+
+       cur_trans = root->fs_info->running_transaction;
+       spin_lock(&root->fs_info->new_trans_lock);
+       root->fs_info->running_transaction = NULL;
+       spin_unlock(&root->fs_info->new_trans_lock);
+       btrfs_set_super_generation(&root->fs_info->super_copy,
+                                  cur_trans->transid);
+       btrfs_set_super_root(&root->fs_info->super_copy,
+                            root->fs_info->tree_root->node->start);
+       btrfs_set_super_root_level(&root->fs_info->super_copy,
+                          btrfs_header_level(root->fs_info->tree_root->node));
+
+       btrfs_set_super_chunk_root(&root->fs_info->super_copy,
+                                  chunk_root->node->start);
+       btrfs_set_super_chunk_root_level(&root->fs_info->super_copy,
+                                        btrfs_header_level(chunk_root->node));
+       btrfs_set_super_chunk_root_generation(&root->fs_info->super_copy,
+                               btrfs_header_generation(chunk_root->node));
+
+       if (!root->fs_info->log_root_recovering) {
+               btrfs_set_super_log_root(&root->fs_info->super_copy, 0);
+               btrfs_set_super_log_root_level(&root->fs_info->super_copy, 0);
+       }
+
+       memcpy(&root->fs_info->super_for_commit, &root->fs_info->super_copy,
+              sizeof(root->fs_info->super_copy));
+
+       btrfs_copy_pinned(root, pinned_copy);
+
+       trans->transaction->blocked = 0;
+       wake_up(&root->fs_info->transaction_throttle);
+       wake_up(&root->fs_info->transaction_wait);
+
+       mutex_unlock(&root->fs_info->trans_mutex);
+       ret = btrfs_write_and_wait_transaction(trans, root);
+       BUG_ON(ret);
+       write_ctree_super(trans, root, 0);
+
+       /*
+        * the super is written, we can safely allow the tree-loggers
+        * to go about their business
+        */
+       mutex_unlock(&root->fs_info->tree_log_mutex);
+
+       btrfs_finish_extent_commit(trans, root, pinned_copy);
+       kfree(pinned_copy);
+
+       btrfs_drop_dead_reloc_roots(root);
+       mutex_unlock(&root->fs_info->tree_reloc_mutex);
+
+       /* do the directory inserts of any pending snapshot creations */
+       finish_pending_snapshots(trans, root->fs_info);
+
+       mutex_lock(&root->fs_info->trans_mutex);
+
+       cur_trans->commit_done = 1;
+       root->fs_info->last_trans_committed = cur_trans->transid;
+       wake_up(&cur_trans->commit_wait);
+
+       put_transaction(cur_trans);
+       put_transaction(cur_trans);
+
+       list_splice_init(&dirty_fs_roots, &root->fs_info->dead_roots);
+       if (root->fs_info->closing)
+               list_splice_init(&root->fs_info->dead_roots, &dirty_fs_roots);
+
+       mutex_unlock(&root->fs_info->trans_mutex);
+
+       kmem_cache_free(btrfs_trans_handle_cachep, trans);
+
+       if (root->fs_info->closing)
+               drop_dirty_roots(root->fs_info->tree_root, &dirty_fs_roots);
+       return ret;
+}
+
+/*
+ * interface function to delete all the snapshots we have scheduled for deletion
+ */
+int btrfs_clean_old_snapshots(struct btrfs_root *root)
+{
+       struct list_head dirty_roots;
+       INIT_LIST_HEAD(&dirty_roots);
+again:
+       mutex_lock(&root->fs_info->trans_mutex);
+       list_splice_init(&root->fs_info->dead_roots, &dirty_roots);
+       mutex_unlock(&root->fs_info->trans_mutex);
+
+       if (!list_empty(&dirty_roots)) {
+               drop_dirty_roots(root, &dirty_roots);
+               goto again;
+       }
+       return 0;
+}
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h

new file mode 100644 (file)

index 0000000..ea29211
--- /dev/null
+++ b/fs/btrfs/transaction.h
@@ -0,0 +1,106 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_TRANSACTION__
+#define __BTRFS_TRANSACTION__
+#include "btrfs_inode.h"
+
+struct btrfs_transaction {
+       u64 transid;
+       unsigned long num_writers;
+       unsigned long num_joined;
+       int in_commit;
+       int use_count;
+       int commit_done;
+       int blocked;
+       struct list_head list;
+       struct extent_io_tree dirty_pages;
+       unsigned long start_time;
+       wait_queue_head_t writer_wait;
+       wait_queue_head_t commit_wait;
+       struct list_head pending_snapshots;
+};
+
+struct btrfs_trans_handle {
+       u64 transid;
+       unsigned long blocks_reserved;
+       unsigned long blocks_used;
+       struct btrfs_transaction *transaction;
+       u64 block_group;
+       u64 alloc_exclude_start;
+       u64 alloc_exclude_nr;
+};
+
+struct btrfs_pending_snapshot {
+       struct dentry *dentry;
+       struct btrfs_root *root;
+       char *name;
+       struct btrfs_key root_key;
+       struct list_head list;
+};
+
+struct btrfs_dirty_root {
+       struct list_head list;
+       struct btrfs_root *root;
+       struct btrfs_root *latest_root;
+};
+
+static inline void btrfs_set_trans_block_group(struct btrfs_trans_handle *trans,
+                                              struct inode *inode)
+{
+       trans->block_group = BTRFS_I(inode)->block_group;
+}
+
+static inline void btrfs_update_inode_block_group(
+                                         struct btrfs_trans_handle *trans,
+                                         struct inode *inode)
+{
+       BTRFS_I(inode)->block_group = trans->block_group;
+}
+
+static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans,
+                                             struct inode *inode)
+{
+       BTRFS_I(inode)->last_trans = trans->transaction->transid;
+}
+
+int btrfs_end_transaction(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root);
+struct btrfs_trans_handle *btrfs_start_transaction(struct btrfs_root *root,
+                                                  int num_blocks);
+struct btrfs_trans_handle *btrfs_join_transaction(struct btrfs_root *root,
+                                                  int num_blocks);
+struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root *r,
+                                                  int num_blocks);
+int btrfs_write_and_wait_transaction(struct btrfs_trans_handle *trans,
+                                    struct btrfs_root *root);
+int btrfs_commit_tree_roots(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root);
+
+int btrfs_add_dead_root(struct btrfs_root *root, struct btrfs_root *latest);
+int btrfs_defrag_root(struct btrfs_root *root, int cacheonly);
+int btrfs_clean_old_snapshots(struct btrfs_root *root);
+int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root);
+int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans,
+                                  struct btrfs_root *root);
+void btrfs_throttle(struct btrfs_root *root);
+int btrfs_record_root_in_trans(struct btrfs_root *root);
+int btrfs_write_and_wait_marked_extents(struct btrfs_root *root,
+                                       struct extent_io_tree *dirty_pages);
+#endif
diff --git a/fs/btrfs/tree-defrag.c b/fs/btrfs/tree-defrag.c

new file mode 100644 (file)

index 0000000..3e8358c
--- /dev/null
+++ b/fs/btrfs/tree-defrag.c
@@ -0,0 +1,147 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/sched.h>
+#include "ctree.h"
+#include "disk-io.h"
+#include "print-tree.h"
+#include "transaction.h"
+#include "locking.h"
+
+/* defrag all the leaves in a given btree.  If cache_only == 1, don't read
+ * things from disk, otherwise read all the leaves and try to get key order to
+ * better reflect disk order
+ */
+
+int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
+                       struct btrfs_root *root, int cache_only)
+{
+       struct btrfs_path *path = NULL;
+       struct btrfs_key key;
+       int ret = 0;
+       int wret;
+       int level;
+       int orig_level;
+       int is_extent = 0;
+       int next_key_ret = 0;
+       u64 last_ret = 0;
+       u64 min_trans = 0;
+
+       if (cache_only)
+               goto out;
+
+       if (root->fs_info->extent_root == root) {
+               /*
+                * there's recursion here right now in the tree locking,
+                * we can't defrag the extent root without deadlock
+                */
+               goto out;
+       }
+
+       if (root->ref_cows == 0 && !is_extent)
+               goto out;
+
+       if (btrfs_test_opt(root, SSD))
+               goto out;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       level = btrfs_header_level(root->node);
+       orig_level = level;
+
+       if (level == 0)
+               goto out;
+
+       if (root->defrag_progress.objectid == 0) {
+               struct extent_buffer *root_node;
+               u32 nritems;
+
+               root_node = btrfs_lock_root_node(root);
+               nritems = btrfs_header_nritems(root_node);
+               root->defrag_max.objectid = 0;
+               /* from above we know this is not a leaf */
+               btrfs_node_key_to_cpu(root_node, &root->defrag_max,
+                                     nritems - 1);
+               btrfs_tree_unlock(root_node);
+               free_extent_buffer(root_node);
+               memset(&key, 0, sizeof(key));
+       } else {
+               memcpy(&key, &root->defrag_progress, sizeof(key));
+       }
+
+       path->keep_locks = 1;
+       if (cache_only)
+               min_trans = root->defrag_trans_start;
+
+       ret = btrfs_search_forward(root, &key, NULL, path,
+                                  cache_only, min_trans);
+       if (ret < 0)
+               goto out;
+       if (ret > 0) {
+               ret = 0;
+               goto out;
+       }
+       btrfs_release_path(root, path);
+       wret = btrfs_search_slot(trans, root, &key, path, 0, 1);
+
+       if (wret < 0) {
+               ret = wret;
+               goto out;
+       }
+       if (!path->nodes[1]) {
+               ret = 0;
+               goto out;
+       }
+       path->slots[1] = btrfs_header_nritems(path->nodes[1]);
+       next_key_ret = btrfs_find_next_key(root, path, &key, 1, cache_only,
+                                          min_trans);
+       ret = btrfs_realloc_node(trans, root,
+                                path->nodes[1], 0,
+                                cache_only, &last_ret,
+                                &root->defrag_progress);
+       WARN_ON(ret && ret != -EAGAIN);
+       if (next_key_ret == 0) {
+               memcpy(&root->defrag_progress, &key, sizeof(key));
+               ret = -EAGAIN;
+       }
+
+       btrfs_release_path(root, path);
+       if (is_extent)
+               btrfs_extent_post_op(trans, root);
+out:
+       if (path)
+               btrfs_free_path(path);
+       if (ret == -EAGAIN) {
+               if (root->defrag_max.objectid > root->defrag_progress.objectid)
+                       goto done;
+               if (root->defrag_max.type > root->defrag_progress.type)
+                       goto done;
+               if (root->defrag_max.offset > root->defrag_progress.offset)
+                       goto done;
+               ret = 0;
+       }
+done:
+       if (ret != -EAGAIN) {
+               memset(&root->defrag_progress, 0,
+                      sizeof(root->defrag_progress));
+               root->defrag_trans_start = trans->transid;
+       }
+       return ret;
+}
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c

new file mode 100644 (file)

index 0000000..d81cda2
--- /dev/null
+++ b/fs/btrfs/tree-log.c
@@ -0,0 +1,2898 @@
+/*
+ * Copyright (C) 2008 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/sched.h>
+#include "ctree.h"
+#include "transaction.h"
+#include "disk-io.h"
+#include "locking.h"
+#include "print-tree.h"
+#include "compat.h"
+#include "tree-log.h"
+
+/* magic values for the inode_only field in btrfs_log_inode:
+ *
+ * LOG_INODE_ALL means to log everything
+ * LOG_INODE_EXISTS means to log just enough to recreate the inode
+ * during log replay
+ */
+#define LOG_INODE_ALL 0
+#define LOG_INODE_EXISTS 1
+
+/*
+ * stages for the tree walking.  The first
+ * stage (0) is to only pin down the blocks we find
+ * the second stage (1) is to make sure that all the inodes
+ * we find in the log are created in the subvolume.
+ *
+ * The last stage is to deal with directories and links and extents
+ * and all the other fun semantics
+ */
+#define LOG_WALK_PIN_ONLY 0
+#define LOG_WALK_REPLAY_INODES 1
+#define LOG_WALK_REPLAY_ALL 2
+
+static int __btrfs_log_inode(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root, struct inode *inode,
+                            int inode_only);
+static int link_to_fixup_dir(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root,
+                            struct btrfs_path *path, u64 objectid);
+
+/*
+ * tree logging is a special write ahead log used to make sure that
+ * fsyncs and O_SYNCs can happen without doing full tree commits.
+ *
+ * Full tree commits are expensive because they require commonly
+ * modified blocks to be recowed, creating many dirty pages in the
+ * extent tree an 4x-6x higher write load than ext3.
+ *
+ * Instead of doing a tree commit on every fsync, we use the
+ * key ranges and transaction ids to find items for a given file or directory
+ * that have changed in this transaction.  Those items are copied into
+ * a special tree (one per subvolume root), that tree is written to disk
+ * and then the fsync is considered complete.
+ *
+ * After a crash, items are copied out of the log-tree back into the
+ * subvolume tree.  Any file data extents found are recorded in the extent
+ * allocation tree, and the log-tree freed.
+ *
+ * The log tree is read three times, once to pin down all the extents it is
+ * using in ram and once, once to create all the inodes logged in the tree
+ * and once to do all the other items.
+ */
+
+/*
+ * btrfs_add_log_tree adds a new per-subvolume log tree into the
+ * tree of log tree roots.  This must be called with a tree log transaction
+ * running (see start_log_trans).
+ */
+static int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
+                     struct btrfs_root *root)
+{
+       struct btrfs_key key;
+       struct btrfs_root_item root_item;
+       struct btrfs_inode_item *inode_item;
+       struct extent_buffer *leaf;
+       struct btrfs_root *new_root = root;
+       int ret;
+       u64 objectid = root->root_key.objectid;
+
+       leaf = btrfs_alloc_free_block(trans, root, root->leafsize, 0,
+                                     BTRFS_TREE_LOG_OBJECTID,
+                                     trans->transid, 0, 0, 0);
+       if (IS_ERR(leaf)) {
+               ret = PTR_ERR(leaf);
+               return ret;
+       }
+
+       btrfs_set_header_nritems(leaf, 0);
+       btrfs_set_header_level(leaf, 0);
+       btrfs_set_header_bytenr(leaf, leaf->start);
+       btrfs_set_header_generation(leaf, trans->transid);
+       btrfs_set_header_owner(leaf, BTRFS_TREE_LOG_OBJECTID);
+
+       write_extent_buffer(leaf, root->fs_info->fsid,
+                           (unsigned long)btrfs_header_fsid(leaf),
+                           BTRFS_FSID_SIZE);
+       btrfs_mark_buffer_dirty(leaf);
+
+       inode_item = &root_item.inode;
+       memset(inode_item, 0, sizeof(*inode_item));
+       inode_item->generation = cpu_to_le64(1);
+       inode_item->size = cpu_to_le64(3);
+       inode_item->nlink = cpu_to_le32(1);
+       inode_item->nbytes = cpu_to_le64(root->leafsize);
+       inode_item->mode = cpu_to_le32(S_IFDIR | 0755);
+
+       btrfs_set_root_bytenr(&root_item, leaf->start);
+       btrfs_set_root_generation(&root_item, trans->transid);
+       btrfs_set_root_level(&root_item, 0);
+       btrfs_set_root_refs(&root_item, 0);
+       btrfs_set_root_used(&root_item, 0);
+
+       memset(&root_item.drop_progress, 0, sizeof(root_item.drop_progress));
+       root_item.drop_level = 0;
+
+       btrfs_tree_unlock(leaf);
+       free_extent_buffer(leaf);
+       leaf = NULL;
+
+       btrfs_set_root_dirid(&root_item, 0);
+
+       key.objectid = BTRFS_TREE_LOG_OBJECTID;
+       key.offset = objectid;
+       btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY);
+       ret = btrfs_insert_root(trans, root->fs_info->log_root_tree, &key,
+                               &root_item);
+       if (ret)
+               goto fail;
+
+       new_root = btrfs_read_fs_root_no_radix(root->fs_info->log_root_tree,
+                                              &key);
+       BUG_ON(!new_root);
+
+       WARN_ON(root->log_root);
+       root->log_root = new_root;
+
+       /*
+        * log trees do not get reference counted because they go away
+        * before a real commit is actually done.  They do store pointers
+        * to file data extents, and those reference counts still get
+        * updated (along with back refs to the log tree).
+        */
+       new_root->ref_cows = 0;
+       new_root->last_trans = trans->transid;
+
+       /*
+        * we need to make sure the root block for this new tree
+        * is marked as dirty in the dirty_log_pages tree.  This
+        * is how it gets flushed down to disk at tree log commit time.
+        *
+        * the tree logging mutex keeps others from coming in and changing
+        * the new_root->node, so we can safely access it here
+        */
+       set_extent_dirty(&new_root->dirty_log_pages, new_root->node->start,
+                        new_root->node->start + new_root->node->len - 1,
+                        GFP_NOFS);
+
+fail:
+       return ret;
+}
+
+/*
+ * start a sub transaction and setup the log tree
+ * this increments the log tree writer count to make the people
+ * syncing the tree wait for us to finish
+ */
+static int start_log_trans(struct btrfs_trans_handle *trans,
+                          struct btrfs_root *root)
+{
+       int ret;
+       mutex_lock(&root->fs_info->tree_log_mutex);
+       if (!root->fs_info->log_root_tree) {
+               ret = btrfs_init_log_root_tree(trans, root->fs_info);
+               BUG_ON(ret);
+       }
+       if (!root->log_root) {
+               ret = btrfs_add_log_tree(trans, root);
+               BUG_ON(ret);
+       }
+       atomic_inc(&root->fs_info->tree_log_writers);
+       root->fs_info->tree_log_batch++;
+       mutex_unlock(&root->fs_info->tree_log_mutex);
+       return 0;
+}
+
+/*
+ * returns 0 if there was a log transaction running and we were able
+ * to join, or returns -ENOENT if there were not transactions
+ * in progress
+ */
+static int join_running_log_trans(struct btrfs_root *root)
+{
+       int ret = -ENOENT;
+
+       smp_mb();
+       if (!root->log_root)
+               return -ENOENT;
+
+       mutex_lock(&root->fs_info->tree_log_mutex);
+       if (root->log_root) {
+               ret = 0;
+               atomic_inc(&root->fs_info->tree_log_writers);
+               root->fs_info->tree_log_batch++;
+       }
+       mutex_unlock(&root->fs_info->tree_log_mutex);
+       return ret;
+}
+
+/*
+ * indicate we're done making changes to the log tree
+ * and wake up anyone waiting to do a sync
+ */
+static int end_log_trans(struct btrfs_root *root)
+{
+       atomic_dec(&root->fs_info->tree_log_writers);
+       smp_mb();
+       if (waitqueue_active(&root->fs_info->tree_log_wait))
+               wake_up(&root->fs_info->tree_log_wait);
+       return 0;
+}
+
+
+/*
+ * the walk control struct is used to pass state down the chain when
+ * processing the log tree.  The stage field tells us which part
+ * of the log tree processing we are currently doing.  The others
+ * are state fields used for that specific part
+ */
+struct walk_control {
+       /* should we free the extent on disk when done?  This is used
+        * at transaction commit time while freeing a log tree
+        */
+       int free;
+
+       /* should we write out the extent buffer?  This is used
+        * while flushing the log tree to disk during a sync
+        */
+       int write;
+
+       /* should we wait for the extent buffer io to finish?  Also used
+        * while flushing the log tree to disk for a sync
+        */
+       int wait;
+
+       /* pin only walk, we record which extents on disk belong to the
+        * log trees
+        */
+       int pin;
+
+       /* what stage of the replay code we're currently in */
+       int stage;
+
+       /* the root we are currently replaying */
+       struct btrfs_root *replay_dest;
+
+       /* the trans handle for the current replay */
+       struct btrfs_trans_handle *trans;
+
+       /* the function that gets used to process blocks we find in the
+        * tree.  Note the extent_buffer might not be up to date when it is
+        * passed in, and it must be checked or read if you need the data
+        * inside it
+        */
+       int (*process_func)(struct btrfs_root *log, struct extent_buffer *eb,
+                           struct walk_control *wc, u64 gen);
+};
+
+/*
+ * process_func used to pin down extents, write them or wait on them
+ */
+static int process_one_buffer(struct btrfs_root *log,
+                             struct extent_buffer *eb,
+                             struct walk_control *wc, u64 gen)
+{
+       if (wc->pin) {
+               mutex_lock(&log->fs_info->pinned_mutex);
+               btrfs_update_pinned_extents(log->fs_info->extent_root,
+                                           eb->start, eb->len, 1);
+               mutex_unlock(&log->fs_info->pinned_mutex);
+       }
+
+       if (btrfs_buffer_uptodate(eb, gen)) {
+               if (wc->write)
+                       btrfs_write_tree_block(eb);
+               if (wc->wait)
+                       btrfs_wait_tree_block_writeback(eb);
+       }
+       return 0;
+}
+
+/*
+ * Item overwrite used by replay and tree logging.  eb, slot and key all refer
+ * to the src data we are copying out.
+ *
+ * root is the tree we are copying into, and path is a scratch
+ * path for use in this function (it should be released on entry and
+ * will be released on exit).
+ *
+ * If the key is already in the destination tree the existing item is
+ * overwritten.  If the existing item isn't big enough, it is extended.
+ * If it is too large, it is truncated.
+ *
+ * If the key isn't in the destination yet, a new item is inserted.
+ */
+static noinline int overwrite_item(struct btrfs_trans_handle *trans,
+                                  struct btrfs_root *root,
+                                  struct btrfs_path *path,
+                                  struct extent_buffer *eb, int slot,
+                                  struct btrfs_key *key)
+{
+       int ret;
+       u32 item_size;
+       u64 saved_i_size = 0;
+       int save_old_i_size = 0;
+       unsigned long src_ptr;
+       unsigned long dst_ptr;
+       int overwrite_root = 0;
+
+       if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID)
+               overwrite_root = 1;
+
+       item_size = btrfs_item_size_nr(eb, slot);
+       src_ptr = btrfs_item_ptr_offset(eb, slot);
+
+       /* look for the key in the destination tree */
+       ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
+       if (ret == 0) {
+               char *src_copy;
+               char *dst_copy;
+               u32 dst_size = btrfs_item_size_nr(path->nodes[0],
+                                                 path->slots[0]);
+               if (dst_size != item_size)
+                       goto insert;
+
+               if (item_size == 0) {
+                       btrfs_release_path(root, path);
+                       return 0;
+               }
+               dst_copy = kmalloc(item_size, GFP_NOFS);
+               src_copy = kmalloc(item_size, GFP_NOFS);
+
+               read_extent_buffer(eb, src_copy, src_ptr, item_size);
+
+               dst_ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
+               read_extent_buffer(path->nodes[0], dst_copy, dst_ptr,
+                                  item_size);
+               ret = memcmp(dst_copy, src_copy, item_size);
+
+               kfree(dst_copy);
+               kfree(src_copy);
+               /*
+                * they have the same contents, just return, this saves
+                * us from cowing blocks in the destination tree and doing
+                * extra writes that may not have been done by a previous
+                * sync
+                */
+               if (ret == 0) {
+                       btrfs_release_path(root, path);
+                       return 0;
+               }
+
+       }
+insert:
+       btrfs_release_path(root, path);
+       /* try to insert the key into the destination tree */
+       ret = btrfs_insert_empty_item(trans, root, path,
+                                     key, item_size);
+
+       /* make sure any existing item is the correct size */
+       if (ret == -EEXIST) {
+               u32 found_size;
+               found_size = btrfs_item_size_nr(path->nodes[0],
+                                               path->slots[0]);
+               if (found_size > item_size) {
+                       btrfs_truncate_item(trans, root, path, item_size, 1);
+               } else if (found_size < item_size) {
+                       ret = btrfs_extend_item(trans, root, path,
+                                               item_size - found_size);
+                       BUG_ON(ret);
+               }
+       } else if (ret) {
+               BUG();
+       }
+       dst_ptr = btrfs_item_ptr_offset(path->nodes[0],
+                                       path->slots[0]);
+
+       /* don't overwrite an existing inode if the generation number
+        * was logged as zero.  This is done when the tree logging code
+        * is just logging an inode to make sure it exists after recovery.
+        *
+        * Also, don't overwrite i_size on directories during replay.
+        * log replay inserts and removes directory items based on the
+        * state of the tree found in the subvolume, and i_size is modified
+        * as it goes
+        */
+       if (key->type == BTRFS_INODE_ITEM_KEY && ret == -EEXIST) {
+               struct btrfs_inode_item *src_item;
+               struct btrfs_inode_item *dst_item;
+
+               src_item = (struct btrfs_inode_item *)src_ptr;
+               dst_item = (struct btrfs_inode_item *)dst_ptr;
+
+               if (btrfs_inode_generation(eb, src_item) == 0)
+                       goto no_copy;
+
+               if (overwrite_root &&
+                   S_ISDIR(btrfs_inode_mode(eb, src_item)) &&
+                   S_ISDIR(btrfs_inode_mode(path->nodes[0], dst_item))) {
+                       save_old_i_size = 1;
+                       saved_i_size = btrfs_inode_size(path->nodes[0],
+                                                       dst_item);
+               }
+       }
+
+       copy_extent_buffer(path->nodes[0], eb, dst_ptr,
+                          src_ptr, item_size);
+
+       if (save_old_i_size) {
+               struct btrfs_inode_item *dst_item;
+               dst_item = (struct btrfs_inode_item *)dst_ptr;
+               btrfs_set_inode_size(path->nodes[0], dst_item, saved_i_size);
+       }
+
+       /* make sure the generation is filled in */
+       if (key->type == BTRFS_INODE_ITEM_KEY) {
+               struct btrfs_inode_item *dst_item;
+               dst_item = (struct btrfs_inode_item *)dst_ptr;
+               if (btrfs_inode_generation(path->nodes[0], dst_item) == 0) {
+                       btrfs_set_inode_generation(path->nodes[0], dst_item,
+                                                  trans->transid);
+               }
+       }
+no_copy:
+       btrfs_mark_buffer_dirty(path->nodes[0]);
+       btrfs_release_path(root, path);
+       return 0;
+}
+
+/*
+ * simple helper to read an inode off the disk from a given root
+ * This can only be called for subvolume roots and not for the log
+ */
+static noinline struct inode *read_one_inode(struct btrfs_root *root,
+                                            u64 objectid)
+{
+       struct inode *inode;
+       inode = btrfs_iget_locked(root->fs_info->sb, objectid, root);
+       if (inode->i_state & I_NEW) {
+               BTRFS_I(inode)->root = root;
+               BTRFS_I(inode)->location.objectid = objectid;
+               BTRFS_I(inode)->location.type = BTRFS_INODE_ITEM_KEY;
+               BTRFS_I(inode)->location.offset = 0;
+               btrfs_read_locked_inode(inode);
+               unlock_new_inode(inode);
+
+       }
+       if (is_bad_inode(inode)) {
+               iput(inode);
+               inode = NULL;
+       }
+       return inode;
+}
+
+/* replays a single extent in 'eb' at 'slot' with 'key' into the
+ * subvolume 'root'.  path is released on entry and should be released
+ * on exit.
+ *
+ * extents in the log tree have not been allocated out of the extent
+ * tree yet.  So, this completes the allocation, taking a reference
+ * as required if the extent already exists or creating a new extent
+ * if it isn't in the extent allocation tree yet.
+ *
+ * The extent is inserted into the file, dropping any existing extents
+ * from the file that overlap the new one.
+ */
+static noinline int replay_one_extent(struct btrfs_trans_handle *trans,
+                                     struct btrfs_root *root,
+                                     struct btrfs_path *path,
+                                     struct extent_buffer *eb, int slot,
+                                     struct btrfs_key *key)
+{
+       int found_type;
+       u64 mask = root->sectorsize - 1;
+       u64 extent_end;
+       u64 alloc_hint;
+       u64 start = key->offset;
+       u64 saved_nbytes;
+       struct btrfs_file_extent_item *item;
+       struct inode *inode = NULL;
+       unsigned long size;
+       int ret = 0;
+
+       item = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
+       found_type = btrfs_file_extent_type(eb, item);
+
+       if (found_type == BTRFS_FILE_EXTENT_REG ||
+           found_type == BTRFS_FILE_EXTENT_PREALLOC)
+               extent_end = start + btrfs_file_extent_num_bytes(eb, item);
+       else if (found_type == BTRFS_FILE_EXTENT_INLINE) {
+               size = btrfs_file_extent_inline_len(eb, item);
+               extent_end = (start + size + mask) & ~mask;
+       } else {
+               ret = 0;
+               goto out;
+       }
+
+       inode = read_one_inode(root, key->objectid);
+       if (!inode) {
+               ret = -EIO;
+               goto out;
+       }
+
+       /*
+        * first check to see if we already have this extent in the
+        * file.  This must be done before the btrfs_drop_extents run
+        * so we don't try to drop this extent.
+        */
+       ret = btrfs_lookup_file_extent(trans, root, path, inode->i_ino,
+                                      start, 0);
+
+       if (ret == 0 &&
+           (found_type == BTRFS_FILE_EXTENT_REG ||
+            found_type == BTRFS_FILE_EXTENT_PREALLOC)) {
+               struct btrfs_file_extent_item cmp1;
+               struct btrfs_file_extent_item cmp2;
+               struct btrfs_file_extent_item *existing;
+               struct extent_buffer *leaf;
+
+               leaf = path->nodes[0];
+               existing = btrfs_item_ptr(leaf, path->slots[0],
+                                         struct btrfs_file_extent_item);
+
+               read_extent_buffer(eb, &cmp1, (unsigned long)item,
+                                  sizeof(cmp1));
+               read_extent_buffer(leaf, &cmp2, (unsigned long)existing,
+                                  sizeof(cmp2));
+
+               /*
+                * we already have a pointer to this exact extent,
+                * we don't have to do anything
+                */
+               if (memcmp(&cmp1, &cmp2, sizeof(cmp1)) == 0) {
+                       btrfs_release_path(root, path);
+                       goto out;
+               }
+       }
+       btrfs_release_path(root, path);
+
+       saved_nbytes = inode_get_bytes(inode);
+       /* drop any overlapping extents */
+       ret = btrfs_drop_extents(trans, root, inode,
+                        start, extent_end, start, &alloc_hint);
+       BUG_ON(ret);
+
+       if (found_type == BTRFS_FILE_EXTENT_REG ||
+           found_type == BTRFS_FILE_EXTENT_PREALLOC) {
+               unsigned long dest_offset;
+               struct btrfs_key ins;
+
+               ret = btrfs_insert_empty_item(trans, root, path, key,
+                                             sizeof(*item));
+               BUG_ON(ret);
+               dest_offset = btrfs_item_ptr_offset(path->nodes[0],
+                                                   path->slots[0]);
+               copy_extent_buffer(path->nodes[0], eb, dest_offset,
+                               (unsigned long)item,  sizeof(*item));
+
+               ins.objectid = btrfs_file_extent_disk_bytenr(eb, item);
+               ins.offset = btrfs_file_extent_disk_num_bytes(eb, item);
+               ins.type = BTRFS_EXTENT_ITEM_KEY;
+
+               if (ins.objectid > 0) {
+                       u64 csum_start;
+                       u64 csum_end;
+                       LIST_HEAD(ordered_sums);
+                       /*
+                        * is this extent already allocated in the extent
+                        * allocation tree?  If so, just add a reference
+                        */
+                       ret = btrfs_lookup_extent(root, ins.objectid,
+                                               ins.offset);
+                       if (ret == 0) {
+                               ret = btrfs_inc_extent_ref(trans, root,
+                                               ins.objectid, ins.offset,
+                                               path->nodes[0]->start,
+                                               root->root_key.objectid,
+                                               trans->transid, key->objectid);
+                       } else {
+                               /*
+                                * insert the extent pointer in the extent
+                                * allocation tree
+                                */
+                               ret = btrfs_alloc_logged_extent(trans, root,
+                                               path->nodes[0]->start,
+                                               root->root_key.objectid,
+                                               trans->transid, key->objectid,
+                                               &ins);
+                               BUG_ON(ret);
+                       }
+                       btrfs_release_path(root, path);
+
+                       if (btrfs_file_extent_compression(eb, item)) {
+                               csum_start = ins.objectid;
+                               csum_end = csum_start + ins.offset;
+                       } else {
+                               csum_start = ins.objectid +
+                                       btrfs_file_extent_offset(eb, item);
+                               csum_end = csum_start +
+                                       btrfs_file_extent_num_bytes(eb, item);
+                       }
+
+                       ret = btrfs_lookup_csums_range(root->log_root,
+                                               csum_start, csum_end - 1,
+                                               &ordered_sums);
+                       BUG_ON(ret);
+                       while (!list_empty(&ordered_sums)) {
+                               struct btrfs_ordered_sum *sums;
+                               sums = list_entry(ordered_sums.next,
+                                               struct btrfs_ordered_sum,
+                                               list);
+                               ret = btrfs_csum_file_blocks(trans,
+                                               root->fs_info->csum_root,
+                                               sums);
+                               BUG_ON(ret);
+                               list_del(&sums->list);
+                               kfree(sums);
+                       }
+               } else {
+                       btrfs_release_path(root, path);
+               }
+       } else if (found_type == BTRFS_FILE_EXTENT_INLINE) {
+               /* inline extents are easy, we just overwrite them */
+               ret = overwrite_item(trans, root, path, eb, slot, key);
+               BUG_ON(ret);
+       }
+
+       inode_set_bytes(inode, saved_nbytes);
+       btrfs_update_inode(trans, root, inode);
+out:
+       if (inode)
+               iput(inode);
+       return ret;
+}
+
+/*
+ * when cleaning up conflicts between the directory names in the
+ * subvolume, directory names in the log and directory names in the
+ * inode back references, we may have to unlink inodes from directories.
+ *
+ * This is a helper function to do the unlink of a specific directory
+ * item
+ */
+static noinline int drop_one_dir_item(struct btrfs_trans_handle *trans,
+                                     struct btrfs_root *root,
+                                     struct btrfs_path *path,
+                                     struct inode *dir,
+                                     struct btrfs_dir_item *di)
+{
+       struct inode *inode;
+       char *name;
+       int name_len;
+       struct extent_buffer *leaf;
+       struct btrfs_key location;
+       int ret;
+
+       leaf = path->nodes[0];
+
+       btrfs_dir_item_key_to_cpu(leaf, di, &location);
+       name_len = btrfs_dir_name_len(leaf, di);
+       name = kmalloc(name_len, GFP_NOFS);
+       read_extent_buffer(leaf, name, (unsigned long)(di + 1), name_len);
+       btrfs_release_path(root, path);
+
+       inode = read_one_inode(root, location.objectid);
+       BUG_ON(!inode);
+
+       ret = link_to_fixup_dir(trans, root, path, location.objectid);
+       BUG_ON(ret);
+       ret = btrfs_unlink_inode(trans, root, dir, inode, name, name_len);
+       BUG_ON(ret);
+       kfree(name);
+
+       iput(inode);
+       return ret;
+}
+
+/*
+ * helper function to see if a given name and sequence number found
+ * in an inode back reference are already in a directory and correctly
+ * point to this inode
+ */
+static noinline int inode_in_dir(struct btrfs_root *root,
+                                struct btrfs_path *path,
+                                u64 dirid, u64 objectid, u64 index,
+                                const char *name, int name_len)
+{
+       struct btrfs_dir_item *di;
+       struct btrfs_key location;
+       int match = 0;
+
+       di = btrfs_lookup_dir_index_item(NULL, root, path, dirid,
+                                        index, name, name_len, 0);
+       if (di && !IS_ERR(di)) {
+               btrfs_dir_item_key_to_cpu(path->nodes[0], di, &location);
+               if (location.objectid != objectid)
+                       goto out;
+       } else
+               goto out;
+       btrfs_release_path(root, path);
+
+       di = btrfs_lookup_dir_item(NULL, root, path, dirid, name, name_len, 0);
+       if (di && !IS_ERR(di)) {
+               btrfs_dir_item_key_to_cpu(path->nodes[0], di, &location);
+               if (location.objectid != objectid)
+                       goto out;
+       } else
+               goto out;
+       match = 1;
+out:
+       btrfs_release_path(root, path);
+       return match;
+}
+
+/*
+ * helper function to check a log tree for a named back reference in
+ * an inode.  This is used to decide if a back reference that is
+ * found in the subvolume conflicts with what we find in the log.
+ *
+ * inode backreferences may have multiple refs in a single item,
+ * during replay we process one reference at a time, and we don't
+ * want to delete valid links to a file from the subvolume if that
+ * link is also in the log.
+ */
+static noinline int backref_in_log(struct btrfs_root *log,
+                                  struct btrfs_key *key,
+                                  char *name, int namelen)
+{
+       struct btrfs_path *path;
+       struct btrfs_inode_ref *ref;
+       unsigned long ptr;
+       unsigned long ptr_end;
+       unsigned long name_ptr;
+       int found_name_len;
+       int item_size;
+       int ret;
+       int match = 0;
+
+       path = btrfs_alloc_path();
+       ret = btrfs_search_slot(NULL, log, key, path, 0, 0);
+       if (ret != 0)
+               goto out;
+
+       item_size = btrfs_item_size_nr(path->nodes[0], path->slots[0]);
+       ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
+       ptr_end = ptr + item_size;
+       while (ptr < ptr_end) {
+               ref = (struct btrfs_inode_ref *)ptr;
+               found_name_len = btrfs_inode_ref_name_len(path->nodes[0], ref);
+               if (found_name_len == namelen) {
+                       name_ptr = (unsigned long)(ref + 1);
+                       ret = memcmp_extent_buffer(path->nodes[0], name,
+                                                  name_ptr, namelen);
+                       if (ret == 0) {
+                               match = 1;
+                               goto out;
+                       }
+               }
+               ptr = (unsigned long)(ref + 1) + found_name_len;
+       }
+out:
+       btrfs_free_path(path);
+       return match;
+}
+
+
+/*
+ * replay one inode back reference item found in the log tree.
+ * eb, slot and key refer to the buffer and key found in the log tree.
+ * root is the destination we are replaying into, and path is for temp
+ * use by this function.  (it should be released on return).
+ */
+static noinline int add_inode_ref(struct btrfs_trans_handle *trans,
+                                 struct btrfs_root *root,
+                                 struct btrfs_root *log,
+                                 struct btrfs_path *path,
+                                 struct extent_buffer *eb, int slot,
+                                 struct btrfs_key *key)
+{
+       struct inode *dir;
+       int ret;
+       struct btrfs_key location;
+       struct btrfs_inode_ref *ref;
+       struct btrfs_dir_item *di;
+       struct inode *inode;
+       char *name;
+       int namelen;
+       unsigned long ref_ptr;
+       unsigned long ref_end;
+
+       location.objectid = key->objectid;
+       location.type = BTRFS_INODE_ITEM_KEY;
+       location.offset = 0;
+
+       /*
+        * it is possible that we didn't log all the parent directories
+        * for a given inode.  If we don't find the dir, just don't
+        * copy the back ref in.  The link count fixup code will take
+        * care of the rest
+        */
+       dir = read_one_inode(root, key->offset);
+       if (!dir)
+               return -ENOENT;
+
+       inode = read_one_inode(root, key->objectid);
+       BUG_ON(!dir);
+
+       ref_ptr = btrfs_item_ptr_offset(eb, slot);
+       ref_end = ref_ptr + btrfs_item_size_nr(eb, slot);
+
+again:
+       ref = (struct btrfs_inode_ref *)ref_ptr;
+
+       namelen = btrfs_inode_ref_name_len(eb, ref);
+       name = kmalloc(namelen, GFP_NOFS);
+       BUG_ON(!name);
+
+       read_extent_buffer(eb, name, (unsigned long)(ref + 1), namelen);
+
+       /* if we already have a perfect match, we're done */
+       if (inode_in_dir(root, path, dir->i_ino, inode->i_ino,
+                        btrfs_inode_ref_index(eb, ref),
+                        name, namelen)) {
+               goto out;
+       }
+
+       /*
+        * look for a conflicting back reference in the metadata.
+        * if we find one we have to unlink that name of the file
+        * before we add our new link.  Later on, we overwrite any
+        * existing back reference, and we don't want to create
+        * dangling pointers in the directory.
+        */
+conflict_again:
+       ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
+       if (ret == 0) {
+               char *victim_name;
+               int victim_name_len;
+               struct btrfs_inode_ref *victim_ref;
+               unsigned long ptr;
+               unsigned long ptr_end;
+               struct extent_buffer *leaf = path->nodes[0];
+
+               /* are we trying to overwrite a back ref for the root directory
+                * if so, just jump out, we're done
+                */
+               if (key->objectid == key->offset)
+                       goto out_nowrite;
+
+               /* check all the names in this back reference to see
+                * if they are in the log.  if so, we allow them to stay
+                * otherwise they must be unlinked as a conflict
+                */
+               ptr = btrfs_item_ptr_offset(leaf, path->slots[0]);
+               ptr_end = ptr + btrfs_item_size_nr(leaf, path->slots[0]);
+               while (ptr < ptr_end) {
+                       victim_ref = (struct btrfs_inode_ref *)ptr;
+                       victim_name_len = btrfs_inode_ref_name_len(leaf,
+                                                                  victim_ref);
+                       victim_name = kmalloc(victim_name_len, GFP_NOFS);
+                       BUG_ON(!victim_name);
+
+                       read_extent_buffer(leaf, victim_name,
+                                          (unsigned long)(victim_ref + 1),
+                                          victim_name_len);
+
+                       if (!backref_in_log(log, key, victim_name,
+                                           victim_name_len)) {
+                               btrfs_inc_nlink(inode);
+                               btrfs_release_path(root, path);
+                               ret = btrfs_unlink_inode(trans, root, dir,
+                                                        inode, victim_name,
+                                                        victim_name_len);
+                               kfree(victim_name);
+                               btrfs_release_path(root, path);
+                               goto conflict_again;
+                       }
+                       kfree(victim_name);
+                       ptr = (unsigned long)(victim_ref + 1) + victim_name_len;
+               }
+               BUG_ON(ret);
+       }
+       btrfs_release_path(root, path);
+
+       /* look for a conflicting sequence number */
+       di = btrfs_lookup_dir_index_item(trans, root, path, dir->i_ino,
+                                        btrfs_inode_ref_index(eb, ref),
+                                        name, namelen, 0);
+       if (di && !IS_ERR(di)) {
+               ret = drop_one_dir_item(trans, root, path, dir, di);
+               BUG_ON(ret);
+       }
+       btrfs_release_path(root, path);
+
+
+       /* look for a conflicting name */
+       di = btrfs_lookup_dir_item(trans, root, path, dir->i_ino,
+                                  name, namelen, 0);
+       if (di && !IS_ERR(di)) {
+               ret = drop_one_dir_item(trans, root, path, dir, di);
+               BUG_ON(ret);
+       }
+       btrfs_release_path(root, path);
+
+       /* insert our name */
+       ret = btrfs_add_link(trans, dir, inode, name, namelen, 0,
+                            btrfs_inode_ref_index(eb, ref));
+       BUG_ON(ret);
+
+       btrfs_update_inode(trans, root, inode);
+
+out:
+       ref_ptr = (unsigned long)(ref + 1) + namelen;
+       kfree(name);
+       if (ref_ptr < ref_end)
+               goto again;
+
+       /* finally write the back reference in the inode */
+       ret = overwrite_item(trans, root, path, eb, slot, key);
+       BUG_ON(ret);
+
+out_nowrite:
+       btrfs_release_path(root, path);
+       iput(dir);
+       iput(inode);
+       return 0;
+}
+
+/*
+ * There are a few corners where the link count of the file can't
+ * be properly maintained during replay.  So, instead of adding
+ * lots of complexity to the log code, we just scan the backrefs
+ * for any file that has been through replay.
+ *
+ * The scan will update the link count on the inode to reflect the
+ * number of back refs found.  If it goes down to zero, the iput
+ * will free the inode.
+ */
+static noinline int fixup_inode_link_count(struct btrfs_trans_handle *trans,
+                                          struct btrfs_root *root,
+                                          struct inode *inode)
+{
+       struct btrfs_path *path;
+       int ret;
+       struct btrfs_key key;
+       u64 nlink = 0;
+       unsigned long ptr;
+       unsigned long ptr_end;
+       int name_len;
+
+       key.objectid = inode->i_ino;
+       key.type = BTRFS_INODE_REF_KEY;
+       key.offset = (u64)-1;
+
+       path = btrfs_alloc_path();
+
+       while (1) {
+               ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+               if (ret < 0)
+                       break;
+               if (ret > 0) {
+                       if (path->slots[0] == 0)
+                               break;
+                       path->slots[0]--;
+               }
+               btrfs_item_key_to_cpu(path->nodes[0], &key,
+                                     path->slots[0]);
+               if (key.objectid != inode->i_ino ||
+                   key.type != BTRFS_INODE_REF_KEY)
+                       break;
+               ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]);
+               ptr_end = ptr + btrfs_item_size_nr(path->nodes[0],
+                                                  path->slots[0]);
+               while (ptr < ptr_end) {
+                       struct btrfs_inode_ref *ref;
+
+                       ref = (struct btrfs_inode_ref *)ptr;
+                       name_len = btrfs_inode_ref_name_len(path->nodes[0],
+                                                           ref);
+                       ptr = (unsigned long)(ref + 1) + name_len;
+                       nlink++;
+               }
+
+               if (key.offset == 0)
+                       break;
+               key.offset--;
+               btrfs_release_path(root, path);
+       }
+       btrfs_free_path(path);
+       if (nlink != inode->i_nlink) {
+               inode->i_nlink = nlink;
+               btrfs_update_inode(trans, root, inode);
+       }
+       BTRFS_I(inode)->index_cnt = (u64)-1;
+
+       return 0;
+}
+
+static noinline int fixup_inode_link_counts(struct btrfs_trans_handle *trans,
+                                           struct btrfs_root *root,
+                                           struct btrfs_path *path)
+{
+       int ret;
+       struct btrfs_key key;
+       struct inode *inode;
+
+       key.objectid = BTRFS_TREE_LOG_FIXUP_OBJECTID;
+       key.type = BTRFS_ORPHAN_ITEM_KEY;
+       key.offset = (u64)-1;
+       while (1) {
+               ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+               if (ret < 0)
+                       break;
+
+               if (ret == 1) {
+                       if (path->slots[0] == 0)
+                               break;
+                       path->slots[0]--;
+               }
+
+               btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+               if (key.objectid != BTRFS_TREE_LOG_FIXUP_OBJECTID ||
+                   key.type != BTRFS_ORPHAN_ITEM_KEY)
+                       break;
+
+               ret = btrfs_del_item(trans, root, path);
+               BUG_ON(ret);
+
+               btrfs_release_path(root, path);
+               inode = read_one_inode(root, key.offset);
+               BUG_ON(!inode);
+
+               ret = fixup_inode_link_count(trans, root, inode);
+               BUG_ON(ret);
+
+               iput(inode);
+
+               if (key.offset == 0)
+                       break;
+               key.offset--;
+       }
+       btrfs_release_path(root, path);
+       return 0;
+}
+
+
+/*
+ * record a given inode in the fixup dir so we can check its link
+ * count when replay is done.  The link count is incremented here
+ * so the inode won't go away until we check it
+ */
+static noinline int link_to_fixup_dir(struct btrfs_trans_handle *trans,
+                                     struct btrfs_root *root,
+                                     struct btrfs_path *path,
+                                     u64 objectid)
+{
+       struct btrfs_key key;
+       int ret = 0;
+       struct inode *inode;
+
+       inode = read_one_inode(root, objectid);
+       BUG_ON(!inode);
+
+       key.objectid = BTRFS_TREE_LOG_FIXUP_OBJECTID;
+       btrfs_set_key_type(&key, BTRFS_ORPHAN_ITEM_KEY);
+       key.offset = objectid;
+
+       ret = btrfs_insert_empty_item(trans, root, path, &key, 0);
+
+       btrfs_release_path(root, path);
+       if (ret == 0) {
+               btrfs_inc_nlink(inode);
+               btrfs_update_inode(trans, root, inode);
+       } else if (ret == -EEXIST) {
+               ret = 0;
+       } else {
+               BUG();
+       }
+       iput(inode);
+
+       return ret;
+}
+
+/*
+ * when replaying the log for a directory, we only insert names
+ * for inodes that actually exist.  This means an fsync on a directory
+ * does not implicitly fsync all the new files in it
+ */
+static noinline int insert_one_name(struct btrfs_trans_handle *trans,
+                                   struct btrfs_root *root,
+                                   struct btrfs_path *path,
+                                   u64 dirid, u64 index,
+                                   char *name, int name_len, u8 type,
+                                   struct btrfs_key *location)
+{
+       struct inode *inode;
+       struct inode *dir;
+       int ret;
+
+       inode = read_one_inode(root, location->objectid);
+       if (!inode)
+               return -ENOENT;
+
+       dir = read_one_inode(root, dirid);
+       if (!dir) {
+               iput(inode);
+               return -EIO;
+       }
+       ret = btrfs_add_link(trans, dir, inode, name, name_len, 1, index);
+
+       /* FIXME, put inode into FIXUP list */
+
+       iput(inode);
+       iput(dir);
+       return ret;
+}
+
+/*
+ * take a single entry in a log directory item and replay it into
+ * the subvolume.
+ *
+ * if a conflicting item exists in the subdirectory already,
+ * the inode it points to is unlinked and put into the link count
+ * fix up tree.
+ *
+ * If a name from the log points to a file or directory that does
+ * not exist in the FS, it is skipped.  fsyncs on directories
+ * do not force down inodes inside that directory, just changes to the
+ * names or unlinks in a directory.
+ */
+static noinline int replay_one_name(struct btrfs_trans_handle *trans,
+                                   struct btrfs_root *root,
+                                   struct btrfs_path *path,
+                                   struct extent_buffer *eb,
+                                   struct btrfs_dir_item *di,
+                                   struct btrfs_key *key)
+{
+       char *name;
+       int name_len;
+       struct btrfs_dir_item *dst_di;
+       struct btrfs_key found_key;
+       struct btrfs_key log_key;
+       struct inode *dir;
+       u8 log_type;
+       int exists;
+       int ret;
+
+       dir = read_one_inode(root, key->objectid);
+       BUG_ON(!dir);
+
+       name_len = btrfs_dir_name_len(eb, di);
+       name = kmalloc(name_len, GFP_NOFS);
+       log_type = btrfs_dir_type(eb, di);
+       read_extent_buffer(eb, name, (unsigned long)(di + 1),
+                  name_len);
+
+       btrfs_dir_item_key_to_cpu(eb, di, &log_key);
+       exists = btrfs_lookup_inode(trans, root, path, &log_key, 0);
+       if (exists == 0)
+               exists = 1;
+       else
+               exists = 0;
+       btrfs_release_path(root, path);
+
+       if (key->type == BTRFS_DIR_ITEM_KEY) {
+               dst_di = btrfs_lookup_dir_item(trans, root, path, key->objectid,
+                                      name, name_len, 1);
+       } else if (key->type == BTRFS_DIR_INDEX_KEY) {
+               dst_di = btrfs_lookup_dir_index_item(trans, root, path,
+                                                    key->objectid,
+                                                    key->offset, name,
+                                                    name_len, 1);
+       } else {
+               BUG();
+       }
+       if (!dst_di || IS_ERR(dst_di)) {
+               /* we need a sequence number to insert, so we only
+                * do inserts for the BTRFS_DIR_INDEX_KEY types
+                */
+               if (key->type != BTRFS_DIR_INDEX_KEY)
+                       goto out;
+               goto insert;
+       }
+
+       btrfs_dir_item_key_to_cpu(path->nodes[0], dst_di, &found_key);
+       /* the existing item matches the logged item */
+       if (found_key.objectid == log_key.objectid &&
+           found_key.type == log_key.type &&
+           found_key.offset == log_key.offset &&
+           btrfs_dir_type(path->nodes[0], dst_di) == log_type) {
+               goto out;
+       }
+
+       /*
+        * don't drop the conflicting directory entry if the inode
+        * for the new entry doesn't exist
+        */
+       if (!exists)
+               goto out;
+
+       ret = drop_one_dir_item(trans, root, path, dir, dst_di);
+       BUG_ON(ret);
+
+       if (key->type == BTRFS_DIR_INDEX_KEY)
+               goto insert;
+out:
+       btrfs_release_path(root, path);
+       kfree(name);
+       iput(dir);
+       return 0;
+
+insert:
+       btrfs_release_path(root, path);
+       ret = insert_one_name(trans, root, path, key->objectid, key->offset,
+                             name, name_len, log_type, &log_key);
+
+       if (ret && ret != -ENOENT)
+               BUG();
+       goto out;
+}
+
+/*
+ * find all the names in a directory item and reconcile them into
+ * the subvolume.  Only BTRFS_DIR_ITEM_KEY types will have more than
+ * one name in a directory item, but the same code gets used for
+ * both directory index types
+ */
+static noinline int replay_one_dir_item(struct btrfs_trans_handle *trans,
+                                       struct btrfs_root *root,
+                                       struct btrfs_path *path,
+                                       struct extent_buffer *eb, int slot,
+                                       struct btrfs_key *key)
+{
+       int ret;
+       u32 item_size = btrfs_item_size_nr(eb, slot);
+       struct btrfs_dir_item *di;
+       int name_len;
+       unsigned long ptr;
+       unsigned long ptr_end;
+
+       ptr = btrfs_item_ptr_offset(eb, slot);
+       ptr_end = ptr + item_size;
+       while (ptr < ptr_end) {
+               di = (struct btrfs_dir_item *)ptr;
+               name_len = btrfs_dir_name_len(eb, di);
+               ret = replay_one_name(trans, root, path, eb, di, key);
+               BUG_ON(ret);
+               ptr = (unsigned long)(di + 1);
+               ptr += name_len;
+       }
+       return 0;
+}
+
+/*
+ * directory replay has two parts.  There are the standard directory
+ * items in the log copied from the subvolume, and range items
+ * created in the log while the subvolume was logged.
+ *
+ * The range items tell us which parts of the key space the log
+ * is authoritative for.  During replay, if a key in the subvolume
+ * directory is in a logged range item, but not actually in the log
+ * that means it was deleted from the directory before the fsync
+ * and should be removed.
+ */
+static noinline int find_dir_range(struct btrfs_root *root,
+                                  struct btrfs_path *path,
+                                  u64 dirid, int key_type,
+                                  u64 *start_ret, u64 *end_ret)
+{
+       struct btrfs_key key;
+       u64 found_end;
+       struct btrfs_dir_log_item *item;
+       int ret;
+       int nritems;
+
+       if (*start_ret == (u64)-1)
+               return 1;
+
+       key.objectid = dirid;
+       key.type = key_type;
+       key.offset = *start_ret;
+
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       if (ret < 0)
+               goto out;
+       if (ret > 0) {
+               if (path->slots[0] == 0)
+                       goto out;
+               path->slots[0]--;
+       }
+       if (ret != 0)
+               btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+       if (key.type != key_type || key.objectid != dirid) {
+               ret = 1;
+               goto next;
+       }
+       item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+                             struct btrfs_dir_log_item);
+       found_end = btrfs_dir_log_end(path->nodes[0], item);
+
+       if (*start_ret >= key.offset && *start_ret <= found_end) {
+               ret = 0;
+               *start_ret = key.offset;
+               *end_ret = found_end;
+               goto out;
+       }
+       ret = 1;
+next:
+       /* check the next slot in the tree to see if it is a valid item */
+       nritems = btrfs_header_nritems(path->nodes[0]);
+       if (path->slots[0] >= nritems) {
+               ret = btrfs_next_leaf(root, path);
+               if (ret)
+                       goto out;
+       } else {
+               path->slots[0]++;
+       }
+
+       btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+       if (key.type != key_type || key.objectid != dirid) {
+               ret = 1;
+               goto out;
+       }
+       item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+                             struct btrfs_dir_log_item);
+       found_end = btrfs_dir_log_end(path->nodes[0], item);
+       *start_ret = key.offset;
+       *end_ret = found_end;
+       ret = 0;
+out:
+       btrfs_release_path(root, path);
+       return ret;
+}
+
+/*
+ * this looks for a given directory item in the log.  If the directory
+ * item is not in the log, the item is removed and the inode it points
+ * to is unlinked
+ */
+static noinline int check_item_in_log(struct btrfs_trans_handle *trans,
+                                     struct btrfs_root *root,
+                                     struct btrfs_root *log,
+                                     struct btrfs_path *path,
+                                     struct btrfs_path *log_path,
+                                     struct inode *dir,
+                                     struct btrfs_key *dir_key)
+{
+       int ret;
+       struct extent_buffer *eb;
+       int slot;
+       u32 item_size;
+       struct btrfs_dir_item *di;
+       struct btrfs_dir_item *log_di;
+       int name_len;
+       unsigned long ptr;
+       unsigned long ptr_end;
+       char *name;
+       struct inode *inode;
+       struct btrfs_key location;
+
+again:
+       eb = path->nodes[0];
+       slot = path->slots[0];
+       item_size = btrfs_item_size_nr(eb, slot);
+       ptr = btrfs_item_ptr_offset(eb, slot);
+       ptr_end = ptr + item_size;
+       while (ptr < ptr_end) {
+               di = (struct btrfs_dir_item *)ptr;
+               name_len = btrfs_dir_name_len(eb, di);
+               name = kmalloc(name_len, GFP_NOFS);
+               if (!name) {
+                       ret = -ENOMEM;
+                       goto out;
+               }
+               read_extent_buffer(eb, name, (unsigned long)(di + 1),
+                                 name_len);
+               log_di = NULL;
+               if (dir_key->type == BTRFS_DIR_ITEM_KEY) {
+                       log_di = btrfs_lookup_dir_item(trans, log, log_path,
+                                                      dir_key->objectid,
+                                                      name, name_len, 0);
+               } else if (dir_key->type == BTRFS_DIR_INDEX_KEY) {
+                       log_di = btrfs_lookup_dir_index_item(trans, log,
+                                                    log_path,
+                                                    dir_key->objectid,
+                                                    dir_key->offset,
+                                                    name, name_len, 0);
+               }
+               if (!log_di || IS_ERR(log_di)) {
+                       btrfs_dir_item_key_to_cpu(eb, di, &location);
+                       btrfs_release_path(root, path);
+                       btrfs_release_path(log, log_path);
+                       inode = read_one_inode(root, location.objectid);
+                       BUG_ON(!inode);
+
+                       ret = link_to_fixup_dir(trans, root,
+                                               path, location.objectid);
+                       BUG_ON(ret);
+                       btrfs_inc_nlink(inode);
+                       ret = btrfs_unlink_inode(trans, root, dir, inode,
+                                                name, name_len);
+                       BUG_ON(ret);
+                       kfree(name);
+                       iput(inode);
+
+                       /* there might still be more names under this key
+                        * check and repeat if required
+                        */
+                       ret = btrfs_search_slot(NULL, root, dir_key, path,
+                                               0, 0);
+                       if (ret == 0)
+                               goto again;
+                       ret = 0;
+                       goto out;
+               }
+               btrfs_release_path(log, log_path);
+               kfree(name);
+
+               ptr = (unsigned long)(di + 1);
+               ptr += name_len;
+       }
+       ret = 0;
+out:
+       btrfs_release_path(root, path);
+       btrfs_release_path(log, log_path);
+       return ret;
+}
+
+/*
+ * deletion replay happens before we copy any new directory items
+ * out of the log or out of backreferences from inodes.  It
+ * scans the log to find ranges of keys that log is authoritative for,
+ * and then scans the directory to find items in those ranges that are
+ * not present in the log.
+ *
+ * Anything we don't find in the log is unlinked and removed from the
+ * directory.
+ */
+static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
+                                      struct btrfs_root *root,
+                                      struct btrfs_root *log,
+                                      struct btrfs_path *path,
+                                      u64 dirid)
+{
+       u64 range_start;
+       u64 range_end;
+       int key_type = BTRFS_DIR_LOG_ITEM_KEY;
+       int ret = 0;
+       struct btrfs_key dir_key;
+       struct btrfs_key found_key;
+       struct btrfs_path *log_path;
+       struct inode *dir;
+
+       dir_key.objectid = dirid;
+       dir_key.type = BTRFS_DIR_ITEM_KEY;
+       log_path = btrfs_alloc_path();
+       if (!log_path)
+               return -ENOMEM;
+
+       dir = read_one_inode(root, dirid);
+       /* it isn't an error if the inode isn't there, that can happen
+        * because we replay the deletes before we copy in the inode item
+        * from the log
+        */
+       if (!dir) {
+               btrfs_free_path(log_path);
+               return 0;
+       }
+again:
+       range_start = 0;
+       range_end = 0;
+       while (1) {
+               ret = find_dir_range(log, path, dirid, key_type,
+                                    &range_start, &range_end);
+               if (ret != 0)
+                       break;
+
+               dir_key.offset = range_start;
+               while (1) {
+                       int nritems;
+                       ret = btrfs_search_slot(NULL, root, &dir_key, path,
+                                               0, 0);
+                       if (ret < 0)
+                               goto out;
+
+                       nritems = btrfs_header_nritems(path->nodes[0]);
+                       if (path->slots[0] >= nritems) {
+                               ret = btrfs_next_leaf(root, path);
+                               if (ret)
+                                       break;
+                       }
+                       btrfs_item_key_to_cpu(path->nodes[0], &found_key,
+                                             path->slots[0]);
+                       if (found_key.objectid != dirid ||
+                           found_key.type != dir_key.type)
+                               goto next_type;
+
+                       if (found_key.offset > range_end)
+                               break;
+
+                       ret = check_item_in_log(trans, root, log, path,
+                                               log_path, dir, &found_key);
+                       BUG_ON(ret);
+                       if (found_key.offset == (u64)-1)
+                               break;
+                       dir_key.offset = found_key.offset + 1;
+               }
+               btrfs_release_path(root, path);
+               if (range_end == (u64)-1)
+                       break;
+               range_start = range_end + 1;
+       }
+
+next_type:
+       ret = 0;
+       if (key_type == BTRFS_DIR_LOG_ITEM_KEY) {
+               key_type = BTRFS_DIR_LOG_INDEX_KEY;
+               dir_key.type = BTRFS_DIR_INDEX_KEY;
+               btrfs_release_path(root, path);
+               goto again;
+       }
+out:
+       btrfs_release_path(root, path);
+       btrfs_free_path(log_path);
+       iput(dir);
+       return ret;
+}
+
+/*
+ * the process_func used to replay items from the log tree.  This
+ * gets called in two different stages.  The first stage just looks
+ * for inodes and makes sure they are all copied into the subvolume.
+ *
+ * The second stage copies all the other item types from the log into
+ * the subvolume.  The two stage approach is slower, but gets rid of
+ * lots of complexity around inodes referencing other inodes that exist
+ * only in the log (references come from either directory items or inode
+ * back refs).
+ */
+static int replay_one_buffer(struct btrfs_root *log, struct extent_buffer *eb,
+                            struct walk_control *wc, u64 gen)
+{
+       int nritems;
+       struct btrfs_path *path;
+       struct btrfs_root *root = wc->replay_dest;
+       struct btrfs_key key;
+       u32 item_size;
+       int level;
+       int i;
+       int ret;
+
+       btrfs_read_buffer(eb, gen);
+
+       level = btrfs_header_level(eb);
+
+       if (level != 0)
+               return 0;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       nritems = btrfs_header_nritems(eb);
+       for (i = 0; i < nritems; i++) {
+               btrfs_item_key_to_cpu(eb, &key, i);
+               item_size = btrfs_item_size_nr(eb, i);
+
+               /* inode keys are done during the first stage */
+               if (key.type == BTRFS_INODE_ITEM_KEY &&
+                   wc->stage == LOG_WALK_REPLAY_INODES) {
+                       struct inode *inode;
+                       struct btrfs_inode_item *inode_item;
+                       u32 mode;
+
+                       inode_item = btrfs_item_ptr(eb, i,
+                                           struct btrfs_inode_item);
+                       mode = btrfs_inode_mode(eb, inode_item);
+                       if (S_ISDIR(mode)) {
+                               ret = replay_dir_deletes(wc->trans,
+                                        root, log, path, key.objectid);
+                               BUG_ON(ret);
+                       }
+                       ret = overwrite_item(wc->trans, root, path,
+                                            eb, i, &key);
+                       BUG_ON(ret);
+
+                       /* for regular files, truncate away
+                        * extents past the new EOF
+                        */
+                       if (S_ISREG(mode)) {
+                               inode = read_one_inode(root,
+                                                      key.objectid);
+                               BUG_ON(!inode);
+
+                               ret = btrfs_truncate_inode_items(wc->trans,
+                                       root, inode, inode->i_size,
+                                       BTRFS_EXTENT_DATA_KEY);
+                               BUG_ON(ret);
+                               iput(inode);
+                       }
+                       ret = link_to_fixup_dir(wc->trans, root,
+                                               path, key.objectid);
+                       BUG_ON(ret);
+               }
+               if (wc->stage < LOG_WALK_REPLAY_ALL)
+                       continue;
+
+               /* these keys are simply copied */
+               if (key.type == BTRFS_XATTR_ITEM_KEY) {
+                       ret = overwrite_item(wc->trans, root, path,
+                                            eb, i, &key);
+                       BUG_ON(ret);
+               } else if (key.type == BTRFS_INODE_REF_KEY) {
+                       ret = add_inode_ref(wc->trans, root, log, path,
+                                           eb, i, &key);
+                       BUG_ON(ret && ret != -ENOENT);
+               } else if (key.type == BTRFS_EXTENT_DATA_KEY) {
+                       ret = replay_one_extent(wc->trans, root, path,
+                                               eb, i, &key);
+                       BUG_ON(ret);
+               } else if (key.type == BTRFS_DIR_ITEM_KEY ||
+                          key.type == BTRFS_DIR_INDEX_KEY) {
+                       ret = replay_one_dir_item(wc->trans, root, path,
+                                                 eb, i, &key);
+                       BUG_ON(ret);
+               }
+       }
+       btrfs_free_path(path);
+       return 0;
+}
+
+static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans,
+                                  struct btrfs_root *root,
+                                  struct btrfs_path *path, int *level,
+                                  struct walk_control *wc)
+{
+       u64 root_owner;
+       u64 root_gen;
+       u64 bytenr;
+       u64 ptr_gen;
+       struct extent_buffer *next;
+       struct extent_buffer *cur;
+       struct extent_buffer *parent;
+       u32 blocksize;
+       int ret = 0;
+
+       WARN_ON(*level < 0);
+       WARN_ON(*level >= BTRFS_MAX_LEVEL);
+
+       while (*level > 0) {
+               WARN_ON(*level < 0);
+               WARN_ON(*level >= BTRFS_MAX_LEVEL);
+               cur = path->nodes[*level];
+
+               if (btrfs_header_level(cur) != *level)
+                       WARN_ON(1);
+
+               if (path->slots[*level] >=
+                   btrfs_header_nritems(cur))
+                       break;
+
+               bytenr = btrfs_node_blockptr(cur, path->slots[*level]);
+               ptr_gen = btrfs_node_ptr_generation(cur, path->slots[*level]);
+               blocksize = btrfs_level_size(root, *level - 1);
+
+               parent = path->nodes[*level];
+               root_owner = btrfs_header_owner(parent);
+               root_gen = btrfs_header_generation(parent);
+
+               next = btrfs_find_create_tree_block(root, bytenr, blocksize);
+
+               wc->process_func(root, next, wc, ptr_gen);
+
+               if (*level == 1) {
+                       path->slots[*level]++;
+                       if (wc->free) {
+                               btrfs_read_buffer(next, ptr_gen);
+
+                               btrfs_tree_lock(next);
+                               clean_tree_block(trans, root, next);
+                               btrfs_wait_tree_block_writeback(next);
+                               btrfs_tree_unlock(next);
+
+                               ret = btrfs_drop_leaf_ref(trans, root, next);
+                               BUG_ON(ret);
+
+                               WARN_ON(root_owner !=
+                                       BTRFS_TREE_LOG_OBJECTID);
+                               ret = btrfs_free_reserved_extent(root,
+                                                        bytenr, blocksize);
+                               BUG_ON(ret);
+                       }
+                       free_extent_buffer(next);
+                       continue;
+               }
+               btrfs_read_buffer(next, ptr_gen);
+
+               WARN_ON(*level <= 0);
+               if (path->nodes[*level-1])
+                       free_extent_buffer(path->nodes[*level-1]);
+               path->nodes[*level-1] = next;
+               *level = btrfs_header_level(next);
+               path->slots[*level] = 0;
+               cond_resched();
+       }
+       WARN_ON(*level < 0);
+       WARN_ON(*level >= BTRFS_MAX_LEVEL);
+
+       if (path->nodes[*level] == root->node)
+               parent = path->nodes[*level];
+       else
+               parent = path->nodes[*level + 1];
+
+       bytenr = path->nodes[*level]->start;
+
+       blocksize = btrfs_level_size(root, *level);
+       root_owner = btrfs_header_owner(parent);
+       root_gen = btrfs_header_generation(parent);
+
+       wc->process_func(root, path->nodes[*level], wc,
+                        btrfs_header_generation(path->nodes[*level]));
+
+       if (wc->free) {
+               next = path->nodes[*level];
+               btrfs_tree_lock(next);
+               clean_tree_block(trans, root, next);
+               btrfs_wait_tree_block_writeback(next);
+               btrfs_tree_unlock(next);
+
+               if (*level == 0) {
+                       ret = btrfs_drop_leaf_ref(trans, root, next);
+                       BUG_ON(ret);
+               }
+               WARN_ON(root_owner != BTRFS_TREE_LOG_OBJECTID);
+               ret = btrfs_free_reserved_extent(root, bytenr, blocksize);
+               BUG_ON(ret);
+       }
+       free_extent_buffer(path->nodes[*level]);
+       path->nodes[*level] = NULL;
+       *level += 1;
+
+       cond_resched();
+       return 0;
+}
+
+static noinline int walk_up_log_tree(struct btrfs_trans_handle *trans,
+                                struct btrfs_root *root,
+                                struct btrfs_path *path, int *level,
+                                struct walk_control *wc)
+{
+       u64 root_owner;
+       u64 root_gen;
+       int i;
+       int slot;
+       int ret;
+
+       for (i = *level; i < BTRFS_MAX_LEVEL - 1 && path->nodes[i]; i++) {
+               slot = path->slots[i];
+               if (slot < btrfs_header_nritems(path->nodes[i]) - 1) {
+                       struct extent_buffer *node;
+                       node = path->nodes[i];
+                       path->slots[i]++;
+                       *level = i;
+                       WARN_ON(*level == 0);
+                       return 0;
+               } else {
+                       struct extent_buffer *parent;
+                       if (path->nodes[*level] == root->node)
+                               parent = path->nodes[*level];
+                       else
+                               parent = path->nodes[*level + 1];
+
+                       root_owner = btrfs_header_owner(parent);
+                       root_gen = btrfs_header_generation(parent);
+                       wc->process_func(root, path->nodes[*level], wc,
+                                btrfs_header_generation(path->nodes[*level]));
+                       if (wc->free) {
+                               struct extent_buffer *next;
+
+                               next = path->nodes[*level];
+
+                               btrfs_tree_lock(next);
+                               clean_tree_block(trans, root, next);
+                               btrfs_wait_tree_block_writeback(next);
+                               btrfs_tree_unlock(next);
+
+                               if (*level == 0) {
+                                       ret = btrfs_drop_leaf_ref(trans, root,
+                                                                 next);
+                                       BUG_ON(ret);
+                               }
+
+                               WARN_ON(root_owner != BTRFS_TREE_LOG_OBJECTID);
+                               ret = btrfs_free_reserved_extent(root,
+                                               path->nodes[*level]->start,
+                                               path->nodes[*level]->len);
+                               BUG_ON(ret);
+                       }
+                       free_extent_buffer(path->nodes[*level]);
+                       path->nodes[*level] = NULL;
+                       *level = i + 1;
+               }
+       }
+       return 1;
+}
+
+/*
+ * drop the reference count on the tree rooted at 'snap'.  This traverses
+ * the tree freeing any blocks that have a ref count of zero after being
+ * decremented.
+ */
+static int walk_log_tree(struct btrfs_trans_handle *trans,
+                        struct btrfs_root *log, struct walk_control *wc)
+{
+       int ret = 0;
+       int wret;
+       int level;
+       struct btrfs_path *path;
+       int i;
+       int orig_level;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       level = btrfs_header_level(log->node);
+       orig_level = level;
+       path->nodes[level] = log->node;
+       extent_buffer_get(log->node);
+       path->slots[level] = 0;
+
+       while (1) {
+               wret = walk_down_log_tree(trans, log, path, &level, wc);
+               if (wret > 0)
+                       break;
+               if (wret < 0)
+                       ret = wret;
+
+               wret = walk_up_log_tree(trans, log, path, &level, wc);
+               if (wret > 0)
+                       break;
+               if (wret < 0)
+                       ret = wret;
+       }
+
+       /* was the root node processed? if not, catch it here */
+       if (path->nodes[orig_level]) {
+               wc->process_func(log, path->nodes[orig_level], wc,
+                        btrfs_header_generation(path->nodes[orig_level]));
+               if (wc->free) {
+                       struct extent_buffer *next;
+
+                       next = path->nodes[orig_level];
+
+                       btrfs_tree_lock(next);
+                       clean_tree_block(trans, log, next);
+                       btrfs_wait_tree_block_writeback(next);
+                       btrfs_tree_unlock(next);
+
+                       if (orig_level == 0) {
+                               ret = btrfs_drop_leaf_ref(trans, log,
+                                                         next);
+                               BUG_ON(ret);
+                       }
+                       WARN_ON(log->root_key.objectid !=
+                               BTRFS_TREE_LOG_OBJECTID);
+                       ret = btrfs_free_reserved_extent(log, next->start,
+                                                        next->len);
+                       BUG_ON(ret);
+               }
+       }
+
+       for (i = 0; i <= orig_level; i++) {
+               if (path->nodes[i]) {
+                       free_extent_buffer(path->nodes[i]);
+                       path->nodes[i] = NULL;
+               }
+       }
+       btrfs_free_path(path);
+       if (wc->free)
+               free_extent_buffer(log->node);
+       return ret;
+}
+
+static int wait_log_commit(struct btrfs_root *log)
+{
+       DEFINE_WAIT(wait);
+       u64 transid = log->fs_info->tree_log_transid;
+
+       do {
+               prepare_to_wait(&log->fs_info->tree_log_wait, &wait,
+                               TASK_UNINTERRUPTIBLE);
+               mutex_unlock(&log->fs_info->tree_log_mutex);
+               if (atomic_read(&log->fs_info->tree_log_commit))
+                       schedule();
+               finish_wait(&log->fs_info->tree_log_wait, &wait);
+               mutex_lock(&log->fs_info->tree_log_mutex);
+       } while (transid == log->fs_info->tree_log_transid &&
+               atomic_read(&log->fs_info->tree_log_commit));
+       return 0;
+}
+
+/*
+ * btrfs_sync_log does sends a given tree log down to the disk and
+ * updates the super blocks to record it.  When this call is done,
+ * you know that any inodes previously logged are safely on disk
+ */
+int btrfs_sync_log(struct btrfs_trans_handle *trans,
+                  struct btrfs_root *root)
+{
+       int ret;
+       unsigned long batch;
+       struct btrfs_root *log = root->log_root;
+
+       mutex_lock(&log->fs_info->tree_log_mutex);
+       if (atomic_read(&log->fs_info->tree_log_commit)) {
+               wait_log_commit(log);
+               goto out;
+       }
+       atomic_set(&log->fs_info->tree_log_commit, 1);
+
+       while (1) {
+               batch = log->fs_info->tree_log_batch;
+               mutex_unlock(&log->fs_info->tree_log_mutex);
+               schedule_timeout_uninterruptible(1);
+               mutex_lock(&log->fs_info->tree_log_mutex);
+
+               while (atomic_read(&log->fs_info->tree_log_writers)) {
+                       DEFINE_WAIT(wait);
+                       prepare_to_wait(&log->fs_info->tree_log_wait, &wait,
+                                       TASK_UNINTERRUPTIBLE);
+                       mutex_unlock(&log->fs_info->tree_log_mutex);
+                       if (atomic_read(&log->fs_info->tree_log_writers))
+                               schedule();
+                       mutex_lock(&log->fs_info->tree_log_mutex);
+                       finish_wait(&log->fs_info->tree_log_wait, &wait);
+               }
+               if (batch == log->fs_info->tree_log_batch)
+                       break;
+       }
+
+       ret = btrfs_write_and_wait_marked_extents(log, &log->dirty_log_pages);
+       BUG_ON(ret);
+       ret = btrfs_write_and_wait_marked_extents(root->fs_info->log_root_tree,
+                              &root->fs_info->log_root_tree->dirty_log_pages);
+       BUG_ON(ret);
+
+       btrfs_set_super_log_root(&root->fs_info->super_for_commit,
+                                log->fs_info->log_root_tree->node->start);
+       btrfs_set_super_log_root_level(&root->fs_info->super_for_commit,
+                      btrfs_header_level(log->fs_info->log_root_tree->node));
+
+       write_ctree_super(trans, log->fs_info->tree_root, 2);
+       log->fs_info->tree_log_transid++;
+       log->fs_info->tree_log_batch = 0;
+       atomic_set(&log->fs_info->tree_log_commit, 0);
+       smp_mb();
+       if (waitqueue_active(&log->fs_info->tree_log_wait))
+               wake_up(&log->fs_info->tree_log_wait);
+out:
+       mutex_unlock(&log->fs_info->tree_log_mutex);
+       return 0;
+}
+
+/* * free all the extents used by the tree log.  This should be called
+ * at commit time of the full transaction
+ */
+int btrfs_free_log(struct btrfs_trans_handle *trans, struct btrfs_root *root)
+{
+       int ret;
+       struct btrfs_root *log;
+       struct key;
+       u64 start;
+       u64 end;
+       struct walk_control wc = {
+               .free = 1,
+               .process_func = process_one_buffer
+       };
+
+       if (!root->log_root || root->fs_info->log_root_recovering)
+               return 0;
+
+       log = root->log_root;
+       ret = walk_log_tree(trans, log, &wc);
+       BUG_ON(ret);
+
+       while (1) {
+               ret = find_first_extent_bit(&log->dirty_log_pages,
+                                   0, &start, &end, EXTENT_DIRTY);
+               if (ret)
+                       break;
+
+               clear_extent_dirty(&log->dirty_log_pages,
+                                  start, end, GFP_NOFS);
+       }
+
+       log = root->log_root;
+       ret = btrfs_del_root(trans, root->fs_info->log_root_tree,
+                            &log->root_key);
+       BUG_ON(ret);
+       root->log_root = NULL;
+       kfree(root->log_root);
+       return 0;
+}
+
+/*
+ * helper function to update the item for a given subvolumes log root
+ * in the tree of log roots
+ */
+static int update_log_root(struct btrfs_trans_handle *trans,
+                          struct btrfs_root *log)
+{
+       u64 bytenr = btrfs_root_bytenr(&log->root_item);
+       int ret;
+
+       if (log->node->start == bytenr)
+               return 0;
+
+       btrfs_set_root_bytenr(&log->root_item, log->node->start);
+       btrfs_set_root_generation(&log->root_item, trans->transid);
+       btrfs_set_root_level(&log->root_item, btrfs_header_level(log->node));
+       ret = btrfs_update_root(trans, log->fs_info->log_root_tree,
+                               &log->root_key, &log->root_item);
+       BUG_ON(ret);
+       return ret;
+}
+
+/*
+ * If both a file and directory are logged, and unlinks or renames are
+ * mixed in, we have a few interesting corners:
+ *
+ * create file X in dir Y
+ * link file X to X.link in dir Y
+ * fsync file X
+ * unlink file X but leave X.link
+ * fsync dir Y
+ *
+ * After a crash we would expect only X.link to exist.  But file X
+ * didn't get fsync'd again so the log has back refs for X and X.link.
+ *
+ * We solve this by removing directory entries and inode backrefs from the
+ * log when a file that was logged in the current transaction is
+ * unlinked.  Any later fsync will include the updated log entries, and
+ * we'll be able to reconstruct the proper directory items from backrefs.
+ *
+ * This optimizations allows us to avoid relogging the entire inode
+ * or the entire directory.
+ */
+int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans,
+                                struct btrfs_root *root,
+                                const char *name, int name_len,
+                                struct inode *dir, u64 index)
+{
+       struct btrfs_root *log;
+       struct btrfs_dir_item *di;
+       struct btrfs_path *path;
+       int ret;
+       int bytes_del = 0;
+
+       if (BTRFS_I(dir)->logged_trans < trans->transid)
+               return 0;
+
+       ret = join_running_log_trans(root);
+       if (ret)
+               return 0;
+
+       mutex_lock(&BTRFS_I(dir)->log_mutex);
+
+       log = root->log_root;
+       path = btrfs_alloc_path();
+       di = btrfs_lookup_dir_item(trans, log, path, dir->i_ino,
+                                  name, name_len, -1);
+       if (di && !IS_ERR(di)) {
+               ret = btrfs_delete_one_dir_name(trans, log, path, di);
+               bytes_del += name_len;
+               BUG_ON(ret);
+       }
+       btrfs_release_path(log, path);
+       di = btrfs_lookup_dir_index_item(trans, log, path, dir->i_ino,
+                                        index, name, name_len, -1);
+       if (di && !IS_ERR(di)) {
+               ret = btrfs_delete_one_dir_name(trans, log, path, di);
+               bytes_del += name_len;
+               BUG_ON(ret);
+       }
+
+       /* update the directory size in the log to reflect the names
+        * we have removed
+        */
+       if (bytes_del) {
+               struct btrfs_key key;
+
+               key.objectid = dir->i_ino;
+               key.offset = 0;
+               key.type = BTRFS_INODE_ITEM_KEY;
+               btrfs_release_path(log, path);
+
+               ret = btrfs_search_slot(trans, log, &key, path, 0, 1);
+               if (ret == 0) {
+                       struct btrfs_inode_item *item;
+                       u64 i_size;
+
+                       item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+                                             struct btrfs_inode_item);
+                       i_size = btrfs_inode_size(path->nodes[0], item);
+                       if (i_size > bytes_del)
+                               i_size -= bytes_del;
+                       else
+                               i_size = 0;
+                       btrfs_set_inode_size(path->nodes[0], item, i_size);
+                       btrfs_mark_buffer_dirty(path->nodes[0]);
+               } else
+                       ret = 0;
+               btrfs_release_path(log, path);
+       }
+
+       btrfs_free_path(path);
+       mutex_unlock(&BTRFS_I(dir)->log_mutex);
+       end_log_trans(root);
+
+       return 0;
+}
+
+/* see comments for btrfs_del_dir_entries_in_log */
+int btrfs_del_inode_ref_in_log(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *root,
+                              const char *name, int name_len,
+                              struct inode *inode, u64 dirid)
+{
+       struct btrfs_root *log;
+       u64 index;
+       int ret;
+
+       if (BTRFS_I(inode)->logged_trans < trans->transid)
+               return 0;
+
+       ret = join_running_log_trans(root);
+       if (ret)
+               return 0;
+       log = root->log_root;
+       mutex_lock(&BTRFS_I(inode)->log_mutex);
+
+       ret = btrfs_del_inode_ref(trans, log, name, name_len, inode->i_ino,
+                                 dirid, &index);
+       mutex_unlock(&BTRFS_I(inode)->log_mutex);
+       end_log_trans(root);
+
+       return ret;
+}
+
+/*
+ * creates a range item in the log for 'dirid'.  first_offset and
+ * last_offset tell us which parts of the key space the log should
+ * be considered authoritative for.
+ */
+static noinline int insert_dir_log_key(struct btrfs_trans_handle *trans,
+                                      struct btrfs_root *log,
+                                      struct btrfs_path *path,
+                                      int key_type, u64 dirid,
+                                      u64 first_offset, u64 last_offset)
+{
+       int ret;
+       struct btrfs_key key;
+       struct btrfs_dir_log_item *item;
+
+       key.objectid = dirid;
+       key.offset = first_offset;
+       if (key_type == BTRFS_DIR_ITEM_KEY)
+               key.type = BTRFS_DIR_LOG_ITEM_KEY;
+       else
+               key.type = BTRFS_DIR_LOG_INDEX_KEY;
+       ret = btrfs_insert_empty_item(trans, log, path, &key, sizeof(*item));
+       BUG_ON(ret);
+
+       item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+                             struct btrfs_dir_log_item);
+       btrfs_set_dir_log_end(path->nodes[0], item, last_offset);
+       btrfs_mark_buffer_dirty(path->nodes[0]);
+       btrfs_release_path(log, path);
+       return 0;
+}
+
+/*
+ * log all the items included in the current transaction for a given
+ * directory.  This also creates the range items in the log tree required
+ * to replay anything deleted before the fsync
+ */
+static noinline int log_dir_items(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root, struct inode *inode,
+                         struct btrfs_path *path,
+                         struct btrfs_path *dst_path, int key_type,
+                         u64 min_offset, u64 *last_offset_ret)
+{
+       struct btrfs_key min_key;
+       struct btrfs_key max_key;
+       struct btrfs_root *log = root->log_root;
+       struct extent_buffer *src;
+       int ret;
+       int i;
+       int nritems;
+       u64 first_offset = min_offset;
+       u64 last_offset = (u64)-1;
+
+       log = root->log_root;
+       max_key.objectid = inode->i_ino;
+       max_key.offset = (u64)-1;
+       max_key.type = key_type;
+
+       min_key.objectid = inode->i_ino;
+       min_key.type = key_type;
+       min_key.offset = min_offset;
+
+       path->keep_locks = 1;
+
+       ret = btrfs_search_forward(root, &min_key, &max_key,
+                                  path, 0, trans->transid);
+
+       /*
+        * we didn't find anything from this transaction, see if there
+        * is anything at all
+        */
+       if (ret != 0 || min_key.objectid != inode->i_ino ||
+           min_key.type != key_type) {
+               min_key.objectid = inode->i_ino;
+               min_key.type = key_type;
+               min_key.offset = (u64)-1;
+               btrfs_release_path(root, path);
+               ret = btrfs_search_slot(NULL, root, &min_key, path, 0, 0);
+               if (ret < 0) {
+                       btrfs_release_path(root, path);
+                       return ret;
+               }
+               ret = btrfs_previous_item(root, path, inode->i_ino, key_type);
+
+               /* if ret == 0 there are items for this type,
+                * create a range to tell us the last key of this type.
+                * otherwise, there are no items in this directory after
+                * *min_offset, and we create a range to indicate that.
+                */
+               if (ret == 0) {
+                       struct btrfs_key tmp;
+                       btrfs_item_key_to_cpu(path->nodes[0], &tmp,
+                                             path->slots[0]);
+                       if (key_type == tmp.type)
+                               first_offset = max(min_offset, tmp.offset) + 1;
+               }
+               goto done;
+       }
+
+       /* go backward to find any previous key */
+       ret = btrfs_previous_item(root, path, inode->i_ino, key_type);
+       if (ret == 0) {
+               struct btrfs_key tmp;
+               btrfs_item_key_to_cpu(path->nodes[0], &tmp, path->slots[0]);
+               if (key_type == tmp.type) {
+                       first_offset = tmp.offset;
+                       ret = overwrite_item(trans, log, dst_path,
+                                            path->nodes[0], path->slots[0],
+                                            &tmp);
+               }
+       }
+       btrfs_release_path(root, path);
+
+       /* find the first key from this transaction again */
+       ret = btrfs_search_slot(NULL, root, &min_key, path, 0, 0);
+       if (ret != 0) {
+               WARN_ON(1);
+               goto done;
+       }
+
+       /*
+        * we have a block from this transaction, log every item in it
+        * from our directory
+        */
+       while (1) {
+               struct btrfs_key tmp;
+               src = path->nodes[0];
+               nritems = btrfs_header_nritems(src);
+               for (i = path->slots[0]; i < nritems; i++) {
+                       btrfs_item_key_to_cpu(src, &min_key, i);
+
+                       if (min_key.objectid != inode->i_ino ||
+                           min_key.type != key_type)
+                               goto done;
+                       ret = overwrite_item(trans, log, dst_path, src, i,
+                                            &min_key);
+                       BUG_ON(ret);
+               }
+               path->slots[0] = nritems;
+
+               /*
+                * look ahead to the next item and see if it is also
+                * from this directory and from this transaction
+                */
+               ret = btrfs_next_leaf(root, path);
+               if (ret == 1) {
+                       last_offset = (u64)-1;
+                       goto done;
+               }
+               btrfs_item_key_to_cpu(path->nodes[0], &tmp, path->slots[0]);
+               if (tmp.objectid != inode->i_ino || tmp.type != key_type) {
+                       last_offset = (u64)-1;
+                       goto done;
+               }
+               if (btrfs_header_generation(path->nodes[0]) != trans->transid) {
+                       ret = overwrite_item(trans, log, dst_path,
+                                            path->nodes[0], path->slots[0],
+                                            &tmp);
+
+                       BUG_ON(ret);
+                       last_offset = tmp.offset;
+                       goto done;
+               }
+       }
+done:
+       *last_offset_ret = last_offset;
+       btrfs_release_path(root, path);
+       btrfs_release_path(log, dst_path);
+
+       /* insert the log range keys to indicate where the log is valid */
+       ret = insert_dir_log_key(trans, log, path, key_type, inode->i_ino,
+                                first_offset, last_offset);
+       BUG_ON(ret);
+       return 0;
+}
+
+/*
+ * logging directories is very similar to logging inodes, We find all the items
+ * from the current transaction and write them to the log.
+ *
+ * The recovery code scans the directory in the subvolume, and if it finds a
+ * key in the range logged that is not present in the log tree, then it means
+ * that dir entry was unlinked during the transaction.
+ *
+ * In order for that scan to work, we must include one key smaller than
+ * the smallest logged by this transaction and one key larger than the largest
+ * key logged by this transaction.
+ */
+static noinline int log_directory_changes(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root, struct inode *inode,
+                         struct btrfs_path *path,
+                         struct btrfs_path *dst_path)
+{
+       u64 min_key;
+       u64 max_key;
+       int ret;
+       int key_type = BTRFS_DIR_ITEM_KEY;
+
+again:
+       min_key = 0;
+       max_key = 0;
+       while (1) {
+               ret = log_dir_items(trans, root, inode, path,
+                                   dst_path, key_type, min_key,
+                                   &max_key);
+               BUG_ON(ret);
+               if (max_key == (u64)-1)
+                       break;
+               min_key = max_key + 1;
+       }
+
+       if (key_type == BTRFS_DIR_ITEM_KEY) {
+               key_type = BTRFS_DIR_INDEX_KEY;
+               goto again;
+       }
+       return 0;
+}
+
+/*
+ * a helper function to drop items from the log before we relog an
+ * inode.  max_key_type indicates the highest item type to remove.
+ * This cannot be run for file data extents because it does not
+ * free the extents they point to.
+ */
+static int drop_objectid_items(struct btrfs_trans_handle *trans,
+                                 struct btrfs_root *log,
+                                 struct btrfs_path *path,
+                                 u64 objectid, int max_key_type)
+{
+       int ret;
+       struct btrfs_key key;
+       struct btrfs_key found_key;
+
+       key.objectid = objectid;
+       key.type = max_key_type;
+       key.offset = (u64)-1;
+
+       while (1) {
+               ret = btrfs_search_slot(trans, log, &key, path, -1, 1);
+
+               if (ret != 1)
+                       break;
+
+               if (path->slots[0] == 0)
+                       break;
+
+               path->slots[0]--;
+               btrfs_item_key_to_cpu(path->nodes[0], &found_key,
+                                     path->slots[0]);
+
+               if (found_key.objectid != objectid)
+                       break;
+
+               ret = btrfs_del_item(trans, log, path);
+               BUG_ON(ret);
+               btrfs_release_path(log, path);
+       }
+       btrfs_release_path(log, path);
+       return 0;
+}
+
+static noinline int copy_items(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *log,
+                              struct btrfs_path *dst_path,
+                              struct extent_buffer *src,
+                              int start_slot, int nr, int inode_only)
+{
+       unsigned long src_offset;
+       unsigned long dst_offset;
+       struct btrfs_file_extent_item *extent;
+       struct btrfs_inode_item *inode_item;
+       int ret;
+       struct btrfs_key *ins_keys;
+       u32 *ins_sizes;
+       char *ins_data;
+       int i;
+       struct list_head ordered_sums;
+
+       INIT_LIST_HEAD(&ordered_sums);
+
+       ins_data = kmalloc(nr * sizeof(struct btrfs_key) +
+                          nr * sizeof(u32), GFP_NOFS);
+       ins_sizes = (u32 *)ins_data;
+       ins_keys = (struct btrfs_key *)(ins_data + nr * sizeof(u32));
+
+       for (i = 0; i < nr; i++) {
+               ins_sizes[i] = btrfs_item_size_nr(src, i + start_slot);
+               btrfs_item_key_to_cpu(src, ins_keys + i, i + start_slot);
+       }
+       ret = btrfs_insert_empty_items(trans, log, dst_path,
+                                      ins_keys, ins_sizes, nr);
+       BUG_ON(ret);
+
+       for (i = 0; i < nr; i++) {
+               dst_offset = btrfs_item_ptr_offset(dst_path->nodes[0],
+                                                  dst_path->slots[0]);
+
+               src_offset = btrfs_item_ptr_offset(src, start_slot + i);
+
+               copy_extent_buffer(dst_path->nodes[0], src, dst_offset,
+                                  src_offset, ins_sizes[i]);
+
+               if (inode_only == LOG_INODE_EXISTS &&
+                   ins_keys[i].type == BTRFS_INODE_ITEM_KEY) {
+                       inode_item = btrfs_item_ptr(dst_path->nodes[0],
+                                                   dst_path->slots[0],
+                                                   struct btrfs_inode_item);
+                       btrfs_set_inode_size(dst_path->nodes[0], inode_item, 0);
+
+                       /* set the generation to zero so the recover code
+                        * can tell the difference between an logging
+                        * just to say 'this inode exists' and a logging
+                        * to say 'update this inode with these values'
+                        */
+                       btrfs_set_inode_generation(dst_path->nodes[0],
+                                                  inode_item, 0);
+               }
+               /* take a reference on file data extents so that truncates
+                * or deletes of this inode don't have to relog the inode
+                * again
+                */
+               if (btrfs_key_type(ins_keys + i) == BTRFS_EXTENT_DATA_KEY) {
+                       int found_type;
+                       extent = btrfs_item_ptr(src, start_slot + i,
+                                               struct btrfs_file_extent_item);
+
+                       found_type = btrfs_file_extent_type(src, extent);
+                       if (found_type == BTRFS_FILE_EXTENT_REG ||
+                           found_type == BTRFS_FILE_EXTENT_PREALLOC) {
+                               u64 ds = btrfs_file_extent_disk_bytenr(src,
+                                                                  extent);
+                               u64 dl = btrfs_file_extent_disk_num_bytes(src,
+                                                                     extent);
+                               u64 cs = btrfs_file_extent_offset(src, extent);
+                               u64 cl = btrfs_file_extent_num_bytes(src,
+                                                                    extent);;
+                               if (btrfs_file_extent_compression(src,
+                                                                 extent)) {
+                                       cs = 0;
+                                       cl = dl;
+                               }
+                               /* ds == 0 is a hole */
+                               if (ds != 0) {
+                                       ret = btrfs_inc_extent_ref(trans, log,
+                                                  ds, dl,
+                                                  dst_path->nodes[0]->start,
+                                                  BTRFS_TREE_LOG_OBJECTID,
+                                                  trans->transid,
+                                                  ins_keys[i].objectid);
+                                       BUG_ON(ret);
+                                       ret = btrfs_lookup_csums_range(
+                                                  log->fs_info->csum_root,
+                                                  ds + cs, ds + cs + cl - 1,
+                                                  &ordered_sums);
+                                       BUG_ON(ret);
+                               }
+                       }
+               }
+               dst_path->slots[0]++;
+       }
+
+       btrfs_mark_buffer_dirty(dst_path->nodes[0]);
+       btrfs_release_path(log, dst_path);
+       kfree(ins_data);
+
+       /*
+        * we have to do this after the loop above to avoid changing the
+        * log tree while trying to change the log tree.
+        */
+       while (!list_empty(&ordered_sums)) {
+               struct btrfs_ordered_sum *sums = list_entry(ordered_sums.next,
+                                                  struct btrfs_ordered_sum,
+                                                  list);
+               ret = btrfs_csum_file_blocks(trans, log, sums);
+               BUG_ON(ret);
+               list_del(&sums->list);
+               kfree(sums);
+       }
+       return 0;
+}
+
+/* log a single inode in the tree log.
+ * At least one parent directory for this inode must exist in the tree
+ * or be logged already.
+ *
+ * Any items from this inode changed by the current transaction are copied
+ * to the log tree.  An extra reference is taken on any extents in this
+ * file, allowing us to avoid a whole pile of corner cases around logging
+ * blocks that have been removed from the tree.
+ *
+ * See LOG_INODE_ALL and related defines for a description of what inode_only
+ * does.
+ *
+ * This handles both files and directories.
+ */
+static int __btrfs_log_inode(struct btrfs_trans_handle *trans,
+                            struct btrfs_root *root, struct inode *inode,
+                            int inode_only)
+{
+       struct btrfs_path *path;
+       struct btrfs_path *dst_path;
+       struct btrfs_key min_key;
+       struct btrfs_key max_key;
+       struct btrfs_root *log = root->log_root;
+       struct extent_buffer *src = NULL;
+       u32 size;
+       int ret;
+       int nritems;
+       int ins_start_slot = 0;
+       int ins_nr;
+
+       log = root->log_root;
+
+       path = btrfs_alloc_path();
+       dst_path = btrfs_alloc_path();
+
+       min_key.objectid = inode->i_ino;
+       min_key.type = BTRFS_INODE_ITEM_KEY;
+       min_key.offset = 0;
+
+       max_key.objectid = inode->i_ino;
+       if (inode_only == LOG_INODE_EXISTS || S_ISDIR(inode->i_mode))
+               max_key.type = BTRFS_XATTR_ITEM_KEY;
+       else
+               max_key.type = (u8)-1;
+       max_key.offset = (u64)-1;
+
+       /*
+        * if this inode has already been logged and we're in inode_only
+        * mode, we don't want to delete the things that have already
+        * been written to the log.
+        *
+        * But, if the inode has been through an inode_only log,
+        * the logged_trans field is not set.  This allows us to catch
+        * any new names for this inode in the backrefs by logging it
+        * again
+        */
+       if (inode_only == LOG_INODE_EXISTS &&
+           BTRFS_I(inode)->logged_trans == trans->transid) {
+               btrfs_free_path(path);
+               btrfs_free_path(dst_path);
+               goto out;
+       }
+       mutex_lock(&BTRFS_I(inode)->log_mutex);
+
+       /*
+        * a brute force approach to making sure we get the most uptodate
+        * copies of everything.
+        */
+       if (S_ISDIR(inode->i_mode)) {
+               int max_key_type = BTRFS_DIR_LOG_INDEX_KEY;
+
+               if (inode_only == LOG_INODE_EXISTS)
+                       max_key_type = BTRFS_XATTR_ITEM_KEY;
+               ret = drop_objectid_items(trans, log, path,
+                                         inode->i_ino, max_key_type);
+       } else {
+               ret = btrfs_truncate_inode_items(trans, log, inode, 0, 0);
+       }
+       BUG_ON(ret);
+       path->keep_locks = 1;
+
+       while (1) {
+               ins_nr = 0;
+               ret = btrfs_search_forward(root, &min_key, &max_key,
+                                          path, 0, trans->transid);
+               if (ret != 0)
+                       break;
+again:
+               /* note, ins_nr might be > 0 here, cleanup outside the loop */
+               if (min_key.objectid != inode->i_ino)
+                       break;
+               if (min_key.type > max_key.type)
+                       break;
+
+               src = path->nodes[0];
+               size = btrfs_item_size_nr(src, path->slots[0]);
+               if (ins_nr && ins_start_slot + ins_nr == path->slots[0]) {
+                       ins_nr++;
+                       goto next_slot;
+               } else if (!ins_nr) {
+                       ins_start_slot = path->slots[0];
+                       ins_nr = 1;
+                       goto next_slot;
+               }
+
+               ret = copy_items(trans, log, dst_path, src, ins_start_slot,
+                                ins_nr, inode_only);
+               BUG_ON(ret);
+               ins_nr = 1;
+               ins_start_slot = path->slots[0];
+next_slot:
+
+               nritems = btrfs_header_nritems(path->nodes[0]);
+               path->slots[0]++;
+               if (path->slots[0] < nritems) {
+                       btrfs_item_key_to_cpu(path->nodes[0], &min_key,
+                                             path->slots[0]);
+                       goto again;
+               }
+               if (ins_nr) {
+                       ret = copy_items(trans, log, dst_path, src,
+                                        ins_start_slot,
+                                        ins_nr, inode_only);
+                       BUG_ON(ret);
+                       ins_nr = 0;
+               }
+               btrfs_release_path(root, path);
+
+               if (min_key.offset < (u64)-1)
+                       min_key.offset++;
+               else if (min_key.type < (u8)-1)
+                       min_key.type++;
+               else if (min_key.objectid < (u64)-1)
+                       min_key.objectid++;
+               else
+                       break;
+       }
+       if (ins_nr) {
+               ret = copy_items(trans, log, dst_path, src,
+                                ins_start_slot,
+                                ins_nr, inode_only);
+               BUG_ON(ret);
+               ins_nr = 0;
+       }
+       WARN_ON(ins_nr);
+       if (inode_only == LOG_INODE_ALL && S_ISDIR(inode->i_mode)) {
+               btrfs_release_path(root, path);
+               btrfs_release_path(log, dst_path);
+               BTRFS_I(inode)->log_dirty_trans = 0;
+               ret = log_directory_changes(trans, root, inode, path, dst_path);
+               BUG_ON(ret);
+       }
+       BTRFS_I(inode)->logged_trans = trans->transid;
+       mutex_unlock(&BTRFS_I(inode)->log_mutex);
+
+       btrfs_free_path(path);
+       btrfs_free_path(dst_path);
+
+       mutex_lock(&root->fs_info->tree_log_mutex);
+       ret = update_log_root(trans, log);
+       BUG_ON(ret);
+       mutex_unlock(&root->fs_info->tree_log_mutex);
+out:
+       return 0;
+}
+
+int btrfs_log_inode(struct btrfs_trans_handle *trans,
+                   struct btrfs_root *root, struct inode *inode,
+                   int inode_only)
+{
+       int ret;
+
+       start_log_trans(trans, root);
+       ret = __btrfs_log_inode(trans, root, inode, inode_only);
+       end_log_trans(root);
+       return ret;
+}
+
+/*
+ * helper function around btrfs_log_inode to make sure newly created
+ * parent directories also end up in the log.  A minimal inode and backref
+ * only logging is done of any parent directories that are older than
+ * the last committed transaction
+ */
+int btrfs_log_dentry(struct btrfs_trans_handle *trans,
+                   struct btrfs_root *root, struct dentry *dentry)
+{
+       int inode_only = LOG_INODE_ALL;
+       struct super_block *sb;
+       int ret;
+
+       start_log_trans(trans, root);
+       sb = dentry->d_inode->i_sb;
+       while (1) {
+               ret = __btrfs_log_inode(trans, root, dentry->d_inode,
+                                       inode_only);
+               BUG_ON(ret);
+               inode_only = LOG_INODE_EXISTS;
+
+               dentry = dentry->d_parent;
+               if (!dentry || !dentry->d_inode || sb != dentry->d_inode->i_sb)
+                       break;
+
+               if (BTRFS_I(dentry->d_inode)->generation <=
+                   root->fs_info->last_trans_committed)
+                       break;
+       }
+       end_log_trans(root);
+       return 0;
+}
+
+/*
+ * it is not safe to log dentry if the chunk root has added new
+ * chunks.  This returns 0 if the dentry was logged, and 1 otherwise.
+ * If this returns 1, you must commit the transaction to safely get your
+ * data on disk.
+ */
+int btrfs_log_dentry_safe(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root, struct dentry *dentry)
+{
+       u64 gen;
+       gen = root->fs_info->last_trans_new_blockgroup;
+       if (gen > root->fs_info->last_trans_committed)
+               return 1;
+       else
+               return btrfs_log_dentry(trans, root, dentry);
+}
+
+/*
+ * should be called during mount to recover any replay any log trees
+ * from the FS
+ */
+int btrfs_recover_log_trees(struct btrfs_root *log_root_tree)
+{
+       int ret;
+       struct btrfs_path *path;
+       struct btrfs_trans_handle *trans;
+       struct btrfs_key key;
+       struct btrfs_key found_key;
+       struct btrfs_key tmp_key;
+       struct btrfs_root *log;
+       struct btrfs_fs_info *fs_info = log_root_tree->fs_info;
+       u64 highest_inode;
+       struct walk_control wc = {
+               .process_func = process_one_buffer,
+               .stage = 0,
+       };
+
+       fs_info->log_root_recovering = 1;
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       trans = btrfs_start_transaction(fs_info->tree_root, 1);
+
+       wc.trans = trans;
+       wc.pin = 1;
+
+       walk_log_tree(trans, log_root_tree, &wc);
+
+again:
+       key.objectid = BTRFS_TREE_LOG_OBJECTID;
+       key.offset = (u64)-1;
+       btrfs_set_key_type(&key, BTRFS_ROOT_ITEM_KEY);
+
+       while (1) {
+               ret = btrfs_search_slot(NULL, log_root_tree, &key, path, 0, 0);
+               if (ret < 0)
+                       break;
+               if (ret > 0) {
+                       if (path->slots[0] == 0)
+                               break;
+                       path->slots[0]--;
+               }
+               btrfs_item_key_to_cpu(path->nodes[0], &found_key,
+                                     path->slots[0]);
+               btrfs_release_path(log_root_tree, path);
+               if (found_key.objectid != BTRFS_TREE_LOG_OBJECTID)
+                       break;
+
+               log = btrfs_read_fs_root_no_radix(log_root_tree,
+                                                 &found_key);
+               BUG_ON(!log);
+
+
+               tmp_key.objectid = found_key.offset;
+               tmp_key.type = BTRFS_ROOT_ITEM_KEY;
+               tmp_key.offset = (u64)-1;
+
+               wc.replay_dest = btrfs_read_fs_root_no_name(fs_info, &tmp_key);
+               BUG_ON(!wc.replay_dest);
+
+               wc.replay_dest->log_root = log;
+               btrfs_record_root_in_trans(wc.replay_dest);
+               ret = walk_log_tree(trans, log, &wc);
+               BUG_ON(ret);
+
+               if (wc.stage == LOG_WALK_REPLAY_ALL) {
+                       ret = fixup_inode_link_counts(trans, wc.replay_dest,
+                                                     path);
+                       BUG_ON(ret);
+               }
+               ret = btrfs_find_highest_inode(wc.replay_dest, &highest_inode);
+               if (ret == 0) {
+                       wc.replay_dest->highest_inode = highest_inode;
+                       wc.replay_dest->last_inode_alloc = highest_inode;
+               }
+
+               key.offset = found_key.offset - 1;
+               wc.replay_dest->log_root = NULL;
+               free_extent_buffer(log->node);
+               kfree(log);
+
+               if (found_key.offset == 0)
+                       break;
+       }
+       btrfs_release_path(log_root_tree, path);
+
+       /* step one is to pin it all, step two is to replay just inodes */
+       if (wc.pin) {
+               wc.pin = 0;
+               wc.process_func = replay_one_buffer;
+               wc.stage = LOG_WALK_REPLAY_INODES;
+               goto again;
+       }
+       /* step three is to replay everything */
+       if (wc.stage < LOG_WALK_REPLAY_ALL) {
+               wc.stage++;
+               goto again;
+       }
+
+       btrfs_free_path(path);
+
+       free_extent_buffer(log_root_tree->node);
+       log_root_tree->log_root = NULL;
+       fs_info->log_root_recovering = 0;
+
+       /* step 4: commit the transaction, which also unpins the blocks */
+       btrfs_commit_transaction(trans, fs_info->tree_root);
+
+       kfree(log_root_tree);
+       return 0;
+}
diff --git a/fs/btrfs/tree-log.h b/fs/btrfs/tree-log.h

new file mode 100644 (file)

index 0000000..b9409b3
--- /dev/null
+++ b/fs/btrfs/tree-log.h
@@ -0,0 +1,41 @@
+/*
+ * Copyright (C) 2008 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __TREE_LOG_
+#define __TREE_LOG_
+
+int btrfs_sync_log(struct btrfs_trans_handle *trans,
+                  struct btrfs_root *root);
+int btrfs_free_log(struct btrfs_trans_handle *trans, struct btrfs_root *root);
+int btrfs_log_dentry(struct btrfs_trans_handle *trans,
+                   struct btrfs_root *root, struct dentry *dentry);
+int btrfs_recover_log_trees(struct btrfs_root *tree_root);
+int btrfs_log_dentry_safe(struct btrfs_trans_handle *trans,
+                         struct btrfs_root *root, struct dentry *dentry);
+int btrfs_log_inode(struct btrfs_trans_handle *trans,
+                   struct btrfs_root *root, struct inode *inode,
+                   int inode_only);
+int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans,
+                                struct btrfs_root *root,
+                                const char *name, int name_len,
+                                struct inode *dir, u64 index);
+int btrfs_del_inode_ref_in_log(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *root,
+                              const char *name, int name_len,
+                              struct inode *inode, u64 dirid);
+#endif
diff --git a/fs/btrfs/version.h b/fs/btrfs/version.h

new file mode 100644 (file)

index 0000000..9bf3946
--- /dev/null
+++ b/fs/btrfs/version.h
@@ -0,0 +1,4 @@
+#ifndef __BTRFS_VERSION_H
+#define __BTRFS_VERSION_H
+#define BTRFS_BUILD_VERSION "Btrfs"
+#endif
diff --git a/fs/btrfs/version.sh b/fs/btrfs/version.sh

new file mode 100644 (file)

index 0000000..1ca1952
--- /dev/null
+++ b/fs/btrfs/version.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+#
+# determine-version -- report a useful version for releases
+#
+# Copyright 2008, Aron Griffis <agriffis@n01se.net>
+# Copyright 2008, Oracle
+# Released under the GNU GPLv2
+ 
+v="v0.16"
+
+which git &> /dev/null
+if [ $? == 0 ]; then
+    git branch >& /dev/null
+    if [ $? == 0 ]; then
+           if head=`git rev-parse --verify HEAD 2>/dev/null`; then
+               if tag=`git describe --tags 2>/dev/null`; then
+                   v="$tag"
+               fi
+
+               # Are there uncommitted changes?
+               git update-index --refresh --unmerged > /dev/null
+               if git diff-index --name-only HEAD | \
+                   grep -v "^scripts/package" \
+                   | read dummy; then
+                   v="$v"-dirty
+               fi
+           fi
+    fi
+fi
+ 
+echo "#ifndef __BUILD_VERSION" > .build-version.h
+echo "#define __BUILD_VERSION" >> .build-version.h
+echo "#define BTRFS_BUILD_VERSION \"Btrfs $v\"" >> .build-version.h
+echo "#endif" >> .build-version.h
+
+diff -q version.h .build-version.h >& /dev/null
+
+if [ $? == 0 ]; then
+    rm .build-version.h
+    exit 0
+fi
+
+mv .build-version.h version.h
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c

new file mode 100644 (file)

index 0000000..b187b53
--- /dev/null
+++ b/fs/btrfs/volumes.c
@@ -0,0 +1,3218 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+#include <linux/sched.h>
+#include <linux/bio.h>
+#include <linux/buffer_head.h>
+#include <linux/blkdev.h>
+#include <linux/random.h>
+#include <linux/version.h>
+#include <asm/div64.h>
+#include "compat.h"
+#include "ctree.h"
+#include "extent_map.h"
+#include "disk-io.h"
+#include "transaction.h"
+#include "print-tree.h"
+#include "volumes.h"
+#include "async-thread.h"
+
+struct map_lookup {
+       u64 type;
+       int io_align;
+       int io_width;
+       int stripe_len;
+       int sector_size;
+       int num_stripes;
+       int sub_stripes;
+       struct btrfs_bio_stripe stripes[];
+};
+
+static int init_first_rw_device(struct btrfs_trans_handle *trans,
+                               struct btrfs_root *root,
+                               struct btrfs_device *device);
+static int btrfs_relocate_sys_chunks(struct btrfs_root *root);
+
+#define map_lookup_size(n) (sizeof(struct map_lookup) + \
+                           (sizeof(struct btrfs_bio_stripe) * (n)))
+
+static DEFINE_MUTEX(uuid_mutex);
+static LIST_HEAD(fs_uuids);
+
+void btrfs_lock_volumes(void)
+{
+       mutex_lock(&uuid_mutex);
+}
+
+void btrfs_unlock_volumes(void)
+{
+       mutex_unlock(&uuid_mutex);
+}
+
+static void lock_chunks(struct btrfs_root *root)
+{
+       mutex_lock(&root->fs_info->chunk_mutex);
+}
+
+static void unlock_chunks(struct btrfs_root *root)
+{
+       mutex_unlock(&root->fs_info->chunk_mutex);
+}
+
+static void free_fs_devices(struct btrfs_fs_devices *fs_devices)
+{
+       struct btrfs_device *device;
+       WARN_ON(fs_devices->opened);
+       while (!list_empty(&fs_devices->devices)) {
+               device = list_entry(fs_devices->devices.next,
+                                   struct btrfs_device, dev_list);
+               list_del(&device->dev_list);
+               kfree(device->name);
+               kfree(device);
+       }
+       kfree(fs_devices);
+}
+
+int btrfs_cleanup_fs_uuids(void)
+{
+       struct btrfs_fs_devices *fs_devices;
+
+       while (!list_empty(&fs_uuids)) {
+               fs_devices = list_entry(fs_uuids.next,
+                                       struct btrfs_fs_devices, list);
+               list_del(&fs_devices->list);
+               free_fs_devices(fs_devices);
+       }
+       return 0;
+}
+
+static noinline struct btrfs_device *__find_device(struct list_head *head,
+                                                  u64 devid, u8 *uuid)
+{
+       struct btrfs_device *dev;
+       struct list_head *cur;
+
+       list_for_each(cur, head) {
+               dev = list_entry(cur, struct btrfs_device, dev_list);
+               if (dev->devid == devid &&
+                   (!uuid || !memcmp(dev->uuid, uuid, BTRFS_UUID_SIZE))) {
+                       return dev;
+               }
+       }
+       return NULL;
+}
+
+static noinline struct btrfs_fs_devices *find_fsid(u8 *fsid)
+{
+       struct list_head *cur;
+       struct btrfs_fs_devices *fs_devices;
+
+       list_for_each(cur, &fs_uuids) {
+               fs_devices = list_entry(cur, struct btrfs_fs_devices, list);
+               if (memcmp(fsid, fs_devices->fsid, BTRFS_FSID_SIZE) == 0)
+                       return fs_devices;
+       }
+       return NULL;
+}
+
+/*
+ * we try to collect pending bios for a device so we don't get a large
+ * number of procs sending bios down to the same device.  This greatly
+ * improves the schedulers ability to collect and merge the bios.
+ *
+ * But, it also turns into a long list of bios to process and that is sure
+ * to eventually make the worker thread block.  The solution here is to
+ * make some progress and then put this work struct back at the end of
+ * the list if the block device is congested.  This way, multiple devices
+ * can make progress from a single worker thread.
+ */
+static noinline int run_scheduled_bios(struct btrfs_device *device)
+{
+       struct bio *pending;
+       struct backing_dev_info *bdi;
+       struct btrfs_fs_info *fs_info;
+       struct bio *tail;
+       struct bio *cur;
+       int again = 0;
+       unsigned long num_run = 0;
+       unsigned long limit;
+
+       bdi = device->bdev->bd_inode->i_mapping->backing_dev_info;
+       fs_info = device->dev_root->fs_info;
+       limit = btrfs_async_submit_limit(fs_info);
+       limit = limit * 2 / 3;
+
+loop:
+       spin_lock(&device->io_lock);
+
+       /* take all the bios off the list at once and process them
+        * later on (without the lock held).  But, remember the
+        * tail and other pointers so the bios can be properly reinserted
+        * into the list if we hit congestion
+        */
+       pending = device->pending_bios;
+       tail = device->pending_bio_tail;
+       WARN_ON(pending && !tail);
+       device->pending_bios = NULL;
+       device->pending_bio_tail = NULL;
+
+       /*
+        * if pending was null this time around, no bios need processing
+        * at all and we can stop.  Otherwise it'll loop back up again
+        * and do an additional check so no bios are missed.
+        *
+        * device->running_pending is used to synchronize with the
+        * schedule_bio code.
+        */
+       if (pending) {
+               again = 1;
+               device->running_pending = 1;
+       } else {
+               again = 0;
+               device->running_pending = 0;
+       }
+       spin_unlock(&device->io_lock);
+
+       while (pending) {
+               cur = pending;
+               pending = pending->bi_next;
+               cur->bi_next = NULL;
+               atomic_dec(&fs_info->nr_async_bios);
+
+               if (atomic_read(&fs_info->nr_async_bios) < limit &&
+                   waitqueue_active(&fs_info->async_submit_wait))
+                       wake_up(&fs_info->async_submit_wait);
+
+               BUG_ON(atomic_read(&cur->bi_cnt) == 0);
+               bio_get(cur);
+               submit_bio(cur->bi_rw, cur);
+               bio_put(cur);
+               num_run++;
+
+               /*
+                * we made progress, there is more work to do and the bdi
+                * is now congested.  Back off and let other work structs
+                * run instead
+                */
+               if (pending && bdi_write_congested(bdi) &&
+                   fs_info->fs_devices->open_devices > 1) {
+                       struct bio *old_head;
+
+                       spin_lock(&device->io_lock);
+
+                       old_head = device->pending_bios;
+                       device->pending_bios = pending;
+                       if (device->pending_bio_tail)
+                               tail->bi_next = old_head;
+                       else
+                               device->pending_bio_tail = tail;
+
+                       spin_unlock(&device->io_lock);
+                       btrfs_requeue_work(&device->work);
+                       goto done;
+               }
+       }
+       if (again)
+               goto loop;
+done:
+       return 0;
+}
+
+static void pending_bios_fn(struct btrfs_work *work)
+{
+       struct btrfs_device *device;
+
+       device = container_of(work, struct btrfs_device, work);
+       run_scheduled_bios(device);
+}
+
+static noinline int device_list_add(const char *path,
+                          struct btrfs_super_block *disk_super,
+                          u64 devid, struct btrfs_fs_devices **fs_devices_ret)
+{
+       struct btrfs_device *device;
+       struct btrfs_fs_devices *fs_devices;
+       u64 found_transid = btrfs_super_generation(disk_super);
+
+       fs_devices = find_fsid(disk_super->fsid);
+       if (!fs_devices) {
+               fs_devices = kzalloc(sizeof(*fs_devices), GFP_NOFS);
+               if (!fs_devices)
+                       return -ENOMEM;
+               INIT_LIST_HEAD(&fs_devices->devices);
+               INIT_LIST_HEAD(&fs_devices->alloc_list);
+               list_add(&fs_devices->list, &fs_uuids);
+               memcpy(fs_devices->fsid, disk_super->fsid, BTRFS_FSID_SIZE);
+               fs_devices->latest_devid = devid;
+               fs_devices->latest_trans = found_transid;
+               device = NULL;
+       } else {
+               device = __find_device(&fs_devices->devices, devid,
+                                      disk_super->dev_item.uuid);
+       }
+       if (!device) {
+               if (fs_devices->opened)
+                       return -EBUSY;
+
+               device = kzalloc(sizeof(*device), GFP_NOFS);
+               if (!device) {
+                       /* we can safely leave the fs_devices entry around */
+                       return -ENOMEM;
+               }
+               device->devid = devid;
+               device->work.func = pending_bios_fn;
+               memcpy(device->uuid, disk_super->dev_item.uuid,
+                      BTRFS_UUID_SIZE);
+               device->barriers = 1;
+               spin_lock_init(&device->io_lock);
+               device->name = kstrdup(path, GFP_NOFS);
+               if (!device->name) {
+                       kfree(device);
+                       return -ENOMEM;
+               }
+               INIT_LIST_HEAD(&device->dev_alloc_list);
+               list_add(&device->dev_list, &fs_devices->devices);
+               device->fs_devices = fs_devices;
+               fs_devices->num_devices++;
+       }
+
+       if (found_transid > fs_devices->latest_trans) {
+               fs_devices->latest_devid = devid;
+               fs_devices->latest_trans = found_transid;
+       }
+       *fs_devices_ret = fs_devices;
+       return 0;
+}
+
+static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
+{
+       struct btrfs_fs_devices *fs_devices;
+       struct btrfs_device *device;
+       struct btrfs_device *orig_dev;
+
+       fs_devices = kzalloc(sizeof(*fs_devices), GFP_NOFS);
+       if (!fs_devices)
+               return ERR_PTR(-ENOMEM);
+
+       INIT_LIST_HEAD(&fs_devices->devices);
+       INIT_LIST_HEAD(&fs_devices->alloc_list);
+       INIT_LIST_HEAD(&fs_devices->list);
+       fs_devices->latest_devid = orig->latest_devid;
+       fs_devices->latest_trans = orig->latest_trans;
+       memcpy(fs_devices->fsid, orig->fsid, sizeof(fs_devices->fsid));
+
+       list_for_each_entry(orig_dev, &orig->devices, dev_list) {
+               device = kzalloc(sizeof(*device), GFP_NOFS);
+               if (!device)
+                       goto error;
+
+               device->name = kstrdup(orig_dev->name, GFP_NOFS);
+               if (!device->name)
+                       goto error;
+
+               device->devid = orig_dev->devid;
+               device->work.func = pending_bios_fn;
+               memcpy(device->uuid, orig_dev->uuid, sizeof(device->uuid));
+               device->barriers = 1;
+               spin_lock_init(&device->io_lock);
+               INIT_LIST_HEAD(&device->dev_list);
+               INIT_LIST_HEAD(&device->dev_alloc_list);
+
+               list_add(&device->dev_list, &fs_devices->devices);
+               device->fs_devices = fs_devices;
+               fs_devices->num_devices++;
+       }
+       return fs_devices;
+error:
+       free_fs_devices(fs_devices);
+       return ERR_PTR(-ENOMEM);
+}
+
+int btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices)
+{
+       struct list_head *tmp;
+       struct list_head *cur;
+       struct btrfs_device *device;
+
+       mutex_lock(&uuid_mutex);
+again:
+       list_for_each_safe(cur, tmp, &fs_devices->devices) {
+               device = list_entry(cur, struct btrfs_device, dev_list);
+               if (device->in_fs_metadata)
+                       continue;
+
+               if (device->bdev) {
+                       close_bdev_exclusive(device->bdev, device->mode);
+                       device->bdev = NULL;
+                       fs_devices->open_devices--;
+               }
+               if (device->writeable) {
+                       list_del_init(&device->dev_alloc_list);
+                       device->writeable = 0;
+                       fs_devices->rw_devices--;
+               }
+               list_del_init(&device->dev_list);
+               fs_devices->num_devices--;
+               kfree(device->name);
+               kfree(device);
+       }
+
+       if (fs_devices->seed) {
+               fs_devices = fs_devices->seed;
+               goto again;
+       }
+
+       mutex_unlock(&uuid_mutex);
+       return 0;
+}
+
+static int __btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
+{
+       struct list_head *cur;
+       struct btrfs_device *device;
+
+       if (--fs_devices->opened > 0)
+               return 0;
+
+       list_for_each(cur, &fs_devices->devices) {
+               device = list_entry(cur, struct btrfs_device, dev_list);
+               if (device->bdev) {
+                       close_bdev_exclusive(device->bdev, device->mode);
+                       fs_devices->open_devices--;
+               }
+               if (device->writeable) {
+                       list_del_init(&device->dev_alloc_list);
+                       fs_devices->rw_devices--;
+               }
+
+               device->bdev = NULL;
+               device->writeable = 0;
+               device->in_fs_metadata = 0;
+       }
+       WARN_ON(fs_devices->open_devices);
+       WARN_ON(fs_devices->rw_devices);
+       fs_devices->opened = 0;
+       fs_devices->seeding = 0;
+
+       return 0;
+}
+
+int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
+{
+       struct btrfs_fs_devices *seed_devices = NULL;
+       int ret;
+
+       mutex_lock(&uuid_mutex);
+       ret = __btrfs_close_devices(fs_devices);
+       if (!fs_devices->opened) {
+               seed_devices = fs_devices->seed;
+               fs_devices->seed = NULL;
+       }
+       mutex_unlock(&uuid_mutex);
+
+       while (seed_devices) {
+               fs_devices = seed_devices;
+               seed_devices = fs_devices->seed;
+               __btrfs_close_devices(fs_devices);
+               free_fs_devices(fs_devices);
+       }
+       return ret;
+}
+
+static int __btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
+                               fmode_t flags, void *holder)
+{
+       struct block_device *bdev;
+       struct list_head *head = &fs_devices->devices;
+       struct list_head *cur;
+       struct btrfs_device *device;
+       struct block_device *latest_bdev = NULL;
+       struct buffer_head *bh;
+       struct btrfs_super_block *disk_super;
+       u64 latest_devid = 0;
+       u64 latest_transid = 0;
+       u64 devid;
+       int seeding = 1;
+       int ret = 0;
+
+       list_for_each(cur, head) {
+               device = list_entry(cur, struct btrfs_device, dev_list);
+               if (device->bdev)
+                       continue;
+               if (!device->name)
+                       continue;
+
+               bdev = open_bdev_exclusive(device->name, flags, holder);
+               if (IS_ERR(bdev)) {
+                       printk(KERN_INFO "open %s failed\n", device->name);
+                       goto error;
+               }
+               set_blocksize(bdev, 4096);
+
+               bh = btrfs_read_dev_super(bdev);
+               if (!bh)
+                       goto error_close;
+
+               disk_super = (struct btrfs_super_block *)bh->b_data;
+               devid = le64_to_cpu(disk_super->dev_item.devid);
+               if (devid != device->devid)
+                       goto error_brelse;
+
+               if (memcmp(device->uuid, disk_super->dev_item.uuid,
+                          BTRFS_UUID_SIZE))
+                       goto error_brelse;
+
+               device->generation = btrfs_super_generation(disk_super);
+               if (!latest_transid || device->generation > latest_transid) {
+                       latest_devid = devid;
+                       latest_transid = device->generation;
+                       latest_bdev = bdev;
+               }
+
+               if (btrfs_super_flags(disk_super) & BTRFS_SUPER_FLAG_SEEDING) {
+                       device->writeable = 0;
+               } else {
+                       device->writeable = !bdev_read_only(bdev);
+                       seeding = 0;
+               }
+
+               device->bdev = bdev;
+               device->in_fs_metadata = 0;
+               device->mode = flags;
+
+               fs_devices->open_devices++;
+               if (device->writeable) {
+                       fs_devices->rw_devices++;
+                       list_add(&device->dev_alloc_list,
+                                &fs_devices->alloc_list);
+               }
+               continue;
+
+error_brelse:
+               brelse(bh);
+error_close:
+               close_bdev_exclusive(bdev, FMODE_READ);
+error:
+               continue;
+       }
+       if (fs_devices->open_devices == 0) {
+               ret = -EIO;
+               goto out;
+       }
+       fs_devices->seeding = seeding;
+       fs_devices->opened = 1;
+       fs_devices->latest_bdev = latest_bdev;
+       fs_devices->latest_devid = latest_devid;
+       fs_devices->latest_trans = latest_transid;
+       fs_devices->total_rw_bytes = 0;
+out:
+       return ret;
+}
+
+int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
+                      fmode_t flags, void *holder)
+{
+       int ret;
+
+       mutex_lock(&uuid_mutex);
+       if (fs_devices->opened) {
+               fs_devices->opened++;
+               ret = 0;
+       } else {
+               ret = __btrfs_open_devices(fs_devices, flags, holder);
+       }
+       mutex_unlock(&uuid_mutex);
+       return ret;
+}
+
+int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder,
+                         struct btrfs_fs_devices **fs_devices_ret)
+{
+       struct btrfs_super_block *disk_super;
+       struct block_device *bdev;
+       struct buffer_head *bh;
+       int ret;
+       u64 devid;
+       u64 transid;
+
+       mutex_lock(&uuid_mutex);
+
+       bdev = open_bdev_exclusive(path, flags, holder);
+
+       if (IS_ERR(bdev)) {
+               ret = PTR_ERR(bdev);
+               goto error;
+       }
+
+       ret = set_blocksize(bdev, 4096);
+       if (ret)
+               goto error_close;
+       bh = btrfs_read_dev_super(bdev);
+       if (!bh) {
+               ret = -EIO;
+               goto error_close;
+       }
+       disk_super = (struct btrfs_super_block *)bh->b_data;
+       devid = le64_to_cpu(disk_super->dev_item.devid);
+       transid = btrfs_super_generation(disk_super);
+       if (disk_super->label[0])
+               printk(KERN_INFO "device label %s ", disk_super->label);
+       else {
+               /* FIXME, make a readl uuid parser */
+               printk(KERN_INFO "device fsid %llx-%llx ",
+                      *(unsigned long long *)disk_super->fsid,
+                      *(unsigned long long *)(disk_super->fsid + 8));
+       }
+       printk(KERN_INFO "devid %llu transid %llu %s\n",
+              (unsigned long long)devid, (unsigned long long)transid, path);
+       ret = device_list_add(path, disk_super, devid, fs_devices_ret);
+
+       brelse(bh);
+error_close:
+       close_bdev_exclusive(bdev, flags);
+error:
+       mutex_unlock(&uuid_mutex);
+       return ret;
+}
+
+/*
+ * this uses a pretty simple search, the expectation is that it is
+ * called very infrequently and that a given device has a small number
+ * of extents
+ */
+static noinline int find_free_dev_extent(struct btrfs_trans_handle *trans,
+                                        struct btrfs_device *device,
+                                        u64 num_bytes, u64 *start)
+{
+       struct btrfs_key key;
+       struct btrfs_root *root = device->dev_root;
+       struct btrfs_dev_extent *dev_extent = NULL;
+       struct btrfs_path *path;
+       u64 hole_size = 0;
+       u64 last_byte = 0;
+       u64 search_start = 0;
+       u64 search_end = device->total_bytes;
+       int ret;
+       int slot = 0;
+       int start_found;
+       struct extent_buffer *l;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+       path->reada = 2;
+       start_found = 0;
+
+       /* FIXME use last free of some kind */
+
+       /* we don't want to overwrite the superblock on the drive,
+        * so we make sure to start at an offset of at least 1MB
+        */
+       search_start = max((u64)1024 * 1024, search_start);
+
+       if (root->fs_info->alloc_start + num_bytes <= device->total_bytes)
+               search_start = max(root->fs_info->alloc_start, search_start);
+
+       key.objectid = device->devid;
+       key.offset = search_start;
+       key.type = BTRFS_DEV_EXTENT_KEY;
+       ret = btrfs_search_slot(trans, root, &key, path, 0, 0);
+       if (ret < 0)
+               goto error;
+       ret = btrfs_previous_item(root, path, 0, key.type);
+       if (ret < 0)
+               goto error;
+       l = path->nodes[0];
+       btrfs_item_key_to_cpu(l, &key, path->slots[0]);
+       while (1) {
+               l = path->nodes[0];
+               slot = path->slots[0];
+               if (slot >= btrfs_header_nritems(l)) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret == 0)
+                               continue;
+                       if (ret < 0)
+                               goto error;
+no_more_items:
+                       if (!start_found) {
+                               if (search_start >= search_end) {
+                                       ret = -ENOSPC;
+                                       goto error;
+                               }
+                               *start = search_start;
+                               start_found = 1;
+                               goto check_pending;
+                       }
+                       *start = last_byte > search_start ?
+                               last_byte : search_start;
+                       if (search_end <= *start) {
+                               ret = -ENOSPC;
+                               goto error;
+                       }
+                       goto check_pending;
+               }
+               btrfs_item_key_to_cpu(l, &key, slot);
+
+               if (key.objectid < device->devid)
+                       goto next;
+
+               if (key.objectid > device->devid)
+                       goto no_more_items;
+
+               if (key.offset >= search_start && key.offset > last_byte &&
+                   start_found) {
+                       if (last_byte < search_start)
+                               last_byte = search_start;
+                       hole_size = key.offset - last_byte;
+                       if (key.offset > last_byte &&
+                           hole_size >= num_bytes) {
+                               *start = last_byte;
+                               goto check_pending;
+                       }
+               }
+               if (btrfs_key_type(&key) != BTRFS_DEV_EXTENT_KEY)
+                       goto next;
+
+               start_found = 1;
+               dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
+               last_byte = key.offset + btrfs_dev_extent_length(l, dev_extent);
+next:
+               path->slots[0]++;
+               cond_resched();
+       }
+check_pending:
+       /* we have to make sure we didn't find an extent that has already
+        * been allocated by the map tree or the original allocation
+        */
+       BUG_ON(*start < search_start);
+
+       if (*start + num_bytes > search_end) {
+               ret = -ENOSPC;
+               goto error;
+       }
+       /* check for pending inserts here */
+       ret = 0;
+
+error:
+       btrfs_free_path(path);
+       return ret;
+}
+
+static int btrfs_free_dev_extent(struct btrfs_trans_handle *trans,
+                         struct btrfs_device *device,
+                         u64 start)
+{
+       int ret;
+       struct btrfs_path *path;
+       struct btrfs_root *root = device->dev_root;
+       struct btrfs_key key;
+       struct btrfs_key found_key;
+       struct extent_buffer *leaf = NULL;
+       struct btrfs_dev_extent *extent = NULL;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       key.objectid = device->devid;
+       key.offset = start;
+       key.type = BTRFS_DEV_EXTENT_KEY;
+
+       ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+       if (ret > 0) {
+               ret = btrfs_previous_item(root, path, key.objectid,
+                                         BTRFS_DEV_EXTENT_KEY);
+               BUG_ON(ret);
+               leaf = path->nodes[0];
+               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+               extent = btrfs_item_ptr(leaf, path->slots[0],
+                                       struct btrfs_dev_extent);
+               BUG_ON(found_key.offset > start || found_key.offset +
+                      btrfs_dev_extent_length(leaf, extent) < start);
+               ret = 0;
+       } else if (ret == 0) {
+               leaf = path->nodes[0];
+               extent = btrfs_item_ptr(leaf, path->slots[0],
+                                       struct btrfs_dev_extent);
+       }
+       BUG_ON(ret);
+
+       if (device->bytes_used > 0)
+               device->bytes_used -= btrfs_dev_extent_length(leaf, extent);
+       ret = btrfs_del_item(trans, root, path);
+       BUG_ON(ret);
+
+       btrfs_free_path(path);
+       return ret;
+}
+
+int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
+                          struct btrfs_device *device,
+                          u64 chunk_tree, u64 chunk_objectid,
+                          u64 chunk_offset, u64 start, u64 num_bytes)
+{
+       int ret;
+       struct btrfs_path *path;
+       struct btrfs_root *root = device->dev_root;
+       struct btrfs_dev_extent *extent;
+       struct extent_buffer *leaf;
+       struct btrfs_key key;
+
+       WARN_ON(!device->in_fs_metadata);
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       key.objectid = device->devid;
+       key.offset = start;
+       key.type = BTRFS_DEV_EXTENT_KEY;
+       ret = btrfs_insert_empty_item(trans, root, path, &key,
+                                     sizeof(*extent));
+       BUG_ON(ret);
+
+       leaf = path->nodes[0];
+       extent = btrfs_item_ptr(leaf, path->slots[0],
+                               struct btrfs_dev_extent);
+       btrfs_set_dev_extent_chunk_tree(leaf, extent, chunk_tree);
+       btrfs_set_dev_extent_chunk_objectid(leaf, extent, chunk_objectid);
+       btrfs_set_dev_extent_chunk_offset(leaf, extent, chunk_offset);
+
+       write_extent_buffer(leaf, root->fs_info->chunk_tree_uuid,
+                   (unsigned long)btrfs_dev_extent_chunk_tree_uuid(extent),
+                   BTRFS_UUID_SIZE);
+
+       btrfs_set_dev_extent_length(leaf, extent, num_bytes);
+       btrfs_mark_buffer_dirty(leaf);
+       btrfs_free_path(path);
+       return ret;
+}
+
+static noinline int find_next_chunk(struct btrfs_root *root,
+                                   u64 objectid, u64 *offset)
+{
+       struct btrfs_path *path;
+       int ret;
+       struct btrfs_key key;
+       struct btrfs_chunk *chunk;
+       struct btrfs_key found_key;
+
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       key.objectid = objectid;
+       key.offset = (u64)-1;
+       key.type = BTRFS_CHUNK_ITEM_KEY;
+
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       if (ret < 0)
+               goto error;
+
+       BUG_ON(ret == 0);
+
+       ret = btrfs_previous_item(root, path, 0, BTRFS_CHUNK_ITEM_KEY);
+       if (ret) {
+               *offset = 0;
+       } else {
+               btrfs_item_key_to_cpu(path->nodes[0], &found_key,
+                                     path->slots[0]);
+               if (found_key.objectid != objectid)
+                       *offset = 0;
+               else {
+                       chunk = btrfs_item_ptr(path->nodes[0], path->slots[0],
+                                              struct btrfs_chunk);
+                       *offset = found_key.offset +
+                               btrfs_chunk_length(path->nodes[0], chunk);
+               }
+       }
+       ret = 0;
+error:
+       btrfs_free_path(path);
+       return ret;
+}
+
+static noinline int find_next_devid(struct btrfs_root *root, u64 *objectid)
+{
+       int ret;
+       struct btrfs_key key;
+       struct btrfs_key found_key;
+       struct btrfs_path *path;
+
+       root = root->fs_info->chunk_root;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
+       key.type = BTRFS_DEV_ITEM_KEY;
+       key.offset = (u64)-1;
+
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       if (ret < 0)
+               goto error;
+
+       BUG_ON(ret == 0);
+
+       ret = btrfs_previous_item(root, path, BTRFS_DEV_ITEMS_OBJECTID,
+                                 BTRFS_DEV_ITEM_KEY);
+       if (ret) {
+               *objectid = 1;
+       } else {
+               btrfs_item_key_to_cpu(path->nodes[0], &found_key,
+                                     path->slots[0]);
+               *objectid = found_key.offset + 1;
+       }
+       ret = 0;
+error:
+       btrfs_free_path(path);
+       return ret;
+}
+
+/*
+ * the device information is stored in the chunk root
+ * the btrfs_device struct should be fully filled in
+ */
+int btrfs_add_device(struct btrfs_trans_handle *trans,
+                    struct btrfs_root *root,
+                    struct btrfs_device *device)
+{
+       int ret;
+       struct btrfs_path *path;
+       struct btrfs_dev_item *dev_item;
+       struct extent_buffer *leaf;
+       struct btrfs_key key;
+       unsigned long ptr;
+
+       root = root->fs_info->chunk_root;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
+       key.type = BTRFS_DEV_ITEM_KEY;
+       key.offset = device->devid;
+
+       ret = btrfs_insert_empty_item(trans, root, path, &key,
+                                     sizeof(*dev_item));
+       if (ret)
+               goto out;
+
+       leaf = path->nodes[0];
+       dev_item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_item);
+
+       btrfs_set_device_id(leaf, dev_item, device->devid);
+       btrfs_set_device_generation(leaf, dev_item, 0);
+       btrfs_set_device_type(leaf, dev_item, device->type);
+       btrfs_set_device_io_align(leaf, dev_item, device->io_align);
+       btrfs_set_device_io_width(leaf, dev_item, device->io_width);
+       btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
+       btrfs_set_device_total_bytes(leaf, dev_item, device->total_bytes);
+       btrfs_set_device_bytes_used(leaf, dev_item, device->bytes_used);
+       btrfs_set_device_group(leaf, dev_item, 0);
+       btrfs_set_device_seek_speed(leaf, dev_item, 0);
+       btrfs_set_device_bandwidth(leaf, dev_item, 0);
+       btrfs_set_device_start_offset(leaf, dev_item, 0);
+
+       ptr = (unsigned long)btrfs_device_uuid(dev_item);
+       write_extent_buffer(leaf, device->uuid, ptr, BTRFS_UUID_SIZE);
+       ptr = (unsigned long)btrfs_device_fsid(dev_item);
+       write_extent_buffer(leaf, root->fs_info->fsid, ptr, BTRFS_UUID_SIZE);
+       btrfs_mark_buffer_dirty(leaf);
+
+       ret = 0;
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+static int btrfs_rm_dev_item(struct btrfs_root *root,
+                            struct btrfs_device *device)
+{
+       int ret;
+       struct btrfs_path *path;
+       struct btrfs_key key;
+       struct btrfs_trans_handle *trans;
+
+       root = root->fs_info->chunk_root;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       trans = btrfs_start_transaction(root, 1);
+       key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
+       key.type = BTRFS_DEV_ITEM_KEY;
+       key.offset = device->devid;
+       lock_chunks(root);
+
+       ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+       if (ret < 0)
+               goto out;
+
+       if (ret > 0) {
+               ret = -ENOENT;
+               goto out;
+       }
+
+       ret = btrfs_del_item(trans, root, path);
+       if (ret)
+               goto out;
+out:
+       btrfs_free_path(path);
+       unlock_chunks(root);
+       btrfs_commit_transaction(trans, root);
+       return ret;
+}
+
+int btrfs_rm_device(struct btrfs_root *root, char *device_path)
+{
+       struct btrfs_device *device;
+       struct btrfs_device *next_device;
+       struct block_device *bdev;
+       struct buffer_head *bh = NULL;
+       struct btrfs_super_block *disk_super;
+       u64 all_avail;
+       u64 devid;
+       u64 num_devices;
+       u8 *dev_uuid;
+       int ret = 0;
+
+       mutex_lock(&uuid_mutex);
+       mutex_lock(&root->fs_info->volume_mutex);
+
+       all_avail = root->fs_info->avail_data_alloc_bits |
+               root->fs_info->avail_system_alloc_bits |
+               root->fs_info->avail_metadata_alloc_bits;
+
+       if ((all_avail & BTRFS_BLOCK_GROUP_RAID10) &&
+           root->fs_info->fs_devices->rw_devices <= 4) {
+               printk(KERN_ERR "btrfs: unable to go below four devices "
+                      "on raid10\n");
+               ret = -EINVAL;
+               goto out;
+       }
+
+       if ((all_avail & BTRFS_BLOCK_GROUP_RAID1) &&
+           root->fs_info->fs_devices->rw_devices <= 2) {
+               printk(KERN_ERR "btrfs: unable to go below two "
+                      "devices on raid1\n");
+               ret = -EINVAL;
+               goto out;
+       }
+
+       if (strcmp(device_path, "missing") == 0) {
+               struct list_head *cur;
+               struct list_head *devices;
+               struct btrfs_device *tmp;
+
+               device = NULL;
+               devices = &root->fs_info->fs_devices->devices;
+               list_for_each(cur, devices) {
+                       tmp = list_entry(cur, struct btrfs_device, dev_list);
+                       if (tmp->in_fs_metadata && !tmp->bdev) {
+                               device = tmp;
+                               break;
+                       }
+               }
+               bdev = NULL;
+               bh = NULL;
+               disk_super = NULL;
+               if (!device) {
+                       printk(KERN_ERR "btrfs: no missing devices found to "
+                              "remove\n");
+                       goto out;
+               }
+       } else {
+               bdev = open_bdev_exclusive(device_path, FMODE_READ,
+                                     root->fs_info->bdev_holder);
+               if (IS_ERR(bdev)) {
+                       ret = PTR_ERR(bdev);
+                       goto out;
+               }
+
+               set_blocksize(bdev, 4096);
+               bh = btrfs_read_dev_super(bdev);
+               if (!bh) {
+                       ret = -EIO;
+                       goto error_close;
+               }
+               disk_super = (struct btrfs_super_block *)bh->b_data;
+               devid = le64_to_cpu(disk_super->dev_item.devid);
+               dev_uuid = disk_super->dev_item.uuid;
+               device = btrfs_find_device(root, devid, dev_uuid,
+                                          disk_super->fsid);
+               if (!device) {
+                       ret = -ENOENT;
+                       goto error_brelse;
+               }
+       }
+
+       if (device->writeable && root->fs_info->fs_devices->rw_devices == 1) {
+               printk(KERN_ERR "btrfs: unable to remove the only writeable "
+                      "device\n");
+               ret = -EINVAL;
+               goto error_brelse;
+       }
+
+       if (device->writeable) {
+               list_del_init(&device->dev_alloc_list);
+               root->fs_info->fs_devices->rw_devices--;
+       }
+
+       ret = btrfs_shrink_device(device, 0);
+       if (ret)
+               goto error_brelse;
+
+       ret = btrfs_rm_dev_item(root->fs_info->chunk_root, device);
+       if (ret)
+               goto error_brelse;
+
+       device->in_fs_metadata = 0;
+       list_del_init(&device->dev_list);
+       device->fs_devices->num_devices--;
+
+       next_device = list_entry(root->fs_info->fs_devices->devices.next,
+                                struct btrfs_device, dev_list);
+       if (device->bdev == root->fs_info->sb->s_bdev)
+               root->fs_info->sb->s_bdev = next_device->bdev;
+       if (device->bdev == root->fs_info->fs_devices->latest_bdev)
+               root->fs_info->fs_devices->latest_bdev = next_device->bdev;
+
+       if (device->bdev) {
+               close_bdev_exclusive(device->bdev, device->mode);
+               device->bdev = NULL;
+               device->fs_devices->open_devices--;
+       }
+
+       num_devices = btrfs_super_num_devices(&root->fs_info->super_copy) - 1;
+       btrfs_set_super_num_devices(&root->fs_info->super_copy, num_devices);
+
+       if (device->fs_devices->open_devices == 0) {
+               struct btrfs_fs_devices *fs_devices;
+               fs_devices = root->fs_info->fs_devices;
+               while (fs_devices) {
+                       if (fs_devices->seed == device->fs_devices)
+                               break;
+                       fs_devices = fs_devices->seed;
+               }
+               fs_devices->seed = device->fs_devices->seed;
+               device->fs_devices->seed = NULL;
+               __btrfs_close_devices(device->fs_devices);
+               free_fs_devices(device->fs_devices);
+       }
+
+       /*
+        * at this point, the device is zero sized.  We want to
+        * remove it from the devices list and zero out the old super
+        */
+       if (device->writeable) {
+               /* make sure this device isn't detected as part of
+                * the FS anymore
+                */
+               memset(&disk_super->magic, 0, sizeof(disk_super->magic));
+               set_buffer_dirty(bh);
+               sync_dirty_buffer(bh);
+       }
+
+       kfree(device->name);
+       kfree(device);
+       ret = 0;
+
+error_brelse:
+       brelse(bh);
+error_close:
+       if (bdev)
+               close_bdev_exclusive(bdev, FMODE_READ);
+out:
+       mutex_unlock(&root->fs_info->volume_mutex);
+       mutex_unlock(&uuid_mutex);
+       return ret;
+}
+
+/*
+ * does all the dirty work required for changing file system's UUID.
+ */
+static int btrfs_prepare_sprout(struct btrfs_trans_handle *trans,
+                               struct btrfs_root *root)
+{
+       struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices;
+       struct btrfs_fs_devices *old_devices;
+       struct btrfs_fs_devices *seed_devices;
+       struct btrfs_super_block *disk_super = &root->fs_info->super_copy;
+       struct btrfs_device *device;
+       u64 super_flags;
+
+       BUG_ON(!mutex_is_locked(&uuid_mutex));
+       if (!fs_devices->seeding)
+               return -EINVAL;
+
+       seed_devices = kzalloc(sizeof(*fs_devices), GFP_NOFS);
+       if (!seed_devices)
+               return -ENOMEM;
+
+       old_devices = clone_fs_devices(fs_devices);
+       if (IS_ERR(old_devices)) {
+               kfree(seed_devices);
+               return PTR_ERR(old_devices);
+       }
+
+       list_add(&old_devices->list, &fs_uuids);
+
+       memcpy(seed_devices, fs_devices, sizeof(*seed_devices));
+       seed_devices->opened = 1;
+       INIT_LIST_HEAD(&seed_devices->devices);
+       INIT_LIST_HEAD(&seed_devices->alloc_list);
+       list_splice_init(&fs_devices->devices, &seed_devices->devices);
+       list_splice_init(&fs_devices->alloc_list, &seed_devices->alloc_list);
+       list_for_each_entry(device, &seed_devices->devices, dev_list) {
+               device->fs_devices = seed_devices;
+       }
+
+       fs_devices->seeding = 0;
+       fs_devices->num_devices = 0;
+       fs_devices->open_devices = 0;
+       fs_devices->seed = seed_devices;
+
+       generate_random_uuid(fs_devices->fsid);
+       memcpy(root->fs_info->fsid, fs_devices->fsid, BTRFS_FSID_SIZE);
+       memcpy(disk_super->fsid, fs_devices->fsid, BTRFS_FSID_SIZE);
+       super_flags = btrfs_super_flags(disk_super) &
+                     ~BTRFS_SUPER_FLAG_SEEDING;
+       btrfs_set_super_flags(disk_super, super_flags);
+
+       return 0;
+}
+
+/*
+ * strore the expected generation for seed devices in device items.
+ */
+static int btrfs_finish_sprout(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *root)
+{
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       struct btrfs_dev_item *dev_item;
+       struct btrfs_device *device;
+       struct btrfs_key key;
+       u8 fs_uuid[BTRFS_UUID_SIZE];
+       u8 dev_uuid[BTRFS_UUID_SIZE];
+       u64 devid;
+       int ret;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       root = root->fs_info->chunk_root;
+       key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
+       key.offset = 0;
+       key.type = BTRFS_DEV_ITEM_KEY;
+
+       while (1) {
+               ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
+               if (ret < 0)
+                       goto error;
+
+               leaf = path->nodes[0];
+next_slot:
+               if (path->slots[0] >= btrfs_header_nritems(leaf)) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret > 0)
+                               break;
+                       if (ret < 0)
+                               goto error;
+                       leaf = path->nodes[0];
+                       btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+                       btrfs_release_path(root, path);
+                       continue;
+               }
+
+               btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
+               if (key.objectid != BTRFS_DEV_ITEMS_OBJECTID ||
+                   key.type != BTRFS_DEV_ITEM_KEY)
+                       break;
+
+               dev_item = btrfs_item_ptr(leaf, path->slots[0],
+                                         struct btrfs_dev_item);
+               devid = btrfs_device_id(leaf, dev_item);
+               read_extent_buffer(leaf, dev_uuid,
+                                  (unsigned long)btrfs_device_uuid(dev_item),
+                                  BTRFS_UUID_SIZE);
+               read_extent_buffer(leaf, fs_uuid,
+                                  (unsigned long)btrfs_device_fsid(dev_item),
+                                  BTRFS_UUID_SIZE);
+               device = btrfs_find_device(root, devid, dev_uuid, fs_uuid);
+               BUG_ON(!device);
+
+               if (device->fs_devices->seeding) {
+                       btrfs_set_device_generation(leaf, dev_item,
+                                                   device->generation);
+                       btrfs_mark_buffer_dirty(leaf);
+               }
+
+               path->slots[0]++;
+               goto next_slot;
+       }
+       ret = 0;
+error:
+       btrfs_free_path(path);
+       return ret;
+}
+
+int btrfs_init_new_device(struct btrfs_root *root, char *device_path)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_device *device;
+       struct block_device *bdev;
+       struct list_head *cur;
+       struct list_head *devices;
+       struct super_block *sb = root->fs_info->sb;
+       u64 total_bytes;
+       int seeding_dev = 0;
+       int ret = 0;
+
+       if ((sb->s_flags & MS_RDONLY) && !root->fs_info->fs_devices->seeding)
+               return -EINVAL;
+
+       bdev = open_bdev_exclusive(device_path, 0, root->fs_info->bdev_holder);
+       if (!bdev)
+               return -EIO;
+
+       if (root->fs_info->fs_devices->seeding) {
+               seeding_dev = 1;
+               down_write(&sb->s_umount);
+               mutex_lock(&uuid_mutex);
+       }
+
+       filemap_write_and_wait(bdev->bd_inode->i_mapping);
+       mutex_lock(&root->fs_info->volume_mutex);
+
+       devices = &root->fs_info->fs_devices->devices;
+       list_for_each(cur, devices) {
+               device = list_entry(cur, struct btrfs_device, dev_list);
+               if (device->bdev == bdev) {
+                       ret = -EEXIST;
+                       goto error;
+               }
+       }
+
+       device = kzalloc(sizeof(*device), GFP_NOFS);
+       if (!device) {
+               /* we can safely leave the fs_devices entry around */
+               ret = -ENOMEM;
+               goto error;
+       }
+
+       device->name = kstrdup(device_path, GFP_NOFS);
+       if (!device->name) {
+               kfree(device);
+               ret = -ENOMEM;
+               goto error;
+       }
+
+       ret = find_next_devid(root, &device->devid);
+       if (ret) {
+               kfree(device);
+               goto error;
+       }
+
+       trans = btrfs_start_transaction(root, 1);
+       lock_chunks(root);
+
+       device->barriers = 1;
+       device->writeable = 1;
+       device->work.func = pending_bios_fn;
+       generate_random_uuid(device->uuid);
+       spin_lock_init(&device->io_lock);
+       device->generation = trans->transid;
+       device->io_width = root->sectorsize;
+       device->io_align = root->sectorsize;
+       device->sector_size = root->sectorsize;
+       device->total_bytes = i_size_read(bdev->bd_inode);
+       device->dev_root = root->fs_info->dev_root;
+       device->bdev = bdev;
+       device->in_fs_metadata = 1;
+       device->mode = 0;
+       set_blocksize(device->bdev, 4096);
+
+       if (seeding_dev) {
+               sb->s_flags &= ~MS_RDONLY;
+               ret = btrfs_prepare_sprout(trans, root);
+               BUG_ON(ret);
+       }
+
+       device->fs_devices = root->fs_info->fs_devices;
+       list_add(&device->dev_list, &root->fs_info->fs_devices->devices);
+       list_add(&device->dev_alloc_list,
+                &root->fs_info->fs_devices->alloc_list);
+       root->fs_info->fs_devices->num_devices++;
+       root->fs_info->fs_devices->open_devices++;
+       root->fs_info->fs_devices->rw_devices++;
+       root->fs_info->fs_devices->total_rw_bytes += device->total_bytes;
+
+       total_bytes = btrfs_super_total_bytes(&root->fs_info->super_copy);
+       btrfs_set_super_total_bytes(&root->fs_info->super_copy,
+                                   total_bytes + device->total_bytes);
+
+       total_bytes = btrfs_super_num_devices(&root->fs_info->super_copy);
+       btrfs_set_super_num_devices(&root->fs_info->super_copy,
+                                   total_bytes + 1);
+
+       if (seeding_dev) {
+               ret = init_first_rw_device(trans, root, device);
+               BUG_ON(ret);
+               ret = btrfs_finish_sprout(trans, root);
+               BUG_ON(ret);
+       } else {
+               ret = btrfs_add_device(trans, root, device);
+       }
+
+       unlock_chunks(root);
+       btrfs_commit_transaction(trans, root);
+
+       if (seeding_dev) {
+               mutex_unlock(&uuid_mutex);
+               up_write(&sb->s_umount);
+
+               ret = btrfs_relocate_sys_chunks(root);
+               BUG_ON(ret);
+       }
+out:
+       mutex_unlock(&root->fs_info->volume_mutex);
+       return ret;
+error:
+       close_bdev_exclusive(bdev, 0);
+       if (seeding_dev) {
+               mutex_unlock(&uuid_mutex);
+               up_write(&sb->s_umount);
+       }
+       goto out;
+}
+
+static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
+                                       struct btrfs_device *device)
+{
+       int ret;
+       struct btrfs_path *path;
+       struct btrfs_root *root;
+       struct btrfs_dev_item *dev_item;
+       struct extent_buffer *leaf;
+       struct btrfs_key key;
+
+       root = device->dev_root->fs_info->chunk_root;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
+       key.type = BTRFS_DEV_ITEM_KEY;
+       key.offset = device->devid;
+
+       ret = btrfs_search_slot(trans, root, &key, path, 0, 1);
+       if (ret < 0)
+               goto out;
+
+       if (ret > 0) {
+               ret = -ENOENT;
+               goto out;
+       }
+
+       leaf = path->nodes[0];
+       dev_item = btrfs_item_ptr(leaf, path->slots[0], struct btrfs_dev_item);
+
+       btrfs_set_device_id(leaf, dev_item, device->devid);
+       btrfs_set_device_type(leaf, dev_item, device->type);
+       btrfs_set_device_io_align(leaf, dev_item, device->io_align);
+       btrfs_set_device_io_width(leaf, dev_item, device->io_width);
+       btrfs_set_device_sector_size(leaf, dev_item, device->sector_size);
+       btrfs_set_device_total_bytes(leaf, dev_item, device->total_bytes);
+       btrfs_set_device_bytes_used(leaf, dev_item, device->bytes_used);
+       btrfs_mark_buffer_dirty(leaf);
+
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+static int __btrfs_grow_device(struct btrfs_trans_handle *trans,
+                     struct btrfs_device *device, u64 new_size)
+{
+       struct btrfs_super_block *super_copy =
+               &device->dev_root->fs_info->super_copy;
+       u64 old_total = btrfs_super_total_bytes(super_copy);
+       u64 diff = new_size - device->total_bytes;
+
+       if (!device->writeable)
+               return -EACCES;
+       if (new_size <= device->total_bytes)
+               return -EINVAL;
+
+       btrfs_set_super_total_bytes(super_copy, old_total + diff);
+       device->fs_devices->total_rw_bytes += diff;
+
+       device->total_bytes = new_size;
+       return btrfs_update_device(trans, device);
+}
+
+int btrfs_grow_device(struct btrfs_trans_handle *trans,
+                     struct btrfs_device *device, u64 new_size)
+{
+       int ret;
+       lock_chunks(device->dev_root);
+       ret = __btrfs_grow_device(trans, device, new_size);
+       unlock_chunks(device->dev_root);
+       return ret;
+}
+
+static int btrfs_free_chunk(struct btrfs_trans_handle *trans,
+                           struct btrfs_root *root,
+                           u64 chunk_tree, u64 chunk_objectid,
+                           u64 chunk_offset)
+{
+       int ret;
+       struct btrfs_path *path;
+       struct btrfs_key key;
+
+       root = root->fs_info->chunk_root;
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       key.objectid = chunk_objectid;
+       key.offset = chunk_offset;
+       key.type = BTRFS_CHUNK_ITEM_KEY;
+
+       ret = btrfs_search_slot(trans, root, &key, path, -1, 1);
+       BUG_ON(ret);
+
+       ret = btrfs_del_item(trans, root, path);
+       BUG_ON(ret);
+
+       btrfs_free_path(path);
+       return 0;
+}
+
+static int btrfs_del_sys_chunk(struct btrfs_root *root, u64 chunk_objectid, u64
+                       chunk_offset)
+{
+       struct btrfs_super_block *super_copy = &root->fs_info->super_copy;
+       struct btrfs_disk_key *disk_key;
+       struct btrfs_chunk *chunk;
+       u8 *ptr;
+       int ret = 0;
+       u32 num_stripes;
+       u32 array_size;
+       u32 len = 0;
+       u32 cur;
+       struct btrfs_key key;
+
+       array_size = btrfs_super_sys_array_size(super_copy);
+
+       ptr = super_copy->sys_chunk_array;
+       cur = 0;
+
+       while (cur < array_size) {
+               disk_key = (struct btrfs_disk_key *)ptr;
+               btrfs_disk_key_to_cpu(&key, disk_key);
+
+               len = sizeof(*disk_key);
+
+               if (key.type == BTRFS_CHUNK_ITEM_KEY) {
+                       chunk = (struct btrfs_chunk *)(ptr + len);
+                       num_stripes = btrfs_stack_chunk_num_stripes(chunk);
+                       len += btrfs_chunk_item_size(num_stripes);
+               } else {
+                       ret = -EIO;
+                       break;
+               }
+               if (key.objectid == chunk_objectid &&
+                   key.offset == chunk_offset) {
+                       memmove(ptr, ptr + len, array_size - (cur + len));
+                       array_size -= len;
+                       btrfs_set_super_sys_array_size(super_copy, array_size);
+               } else {
+                       ptr += len;
+                       cur += len;
+               }
+       }
+       return ret;
+}
+
+static int btrfs_relocate_chunk(struct btrfs_root *root,
+                        u64 chunk_tree, u64 chunk_objectid,
+                        u64 chunk_offset)
+{
+       struct extent_map_tree *em_tree;
+       struct btrfs_root *extent_root;
+       struct btrfs_trans_handle *trans;
+       struct extent_map *em;
+       struct map_lookup *map;
+       int ret;
+       int i;
+
+       printk(KERN_INFO "btrfs relocating chunk %llu\n",
+              (unsigned long long)chunk_offset);
+       root = root->fs_info->chunk_root;
+       extent_root = root->fs_info->extent_root;
+       em_tree = &root->fs_info->mapping_tree.map_tree;
+
+       /* step one, relocate all the extents inside this chunk */
+       ret = btrfs_relocate_block_group(extent_root, chunk_offset);
+       BUG_ON(ret);
+
+       trans = btrfs_start_transaction(root, 1);
+       BUG_ON(!trans);
+
+       lock_chunks(root);
+
+       /*
+        * step two, delete the device extents and the
+        * chunk tree entries
+        */
+       spin_lock(&em_tree->lock);
+       em = lookup_extent_mapping(em_tree, chunk_offset, 1);
+       spin_unlock(&em_tree->lock);
+
+       BUG_ON(em->start > chunk_offset ||
+              em->start + em->len < chunk_offset);
+       map = (struct map_lookup *)em->bdev;
+
+       for (i = 0; i < map->num_stripes; i++) {
+               ret = btrfs_free_dev_extent(trans, map->stripes[i].dev,
+                                           map->stripes[i].physical);
+               BUG_ON(ret);
+
+               if (map->stripes[i].dev) {
+                       ret = btrfs_update_device(trans, map->stripes[i].dev);
+                       BUG_ON(ret);
+               }
+       }
+       ret = btrfs_free_chunk(trans, root, chunk_tree, chunk_objectid,
+                              chunk_offset);
+
+       BUG_ON(ret);
+
+       if (map->type & BTRFS_BLOCK_GROUP_SYSTEM) {
+               ret = btrfs_del_sys_chunk(root, chunk_objectid, chunk_offset);
+               BUG_ON(ret);
+       }
+
+       ret = btrfs_remove_block_group(trans, extent_root, chunk_offset);
+       BUG_ON(ret);
+
+       spin_lock(&em_tree->lock);
+       remove_extent_mapping(em_tree, em);
+       spin_unlock(&em_tree->lock);
+
+       kfree(map);
+       em->bdev = NULL;
+
+       /* once for the tree */
+       free_extent_map(em);
+       /* once for us */
+       free_extent_map(em);
+
+       unlock_chunks(root);
+       btrfs_end_transaction(trans, root);
+       return 0;
+}
+
+static int btrfs_relocate_sys_chunks(struct btrfs_root *root)
+{
+       struct btrfs_root *chunk_root = root->fs_info->chunk_root;
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       struct btrfs_chunk *chunk;
+       struct btrfs_key key;
+       struct btrfs_key found_key;
+       u64 chunk_tree = chunk_root->root_key.objectid;
+       u64 chunk_type;
+       int ret;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+       key.offset = (u64)-1;
+       key.type = BTRFS_CHUNK_ITEM_KEY;
+
+       while (1) {
+               ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
+               if (ret < 0)
+                       goto error;
+               BUG_ON(ret == 0);
+
+               ret = btrfs_previous_item(chunk_root, path, key.objectid,
+                                         key.type);
+               if (ret < 0)
+                       goto error;
+               if (ret > 0)
+                       break;
+
+               leaf = path->nodes[0];
+               btrfs_item_key_to_cpu(leaf, &found_key, path->slots[0]);
+
+               chunk = btrfs_item_ptr(leaf, path->slots[0],
+                                      struct btrfs_chunk);
+               chunk_type = btrfs_chunk_type(leaf, chunk);
+               btrfs_release_path(chunk_root, path);
+
+               if (chunk_type & BTRFS_BLOCK_GROUP_SYSTEM) {
+                       ret = btrfs_relocate_chunk(chunk_root, chunk_tree,
+                                                  found_key.objectid,
+                                                  found_key.offset);
+                       BUG_ON(ret);
+               }
+
+               if (found_key.offset == 0)
+                       break;
+               key.offset = found_key.offset - 1;
+       }
+       ret = 0;
+error:
+       btrfs_free_path(path);
+       return ret;
+}
+
+static u64 div_factor(u64 num, int factor)
+{
+       if (factor == 10)
+               return num;
+       num *= factor;
+       do_div(num, 10);
+       return num;
+}
+
+int btrfs_balance(struct btrfs_root *dev_root)
+{
+       int ret;
+       struct list_head *cur;
+       struct list_head *devices = &dev_root->fs_info->fs_devices->devices;
+       struct btrfs_device *device;
+       u64 old_size;
+       u64 size_to_free;
+       struct btrfs_path *path;
+       struct btrfs_key key;
+       struct btrfs_chunk *chunk;
+       struct btrfs_root *chunk_root = dev_root->fs_info->chunk_root;
+       struct btrfs_trans_handle *trans;
+       struct btrfs_key found_key;
+
+       if (dev_root->fs_info->sb->s_flags & MS_RDONLY)
+               return -EROFS;
+
+       mutex_lock(&dev_root->fs_info->volume_mutex);
+       dev_root = dev_root->fs_info->dev_root;
+
+       /* step one make some room on all the devices */
+       list_for_each(cur, devices) {
+               device = list_entry(cur, struct btrfs_device, dev_list);
+               old_size = device->total_bytes;
+               size_to_free = div_factor(old_size, 1);
+               size_to_free = min(size_to_free, (u64)1 * 1024 * 1024);
+               if (!device->writeable ||
+                   device->total_bytes - device->bytes_used > size_to_free)
+                       continue;
+
+               ret = btrfs_shrink_device(device, old_size - size_to_free);
+               BUG_ON(ret);
+
+               trans = btrfs_start_transaction(dev_root, 1);
+               BUG_ON(!trans);
+
+               ret = btrfs_grow_device(trans, device, old_size);
+               BUG_ON(ret);
+
+               btrfs_end_transaction(trans, dev_root);
+       }
+
+       /* step two, relocate all the chunks */
+       path = btrfs_alloc_path();
+       BUG_ON(!path);
+
+       key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+       key.offset = (u64)-1;
+       key.type = BTRFS_CHUNK_ITEM_KEY;
+
+       while (1) {
+               ret = btrfs_search_slot(NULL, chunk_root, &key, path, 0, 0);
+               if (ret < 0)
+                       goto error;
+
+               /*
+                * this shouldn't happen, it means the last relocate
+                * failed
+                */
+               if (ret == 0)
+                       break;
+
+               ret = btrfs_previous_item(chunk_root, path, 0,
+                                         BTRFS_CHUNK_ITEM_KEY);
+               if (ret)
+                       break;
+
+               btrfs_item_key_to_cpu(path->nodes[0], &found_key,
+                                     path->slots[0]);
+               if (found_key.objectid != key.objectid)
+                       break;
+
+               chunk = btrfs_item_ptr(path->nodes[0],
+                                      path->slots[0],
+                                      struct btrfs_chunk);
+               key.offset = found_key.offset;
+               /* chunk zero is special */
+               if (key.offset == 0)
+                       break;
+
+               btrfs_release_path(chunk_root, path);
+               ret = btrfs_relocate_chunk(chunk_root,
+                                          chunk_root->root_key.objectid,
+                                          found_key.objectid,
+                                          found_key.offset);
+               BUG_ON(ret);
+       }
+       ret = 0;
+error:
+       btrfs_free_path(path);
+       mutex_unlock(&dev_root->fs_info->volume_mutex);
+       return ret;
+}
+
+/*
+ * shrinking a device means finding all of the device extents past
+ * the new size, and then following the back refs to the chunks.
+ * The chunk relocation code actually frees the device extent
+ */
+int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
+{
+       struct btrfs_trans_handle *trans;
+       struct btrfs_root *root = device->dev_root;
+       struct btrfs_dev_extent *dev_extent = NULL;
+       struct btrfs_path *path;
+       u64 length;
+       u64 chunk_tree;
+       u64 chunk_objectid;
+       u64 chunk_offset;
+       int ret;
+       int slot;
+       struct extent_buffer *l;
+       struct btrfs_key key;
+       struct btrfs_super_block *super_copy = &root->fs_info->super_copy;
+       u64 old_total = btrfs_super_total_bytes(super_copy);
+       u64 diff = device->total_bytes - new_size;
+
+       if (new_size >= device->total_bytes)
+               return -EINVAL;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       trans = btrfs_start_transaction(root, 1);
+       if (!trans) {
+               ret = -ENOMEM;
+               goto done;
+       }
+
+       path->reada = 2;
+
+       lock_chunks(root);
+
+       device->total_bytes = new_size;
+       if (device->writeable)
+               device->fs_devices->total_rw_bytes -= diff;
+       ret = btrfs_update_device(trans, device);
+       if (ret) {
+               unlock_chunks(root);
+               btrfs_end_transaction(trans, root);
+               goto done;
+       }
+       WARN_ON(diff > old_total);
+       btrfs_set_super_total_bytes(super_copy, old_total - diff);
+       unlock_chunks(root);
+       btrfs_end_transaction(trans, root);
+
+       key.objectid = device->devid;
+       key.offset = (u64)-1;
+       key.type = BTRFS_DEV_EXTENT_KEY;
+
+       while (1) {
+               ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+               if (ret < 0)
+                       goto done;
+
+               ret = btrfs_previous_item(root, path, 0, key.type);
+               if (ret < 0)
+                       goto done;
+               if (ret) {
+                       ret = 0;
+                       goto done;
+               }
+
+               l = path->nodes[0];
+               slot = path->slots[0];
+               btrfs_item_key_to_cpu(l, &key, path->slots[0]);
+
+               if (key.objectid != device->devid)
+                       goto done;
+
+               dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
+               length = btrfs_dev_extent_length(l, dev_extent);
+
+               if (key.offset + length <= new_size)
+                       goto done;
+
+               chunk_tree = btrfs_dev_extent_chunk_tree(l, dev_extent);
+               chunk_objectid = btrfs_dev_extent_chunk_objectid(l, dev_extent);
+               chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
+               btrfs_release_path(root, path);
+
+               ret = btrfs_relocate_chunk(root, chunk_tree, chunk_objectid,
+                                          chunk_offset);
+               if (ret)
+                       goto done;
+       }
+
+done:
+       btrfs_free_path(path);
+       return ret;
+}
+
+static int btrfs_add_system_chunk(struct btrfs_trans_handle *trans,
+                          struct btrfs_root *root,
+                          struct btrfs_key *key,
+                          struct btrfs_chunk *chunk, int item_size)
+{
+       struct btrfs_super_block *super_copy = &root->fs_info->super_copy;
+       struct btrfs_disk_key disk_key;
+       u32 array_size;
+       u8 *ptr;
+
+       array_size = btrfs_super_sys_array_size(super_copy);
+       if (array_size + item_size > BTRFS_SYSTEM_CHUNK_ARRAY_SIZE)
+               return -EFBIG;
+
+       ptr = super_copy->sys_chunk_array + array_size;
+       btrfs_cpu_key_to_disk(&disk_key, key);
+       memcpy(ptr, &disk_key, sizeof(disk_key));
+       ptr += sizeof(disk_key);
+       memcpy(ptr, chunk, item_size);
+       item_size += sizeof(disk_key);
+       btrfs_set_super_sys_array_size(super_copy, array_size + item_size);
+       return 0;
+}
+
+static noinline u64 chunk_bytes_by_type(u64 type, u64 calc_size,
+                                       int num_stripes, int sub_stripes)
+{
+       if (type & (BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_DUP))
+               return calc_size;
+       else if (type & BTRFS_BLOCK_GROUP_RAID10)
+               return calc_size * (num_stripes / sub_stripes);
+       else
+               return calc_size * num_stripes;
+}
+
+static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
+                              struct btrfs_root *extent_root,
+                              struct map_lookup **map_ret,
+                              u64 *num_bytes, u64 *stripe_size,
+                              u64 start, u64 type)
+{
+       struct btrfs_fs_info *info = extent_root->fs_info;
+       struct btrfs_device *device = NULL;
+       struct btrfs_fs_devices *fs_devices = info->fs_devices;
+       struct list_head *cur;
+       struct map_lookup *map = NULL;
+       struct extent_map_tree *em_tree;
+       struct extent_map *em;
+       struct list_head private_devs;
+       int min_stripe_size = 1 * 1024 * 1024;
+       u64 calc_size = 1024 * 1024 * 1024;
+       u64 max_chunk_size = calc_size;
+       u64 min_free;
+       u64 avail;
+       u64 max_avail = 0;
+       u64 dev_offset;
+       int num_stripes = 1;
+       int min_stripes = 1;
+       int sub_stripes = 0;
+       int looped = 0;
+       int ret;
+       int index;
+       int stripe_len = 64 * 1024;
+
+       if ((type & BTRFS_BLOCK_GROUP_RAID1) &&
+           (type & BTRFS_BLOCK_GROUP_DUP)) {
+               WARN_ON(1);
+               type &= ~BTRFS_BLOCK_GROUP_DUP;
+       }
+       if (list_empty(&fs_devices->alloc_list))
+               return -ENOSPC;
+
+       if (type & (BTRFS_BLOCK_GROUP_RAID0)) {
+               num_stripes = fs_devices->rw_devices;
+               min_stripes = 2;
+       }
+       if (type & (BTRFS_BLOCK_GROUP_DUP)) {
+               num_stripes = 2;
+               min_stripes = 2;
+       }
+       if (type & (BTRFS_BLOCK_GROUP_RAID1)) {
+               num_stripes = min_t(u64, 2, fs_devices->rw_devices);
+               if (num_stripes < 2)
+                       return -ENOSPC;
+               min_stripes = 2;
+       }
+       if (type & (BTRFS_BLOCK_GROUP_RAID10)) {
+               num_stripes = fs_devices->rw_devices;
+               if (num_stripes < 4)
+                       return -ENOSPC;
+               num_stripes &= ~(u32)1;
+               sub_stripes = 2;
+               min_stripes = 4;
+       }
+
+       if (type & BTRFS_BLOCK_GROUP_DATA) {
+               max_chunk_size = 10 * calc_size;
+               min_stripe_size = 64 * 1024 * 1024;
+       } else if (type & BTRFS_BLOCK_GROUP_METADATA) {
+               max_chunk_size = 4 * calc_size;
+               min_stripe_size = 32 * 1024 * 1024;
+       } else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
+               calc_size = 8 * 1024 * 1024;
+               max_chunk_size = calc_size * 2;
+               min_stripe_size = 1 * 1024 * 1024;
+       }
+
+       /* we don't want a chunk larger than 10% of writeable space */
+       max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1),
+                            max_chunk_size);
+
+again:
+       if (!map || map->num_stripes != num_stripes) {
+               kfree(map);
+               map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
+               if (!map)
+                       return -ENOMEM;
+               map->num_stripes = num_stripes;
+       }
+
+       if (calc_size * num_stripes > max_chunk_size) {
+               calc_size = max_chunk_size;
+               do_div(calc_size, num_stripes);
+               do_div(calc_size, stripe_len);
+               calc_size *= stripe_len;
+       }
+       /* we don't want tiny stripes */
+       calc_size = max_t(u64, min_stripe_size, calc_size);
+
+       do_div(calc_size, stripe_len);
+       calc_size *= stripe_len;
+
+       cur = fs_devices->alloc_list.next;
+       index = 0;
+
+       if (type & BTRFS_BLOCK_GROUP_DUP)
+               min_free = calc_size * 2;
+       else
+               min_free = calc_size;
+
+       /*
+        * we add 1MB because we never use the first 1MB of the device, unless
+        * we've looped, then we are likely allocating the maximum amount of
+        * space left already
+        */
+       if (!looped)
+               min_free += 1024 * 1024;
+
+       INIT_LIST_HEAD(&private_devs);
+       while (index < num_stripes) {
+               device = list_entry(cur, struct btrfs_device, dev_alloc_list);
+               BUG_ON(!device->writeable);
+               if (device->total_bytes > device->bytes_used)
+                       avail = device->total_bytes - device->bytes_used;
+               else
+                       avail = 0;
+               cur = cur->next;
+
+               if (device->in_fs_metadata && avail >= min_free) {
+                       ret = find_free_dev_extent(trans, device,
+                                                  min_free, &dev_offset);
+                       if (ret == 0) {
+                               list_move_tail(&device->dev_alloc_list,
+                                              &private_devs);
+                               map->stripes[index].dev = device;
+                               map->stripes[index].physical = dev_offset;
+                               index++;
+                               if (type & BTRFS_BLOCK_GROUP_DUP) {
+                                       map->stripes[index].dev = device;
+                                       map->stripes[index].physical =
+                                               dev_offset + calc_size;
+                                       index++;
+                               }
+                       }
+               } else if (device->in_fs_metadata && avail > max_avail)
+                       max_avail = avail;
+               if (cur == &fs_devices->alloc_list)
+                       break;
+       }
+       list_splice(&private_devs, &fs_devices->alloc_list);
+       if (index < num_stripes) {
+               if (index >= min_stripes) {
+                       num_stripes = index;
+                       if (type & (BTRFS_BLOCK_GROUP_RAID10)) {
+                               num_stripes /= sub_stripes;
+                               num_stripes *= sub_stripes;
+                       }
+                       looped = 1;
+                       goto again;
+               }
+               if (!looped && max_avail > 0) {
+                       looped = 1;
+                       calc_size = max_avail;
+                       goto again;
+               }
+               kfree(map);
+               return -ENOSPC;
+       }
+       map->sector_size = extent_root->sectorsize;
+       map->stripe_len = stripe_len;
+       map->io_align = stripe_len;
+       map->io_width = stripe_len;
+       map->type = type;
+       map->num_stripes = num_stripes;
+       map->sub_stripes = sub_stripes;
+
+       *map_ret = map;
+       *stripe_size = calc_size;
+       *num_bytes = chunk_bytes_by_type(type, calc_size,
+                                        num_stripes, sub_stripes);
+
+       em = alloc_extent_map(GFP_NOFS);
+       if (!em) {
+               kfree(map);
+               return -ENOMEM;
+       }
+       em->bdev = (struct block_device *)map;
+       em->start = start;
+       em->len = *num_bytes;
+       em->block_start = 0;
+       em->block_len = em->len;
+
+       em_tree = &extent_root->fs_info->mapping_tree.map_tree;
+       spin_lock(&em_tree->lock);
+       ret = add_extent_mapping(em_tree, em);
+       spin_unlock(&em_tree->lock);
+       BUG_ON(ret);
+       free_extent_map(em);
+
+       ret = btrfs_make_block_group(trans, extent_root, 0, type,
+                                    BTRFS_FIRST_CHUNK_TREE_OBJECTID,
+                                    start, *num_bytes);
+       BUG_ON(ret);
+
+       index = 0;
+       while (index < map->num_stripes) {
+               device = map->stripes[index].dev;
+               dev_offset = map->stripes[index].physical;
+
+               ret = btrfs_alloc_dev_extent(trans, device,
+                               info->chunk_root->root_key.objectid,
+                               BTRFS_FIRST_CHUNK_TREE_OBJECTID,
+                               start, dev_offset, calc_size);
+               BUG_ON(ret);
+               index++;
+       }
+
+       return 0;
+}
+
+static int __finish_chunk_alloc(struct btrfs_trans_handle *trans,
+                               struct btrfs_root *extent_root,
+                               struct map_lookup *map, u64 chunk_offset,
+                               u64 chunk_size, u64 stripe_size)
+{
+       u64 dev_offset;
+       struct btrfs_key key;
+       struct btrfs_root *chunk_root = extent_root->fs_info->chunk_root;
+       struct btrfs_device *device;
+       struct btrfs_chunk *chunk;
+       struct btrfs_stripe *stripe;
+       size_t item_size = btrfs_chunk_item_size(map->num_stripes);
+       int index = 0;
+       int ret;
+
+       chunk = kzalloc(item_size, GFP_NOFS);
+       if (!chunk)
+               return -ENOMEM;
+
+       index = 0;
+       while (index < map->num_stripes) {
+               device = map->stripes[index].dev;
+               device->bytes_used += stripe_size;
+               ret = btrfs_update_device(trans, device);
+               BUG_ON(ret);
+               index++;
+       }
+
+       index = 0;
+       stripe = &chunk->stripe;
+       while (index < map->num_stripes) {
+               device = map->stripes[index].dev;
+               dev_offset = map->stripes[index].physical;
+
+               btrfs_set_stack_stripe_devid(stripe, device->devid);
+               btrfs_set_stack_stripe_offset(stripe, dev_offset);
+               memcpy(stripe->dev_uuid, device->uuid, BTRFS_UUID_SIZE);
+               stripe++;
+               index++;
+       }
+
+       btrfs_set_stack_chunk_length(chunk, chunk_size);
+       btrfs_set_stack_chunk_owner(chunk, extent_root->root_key.objectid);
+       btrfs_set_stack_chunk_stripe_len(chunk, map->stripe_len);
+       btrfs_set_stack_chunk_type(chunk, map->type);
+       btrfs_set_stack_chunk_num_stripes(chunk, map->num_stripes);
+       btrfs_set_stack_chunk_io_align(chunk, map->stripe_len);
+       btrfs_set_stack_chunk_io_width(chunk, map->stripe_len);
+       btrfs_set_stack_chunk_sector_size(chunk, extent_root->sectorsize);
+       btrfs_set_stack_chunk_sub_stripes(chunk, map->sub_stripes);
+
+       key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
+       key.type = BTRFS_CHUNK_ITEM_KEY;
+       key.offset = chunk_offset;
+
+       ret = btrfs_insert_item(trans, chunk_root, &key, chunk, item_size);
+       BUG_ON(ret);
+
+       if (map->type & BTRFS_BLOCK_GROUP_SYSTEM) {
+               ret = btrfs_add_system_chunk(trans, chunk_root, &key, chunk,
+                                            item_size);
+               BUG_ON(ret);
+       }
+       kfree(chunk);
+       return 0;
+}
+
+/*
+ * Chunk allocation falls into two parts. The first part does works
+ * that make the new allocated chunk useable, but not do any operation
+ * that modifies the chunk tree. The second part does the works that
+ * require modifying the chunk tree. This division is important for the
+ * bootstrap process of adding storage to a seed btrfs.
+ */
+int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
+                     struct btrfs_root *extent_root, u64 type)
+{
+       u64 chunk_offset;
+       u64 chunk_size;
+       u64 stripe_size;
+       struct map_lookup *map;
+       struct btrfs_root *chunk_root = extent_root->fs_info->chunk_root;
+       int ret;
+
+       ret = find_next_chunk(chunk_root, BTRFS_FIRST_CHUNK_TREE_OBJECTID,
+                             &chunk_offset);
+       if (ret)
+               return ret;
+
+       ret = __btrfs_alloc_chunk(trans, extent_root, &map, &chunk_size,
+                                 &stripe_size, chunk_offset, type);
+       if (ret)
+               return ret;
+
+       ret = __finish_chunk_alloc(trans, extent_root, map, chunk_offset,
+                                  chunk_size, stripe_size);
+       BUG_ON(ret);
+       return 0;
+}
+
+static noinline int init_first_rw_device(struct btrfs_trans_handle *trans,
+                                        struct btrfs_root *root,
+                                        struct btrfs_device *device)
+{
+       u64 chunk_offset;
+       u64 sys_chunk_offset;
+       u64 chunk_size;
+       u64 sys_chunk_size;
+       u64 stripe_size;
+       u64 sys_stripe_size;
+       u64 alloc_profile;
+       struct map_lookup *map;
+       struct map_lookup *sys_map;
+       struct btrfs_fs_info *fs_info = root->fs_info;
+       struct btrfs_root *extent_root = fs_info->extent_root;
+       int ret;
+
+       ret = find_next_chunk(fs_info->chunk_root,
+                             BTRFS_FIRST_CHUNK_TREE_OBJECTID, &chunk_offset);
+       BUG_ON(ret);
+
+       alloc_profile = BTRFS_BLOCK_GROUP_METADATA |
+                       (fs_info->metadata_alloc_profile &
+                        fs_info->avail_metadata_alloc_bits);
+       alloc_profile = btrfs_reduce_alloc_profile(root, alloc_profile);
+
+       ret = __btrfs_alloc_chunk(trans, extent_root, &map, &chunk_size,
+                                 &stripe_size, chunk_offset, alloc_profile);
+       BUG_ON(ret);
+
+       sys_chunk_offset = chunk_offset + chunk_size;
+
+       alloc_profile = BTRFS_BLOCK_GROUP_SYSTEM |
+                       (fs_info->system_alloc_profile &
+                        fs_info->avail_system_alloc_bits);
+       alloc_profile = btrfs_reduce_alloc_profile(root, alloc_profile);
+
+       ret = __btrfs_alloc_chunk(trans, extent_root, &sys_map,
+                                 &sys_chunk_size, &sys_stripe_size,
+                                 sys_chunk_offset, alloc_profile);
+       BUG_ON(ret);
+
+       ret = btrfs_add_device(trans, fs_info->chunk_root, device);
+       BUG_ON(ret);
+
+       /*
+        * Modifying chunk tree needs allocating new blocks from both
+        * system block group and metadata block group. So we only can
+        * do operations require modifying the chunk tree after both
+        * block groups were created.
+        */
+       ret = __finish_chunk_alloc(trans, extent_root, map, chunk_offset,
+                                  chunk_size, stripe_size);
+       BUG_ON(ret);
+
+       ret = __finish_chunk_alloc(trans, extent_root, sys_map,
+                                  sys_chunk_offset, sys_chunk_size,
+                                  sys_stripe_size);
+       BUG_ON(ret);
+       return 0;
+}
+
+int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset)
+{
+       struct extent_map *em;
+       struct map_lookup *map;
+       struct btrfs_mapping_tree *map_tree = &root->fs_info->mapping_tree;
+       int readonly = 0;
+       int i;
+
+       spin_lock(&map_tree->map_tree.lock);
+       em = lookup_extent_mapping(&map_tree->map_tree, chunk_offset, 1);
+       spin_unlock(&map_tree->map_tree.lock);
+       if (!em)
+               return 1;
+
+       map = (struct map_lookup *)em->bdev;
+       for (i = 0; i < map->num_stripes; i++) {
+               if (!map->stripes[i].dev->writeable) {
+                       readonly = 1;
+                       break;
+               }
+       }
+       free_extent_map(em);
+       return readonly;
+}
+
+void btrfs_mapping_init(struct btrfs_mapping_tree *tree)
+{
+       extent_map_tree_init(&tree->map_tree, GFP_NOFS);
+}
+
+void btrfs_mapping_tree_free(struct btrfs_mapping_tree *tree)
+{
+       struct extent_map *em;
+
+       while (1) {
+               spin_lock(&tree->map_tree.lock);
+               em = lookup_extent_mapping(&tree->map_tree, 0, (u64)-1);
+               if (em)
+                       remove_extent_mapping(&tree->map_tree, em);
+               spin_unlock(&tree->map_tree.lock);
+               if (!em)
+                       break;
+               kfree(em->bdev);
+               /* once for us */
+               free_extent_map(em);
+               /* once for the tree */
+               free_extent_map(em);
+       }
+}
+
+int btrfs_num_copies(struct btrfs_mapping_tree *map_tree, u64 logical, u64 len)
+{
+       struct extent_map *em;
+       struct map_lookup *map;
+       struct extent_map_tree *em_tree = &map_tree->map_tree;
+       int ret;
+
+       spin_lock(&em_tree->lock);
+       em = lookup_extent_mapping(em_tree, logical, len);
+       spin_unlock(&em_tree->lock);
+       BUG_ON(!em);
+
+       BUG_ON(em->start > logical || em->start + em->len < logical);
+       map = (struct map_lookup *)em->bdev;
+       if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1))
+               ret = map->num_stripes;
+       else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
+               ret = map->sub_stripes;
+       else
+               ret = 1;
+       free_extent_map(em);
+       return ret;
+}
+
+static int find_live_mirror(struct map_lookup *map, int first, int num,
+                           int optimal)
+{
+       int i;
+       if (map->stripes[optimal].dev->bdev)
+               return optimal;
+       for (i = first; i < first + num; i++) {
+               if (map->stripes[i].dev->bdev)
+                       return i;
+       }
+       /* we couldn't find one that doesn't fail.  Just return something
+        * and the io error handling code will clean up eventually
+        */
+       return optimal;
+}
+
+static int __btrfs_map_block(struct btrfs_mapping_tree *map_tree, int rw,
+                            u64 logical, u64 *length,
+                            struct btrfs_multi_bio **multi_ret,
+                            int mirror_num, struct page *unplug_page)
+{
+       struct extent_map *em;
+       struct map_lookup *map;
+       struct extent_map_tree *em_tree = &map_tree->map_tree;
+       u64 offset;
+       u64 stripe_offset;
+       u64 stripe_nr;
+       int stripes_allocated = 8;
+       int stripes_required = 1;
+       int stripe_index;
+       int i;
+       int num_stripes;
+       int max_errors = 0;
+       struct btrfs_multi_bio *multi = NULL;
+
+       if (multi_ret && !(rw & (1 << BIO_RW)))
+               stripes_allocated = 1;
+again:
+       if (multi_ret) {
+               multi = kzalloc(btrfs_multi_bio_size(stripes_allocated),
+                               GFP_NOFS);
+               if (!multi)
+                       return -ENOMEM;
+
+               atomic_set(&multi->error, 0);
+       }
+
+       spin_lock(&em_tree->lock);
+       em = lookup_extent_mapping(em_tree, logical, *length);
+       spin_unlock(&em_tree->lock);
+
+       if (!em && unplug_page)
+               return 0;
+
+       if (!em) {
+               printk(KERN_CRIT "unable to find logical %llu len %llu\n",
+                      (unsigned long long)logical,
+                      (unsigned long long)*length);
+               BUG();
+       }
+
+       BUG_ON(em->start > logical || em->start + em->len < logical);
+       map = (struct map_lookup *)em->bdev;
+       offset = logical - em->start;
+
+       if (mirror_num > map->num_stripes)
+               mirror_num = 0;
+
+       /* if our multi bio struct is too small, back off and try again */
+       if (rw & (1 << BIO_RW)) {
+               if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
+                                BTRFS_BLOCK_GROUP_DUP)) {
+                       stripes_required = map->num_stripes;
+                       max_errors = 1;
+               } else if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
+                       stripes_required = map->sub_stripes;
+                       max_errors = 1;
+               }
+       }
+       if (multi_ret && rw == WRITE &&
+           stripes_allocated < stripes_required) {
+               stripes_allocated = map->num_stripes;
+               free_extent_map(em);
+               kfree(multi);
+               goto again;
+       }
+       stripe_nr = offset;
+       /*
+        * stripe_nr counts the total number of stripes we have to stride
+        * to get to this block
+        */
+       do_div(stripe_nr, map->stripe_len);
+
+       stripe_offset = stripe_nr * map->stripe_len;
+       BUG_ON(offset < stripe_offset);
+
+       /* stripe_offset is the offset of this block in its stripe*/
+       stripe_offset = offset - stripe_offset;
+
+       if (map->type & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1 |
+                        BTRFS_BLOCK_GROUP_RAID10 |
+                        BTRFS_BLOCK_GROUP_DUP)) {
+               /* we limit the length of each bio to what fits in a stripe */
+               *length = min_t(u64, em->len - offset,
+                             map->stripe_len - stripe_offset);
+       } else {
+               *length = em->len - offset;
+       }
+
+       if (!multi_ret && !unplug_page)
+               goto out;
+
+       num_stripes = 1;
+       stripe_index = 0;
+       if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
+               if (unplug_page || (rw & (1 << BIO_RW)))
+                       num_stripes = map->num_stripes;
+               else if (mirror_num)
+                       stripe_index = mirror_num - 1;
+               else {
+                       stripe_index = find_live_mirror(map, 0,
+                                           map->num_stripes,
+                                           current->pid % map->num_stripes);
+               }
+
+       } else if (map->type & BTRFS_BLOCK_GROUP_DUP) {
+               if (rw & (1 << BIO_RW))
+                       num_stripes = map->num_stripes;
+               else if (mirror_num)
+                       stripe_index = mirror_num - 1;
+
+       } else if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
+               int factor = map->num_stripes / map->sub_stripes;
+
+               stripe_index = do_div(stripe_nr, factor);
+               stripe_index *= map->sub_stripes;
+
+               if (unplug_page || (rw & (1 << BIO_RW)))
+                       num_stripes = map->sub_stripes;
+               else if (mirror_num)
+                       stripe_index += mirror_num - 1;
+               else {
+                       stripe_index = find_live_mirror(map, stripe_index,
+                                             map->sub_stripes, stripe_index +
+                                             current->pid % map->sub_stripes);
+               }
+       } else {
+               /*
+                * after this do_div call, stripe_nr is the number of stripes
+                * on this device we have to walk to find the data, and
+                * stripe_index is the number of our device in the stripe array
+                */
+               stripe_index = do_div(stripe_nr, map->num_stripes);
+       }
+       BUG_ON(stripe_index >= map->num_stripes);
+
+       for (i = 0; i < num_stripes; i++) {
+               if (unplug_page) {
+                       struct btrfs_device *device;
+                       struct backing_dev_info *bdi;
+
+                       device = map->stripes[stripe_index].dev;
+                       if (device->bdev) {
+                               bdi = blk_get_backing_dev_info(device->bdev);
+                               if (bdi->unplug_io_fn)
+                                       bdi->unplug_io_fn(bdi, unplug_page);
+                       }
+               } else {
+                       multi->stripes[i].physical =
+                               map->stripes[stripe_index].physical +
+                               stripe_offset + stripe_nr * map->stripe_len;
+                       multi->stripes[i].dev = map->stripes[stripe_index].dev;
+               }
+               stripe_index++;
+       }
+       if (multi_ret) {
+               *multi_ret = multi;
+               multi->num_stripes = num_stripes;
+               multi->max_errors = max_errors;
+       }
+out:
+       free_extent_map(em);
+       return 0;
+}
+
+int btrfs_map_block(struct btrfs_mapping_tree *map_tree, int rw,
+                     u64 logical, u64 *length,
+                     struct btrfs_multi_bio **multi_ret, int mirror_num)
+{
+       return __btrfs_map_block(map_tree, rw, logical, length, multi_ret,
+                                mirror_num, NULL);
+}
+
+int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
+                    u64 chunk_start, u64 physical, u64 devid,
+                    u64 **logical, int *naddrs, int *stripe_len)
+{
+       struct extent_map_tree *em_tree = &map_tree->map_tree;
+       struct extent_map *em;
+       struct map_lookup *map;
+       u64 *buf;
+       u64 bytenr;
+       u64 length;
+       u64 stripe_nr;
+       int i, j, nr = 0;
+
+       spin_lock(&em_tree->lock);
+       em = lookup_extent_mapping(em_tree, chunk_start, 1);
+       spin_unlock(&em_tree->lock);
+
+       BUG_ON(!em || em->start != chunk_start);
+       map = (struct map_lookup *)em->bdev;
+
+       length = em->len;
+       if (map->type & BTRFS_BLOCK_GROUP_RAID10)
+               do_div(length, map->num_stripes / map->sub_stripes);
+       else if (map->type & BTRFS_BLOCK_GROUP_RAID0)
+               do_div(length, map->num_stripes);
+
+       buf = kzalloc(sizeof(u64) * map->num_stripes, GFP_NOFS);
+       BUG_ON(!buf);
+
+       for (i = 0; i < map->num_stripes; i++) {
+               if (devid && map->stripes[i].dev->devid != devid)
+                       continue;
+               if (map->stripes[i].physical > physical ||
+                   map->stripes[i].physical + length <= physical)
+                       continue;
+
+               stripe_nr = physical - map->stripes[i].physical;
+               do_div(stripe_nr, map->stripe_len);
+
+               if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
+                       stripe_nr = stripe_nr * map->num_stripes + i;
+                       do_div(stripe_nr, map->sub_stripes);
+               } else if (map->type & BTRFS_BLOCK_GROUP_RAID0) {
+                       stripe_nr = stripe_nr * map->num_stripes + i;
+               }
+               bytenr = chunk_start + stripe_nr * map->stripe_len;
+               WARN_ON(nr >= map->num_stripes);
+               for (j = 0; j < nr; j++) {
+                       if (buf[j] == bytenr)
+                               break;
+               }
+               if (j == nr) {
+                       WARN_ON(nr >= map->num_stripes);
+                       buf[nr++] = bytenr;
+               }
+       }
+
+       for (i = 0; i > nr; i++) {
+               struct btrfs_multi_bio *multi;
+               struct btrfs_bio_stripe *stripe;
+               int ret;
+
+               length = 1;
+               ret = btrfs_map_block(map_tree, WRITE, buf[i],
+                                     &length, &multi, 0);
+               BUG_ON(ret);
+
+               stripe = multi->stripes;
+               for (j = 0; j < multi->num_stripes; j++) {
+                       if (stripe->physical >= physical &&
+                           physical < stripe->physical + length)
+                               break;
+               }
+               BUG_ON(j >= multi->num_stripes);
+               kfree(multi);
+       }
+
+       *logical = buf;
+       *naddrs = nr;
+       *stripe_len = map->stripe_len;
+
+       free_extent_map(em);
+       return 0;
+}
+
+int btrfs_unplug_page(struct btrfs_mapping_tree *map_tree,
+                     u64 logical, struct page *page)
+{
+       u64 length = PAGE_CACHE_SIZE;
+       return __btrfs_map_block(map_tree, READ, logical, &length,
+                                NULL, 0, page);
+}
+
+static void end_bio_multi_stripe(struct bio *bio, int err)
+{
+       struct btrfs_multi_bio *multi = bio->bi_private;
+       int is_orig_bio = 0;
+
+       if (err)
+               atomic_inc(&multi->error);
+
+       if (bio == multi->orig_bio)
+               is_orig_bio = 1;
+
+       if (atomic_dec_and_test(&multi->stripes_pending)) {
+               if (!is_orig_bio) {
+                       bio_put(bio);
+                       bio = multi->orig_bio;
+               }
+               bio->bi_private = multi->private;
+               bio->bi_end_io = multi->end_io;
+               /* only send an error to the higher layers if it is
+                * beyond the tolerance of the multi-bio
+                */
+               if (atomic_read(&multi->error) > multi->max_errors) {
+                       err = -EIO;
+               } else if (err) {
+                       /*
+                        * this bio is actually up to date, we didn't
+                        * go over the max number of errors
+                        */
+                       set_bit(BIO_UPTODATE, &bio->bi_flags);
+                       err = 0;
+               }
+               kfree(multi);
+
+               bio_endio(bio, err);
+       } else if (!is_orig_bio) {
+               bio_put(bio);
+       }
+}
+
+struct async_sched {
+       struct bio *bio;
+       int rw;
+       struct btrfs_fs_info *info;
+       struct btrfs_work work;
+};
+
+/*
+ * see run_scheduled_bios for a description of why bios are collected for
+ * async submit.
+ *
+ * This will add one bio to the pending list for a device and make sure
+ * the work struct is scheduled.
+ */
+static noinline int schedule_bio(struct btrfs_root *root,
+                                struct btrfs_device *device,
+                                int rw, struct bio *bio)
+{
+       int should_queue = 1;
+
+       /* don't bother with additional async steps for reads, right now */
+       if (!(rw & (1 << BIO_RW))) {
+               bio_get(bio);
+               submit_bio(rw, bio);
+               bio_put(bio);
+               return 0;
+       }
+
+       /*
+        * nr_async_bios allows us to reliably return congestion to the
+        * higher layers.  Otherwise, the async bio makes it appear we have
+        * made progress against dirty pages when we've really just put it
+        * on a queue for later
+        */
+       atomic_inc(&root->fs_info->nr_async_bios);
+       WARN_ON(bio->bi_next);
+       bio->bi_next = NULL;
+       bio->bi_rw |= rw;
+
+       spin_lock(&device->io_lock);
+
+       if (device->pending_bio_tail)
+               device->pending_bio_tail->bi_next = bio;
+
+       device->pending_bio_tail = bio;
+       if (!device->pending_bios)
+               device->pending_bios = bio;
+       if (device->running_pending)
+               should_queue = 0;
+
+       spin_unlock(&device->io_lock);
+
+       if (should_queue)
+               btrfs_queue_worker(&root->fs_info->submit_workers,
+                                  &device->work);
+       return 0;
+}
+
+int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
+                 int mirror_num, int async_submit)
+{
+       struct btrfs_mapping_tree *map_tree;
+       struct btrfs_device *dev;
+       struct bio *first_bio = bio;
+       u64 logical = (u64)bio->bi_sector << 9;
+       u64 length = 0;
+       u64 map_length;
+       struct btrfs_multi_bio *multi = NULL;
+       int ret;
+       int dev_nr = 0;
+       int total_devs = 1;
+
+       length = bio->bi_size;
+       map_tree = &root->fs_info->mapping_tree;
+       map_length = length;
+
+       ret = btrfs_map_block(map_tree, rw, logical, &map_length, &multi,
+                             mirror_num);
+       BUG_ON(ret);
+
+       total_devs = multi->num_stripes;
+       if (map_length < length) {
+               printk(KERN_CRIT "mapping failed logical %llu bio len %llu "
+                      "len %llu\n", (unsigned long long)logical,
+                      (unsigned long long)length,
+                      (unsigned long long)map_length);
+               BUG();
+       }
+       multi->end_io = first_bio->bi_end_io;
+       multi->private = first_bio->bi_private;
+       multi->orig_bio = first_bio;
+       atomic_set(&multi->stripes_pending, multi->num_stripes);
+
+       while (dev_nr < total_devs) {
+               if (total_devs > 1) {
+                       if (dev_nr < total_devs - 1) {
+                               bio = bio_clone(first_bio, GFP_NOFS);
+                               BUG_ON(!bio);
+                       } else {
+                               bio = first_bio;
+                       }
+                       bio->bi_private = multi;
+                       bio->bi_end_io = end_bio_multi_stripe;
+               }
+               bio->bi_sector = multi->stripes[dev_nr].physical >> 9;
+               dev = multi->stripes[dev_nr].dev;
+               BUG_ON(rw == WRITE && !dev->writeable);
+               if (dev && dev->bdev) {
+                       bio->bi_bdev = dev->bdev;
+                       if (async_submit)
+                               schedule_bio(root, dev, rw, bio);
+                       else
+                               submit_bio(rw, bio);
+               } else {
+                       bio->bi_bdev = root->fs_info->fs_devices->latest_bdev;
+                       bio->bi_sector = logical >> 9;
+                       bio_endio(bio, -EIO);
+               }
+               dev_nr++;
+       }
+       if (total_devs == 1)
+               kfree(multi);
+       return 0;
+}
+
+struct btrfs_device *btrfs_find_device(struct btrfs_root *root, u64 devid,
+                                      u8 *uuid, u8 *fsid)
+{
+       struct btrfs_device *device;
+       struct btrfs_fs_devices *cur_devices;
+
+       cur_devices = root->fs_info->fs_devices;
+       while (cur_devices) {
+               if (!fsid ||
+                   !memcmp(cur_devices->fsid, fsid, BTRFS_UUID_SIZE)) {
+                       device = __find_device(&cur_devices->devices,
+                                              devid, uuid);
+                       if (device)
+                               return device;
+               }
+               cur_devices = cur_devices->seed;
+       }
+       return NULL;
+}
+
+static struct btrfs_device *add_missing_dev(struct btrfs_root *root,
+                                           u64 devid, u8 *dev_uuid)
+{
+       struct btrfs_device *device;
+       struct btrfs_fs_devices *fs_devices = root->fs_info->fs_devices;
+
+       device = kzalloc(sizeof(*device), GFP_NOFS);
+       if (!device)
+               return NULL;
+       list_add(&device->dev_list,
+                &fs_devices->devices);
+       device->barriers = 1;
+       device->dev_root = root->fs_info->dev_root;
+       device->devid = devid;
+       device->work.func = pending_bios_fn;
+       device->fs_devices = fs_devices;
+       fs_devices->num_devices++;
+       spin_lock_init(&device->io_lock);
+       INIT_LIST_HEAD(&device->dev_alloc_list);
+       memcpy(device->uuid, dev_uuid, BTRFS_UUID_SIZE);
+       return device;
+}
+
+static int read_one_chunk(struct btrfs_root *root, struct btrfs_key *key,
+                         struct extent_buffer *leaf,
+                         struct btrfs_chunk *chunk)
+{
+       struct btrfs_mapping_tree *map_tree = &root->fs_info->mapping_tree;
+       struct map_lookup *map;
+       struct extent_map *em;
+       u64 logical;
+       u64 length;
+       u64 devid;
+       u8 uuid[BTRFS_UUID_SIZE];
+       int num_stripes;
+       int ret;
+       int i;
+
+       logical = key->offset;
+       length = btrfs_chunk_length(leaf, chunk);
+
+       spin_lock(&map_tree->map_tree.lock);
+       em = lookup_extent_mapping(&map_tree->map_tree, logical, 1);
+       spin_unlock(&map_tree->map_tree.lock);
+
+       /* already mapped? */
+       if (em && em->start <= logical && em->start + em->len > logical) {
+               free_extent_map(em);
+               return 0;
+       } else if (em) {
+               free_extent_map(em);
+       }
+
+       map = kzalloc(sizeof(*map), GFP_NOFS);
+       if (!map)
+               return -ENOMEM;
+
+       em = alloc_extent_map(GFP_NOFS);
+       if (!em)
+               return -ENOMEM;
+       num_stripes = btrfs_chunk_num_stripes(leaf, chunk);
+       map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS);
+       if (!map) {
+               free_extent_map(em);
+               return -ENOMEM;
+       }
+
+       em->bdev = (struct block_device *)map;
+       em->start = logical;
+       em->len = length;
+       em->block_start = 0;
+       em->block_len = em->len;
+
+       map->num_stripes = num_stripes;
+       map->io_width = btrfs_chunk_io_width(leaf, chunk);
+       map->io_align = btrfs_chunk_io_align(leaf, chunk);
+       map->sector_size = btrfs_chunk_sector_size(leaf, chunk);
+       map->stripe_len = btrfs_chunk_stripe_len(leaf, chunk);
+       map->type = btrfs_chunk_type(leaf, chunk);
+       map->sub_stripes = btrfs_chunk_sub_stripes(leaf, chunk);
+       for (i = 0; i < num_stripes; i++) {
+               map->stripes[i].physical =
+                       btrfs_stripe_offset_nr(leaf, chunk, i);
+               devid = btrfs_stripe_devid_nr(leaf, chunk, i);
+               read_extent_buffer(leaf, uuid, (unsigned long)
+                                  btrfs_stripe_dev_uuid_nr(chunk, i),
+                                  BTRFS_UUID_SIZE);
+               map->stripes[i].dev = btrfs_find_device(root, devid, uuid,
+                                                       NULL);
+               if (!map->stripes[i].dev && !btrfs_test_opt(root, DEGRADED)) {
+                       kfree(map);
+                       free_extent_map(em);
+                       return -EIO;
+               }
+               if (!map->stripes[i].dev) {
+                       map->stripes[i].dev =
+                               add_missing_dev(root, devid, uuid);
+                       if (!map->stripes[i].dev) {
+                               kfree(map);
+                               free_extent_map(em);
+                               return -EIO;
+                       }
+               }
+               map->stripes[i].dev->in_fs_metadata = 1;
+       }
+
+       spin_lock(&map_tree->map_tree.lock);
+       ret = add_extent_mapping(&map_tree->map_tree, em);
+       spin_unlock(&map_tree->map_tree.lock);
+       BUG_ON(ret);
+       free_extent_map(em);
+
+       return 0;
+}
+
+static int fill_device_from_item(struct extent_buffer *leaf,
+                                struct btrfs_dev_item *dev_item,
+                                struct btrfs_device *device)
+{
+       unsigned long ptr;
+
+       device->devid = btrfs_device_id(leaf, dev_item);
+       device->total_bytes = btrfs_device_total_bytes(leaf, dev_item);
+       device->bytes_used = btrfs_device_bytes_used(leaf, dev_item);
+       device->type = btrfs_device_type(leaf, dev_item);
+       device->io_align = btrfs_device_io_align(leaf, dev_item);
+       device->io_width = btrfs_device_io_width(leaf, dev_item);
+       device->sector_size = btrfs_device_sector_size(leaf, dev_item);
+
+       ptr = (unsigned long)btrfs_device_uuid(dev_item);
+       read_extent_buffer(leaf, device->uuid, ptr, BTRFS_UUID_SIZE);
+
+       return 0;
+}
+
+static int open_seed_devices(struct btrfs_root *root, u8 *fsid)
+{
+       struct btrfs_fs_devices *fs_devices;
+       int ret;
+
+       mutex_lock(&uuid_mutex);
+
+       fs_devices = root->fs_info->fs_devices->seed;
+       while (fs_devices) {
+               if (!memcmp(fs_devices->fsid, fsid, BTRFS_UUID_SIZE)) {
+                       ret = 0;
+                       goto out;
+               }
+               fs_devices = fs_devices->seed;
+       }
+
+       fs_devices = find_fsid(fsid);
+       if (!fs_devices) {
+               ret = -ENOENT;
+               goto out;
+       }
+
+       fs_devices = clone_fs_devices(fs_devices);
+       if (IS_ERR(fs_devices)) {
+               ret = PTR_ERR(fs_devices);
+               goto out;
+       }
+
+       ret = __btrfs_open_devices(fs_devices, FMODE_READ,
+                                  root->fs_info->bdev_holder);
+       if (ret)
+               goto out;
+
+       if (!fs_devices->seeding) {
+               __btrfs_close_devices(fs_devices);
+               free_fs_devices(fs_devices);
+               ret = -EINVAL;
+               goto out;
+       }
+
+       fs_devices->seed = root->fs_info->fs_devices->seed;
+       root->fs_info->fs_devices->seed = fs_devices;
+out:
+       mutex_unlock(&uuid_mutex);
+       return ret;
+}
+
+static int read_one_dev(struct btrfs_root *root,
+                       struct extent_buffer *leaf,
+                       struct btrfs_dev_item *dev_item)
+{
+       struct btrfs_device *device;
+       u64 devid;
+       int ret;
+       u8 fs_uuid[BTRFS_UUID_SIZE];
+       u8 dev_uuid[BTRFS_UUID_SIZE];
+
+       devid = btrfs_device_id(leaf, dev_item);
+       read_extent_buffer(leaf, dev_uuid,
+                          (unsigned long)btrfs_device_uuid(dev_item),
+                          BTRFS_UUID_SIZE);
+       read_extent_buffer(leaf, fs_uuid,
+                          (unsigned long)btrfs_device_fsid(dev_item),
+                          BTRFS_UUID_SIZE);
+
+       if (memcmp(fs_uuid, root->fs_info->fsid, BTRFS_UUID_SIZE)) {
+               ret = open_seed_devices(root, fs_uuid);
+               if (ret && !btrfs_test_opt(root, DEGRADED))
+                       return ret;
+       }
+
+       device = btrfs_find_device(root, devid, dev_uuid, fs_uuid);
+       if (!device || !device->bdev) {
+               if (!btrfs_test_opt(root, DEGRADED))
+                       return -EIO;
+
+               if (!device) {
+                       printk(KERN_WARNING "warning devid %llu missing\n",
+                              (unsigned long long)devid);
+                       device = add_missing_dev(root, devid, dev_uuid);
+                       if (!device)
+                               return -ENOMEM;
+               }
+       }
+
+       if (device->fs_devices != root->fs_info->fs_devices) {
+               BUG_ON(device->writeable);
+               if (device->generation !=
+                   btrfs_device_generation(leaf, dev_item))
+                       return -EINVAL;
+       }
+
+       fill_device_from_item(leaf, dev_item, device);
+       device->dev_root = root->fs_info->dev_root;
+       device->in_fs_metadata = 1;
+       if (device->writeable)
+               device->fs_devices->total_rw_bytes += device->total_bytes;
+       ret = 0;
+       return ret;
+}
+
+int btrfs_read_super_device(struct btrfs_root *root, struct extent_buffer *buf)
+{
+       struct btrfs_dev_item *dev_item;
+
+       dev_item = (struct btrfs_dev_item *)offsetof(struct btrfs_super_block,
+                                                    dev_item);
+       return read_one_dev(root, buf, dev_item);
+}
+
+int btrfs_read_sys_array(struct btrfs_root *root)
+{
+       struct btrfs_super_block *super_copy = &root->fs_info->super_copy;
+       struct extent_buffer *sb;
+       struct btrfs_disk_key *disk_key;
+       struct btrfs_chunk *chunk;
+       u8 *ptr;
+       unsigned long sb_ptr;
+       int ret = 0;
+       u32 num_stripes;
+       u32 array_size;
+       u32 len = 0;
+       u32 cur;
+       struct btrfs_key key;
+
+       sb = btrfs_find_create_tree_block(root, BTRFS_SUPER_INFO_OFFSET,
+                                         BTRFS_SUPER_INFO_SIZE);
+       if (!sb)
+               return -ENOMEM;
+       btrfs_set_buffer_uptodate(sb);
+       write_extent_buffer(sb, super_copy, 0, BTRFS_SUPER_INFO_SIZE);
+       array_size = btrfs_super_sys_array_size(super_copy);
+
+       ptr = super_copy->sys_chunk_array;
+       sb_ptr = offsetof(struct btrfs_super_block, sys_chunk_array);
+       cur = 0;
+
+       while (cur < array_size) {
+               disk_key = (struct btrfs_disk_key *)ptr;
+               btrfs_disk_key_to_cpu(&key, disk_key);
+
+               len = sizeof(*disk_key); ptr += len;
+               sb_ptr += len;
+               cur += len;
+
+               if (key.type == BTRFS_CHUNK_ITEM_KEY) {
+                       chunk = (struct btrfs_chunk *)sb_ptr;
+                       ret = read_one_chunk(root, &key, sb, chunk);
+                       if (ret)
+                               break;
+                       num_stripes = btrfs_chunk_num_stripes(sb, chunk);
+                       len = btrfs_chunk_item_size(num_stripes);
+               } else {
+                       ret = -EIO;
+                       break;
+               }
+               ptr += len;
+               sb_ptr += len;
+               cur += len;
+       }
+       free_extent_buffer(sb);
+       return ret;
+}
+
+int btrfs_read_chunk_tree(struct btrfs_root *root)
+{
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       struct btrfs_key key;
+       struct btrfs_key found_key;
+       int ret;
+       int slot;
+
+       root = root->fs_info->chunk_root;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       /* first we search for all of the device items, and then we
+        * read in all of the chunk items.  This way we can create chunk
+        * mappings that reference all of the devices that are afound
+        */
+       key.objectid = BTRFS_DEV_ITEMS_OBJECTID;
+       key.offset = 0;
+       key.type = 0;
+again:
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       while (1) {
+               leaf = path->nodes[0];
+               slot = path->slots[0];
+               if (slot >= btrfs_header_nritems(leaf)) {
+                       ret = btrfs_next_leaf(root, path);
+                       if (ret == 0)
+                               continue;
+                       if (ret < 0)
+                               goto error;
+                       break;
+               }
+               btrfs_item_key_to_cpu(leaf, &found_key, slot);
+               if (key.objectid == BTRFS_DEV_ITEMS_OBJECTID) {
+                       if (found_key.objectid != BTRFS_DEV_ITEMS_OBJECTID)
+                               break;
+                       if (found_key.type == BTRFS_DEV_ITEM_KEY) {
+                               struct btrfs_dev_item *dev_item;
+                               dev_item = btrfs_item_ptr(leaf, slot,
+                                                 struct btrfs_dev_item);
+                               ret = read_one_dev(root, leaf, dev_item);
+                               if (ret)
+                                       goto error;
+                       }
+               } else if (found_key.type == BTRFS_CHUNK_ITEM_KEY) {
+                       struct btrfs_chunk *chunk;
+                       chunk = btrfs_item_ptr(leaf, slot, struct btrfs_chunk);
+                       ret = read_one_chunk(root, &found_key, leaf, chunk);
+                       if (ret)
+                               goto error;
+               }
+               path->slots[0]++;
+       }
+       if (key.objectid == BTRFS_DEV_ITEMS_OBJECTID) {
+               key.objectid = 0;
+               btrfs_release_path(root, path);
+               goto again;
+       }
+       ret = 0;
+error:
+       btrfs_free_path(path);
+       return ret;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h

new file mode 100644 (file)

index 0000000..86c44e9
--- /dev/null
+++ b/fs/btrfs/volumes.h
@@ -0,0 +1,162 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __BTRFS_VOLUMES_
+#define __BTRFS_VOLUMES_
+
+#include <linux/bio.h>
+#include "async-thread.h"
+
+struct buffer_head;
+struct btrfs_device {
+       struct list_head dev_list;
+       struct list_head dev_alloc_list;
+       struct btrfs_fs_devices *fs_devices;
+       struct btrfs_root *dev_root;
+       struct bio *pending_bios;
+       struct bio *pending_bio_tail;
+       int running_pending;
+       u64 generation;
+
+       int barriers;
+       int writeable;
+       int in_fs_metadata;
+
+       spinlock_t io_lock;
+
+       struct block_device *bdev;
+
+       /* the mode sent to open_bdev_exclusive */
+       fmode_t mode;
+
+       char *name;
+
+       /* the internal btrfs device id */
+       u64 devid;
+
+       /* size of the device */
+       u64 total_bytes;
+
+       /* bytes used */
+       u64 bytes_used;
+
+       /* optimal io alignment for this device */
+       u32 io_align;
+
+       /* optimal io width for this device */
+       u32 io_width;
+
+       /* minimal io size for this device */
+       u32 sector_size;
+
+       /* type and info about this device */
+       u64 type;
+
+       /* physical drive uuid (or lvm uuid) */
+       u8 uuid[BTRFS_UUID_SIZE];
+
+       struct btrfs_work work;
+};
+
+struct btrfs_fs_devices {
+       u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
+
+       /* the device with this id has the most recent coyp of the super */
+       u64 latest_devid;
+       u64 latest_trans;
+       u64 num_devices;
+       u64 open_devices;
+       u64 rw_devices;
+       u64 total_rw_bytes;
+       struct block_device *latest_bdev;
+       /* all of the devices in the FS */
+       struct list_head devices;
+
+       /* devices not currently being allocated */
+       struct list_head alloc_list;
+       struct list_head list;
+
+       struct btrfs_fs_devices *seed;
+       int seeding;
+
+       int opened;
+};
+
+struct btrfs_bio_stripe {
+       struct btrfs_device *dev;
+       u64 physical;
+};
+
+struct btrfs_multi_bio {
+       atomic_t stripes_pending;
+       bio_end_io_t *end_io;
+       struct bio *orig_bio;
+       void *private;
+       atomic_t error;
+       int max_errors;
+       int num_stripes;
+       struct btrfs_bio_stripe stripes[];
+};
+
+#define btrfs_multi_bio_size(n) (sizeof(struct btrfs_multi_bio) + \
+                           (sizeof(struct btrfs_bio_stripe) * (n)))
+
+int btrfs_alloc_dev_extent(struct btrfs_trans_handle *trans,
+                          struct btrfs_device *device,
+                          u64 chunk_tree, u64 chunk_objectid,
+                          u64 chunk_offset, u64 start, u64 num_bytes);
+int btrfs_map_block(struct btrfs_mapping_tree *map_tree, int rw,
+                   u64 logical, u64 *length,
+                   struct btrfs_multi_bio **multi_ret, int mirror_num);
+int btrfs_rmap_block(struct btrfs_mapping_tree *map_tree,
+                    u64 chunk_start, u64 physical, u64 devid,
+                    u64 **logical, int *naddrs, int *stripe_len);
+int btrfs_read_sys_array(struct btrfs_root *root);
+int btrfs_read_chunk_tree(struct btrfs_root *root);
+int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
+                     struct btrfs_root *extent_root, u64 type);
+void btrfs_mapping_init(struct btrfs_mapping_tree *tree);
+void btrfs_mapping_tree_free(struct btrfs_mapping_tree *tree);
+int btrfs_map_bio(struct btrfs_root *root, int rw, struct bio *bio,
+                 int mirror_num, int async_submit);
+int btrfs_read_super_device(struct btrfs_root *root, struct extent_buffer *buf);
+int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
+                      fmode_t flags, void *holder);
+int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder,
+                         struct btrfs_fs_devices **fs_devices_ret);
+int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
+int btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices);
+int btrfs_add_device(struct btrfs_trans_handle *trans,
+                    struct btrfs_root *root,
+                    struct btrfs_device *device);
+int btrfs_rm_device(struct btrfs_root *root, char *device_path);
+int btrfs_cleanup_fs_uuids(void);
+int btrfs_num_copies(struct btrfs_mapping_tree *map_tree, u64 logical, u64 len);
+int btrfs_unplug_page(struct btrfs_mapping_tree *map_tree,
+                     u64 logical, struct page *page);
+int btrfs_grow_device(struct btrfs_trans_handle *trans,
+                     struct btrfs_device *device, u64 new_size);
+struct btrfs_device *btrfs_find_device(struct btrfs_root *root, u64 devid,
+                                      u8 *uuid, u8 *fsid);
+int btrfs_shrink_device(struct btrfs_device *device, u64 new_size);
+int btrfs_init_new_device(struct btrfs_root *root, char *path);
+int btrfs_balance(struct btrfs_root *dev_root);
+void btrfs_unlock_volumes(void);
+void btrfs_lock_volumes(void);
+int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset);
+#endif
diff --git a/fs/btrfs/xattr.c b/fs/btrfs/xattr.c

new file mode 100644 (file)

index 0000000..7f332e2
--- /dev/null
+++ b/fs/btrfs/xattr.c
@@ -0,0 +1,322 @@
+/*
+ * Copyright (C) 2007 Red Hat.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/rwsem.h>
+#include <linux/xattr.h>
+#include "ctree.h"
+#include "btrfs_inode.h"
+#include "transaction.h"
+#include "xattr.h"
+#include "disk-io.h"
+
+
+ssize_t __btrfs_getxattr(struct inode *inode, const char *name,
+                               void *buffer, size_t size)
+{
+       struct btrfs_dir_item *di;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_path *path;
+       struct extent_buffer *leaf;
+       int ret = 0;
+       unsigned long data_ptr;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       /* lookup the xattr by name */
+       di = btrfs_lookup_xattr(NULL, root, path, inode->i_ino, name,
+                               strlen(name), 0);
+       if (!di || IS_ERR(di)) {
+               ret = -ENODATA;
+               goto out;
+       }
+
+       leaf = path->nodes[0];
+       /* if size is 0, that means we want the size of the attr */
+       if (!size) {
+               ret = btrfs_dir_data_len(leaf, di);
+               goto out;
+       }
+
+       /* now get the data out of our dir_item */
+       if (btrfs_dir_data_len(leaf, di) > size) {
+               ret = -ERANGE;
+               goto out;
+       }
+       data_ptr = (unsigned long)((char *)(di + 1) +
+                                  btrfs_dir_name_len(leaf, di));
+       read_extent_buffer(leaf, buffer, data_ptr,
+                          btrfs_dir_data_len(leaf, di));
+       ret = btrfs_dir_data_len(leaf, di);
+
+out:
+       btrfs_free_path(path);
+       return ret;
+}
+
+int __btrfs_setxattr(struct inode *inode, const char *name,
+                           const void *value, size_t size, int flags)
+{
+       struct btrfs_dir_item *di;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_trans_handle *trans;
+       struct btrfs_path *path;
+       int ret = 0, mod = 0;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+
+       trans = btrfs_start_transaction(root, 1);
+       btrfs_set_trans_block_group(trans, inode);
+
+       /* first lets see if we already have this xattr */
+       di = btrfs_lookup_xattr(trans, root, path, inode->i_ino, name,
+                               strlen(name), -1);
+       if (IS_ERR(di)) {
+               ret = PTR_ERR(di);
+               goto out;
+       }
+
+       /* ok we already have this xattr, lets remove it */
+       if (di) {
+               /* if we want create only exit */
+               if (flags & XATTR_CREATE) {
+                       ret = -EEXIST;
+                       goto out;
+               }
+
+               ret = btrfs_delete_one_dir_name(trans, root, path, di);
+               if (ret)
+                       goto out;
+               btrfs_release_path(root, path);
+
+               /* if we don't have a value then we are removing the xattr */
+               if (!value) {
+                       mod = 1;
+                       goto out;
+               }
+       } else {
+               btrfs_release_path(root, path);
+
+               if (flags & XATTR_REPLACE) {
+                       /* we couldn't find the attr to replace */
+                       ret = -ENODATA;
+                       goto out;
+               }
+       }
+
+       /* ok we have to create a completely new xattr */
+       ret = btrfs_insert_xattr_item(trans, root, name, strlen(name),
+                                     value, size, inode->i_ino);
+       if (ret)
+               goto out;
+       mod = 1;
+
+out:
+       if (mod) {
+               inode->i_ctime = CURRENT_TIME;
+               ret = btrfs_update_inode(trans, root, inode);
+       }
+
+       btrfs_end_transaction(trans, root);
+       btrfs_free_path(path);
+       return ret;
+}
+
+ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size)
+{
+       struct btrfs_key key, found_key;
+       struct inode *inode = dentry->d_inode;
+       struct btrfs_root *root = BTRFS_I(inode)->root;
+       struct btrfs_path *path;
+       struct btrfs_item *item;
+       struct extent_buffer *leaf;
+       struct btrfs_dir_item *di;
+       int ret = 0, slot, advance;
+       size_t total_size = 0, size_left = size;
+       unsigned long name_ptr;
+       size_t name_len;
+       u32 nritems;
+
+       /*
+        * ok we want all objects associated with this id.
+        * NOTE: we set key.offset = 0; because we want to start with the
+        * first xattr that we find and walk forward
+        */
+       key.objectid = inode->i_ino;
+       btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY);
+       key.offset = 0;
+
+       path = btrfs_alloc_path();
+       if (!path)
+               return -ENOMEM;
+       path->reada = 2;
+
+       /* search for our xattrs */
+       ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+       if (ret < 0)
+               goto err;
+       ret = 0;
+       advance = 0;
+       while (1) {
+               leaf = path->nodes[0];
+               nritems = btrfs_header_nritems(leaf);
+               slot = path->slots[0];
+
+               /* this is where we start walking through the path */
+               if (advance || slot >= nritems) {
+                       /*
+                        * if we've reached the last slot in this leaf we need
+                        * to go to the next leaf and reset everything
+                        */
+                       if (slot >= nritems-1) {
+                               ret = btrfs_next_leaf(root, path);
+                               if (ret)
+                                       break;
+                               leaf = path->nodes[0];
+                               nritems = btrfs_header_nritems(leaf);
+                               slot = path->slots[0];
+                       } else {
+                               /*
+                                * just walking through the slots on this leaf
+                                */
+                               slot++;
+                               path->slots[0]++;
+                       }
+               }
+               advance = 1;
+
+               item = btrfs_item_nr(leaf, slot);
+               btrfs_item_key_to_cpu(leaf, &found_key, slot);
+
+               /* check to make sure this item is what we want */
+               if (found_key.objectid != key.objectid)
+                       break;
+               if (btrfs_key_type(&found_key) != BTRFS_XATTR_ITEM_KEY)
+                       break;
+
+               di = btrfs_item_ptr(leaf, slot, struct btrfs_dir_item);
+
+               name_len = btrfs_dir_name_len(leaf, di);
+               total_size += name_len + 1;
+
+               /* we are just looking for how big our buffer needs to be */
+               if (!size)
+                       continue;
+
+               if (!buffer || (name_len + 1) > size_left) {
+                       ret = -ERANGE;
+                       goto err;
+               }
+
+               name_ptr = (unsigned long)(di + 1);
+               read_extent_buffer(leaf, buffer, name_ptr, name_len);
+               buffer[name_len] = '\0';
+
+               size_left -= name_len + 1;
+               buffer += name_len + 1;
+       }
+       ret = total_size;
+
+err:
+       btrfs_free_path(path);
+
+       return ret;
+}
+
+/*
+ * List of handlers for synthetic system.* attributes.  All real ondisk
+ * attributes are handled directly.
+ */
+struct xattr_handler *btrfs_xattr_handlers[] = {
+#ifdef CONFIG_FS_POSIX_ACL
+       &btrfs_xattr_acl_access_handler,
+       &btrfs_xattr_acl_default_handler,
+#endif
+       NULL,
+};
+
+/*
+ * Check if the attribute is in a supported namespace.
+ *
+ * This applied after the check for the synthetic attributes in the system
+ * namespace.
+ */
+static bool btrfs_is_valid_xattr(const char *name)
+{
+       return !strncmp(name, XATTR_SECURITY_PREFIX,
+                       XATTR_SECURITY_PREFIX_LEN) ||
+              !strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN) ||
+              !strncmp(name, XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN) ||
+              !strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN);
+}
+
+ssize_t btrfs_getxattr(struct dentry *dentry, const char *name,
+                      void *buffer, size_t size)
+{
+       /*
+        * If this is a request for a synthetic attribute in the system.*
+        * namespace use the generic infrastructure to resolve a handler
+        * for it via sb->s_xattr.
+        */
+       if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN))
+               return generic_getxattr(dentry, name, buffer, size);
+
+       if (!btrfs_is_valid_xattr(name))
+               return -EOPNOTSUPP;
+       return __btrfs_getxattr(dentry->d_inode, name, buffer, size);
+}
+
+int btrfs_setxattr(struct dentry *dentry, const char *name, const void *value,
+                  size_t size, int flags)
+{
+       /*
+        * If this is a request for a synthetic attribute in the system.*
+        * namespace use the generic infrastructure to resolve a handler
+        * for it via sb->s_xattr.
+        */
+       if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN))
+               return generic_setxattr(dentry, name, value, size, flags);
+
+       if (!btrfs_is_valid_xattr(name))
+               return -EOPNOTSUPP;
+
+       if (size == 0)
+               value = "";  /* empty EA, do not remove */
+       return __btrfs_setxattr(dentry->d_inode, name, value, size, flags);
+}
+
+int btrfs_removexattr(struct dentry *dentry, const char *name)
+{
+       /*
+        * If this is a request for a synthetic attribute in the system.*
+        * namespace use the generic infrastructure to resolve a handler
+        * for it via sb->s_xattr.
+        */
+       if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN))
+               return generic_removexattr(dentry, name);
+
+       if (!btrfs_is_valid_xattr(name))
+               return -EOPNOTSUPP;
+       return __btrfs_setxattr(dentry->d_inode, name, NULL, 0, XATTR_REPLACE);
+}
diff --git a/fs/btrfs/xattr.h b/fs/btrfs/xattr.h

new file mode 100644 (file)

index 0000000..5b1d08f
--- /dev/null
+++ b/fs/btrfs/xattr.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright (C) 2007 Red Hat.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#ifndef __XATTR__
+#define __XATTR__
+
+#include <linux/xattr.h>
+
+extern struct xattr_handler btrfs_xattr_acl_access_handler;
+extern struct xattr_handler btrfs_xattr_acl_default_handler;
+extern struct xattr_handler *btrfs_xattr_handlers[];
+
+extern ssize_t __btrfs_getxattr(struct inode *inode, const char *name,
+               void *buffer, size_t size);
+extern int __btrfs_setxattr(struct inode *inode, const char *name,
+               const void *value, size_t size, int flags);
+
+extern ssize_t btrfs_getxattr(struct dentry *dentry, const char *name,
+               void *buffer, size_t size);
+extern int btrfs_setxattr(struct dentry *dentry, const char *name,
+               const void *value, size_t size, int flags);
+extern int btrfs_removexattr(struct dentry *dentry, const char *name);
+
+#endif /* __XATTR__ */
diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c

new file mode 100644 (file)

index 0000000..ecfbce8
--- /dev/null
+++ b/fs/btrfs/zlib.c
@@ -0,0 +1,632 @@
+/*
+ * Copyright (C) 2008 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Based on jffs2 zlib code:
+ * Copyright © 2001-2007 Red Hat, Inc.
+ * Created by David Woodhouse <dwmw2@infradead.org>
+ */
+
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/zlib.h>
+#include <linux/zutil.h>
+#include <linux/vmalloc.h>
+#include <linux/init.h>
+#include <linux/err.h>
+#include <linux/sched.h>
+#include <linux/pagemap.h>
+#include <linux/bio.h>
+#include "compression.h"
+
+/* Plan: call deflate() with avail_in == *sourcelen,
+       avail_out = *dstlen - 12 and flush == Z_FINISH.
+       If it doesn't manage to finish, call it again with
+       avail_in == 0 and avail_out set to the remaining 12
+       bytes for it to clean up.
+   Q: Is 12 bytes sufficient?
+*/
+#define STREAM_END_SPACE 12
+
+struct workspace {
+       z_stream inf_strm;
+       z_stream def_strm;
+       char *buf;
+       struct list_head list;
+};
+
+static LIST_HEAD(idle_workspace);
+static DEFINE_SPINLOCK(workspace_lock);
+static unsigned long num_workspace;
+static atomic_t alloc_workspace = ATOMIC_INIT(0);
+static DECLARE_WAIT_QUEUE_HEAD(workspace_wait);
+
+/*
+ * this finds an available zlib workspace or allocates a new one
+ * NULL or an ERR_PTR is returned if things go bad.
+ */
+static struct workspace *find_zlib_workspace(void)
+{
+       struct workspace *workspace;
+       int ret;
+       int cpus = num_online_cpus();
+
+again:
+       spin_lock(&workspace_lock);
+       if (!list_empty(&idle_workspace)) {
+               workspace = list_entry(idle_workspace.next, struct workspace,
+                                      list);
+               list_del(&workspace->list);
+               num_workspace--;
+               spin_unlock(&workspace_lock);
+               return workspace;
+
+       }
+       spin_unlock(&workspace_lock);
+       if (atomic_read(&alloc_workspace) > cpus) {
+               DEFINE_WAIT(wait);
+               prepare_to_wait(&workspace_wait, &wait, TASK_UNINTERRUPTIBLE);
+               if (atomic_read(&alloc_workspace) > cpus)
+                       schedule();
+               finish_wait(&workspace_wait, &wait);
+               goto again;
+       }
+       atomic_inc(&alloc_workspace);
+       workspace = kzalloc(sizeof(*workspace), GFP_NOFS);
+       if (!workspace) {
+               ret = -ENOMEM;
+               goto fail;
+       }
+
+       workspace->def_strm.workspace = vmalloc(zlib_deflate_workspacesize());
+       if (!workspace->def_strm.workspace) {
+               ret = -ENOMEM;
+               goto fail;
+       }
+       workspace->inf_strm.workspace = vmalloc(zlib_inflate_workspacesize());
+       if (!workspace->inf_strm.workspace) {
+               ret = -ENOMEM;
+               goto fail_inflate;
+       }
+       workspace->buf = kmalloc(PAGE_CACHE_SIZE, GFP_NOFS);
+       if (!workspace->buf) {
+               ret = -ENOMEM;
+               goto fail_kmalloc;
+       }
+       return workspace;
+
+fail_kmalloc:
+       vfree(workspace->inf_strm.workspace);
+fail_inflate:
+       vfree(workspace->def_strm.workspace);
+fail:
+       kfree(workspace);
+       atomic_dec(&alloc_workspace);
+       wake_up(&workspace_wait);
+       return ERR_PTR(ret);
+}
+
+/*
+ * put a workspace struct back on the list or free it if we have enough
+ * idle ones sitting around
+ */
+static int free_workspace(struct workspace *workspace)
+{
+       spin_lock(&workspace_lock);
+       if (num_workspace < num_online_cpus()) {
+               list_add_tail(&workspace->list, &idle_workspace);
+               num_workspace++;
+               spin_unlock(&workspace_lock);
+               if (waitqueue_active(&workspace_wait))
+                       wake_up(&workspace_wait);
+               return 0;
+       }
+       spin_unlock(&workspace_lock);
+       vfree(workspace->def_strm.workspace);
+       vfree(workspace->inf_strm.workspace);
+       kfree(workspace->buf);
+       kfree(workspace);
+
+       atomic_dec(&alloc_workspace);
+       if (waitqueue_active(&workspace_wait))
+               wake_up(&workspace_wait);
+       return 0;
+}
+
+/*
+ * cleanup function for module exit
+ */
+static void free_workspaces(void)
+{
+       struct workspace *workspace;
+       while (!list_empty(&idle_workspace)) {
+               workspace = list_entry(idle_workspace.next, struct workspace,
+                                      list);
+               list_del(&workspace->list);
+               vfree(workspace->def_strm.workspace);
+               vfree(workspace->inf_strm.workspace);
+               kfree(workspace->buf);
+               kfree(workspace);
+               atomic_dec(&alloc_workspace);
+       }
+}
+
+/*
+ * given an address space and start/len, compress the bytes.
+ *
+ * pages are allocated to hold the compressed result and stored
+ * in 'pages'
+ *
+ * out_pages is used to return the number of pages allocated.  There
+ * may be pages allocated even if we return an error
+ *
+ * total_in is used to return the number of bytes actually read.  It
+ * may be smaller then len if we had to exit early because we
+ * ran out of room in the pages array or because we cross the
+ * max_out threshold.
+ *
+ * total_out is used to return the total number of compressed bytes
+ *
+ * max_out tells us the max number of bytes that we're allowed to
+ * stuff into pages
+ */
+int btrfs_zlib_compress_pages(struct address_space *mapping,
+                             u64 start, unsigned long len,
+                             struct page **pages,
+                             unsigned long nr_dest_pages,
+                             unsigned long *out_pages,
+                             unsigned long *total_in,
+                             unsigned long *total_out,
+                             unsigned long max_out)
+{
+       int ret;
+       struct workspace *workspace;
+       char *data_in;
+       char *cpage_out;
+       int nr_pages = 0;
+       struct page *in_page = NULL;
+       struct page *out_page = NULL;
+       int out_written = 0;
+       int in_read = 0;
+       unsigned long bytes_left;
+
+       *out_pages = 0;
+       *total_out = 0;
+       *total_in = 0;
+
+       workspace = find_zlib_workspace();
+       if (!workspace)
+               return -1;
+
+       if (Z_OK != zlib_deflateInit(&workspace->def_strm, 3)) {
+               printk(KERN_WARNING "deflateInit failed\n");
+               ret = -1;
+               goto out;
+       }
+
+       workspace->def_strm.total_in = 0;
+       workspace->def_strm.total_out = 0;
+
+       in_page = find_get_page(mapping, start >> PAGE_CACHE_SHIFT);
+       data_in = kmap(in_page);
+
+       out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+       cpage_out = kmap(out_page);
+       pages[0] = out_page;
+       nr_pages = 1;
+
+       workspace->def_strm.next_in = data_in;
+       workspace->def_strm.next_out = cpage_out;
+       workspace->def_strm.avail_out = PAGE_CACHE_SIZE;
+       workspace->def_strm.avail_in = min(len, PAGE_CACHE_SIZE);
+
+       out_written = 0;
+       in_read = 0;
+
+       while (workspace->def_strm.total_in < len) {
+               ret = zlib_deflate(&workspace->def_strm, Z_SYNC_FLUSH);
+               if (ret != Z_OK) {
+                       printk(KERN_DEBUG "btrfs deflate in loop returned %d\n",
+                              ret);
+                       zlib_deflateEnd(&workspace->def_strm);
+                       ret = -1;
+                       goto out;
+               }
+
+               /* we're making it bigger, give up */
+               if (workspace->def_strm.total_in > 8192 &&
+                   workspace->def_strm.total_in <
+                   workspace->def_strm.total_out) {
+                       ret = -1;
+                       goto out;
+               }
+               /* we need another page for writing out.  Test this
+                * before the total_in so we will pull in a new page for
+                * the stream end if required
+                */
+               if (workspace->def_strm.avail_out == 0) {
+                       kunmap(out_page);
+                       if (nr_pages == nr_dest_pages) {
+                               out_page = NULL;
+                               ret = -1;
+                               goto out;
+                       }
+                       out_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+                       cpage_out = kmap(out_page);
+                       pages[nr_pages] = out_page;
+                       nr_pages++;
+                       workspace->def_strm.avail_out = PAGE_CACHE_SIZE;
+                       workspace->def_strm.next_out = cpage_out;
+               }
+               /* we're all done */
+               if (workspace->def_strm.total_in >= len)
+                       break;
+
+               /* we've read in a full page, get a new one */
+               if (workspace->def_strm.avail_in == 0) {
+                       if (workspace->def_strm.total_out > max_out)
+                               break;
+
+                       bytes_left = len - workspace->def_strm.total_in;
+                       kunmap(in_page);
+                       page_cache_release(in_page);
+
+                       start += PAGE_CACHE_SIZE;
+                       in_page = find_get_page(mapping,
+                                               start >> PAGE_CACHE_SHIFT);
+                       data_in = kmap(in_page);
+                       workspace->def_strm.avail_in = min(bytes_left,
+                                                          PAGE_CACHE_SIZE);
+                       workspace->def_strm.next_in = data_in;
+               }
+       }
+       workspace->def_strm.avail_in = 0;
+       ret = zlib_deflate(&workspace->def_strm, Z_FINISH);
+       zlib_deflateEnd(&workspace->def_strm);
+
+       if (ret != Z_STREAM_END) {
+               ret = -1;
+               goto out;
+       }
+
+       if (workspace->def_strm.total_out >= workspace->def_strm.total_in) {
+               ret = -1;
+               goto out;
+       }
+
+       ret = 0;
+       *total_out = workspace->def_strm.total_out;
+       *total_in = workspace->def_strm.total_in;
+out:
+       *out_pages = nr_pages;
+       if (out_page)
+               kunmap(out_page);
+
+       if (in_page) {
+               kunmap(in_page);
+               page_cache_release(in_page);
+       }
+       free_workspace(workspace);
+       return ret;
+}
+
+/*
+ * pages_in is an array of pages with compressed data.
+ *
+ * disk_start is the starting logical offset of this array in the file
+ *
+ * bvec is a bio_vec of pages from the file that we want to decompress into
+ *
+ * vcnt is the count of pages in the biovec
+ *
+ * srclen is the number of bytes in pages_in
+ *
+ * The basic idea is that we have a bio that was created by readpages.
+ * The pages in the bio are for the uncompressed data, and they may not
+ * be contiguous.  They all correspond to the range of bytes covered by
+ * the compressed extent.
+ */
+int btrfs_zlib_decompress_biovec(struct page **pages_in,
+                             u64 disk_start,
+                             struct bio_vec *bvec,
+                             int vcnt,
+                             size_t srclen)
+{
+       int ret = 0;
+       int wbits = MAX_WBITS;
+       struct workspace *workspace;
+       char *data_in;
+       size_t total_out = 0;
+       unsigned long page_bytes_left;
+       unsigned long page_in_index = 0;
+       unsigned long page_out_index = 0;
+       struct page *page_out;
+       unsigned long total_pages_in = (srclen + PAGE_CACHE_SIZE - 1) /
+                                       PAGE_CACHE_SIZE;
+       unsigned long buf_start;
+       unsigned long buf_offset;
+       unsigned long bytes;
+       unsigned long working_bytes;
+       unsigned long pg_offset;
+       unsigned long start_byte;
+       unsigned long current_buf_start;
+       char *kaddr;
+
+       workspace = find_zlib_workspace();
+       if (!workspace)
+               return -ENOMEM;
+
+       data_in = kmap(pages_in[page_in_index]);
+       workspace->inf_strm.next_in = data_in;
+       workspace->inf_strm.avail_in = min_t(size_t, srclen, PAGE_CACHE_SIZE);
+       workspace->inf_strm.total_in = 0;
+
+       workspace->inf_strm.total_out = 0;
+       workspace->inf_strm.next_out = workspace->buf;
+       workspace->inf_strm.avail_out = PAGE_CACHE_SIZE;
+       page_out = bvec[page_out_index].bv_page;
+       page_bytes_left = PAGE_CACHE_SIZE;
+       pg_offset = 0;
+
+       /* If it's deflate, and it's got no preset dictionary, then
+          we can tell zlib to skip the adler32 check. */
+       if (srclen > 2 && !(data_in[1] & PRESET_DICT) &&
+           ((data_in[0] & 0x0f) == Z_DEFLATED) &&
+           !(((data_in[0]<<8) + data_in[1]) % 31)) {
+
+               wbits = -((data_in[0] >> 4) + 8);
+               workspace->inf_strm.next_in += 2;
+               workspace->inf_strm.avail_in -= 2;
+       }
+
+       if (Z_OK != zlib_inflateInit2(&workspace->inf_strm, wbits)) {
+               printk(KERN_WARNING "inflateInit failed\n");
+               ret = -1;
+               goto out;
+       }
+       while (workspace->inf_strm.total_in < srclen) {
+               ret = zlib_inflate(&workspace->inf_strm, Z_NO_FLUSH);
+               if (ret != Z_OK && ret != Z_STREAM_END)
+                       break;
+               /*
+                * buf start is the byte offset we're of the start of
+                * our workspace buffer
+                */
+               buf_start = total_out;
+
+               /* total_out is the last byte of the workspace buffer */
+               total_out = workspace->inf_strm.total_out;
+
+               working_bytes = total_out - buf_start;
+
+               /*
+                * start byte is the first byte of the page we're currently
+                * copying into relative to the start of the compressed data.
+                */
+               start_byte = page_offset(page_out) - disk_start;
+
+               if (working_bytes == 0) {
+                       /* we didn't make progress in this inflate
+                        * call, we're done
+                        */
+                       if (ret != Z_STREAM_END)
+                               ret = -1;
+                       break;
+               }
+
+               /* we haven't yet hit data corresponding to this page */
+               if (total_out <= start_byte)
+                       goto next;
+
+               /*
+                * the start of the data we care about is offset into
+                * the middle of our working buffer
+                */
+               if (total_out > start_byte && buf_start < start_byte) {
+                       buf_offset = start_byte - buf_start;
+                       working_bytes -= buf_offset;
+               } else {
+                       buf_offset = 0;
+               }
+               current_buf_start = buf_start;
+
+               /* copy bytes from the working buffer into the pages */
+               while (working_bytes > 0) {
+                       bytes = min(PAGE_CACHE_SIZE - pg_offset,
+                                   PAGE_CACHE_SIZE - buf_offset);
+                       bytes = min(bytes, working_bytes);
+                       kaddr = kmap_atomic(page_out, KM_USER0);
+                       memcpy(kaddr + pg_offset, workspace->buf + buf_offset,
+                              bytes);
+                       kunmap_atomic(kaddr, KM_USER0);
+                       flush_dcache_page(page_out);
+
+                       pg_offset += bytes;
+                       page_bytes_left -= bytes;
+                       buf_offset += bytes;
+                       working_bytes -= bytes;
+                       current_buf_start += bytes;
+
+                       /* check if we need to pick another page */
+                       if (page_bytes_left == 0) {
+                               page_out_index++;
+                               if (page_out_index >= vcnt) {
+                                       ret = 0;
+                                       goto done;
+                               }
+
+                               page_out = bvec[page_out_index].bv_page;
+                               pg_offset = 0;
+                               page_bytes_left = PAGE_CACHE_SIZE;
+                               start_byte = page_offset(page_out) - disk_start;
+
+                               /*
+                                * make sure our new page is covered by this
+                                * working buffer
+                                */
+                               if (total_out <= start_byte)
+                                       goto next;
+
+                               /* the next page in the biovec might not
+                                * be adjacent to the last page, but it
+                                * might still be found inside this working
+                                * buffer.  bump our offset pointer
+                                */
+                               if (total_out > start_byte &&
+                                   current_buf_start < start_byte) {
+                                       buf_offset = start_byte - buf_start;
+                                       working_bytes = total_out - start_byte;
+                                       current_buf_start = buf_start +
+                                               buf_offset;
+                               }
+                       }
+               }
+next:
+               workspace->inf_strm.next_out = workspace->buf;
+               workspace->inf_strm.avail_out = PAGE_CACHE_SIZE;
+
+               if (workspace->inf_strm.avail_in == 0) {
+                       unsigned long tmp;
+                       kunmap(pages_in[page_in_index]);
+                       page_in_index++;
+                       if (page_in_index >= total_pages_in) {
+                               data_in = NULL;
+                               break;
+                       }
+                       data_in = kmap(pages_in[page_in_index]);
+                       workspace->inf_strm.next_in = data_in;
+                       tmp = srclen - workspace->inf_strm.total_in;
+                       workspace->inf_strm.avail_in = min(tmp,
+                                                          PAGE_CACHE_SIZE);
+               }
+       }
+       if (ret != Z_STREAM_END)
+               ret = -1;
+       else
+               ret = 0;
+done:
+       zlib_inflateEnd(&workspace->inf_strm);
+       if (data_in)
+               kunmap(pages_in[page_in_index]);
+out:
+       free_workspace(workspace);
+       return ret;
+}
+
+/*
+ * a less complex decompression routine.  Our compressed data fits in a
+ * single page, and we want to read a single page out of it.
+ * start_byte tells us the offset into the compressed data we're interested in
+ */
+int btrfs_zlib_decompress(unsigned char *data_in,
+                         struct page *dest_page,
+                         unsigned long start_byte,
+                         size_t srclen, size_t destlen)
+{
+       int ret = 0;
+       int wbits = MAX_WBITS;
+       struct workspace *workspace;
+       unsigned long bytes_left = destlen;
+       unsigned long total_out = 0;
+       char *kaddr;
+
+       if (destlen > PAGE_CACHE_SIZE)
+               return -ENOMEM;
+
+       workspace = find_zlib_workspace();
+       if (!workspace)
+               return -ENOMEM;
+
+       workspace->inf_strm.next_in = data_in;
+       workspace->inf_strm.avail_in = srclen;
+       workspace->inf_strm.total_in = 0;
+
+       workspace->inf_strm.next_out = workspace->buf;
+       workspace->inf_strm.avail_out = PAGE_CACHE_SIZE;
+       workspace->inf_strm.total_out = 0;
+       /* If it's deflate, and it's got no preset dictionary, then
+          we can tell zlib to skip the adler32 check. */
+       if (srclen > 2 && !(data_in[1] & PRESET_DICT) &&
+           ((data_in[0] & 0x0f) == Z_DEFLATED) &&
+           !(((data_in[0]<<8) + data_in[1]) % 31)) {
+
+               wbits = -((data_in[0] >> 4) + 8);
+               workspace->inf_strm.next_in += 2;
+               workspace->inf_strm.avail_in -= 2;
+       }
+
+       if (Z_OK != zlib_inflateInit2(&workspace->inf_strm, wbits)) {
+               printk(KERN_WARNING "inflateInit failed\n");
+               ret = -1;
+               goto out;
+       }
+
+       while (bytes_left > 0) {
+               unsigned long buf_start;
+               unsigned long buf_offset;
+               unsigned long bytes;
+               unsigned long pg_offset = 0;
+
+               ret = zlib_inflate(&workspace->inf_strm, Z_NO_FLUSH);
+               if (ret != Z_OK && ret != Z_STREAM_END)
+                       break;
+
+               buf_start = total_out;
+               total_out = workspace->inf_strm.total_out;
+
+               if (total_out == buf_start) {
+                       ret = -1;
+                       break;
+               }
+
+               if (total_out <= start_byte)
+                       goto next;
+
+               if (total_out > start_byte && buf_start < start_byte)
+                       buf_offset = start_byte - buf_start;
+               else
+                       buf_offset = 0;
+
+               bytes = min(PAGE_CACHE_SIZE - pg_offset,
+                           PAGE_CACHE_SIZE - buf_offset);
+               bytes = min(bytes, bytes_left);
+
+               kaddr = kmap_atomic(dest_page, KM_USER0);
+               memcpy(kaddr + pg_offset, workspace->buf + buf_offset, bytes);
+               kunmap_atomic(kaddr, KM_USER0);
+
+               pg_offset += bytes;
+               bytes_left -= bytes;
+next:
+               workspace->inf_strm.next_out = workspace->buf;
+               workspace->inf_strm.avail_out = PAGE_CACHE_SIZE;
+       }
+
+       if (ret != Z_STREAM_END && bytes_left != 0)
+               ret = -1;
+       else
+               ret = 0;
+
+       zlib_inflateEnd(&workspace->inf_strm);
+out:
+       free_workspace(workspace);
+       return ret;
+}
+
+void btrfs_zlib_exit(void)
+{
+    free_workspaces();
+}
diff --git a/fs/partitions/check.c b/fs/partitions/check.c

index 5198ada67398f83019066002a733fdc54a5b8f90..6d720243f5f4a29a43a7b50dd005bbff32a7b0aa 100644 (file)
--- a/fs/partitions/check.c
+++ b/fs/partitions/check.c
@@ -334,6 +334,7 @@ void delete_partition(struct gendisk *disk, int partno)
  
         blk_free_devt(part_devt(part));
         rcu_assign_pointer(ptbl->part[partno], NULL);
+       rcu_assign_pointer(ptbl->last_lookup, NULL);
         kobject_put(part->holder_dir);
         device_del(part_to_dev(part));
  
diff --git a/fs/proc/internal.h b/fs/proc/internal.h

index 3e8aeb8b61ce251601f8795d3f6bd4fd423beed0..cd53ff838498de03973494d6a70ddae9f193a1ce 100644 (file)
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -41,8 +41,6 @@ do {                                          \
         (vmi)->used = 0;                        \
         (vmi)->largest_chunk = 0;               \
  } while(0)
-
-extern int nommu_vma_show(struct seq_file *, struct vm_area_struct *);
  #endif
  
  extern int proc_tid_stat(struct seq_file *m, struct pid_namespace *ns,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c

index b1675c4e66dae952d5b1dfbf0eade72f83077b7f..43d23948384addca6930aa65acbf49804226827f 100644 (file)
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -73,6 +73,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
                 "HighFree:       %8lu kB\n"
                 "LowTotal:       %8lu kB\n"
                 "LowFree:        %8lu kB\n"
+#endif
+#ifndef CONFIG_MMU
+               "MmapCopy:       %8lu kB\n"
  #endif
                 "SwapTotal:      %8lu kB\n"
                 "SwapFree:       %8lu kB\n"
@@ -115,6 +118,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
                 K(i.freehigh),
                 K(i.totalram-i.totalhigh),
                 K(i.freeram-i.freehigh),
+#endif
+#ifndef CONFIG_MMU
+               K((unsigned long) atomic_read(&mmap_pages_allocated)),
  #endif
                 K(i.totalswap),
                 K(i.freeswap),
diff --git a/fs/proc/nommu.c b/fs/proc/nommu.c

index 3f87d2632947371b47df580c32d041fb39bb91ba..b446d7ad0b0da469f02521c451b98ec8cc1bed32 100644 (file)
--- a/fs/proc/nommu.c
+++ b/fs/proc/nommu.c
@@ -33,33 +33,33 @@
  #include "internal.h"
  
  /*
- * display a single VMA to a sequenced file
+ * display a single region to a sequenced file
   */
-int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma)
+static int nommu_region_show(struct seq_file *m, struct vm_region *region)
  {
         unsigned long ino = 0;
         struct file *file;
         dev_t dev = 0;
         int flags, len;
  
-       flags = vma->vm_flags;
-       file = vma->vm_file;
+       flags = region->vm_flags;
+       file = region->vm_file;
  
         if (file) {
-               struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+               struct inode *inode = region->vm_file->f_path.dentry->d_inode;
                 dev = inode->i_sb->s_dev;
                 ino = inode->i_ino;
         }
  
         seq_printf(m,
                    "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n",
-                  vma->vm_start,
-                  vma->vm_end,
+                  region->vm_start,
+                  region->vm_end,
                    flags & VM_READ ? 'r' : '-',
                    flags & VM_WRITE ? 'w' : '-',
                    flags & VM_EXEC ? 'x' : '-',
                    flags & VM_MAYSHARE ? flags & VM_SHARED ? 'S' : 's' : 'p',
-                  ((loff_t)vma->vm_pgoff) << PAGE_SHIFT,
+                  ((loff_t)region->vm_pgoff) << PAGE_SHIFT,
                    MAJOR(dev), MINOR(dev), ino, &len);
  
         if (file) {
@@ -75,61 +75,54 @@ int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma)
  }
  
  /*
- * display a list of all the VMAs the kernel knows about
+ * display a list of all the REGIONs the kernel knows about
   * - nommu kernals have a single flat list
   */
-static int nommu_vma_list_show(struct seq_file *m, void *v)
+static int nommu_region_list_show(struct seq_file *m, void *_p)
  {
-       struct vm_area_struct *vma;
+       struct rb_node *p = _p;
  
-       vma = rb_entry((struct rb_node *) v, struct vm_area_struct, vm_rb);
-       return nommu_vma_show(m, vma);
+       return nommu_region_show(m, rb_entry(p, struct vm_region, vm_rb));
  }
  
-static void *nommu_vma_list_start(struct seq_file *m, loff_t *_pos)
+static void *nommu_region_list_start(struct seq_file *m, loff_t *_pos)
  {
-       struct rb_node *_rb;
+       struct rb_node *p;
         loff_t pos = *_pos;
-       void *next = NULL;
  
-       down_read(&nommu_vma_sem);
+       down_read(&nommu_region_sem);
  
-       for (_rb = rb_first(&nommu_vma_tree); _rb; _rb = rb_next(_rb)) {
-               if (pos == 0) {
-                       next = _rb;
-                       break;
-               }
-               pos--;
-       }
-
-       return next;
+       for (p = rb_first(&nommu_region_tree); p; p = rb_next(p))
+               if (pos-- == 0)
+                       return p;
+       return NULL;
  }
  
-static void nommu_vma_list_stop(struct seq_file *m, void *v)
+static void nommu_region_list_stop(struct seq_file *m, void *v)
  {
-       up_read(&nommu_vma_sem);
+       up_read(&nommu_region_sem);
  }
  
-static void *nommu_vma_list_next(struct seq_file *m, void *v, loff_t *pos)
+static void *nommu_region_list_next(struct seq_file *m, void *v, loff_t *pos)
  {
         (*pos)++;
         return rb_next((struct rb_node *) v);
  }
  
-static const struct seq_operations proc_nommu_vma_list_seqop = {
-       .start  = nommu_vma_list_start,
-       .next   = nommu_vma_list_next,
-       .stop   = nommu_vma_list_stop,
-       .show   = nommu_vma_list_show
+static struct seq_operations proc_nommu_region_list_seqop = {
+       .start  = nommu_region_list_start,
+       .next   = nommu_region_list_next,
+       .stop   = nommu_region_list_stop,
+       .show   = nommu_region_list_show
  };
  
-static int proc_nommu_vma_list_open(struct inode *inode, struct file *file)
+static int proc_nommu_region_list_open(struct inode *inode, struct file *file)
  {
-       return seq_open(file, &proc_nommu_vma_list_seqop);
+       return seq_open(file, &proc_nommu_region_list_seqop);
  }
  
-static const struct file_operations proc_nommu_vma_list_operations = {
-       .open    = proc_nommu_vma_list_open,
+static const struct file_operations proc_nommu_region_list_operations = {
+       .open    = proc_nommu_region_list_open,
         .read    = seq_read,
         .llseek  = seq_lseek,
         .release = seq_release,
@@ -137,7 +130,7 @@ static const struct file_operations proc_nommu_vma_list_operations = {
  
  static int __init proc_nommu_init(void)
  {
-       proc_create("maps", S_IRUGO, NULL, &proc_nommu_vma_list_operations);
+       proc_create("maps", S_IRUGO, NULL, &proc_nommu_region_list_operations);
         return 0;
  }
  
diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c

index d4a8be32b902279a3f2c84b1d4d137636dd3f7de..343ea1216bc8e05ddc84e75af47cfbab73fee529 100644 (file)
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -15,25 +15,32 @@
   */
  void task_mem(struct seq_file *m, struct mm_struct *mm)
  {
-       struct vm_list_struct *vml;
-       unsigned long bytes = 0, sbytes = 0, slack = 0;
+       struct vm_area_struct *vma;
+       struct vm_region *region;
+       struct rb_node *p;
+       unsigned long bytes = 0, sbytes = 0, slack = 0, size;
          
         down_read(&mm->mmap_sem);
-       for (vml = mm->context.vmlist; vml; vml = vml->next) {
-               if (!vml->vma)
-                       continue;
+       for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
+               vma = rb_entry(p, struct vm_area_struct, vm_rb);
+
+               bytes += kobjsize(vma);
+
+               region = vma->vm_region;
+               if (region) {
+                       size = kobjsize(region);
+                       size += region->vm_end - region->vm_start;
+               } else {
+                       size = vma->vm_end - vma->vm_start;
+               }
  
-               bytes += kobjsize(vml);
                 if (atomic_read(&mm->mm_count) > 1 ||
-                   atomic_read(&vml->vma->vm_usage) > 1
-                   ) {
-                       sbytes += kobjsize((void *) vml->vma->vm_start);
-                       sbytes += kobjsize(vml->vma);
+                   vma->vm_flags & VM_MAYSHARE) {
+                       sbytes += size;
                 } else {
-                       bytes += kobjsize((void *) vml->vma->vm_start);
-                       bytes += kobjsize(vml->vma);
-                       slack += kobjsize((void *) vml->vma->vm_start) -
-                               (vml->vma->vm_end - vml->vma->vm_start);
+                       bytes += size;
+                       if (region)
+                               slack = region->vm_end - vma->vm_end;
                 }
         }
  
@@ -70,13 +77,14 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
  
  unsigned long task_vsize(struct mm_struct *mm)
  {
-       struct vm_list_struct *tbp;
+       struct vm_area_struct *vma;
+       struct rb_node *p;
         unsigned long vsize = 0;
  
         down_read(&mm->mmap_sem);
-       for (tbp = mm->context.vmlist; tbp; tbp = tbp->next) {
-               if (tbp->vma)
-                       vsize += kobjsize((void *) tbp->vma->vm_start);
+       for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
+               vma = rb_entry(p, struct vm_area_struct, vm_rb);
+               vsize += vma->vm_end - vma->vm_start;
         }
         up_read(&mm->mmap_sem);
         return vsize;
@@ -85,15 +93,19 @@ unsigned long task_vsize(struct mm_struct *mm)
  int task_statm(struct mm_struct *mm, int *shared, int *text,
                int *data, int *resident)
  {
-       struct vm_list_struct *tbp;
+       struct vm_area_struct *vma;
+       struct vm_region *region;
+       struct rb_node *p;
         int size = kobjsize(mm);
  
         down_read(&mm->mmap_sem);
-       for (tbp = mm->context.vmlist; tbp; tbp = tbp->next) {
-               size += kobjsize(tbp);
-               if (tbp->vma) {
-                       size += kobjsize(tbp->vma);
-                       size += kobjsize((void *) tbp->vma->vm_start);
+       for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
+               vma = rb_entry(p, struct vm_area_struct, vm_rb);
+               size += kobjsize(vma);
+               region = vma->vm_region;
+               if (region) {
+                       size += kobjsize(region);
+                       size += region->vm_end - region->vm_start;
                 }
         }
  
@@ -104,21 +116,63 @@ int task_statm(struct mm_struct *mm, int *shared, int *text,
         return size;
  }
  
+/*
+ * display a single VMA to a sequenced file
+ */
+static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma)
+{
+       unsigned long ino = 0;
+       struct file *file;
+       dev_t dev = 0;
+       int flags, len;
+
+       flags = vma->vm_flags;
+       file = vma->vm_file;
+
+       if (file) {
+               struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+               dev = inode->i_sb->s_dev;
+               ino = inode->i_ino;
+       }
+
+       seq_printf(m,
+                  "%08lx-%08lx %c%c%c%c %08lx %02x:%02x %lu %n",
+                  vma->vm_start,
+                  vma->vm_end,
+                  flags & VM_READ ? 'r' : '-',
+                  flags & VM_WRITE ? 'w' : '-',
+                  flags & VM_EXEC ? 'x' : '-',
+                  flags & VM_MAYSHARE ? flags & VM_SHARED ? 'S' : 's' : 'p',
+                  vma->vm_pgoff << PAGE_SHIFT,
+                  MAJOR(dev), MINOR(dev), ino, &len);
+
+       if (file) {
+               len = 25 + sizeof(void *) * 6 - len;
+               if (len < 1)
+                       len = 1;
+               seq_printf(m, "%*c", len, ' ');
+               seq_path(m, &file->f_path, "");
+       }
+
+       seq_putc(m, '\n');
+       return 0;
+}
+
  /*
   * display mapping lines for a particular process's /proc/pid/maps
   */
-static int show_map(struct seq_file *m, void *_vml)
+static int show_map(struct seq_file *m, void *_p)
  {
-       struct vm_list_struct *vml = _vml;
+       struct rb_node *p = _p;
  
-       return nommu_vma_show(m, vml->vma);
+       return nommu_vma_show(m, rb_entry(p, struct vm_area_struct, vm_rb));
  }
  
  static void *m_start(struct seq_file *m, loff_t *pos)
  {
         struct proc_maps_private *priv = m->private;
-       struct vm_list_struct *vml;
         struct mm_struct *mm;
+       struct rb_node *p;
         loff_t n = *pos;
  
         /* pin the task and mm whilst we play with them */
@@ -134,9 +188,9 @@ static void *m_start(struct seq_file *m, loff_t *pos)
         }
  
         /* start from the Nth VMA */
-       for (vml = mm->context.vmlist; vml; vml = vml->next)
+       for (p = rb_first(&mm->mm_rb); p; p = rb_next(p))
                 if (n-- == 0)
-                       return vml;
+                       return p;
         return NULL;
  }
  
@@ -152,12 +206,12 @@ static void m_stop(struct seq_file *m, void *_vml)
         }
  }
  
-static void *m_next(struct seq_file *m, void *_vml, loff_t *pos)
+static void *m_next(struct seq_file *m, void *_p, loff_t *pos)
  {
-       struct vm_list_struct *vml = _vml;
+       struct rb_node *p = _p;
  
         (*pos)++;
-       return vml ? vml->next : NULL;
+       return p ? rb_next(p) : NULL;
  }
  
  static const struct seq_operations proc_pid_maps_ops = {
diff --git a/fs/ramfs/file-nommu.c b/fs/ramfs/file-nommu.c

index 76acdbc3461144512208755d4b69946d7eda76e3..b9b567a28376313a33b74ad9668eda99134b70bc 100644 (file)
--- a/fs/ramfs/file-nommu.c
+++ b/fs/ramfs/file-nommu.c
@@ -262,11 +262,11 @@ unsigned long ramfs_nommu_get_unmapped_area(struct file *file,
         ret = -ENOMEM;
         pages = kzalloc(lpages * sizeof(struct page *), GFP_KERNEL);
         if (!pages)
-               goto out;
+               goto out_free;
  
         nr = find_get_pages(inode->i_mapping, pgoff, lpages, pages);
         if (nr != lpages)
-               goto out; /* leave if some pages were missing */
+               goto out_free_pages; /* leave if some pages were missing */
  
         /* check the pages for physical adjacency */
         ptr = pages;
@@ -274,19 +274,18 @@ unsigned long ramfs_nommu_get_unmapped_area(struct file *file,
         page++;
         for (loop = lpages; loop > 1; loop--)
                 if (*ptr++ != page++)
-                       goto out;
+                       goto out_free_pages;
  
         /* okay - all conditions fulfilled */
         ret = (unsigned long) page_address(pages[0]);
  
- out:
-       if (pages) {
-               ptr = pages;
-               for (loop = lpages; loop > 0; loop--)
-                       put_page(*ptr++);
-               kfree(pages);
-       }
-
+out_free_pages:
+       ptr = pages;
+       for (loop = nr; loop > 0; loop--)
+               put_page(*ptr++);
+out_free:
+       kfree(pages);
+out:
         return ret;
  }
  
diff --git a/fs/squashfs/Makefile b/fs/squashfs/Makefile

new file mode 100644 (file)

index 0000000..8258cf9
--- /dev/null
+++ b/fs/squashfs/Makefile
@@ -0,0 +1,8 @@
+#
+# Makefile for the linux squashfs routines.
+#
+
+obj-$(CONFIG_SQUASHFS) += squashfs.o
+squashfs-y += block.o cache.o dir.o export.o file.o fragment.o id.o inode.o
+squashfs-y += namei.o super.o symlink.o
+#squashfs-y += squashfs2_0.o
diff --git a/fs/squashfs/block.c b/fs/squashfs/block.c

new file mode 100644 (file)

index 0000000..c837dfc
--- /dev/null
+++ b/fs/squashfs/block.c
@@ -0,0 +1,274 @@
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * block.c
+ */
+
+/*
+ * This file implements the low-level routines to read and decompress
+ * datablocks and metadata blocks.
+ */
+
+#include <linux/fs.h>
+#include <linux/vfs.h>
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/string.h>
+#include <linux/buffer_head.h>
+#include <linux/zlib.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs_fs_i.h"
+#include "squashfs.h"
+
+/*
+ * Read the metadata block length, this is stored in the first two
+ * bytes of the metadata block.
+ */
+static struct buffer_head *get_block_length(struct super_block *sb,
+                       u64 *cur_index, int *offset, int *length)
+{
+       struct squashfs_sb_info *msblk = sb->s_fs_info;
+       struct buffer_head *bh;
+
+       bh = sb_bread(sb, *cur_index);
+       if (bh == NULL)
+               return NULL;
+
+       if (msblk->devblksize - *offset == 1) {
+               *length = (unsigned char) bh->b_data[*offset];
+               put_bh(bh);
+               bh = sb_bread(sb, ++(*cur_index));
+               if (bh == NULL)
+                       return NULL;
+               *length |= (unsigned char) bh->b_data[0] << 8;
+               *offset = 1;
+       } else {
+               *length = (unsigned char) bh->b_data[*offset] |
+                       (unsigned char) bh->b_data[*offset + 1] << 8;
+               *offset += 2;
+       }
+
+       return bh;
+}
+
+
+/*
+ * Read and decompress a metadata block or datablock.  Length is non-zero
+ * if a datablock is being read (the size is stored elsewhere in the
+ * filesystem), otherwise the length is obtained from the first two bytes of
+ * the metadata block.  A bit in the length field indicates if the block
+ * is stored uncompressed in the filesystem (usually because compression
+ * generated a larger block - this does occasionally happen with zlib).
+ */
+int squashfs_read_data(struct super_block *sb, void **buffer, u64 index,
+                       int length, u64 *next_index, int srclength)
+{
+       struct squashfs_sb_info *msblk = sb->s_fs_info;
+       struct buffer_head **bh;
+       int offset = index & ((1 << msblk->devblksize_log2) - 1);
+       u64 cur_index = index >> msblk->devblksize_log2;
+       int bytes, compressed, b = 0, k = 0, page = 0, avail;
+
+
+       bh = kcalloc((msblk->block_size >> msblk->devblksize_log2) + 1,
+                               sizeof(*bh), GFP_KERNEL);
+       if (bh == NULL)
+               return -ENOMEM;
+
+       if (length) {
+               /*
+                * Datablock.
+                */
+               bytes = -offset;
+               compressed = SQUASHFS_COMPRESSED_BLOCK(length);
+               length = SQUASHFS_COMPRESSED_SIZE_BLOCK(length);
+               if (next_index)
+                       *next_index = index + length;
+
+               TRACE("Block @ 0x%llx, %scompressed size %d, src size %d\n",
+                       index, compressed ? "" : "un", length, srclength);
+
+               if (length < 0 || length > srclength ||
+                               (index + length) > msblk->bytes_used)
+                       goto read_failure;
+
+               for (b = 0; bytes < length; b++, cur_index++) {
+                       bh[b] = sb_getblk(sb, cur_index);
+                       if (bh[b] == NULL)
+                               goto block_release;
+                       bytes += msblk->devblksize;
+               }
+               ll_rw_block(READ, b, bh);
+       } else {
+               /*
+                * Metadata block.
+                */
+               if ((index + 2) > msblk->bytes_used)
+                       goto read_failure;
+
+               bh[0] = get_block_length(sb, &cur_index, &offset, &length);
+               if (bh[0] == NULL)
+                       goto read_failure;
+               b = 1;
+
+               bytes = msblk->devblksize - offset;
+               compressed = SQUASHFS_COMPRESSED(length);
+               length = SQUASHFS_COMPRESSED_SIZE(length);
+               if (next_index)
+                       *next_index = index + length + 2;
+
+               TRACE("Block @ 0x%llx, %scompressed size %d\n", index,
+                               compressed ? "" : "un", length);
+
+               if (length < 0 || length > srclength ||
+                                       (index + length) > msblk->bytes_used)
+                       goto block_release;
+
+               for (; bytes < length; b++) {
+                       bh[b] = sb_getblk(sb, ++cur_index);
+                       if (bh[b] == NULL)
+                               goto block_release;
+                       bytes += msblk->devblksize;
+               }
+               ll_rw_block(READ, b - 1, bh + 1);
+       }
+
+       if (compressed) {
+               int zlib_err = 0, zlib_init = 0;
+
+               /*
+                * Uncompress block.
+                */
+
+               mutex_lock(&msblk->read_data_mutex);
+
+               msblk->stream.avail_out = 0;
+               msblk->stream.avail_in = 0;
+
+               bytes = length;
+               do {
+                       if (msblk->stream.avail_in == 0 && k < b) {
+                               avail = min(bytes, msblk->devblksize - offset);
+                               bytes -= avail;
+                               wait_on_buffer(bh[k]);
+                               if (!buffer_uptodate(bh[k]))
+                                       goto release_mutex;
+
+                               if (avail == 0) {
+                                       offset = 0;
+                                       put_bh(bh[k++]);
+                                       continue;
+                               }
+
+                               msblk->stream.next_in = bh[k]->b_data + offset;
+                               msblk->stream.avail_in = avail;
+                               offset = 0;
+                       }
+
+                       if (msblk->stream.avail_out == 0) {
+                               msblk->stream.next_out = buffer[page++];
+                               msblk->stream.avail_out = PAGE_CACHE_SIZE;
+                       }
+
+                       if (!zlib_init) {
+                               zlib_err = zlib_inflateInit(&msblk->stream);
+                               if (zlib_err != Z_OK) {
+                                       ERROR("zlib_inflateInit returned"
+                                               " unexpected result 0x%x,"
+                                               " srclength %d\n", zlib_err,
+                                               srclength);
+                                       goto release_mutex;
+                               }
+                               zlib_init = 1;
+                       }
+
+                       zlib_err = zlib_inflate(&msblk->stream, Z_NO_FLUSH);
+
+                       if (msblk->stream.avail_in == 0 && k < b)
+                               put_bh(bh[k++]);
+               } while (zlib_err == Z_OK);
+
+               if (zlib_err != Z_STREAM_END) {
+                       ERROR("zlib_inflate returned unexpected result"
+                               " 0x%x, srclength %d, avail_in %d,"
+                               " avail_out %d\n", zlib_err, srclength,
+                               msblk->stream.avail_in,
+                               msblk->stream.avail_out);
+                       goto release_mutex;
+               }
+
+               zlib_err = zlib_inflateEnd(&msblk->stream);
+               if (zlib_err != Z_OK) {
+                       ERROR("zlib_inflateEnd returned unexpected result 0x%x,"
+                               " srclength %d\n", zlib_err, srclength);
+                       goto release_mutex;
+               }
+               length = msblk->stream.total_out;
+               mutex_unlock(&msblk->read_data_mutex);
+       } else {
+               /*
+                * Block is uncompressed.
+                */
+               int i, in, pg_offset = 0;
+
+               for (i = 0; i < b; i++) {
+                       wait_on_buffer(bh[i]);
+                       if (!buffer_uptodate(bh[i]))
+                               goto block_release;
+               }
+
+               for (bytes = length; k < b; k++) {
+                       in = min(bytes, msblk->devblksize - offset);
+                       bytes -= in;
+                       while (in) {
+                               if (pg_offset == PAGE_CACHE_SIZE) {
+                                       page++;
+                                       pg_offset = 0;
+                               }
+                               avail = min_t(int, in, PAGE_CACHE_SIZE -
+                                               pg_offset);
+                               memcpy(buffer[page] + pg_offset,
+                                               bh[k]->b_data + offset, avail);
+                               in -= avail;
+                               pg_offset += avail;
+                               offset += avail;
+                       }
+                       offset = 0;
+                       put_bh(bh[k]);
+               }
+       }
+
+       kfree(bh);
+       return length;
+
+release_mutex:
+       mutex_unlock(&msblk->read_data_mutex);
+
+block_release:
+       for (; k < b; k++)
+               put_bh(bh[k]);
+
+read_failure:
+       ERROR("sb_bread failed reading block 0x%llx\n", cur_index);
+       kfree(bh);
+       return -EIO;
+}
diff --git a/fs/squashfs/cache.c b/fs/squashfs/cache.c

new file mode 100644 (file)

index 0000000..f29eda1
--- /dev/null
+++ b/fs/squashfs/cache.c
@@ -0,0 +1,412 @@
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * cache.c
+ */
+
+/*
+ * Blocks in Squashfs are compressed.  To avoid repeatedly decompressing
+ * recently accessed data Squashfs uses two small metadata and fragment caches.
+ *
+ * This file implements a generic cache implementation used for both caches,
+ * plus functions layered ontop of the generic cache implementation to
+ * access the metadata and fragment caches.
+ *
+ * To avoid out of memory and fragmentation isssues with vmalloc the cache
+ * uses sequences of kmalloced PAGE_CACHE_SIZE buffers.
+ *
+ * It should be noted that the cache is not used for file datablocks, these
+ * are decompressed and cached in the page-cache in the normal way.  The
+ * cache is only used to temporarily cache fragment and metadata blocks
+ * which have been read as as a result of a metadata (i.e. inode or
+ * directory) or fragment access.  Because metadata and fragments are packed
+ * together into blocks (to gain greater compression) the read of a particular
+ * piece of metadata or fragment will retrieve other metadata/fragments which
+ * have been packed with it, these because of locality-of-reference may be read
+ * in the near future. Temporarily caching them ensures they are available for
+ * near future access without requiring an additional read and decompress.
+ */
+
+#include <linux/fs.h>
+#include <linux/vfs.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+#include <linux/zlib.h>
+#include <linux/pagemap.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs_fs_i.h"
+#include "squashfs.h"
+
+/*
+ * Look-up block in cache, and increment usage count.  If not in cache, read
+ * and decompress it from disk.
+ */
+struct squashfs_cache_entry *squashfs_cache_get(struct super_block *sb,
+       struct squashfs_cache *cache, u64 block, int length)
+{
+       int i, n;
+       struct squashfs_cache_entry *entry;
+
+       spin_lock(&cache->lock);
+
+       while (1) {
+               for (i = 0; i < cache->entries; i++)
+                       if (cache->entry[i].block == block)
+                               break;
+
+               if (i == cache->entries) {
+                       /*
+                        * Block not in cache, if all cache entries are used
+                        * go to sleep waiting for one to become available.
+                        */
+                       if (cache->unused == 0) {
+                               cache->num_waiters++;
+                               spin_unlock(&cache->lock);
+                               wait_event(cache->wait_queue, cache->unused);
+                               spin_lock(&cache->lock);
+                               cache->num_waiters--;
+                               continue;
+                       }
+
+                       /*
+                        * At least one unused cache entry.  A simple
+                        * round-robin strategy is used to choose the entry to
+                        * be evicted from the cache.
+                        */
+                       i = cache->next_blk;
+                       for (n = 0; n < cache->entries; n++) {
+                               if (cache->entry[i].refcount == 0)
+                                       break;
+                               i = (i + 1) % cache->entries;
+                       }
+
+                       cache->next_blk = (i + 1) % cache->entries;
+                       entry = &cache->entry[i];
+
+                       /*
+                        * Initialise choosen cache entry, and fill it in from
+                        * disk.
+                        */
+                       cache->unused--;
+                       entry->block = block;
+                       entry->refcount = 1;
+                       entry->pending = 1;
+                       entry->num_waiters = 0;
+                       entry->error = 0;
+                       spin_unlock(&cache->lock);
+
+                       entry->length = squashfs_read_data(sb, entry->data,
+                               block, length, &entry->next_index,
+                               cache->block_size);
+
+                       spin_lock(&cache->lock);
+
+                       if (entry->length < 0)
+                               entry->error = entry->length;
+
+                       entry->pending = 0;
+
+                       /*
+                        * While filling this entry one or more other processes
+                        * have looked it up in the cache, and have slept
+                        * waiting for it to become available.
+                        */
+                       if (entry->num_waiters) {
+                               spin_unlock(&cache->lock);
+                               wake_up_all(&entry->wait_queue);
+                       } else
+                               spin_unlock(&cache->lock);
+
+                       goto out;
+               }
+
+               /*
+                * Block already in cache.  Increment refcount so it doesn't
+                * get reused until we're finished with it, if it was
+                * previously unused there's one less cache entry available
+                * for reuse.
+                */
+               entry = &cache->entry[i];
+               if (entry->refcount == 0)
+                       cache->unused--;
+               entry->refcount++;
+
+               /*
+                * If the entry is currently being filled in by another process
+                * go to sleep waiting for it to become available.
+                */
+               if (entry->pending) {
+                       entry->num_waiters++;
+                       spin_unlock(&cache->lock);
+                       wait_event(entry->wait_queue, !entry->pending);
+               } else
+                       spin_unlock(&cache->lock);
+
+               goto out;
+       }
+
+out:
+       TRACE("Got %s %d, start block %lld, refcount %d, error %d\n",
+               cache->name, i, entry->block, entry->refcount, entry->error);
+
+       if (entry->error)
+               ERROR("Unable to read %s cache entry [%llx]\n", cache->name,
+                                                       block);
+       return entry;
+}
+
+
+/*
+ * Release cache entry, once usage count is zero it can be reused.
+ */
+void squashfs_cache_put(struct squashfs_cache_entry *entry)
+{
+       struct squashfs_cache *cache = entry->cache;
+
+       spin_lock(&cache->lock);
+       entry->refcount--;
+       if (entry->refcount == 0) {
+               cache->unused++;
+               /*
+                * If there's any processes waiting for a block to become
+                * available, wake one up.
+                */
+               if (cache->num_waiters) {
+                       spin_unlock(&cache->lock);
+                       wake_up(&cache->wait_queue);
+                       return;
+               }
+       }
+       spin_unlock(&cache->lock);
+}
+
+/*
+ * Delete cache reclaiming all kmalloced buffers.
+ */
+void squashfs_cache_delete(struct squashfs_cache *cache)
+{
+       int i, j;
+
+       if (cache == NULL)
+               return;
+
+       for (i = 0; i < cache->entries; i++) {
+               if (cache->entry[i].data) {
+                       for (j = 0; j < cache->pages; j++)
+                               kfree(cache->entry[i].data[j]);
+                       kfree(cache->entry[i].data);
+               }
+       }
+
+       kfree(cache->entry);
+       kfree(cache);
+}
+
+
+/*
+ * Initialise cache allocating the specified number of entries, each of
+ * size block_size.  To avoid vmalloc fragmentation issues each entry
+ * is allocated as a sequence of kmalloced PAGE_CACHE_SIZE buffers.
+ */
+struct squashfs_cache *squashfs_cache_init(char *name, int entries,
+       int block_size)
+{
+       int i, j;
+       struct squashfs_cache *cache = kzalloc(sizeof(*cache), GFP_KERNEL);
+
+       if (cache == NULL) {
+               ERROR("Failed to allocate %s cache\n", name);
+               return NULL;
+       }
+
+       cache->entry = kcalloc(entries, sizeof(*(cache->entry)), GFP_KERNEL);
+       if (cache->entry == NULL) {
+               ERROR("Failed to allocate %s cache\n", name);
+               goto cleanup;
+       }
+
+       cache->next_blk = 0;
+       cache->unused = entries;
+       cache->entries = entries;
+       cache->block_size = block_size;
+       cache->pages = block_size >> PAGE_CACHE_SHIFT;
+       cache->name = name;
+       cache->num_waiters = 0;
+       spin_lock_init(&cache->lock);
+       init_waitqueue_head(&cache->wait_queue);
+
+       for (i = 0; i < entries; i++) {
+               struct squashfs_cache_entry *entry = &cache->entry[i];
+
+               init_waitqueue_head(&cache->entry[i].wait_queue);
+               entry->cache = cache;
+               entry->block = SQUASHFS_INVALID_BLK;
+               entry->data = kcalloc(cache->pages, sizeof(void *), GFP_KERNEL);
+               if (entry->data == NULL) {
+                       ERROR("Failed to allocate %s cache entry\n", name);
+                       goto cleanup;
+               }
+
+               for (j = 0; j < cache->pages; j++) {
+                       entry->data[j] = kmalloc(PAGE_CACHE_SIZE, GFP_KERNEL);
+                       if (entry->data[j] == NULL) {
+                               ERROR("Failed to allocate %s buffer\n", name);
+                               goto cleanup;
+                       }
+               }
+       }
+
+       return cache;
+
+cleanup:
+       squashfs_cache_delete(cache);
+       return NULL;
+}
+
+
+/*
+ * Copy upto length bytes from cache entry to buffer starting at offset bytes
+ * into the cache entry.  If there's not length bytes then copy the number of
+ * bytes available.  In all cases return the number of bytes copied.
+ */
+int squashfs_copy_data(void *buffer, struct squashfs_cache_entry *entry,
+               int offset, int length)
+{
+       int remaining = length;
+
+       if (length == 0)
+               return 0;
+       else if (buffer == NULL)
+               return min(length, entry->length - offset);
+
+       while (offset < entry->length) {
+               void *buff = entry->data[offset / PAGE_CACHE_SIZE]
+                               + (offset % PAGE_CACHE_SIZE);
+               int bytes = min_t(int, entry->length - offset,
+                               PAGE_CACHE_SIZE - (offset % PAGE_CACHE_SIZE));
+
+               if (bytes >= remaining) {
+                       memcpy(buffer, buff, remaining);
+                       remaining = 0;
+                       break;
+               }
+
+               memcpy(buffer, buff, bytes);
+               buffer += bytes;
+               remaining -= bytes;
+               offset += bytes;
+       }
+
+       return length - remaining;
+}
+
+
+/*
+ * Read length bytes from metadata position <block, offset> (block is the
+ * start of the compressed block on disk, and offset is the offset into
+ * the block once decompressed).  Data is packed into consecutive blocks,
+ * and length bytes may require reading more than one block.
+ */
+int squashfs_read_metadata(struct super_block *sb, void *buffer,
+               u64 *block, int *offset, int length)
+{
+       struct squashfs_sb_info *msblk = sb->s_fs_info;
+       int bytes, copied = length;
+       struct squashfs_cache_entry *entry;
+
+       TRACE("Entered squashfs_read_metadata [%llx:%x]\n", *block, *offset);
+
+       while (length) {
+               entry = squashfs_cache_get(sb, msblk->block_cache, *block, 0);
+               if (entry->error)
+                       return entry->error;
+               else if (*offset >= entry->length)
+                       return -EIO;
+
+               bytes = squashfs_copy_data(buffer, entry, *offset, length);
+               if (buffer)
+                       buffer += bytes;
+               length -= bytes;
+               *offset += bytes;
+
+               if (*offset == entry->length) {
+                       *block = entry->next_index;
+                       *offset = 0;
+               }
+
+               squashfs_cache_put(entry);
+       }
+
+       return copied;
+}
+
+
+/*
+ * Look-up in the fragmment cache the fragment located at <start_block> in the
+ * filesystem.  If necessary read and decompress it from disk.
+ */
+struct squashfs_cache_entry *squashfs_get_fragment(struct super_block *sb,
+                               u64 start_block, int length)
+{
+       struct squashfs_sb_info *msblk = sb->s_fs_info;
+
+       return squashfs_cache_get(sb, msblk->fragment_cache, start_block,
+               length);
+}
+
+
+/*
+ * Read and decompress the datablock located at <start_block> in the
+ * filesystem.  The cache is used here to avoid duplicating locking and
+ * read/decompress code.
+ */
+struct squashfs_cache_entry *squashfs_get_datablock(struct super_block *sb,
+                               u64 start_block, int length)
+{
+       struct squashfs_sb_info *msblk = sb->s_fs_info;
+
+       return squashfs_cache_get(sb, msblk->read_page, start_block, length);
+}
+
+
+/*
+ * Read a filesystem table (uncompressed sequence of bytes) from disk
+ */
+int squashfs_read_table(struct super_block *sb, void *buffer, u64 block,
+       int length)
+{
+       int pages = (length + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+       int i, res;
+       void **data = kcalloc(pages, sizeof(void *), GFP_KERNEL);
+       if (data == NULL)
+               return -ENOMEM;
+
+       for (i = 0; i < pages; i++, buffer += PAGE_CACHE_SIZE)
+               data[i] = buffer;
+       res = squashfs_read_data(sb, data, block, length |
+               SQUASHFS_COMPRESSED_BIT_BLOCK, NULL, length);
+       kfree(data);
+       return res;
+}
diff --git a/fs/squashfs/dir.c b/fs/squashfs/dir.c

new file mode 100644 (file)

index 0000000..566b0ea
--- /dev/null
+++ b/fs/squashfs/dir.c
@@ -0,0 +1,235 @@
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * dir.c
+ */
+
+/*
+ * This file implements code to read directories from disk.
+ *
+ * See namei.c for a description of directory organisation on disk.
+ */
+
+#include <linux/fs.h>
+#include <linux/vfs.h>
+#include <linux/slab.h>
+#include <linux/zlib.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs_fs_i.h"
+#include "squashfs.h"
+
+static const unsigned char squashfs_filetype_table[] = {
+       DT_UNKNOWN, DT_DIR, DT_REG, DT_LNK, DT_BLK, DT_CHR, DT_FIFO, DT_SOCK
+};
+
+/*
+ * Lookup offset (f_pos) in the directory index, returning the
+ * metadata block containing it.
+ *
+ * If we get an error reading the index then return the part of the index
+ * (if any) we have managed to read - the index isn't essential, just
+ * quicker.
+ */
+static int get_dir_index_using_offset(struct super_block *sb,
+       u64 *next_block, int *next_offset, u64 index_start, int index_offset,
+       int i_count, u64 f_pos)
+{
+       struct squashfs_sb_info *msblk = sb->s_fs_info;
+       int err, i, index, length = 0;
+       struct squashfs_dir_index dir_index;
+
+       TRACE("Entered get_dir_index_using_offset, i_count %d, f_pos %lld\n",
+                                       i_count, f_pos);
+
+       /*
+        * Translate from external f_pos to the internal f_pos.  This
+        * is offset by 3 because we invent "." and ".." entries which are
+        * not actually stored in the directory.
+        */
+       if (f_pos < 3)
+               return f_pos;
+       f_pos -= 3;
+
+       for (i = 0; i < i_count; i++) {
+               err = squashfs_read_metadata(sb, &dir_index, &index_start,
+                               &index_offset, sizeof(dir_index));
+               if (err < 0)
+                       break;
+
+               index = le32_to_cpu(dir_index.index);
+               if (index > f_pos)
+                       /*
+                        * Found the index we're looking for.
+                        */
+                       break;
+
+               err = squashfs_read_metadata(sb, NULL, &index_start,
+                               &index_offset, le32_to_cpu(dir_index.size) + 1);
+               if (err < 0)
+                       break;
+
+               length = index;
+               *next_block = le32_to_cpu(dir_index.start_block) +
+                                       msblk->directory_table;
+       }
+
+       *next_offset = (length + *next_offset) % SQUASHFS_METADATA_SIZE;
+
+       /*
+        * Translate back from internal f_pos to external f_pos.
+        */
+       return length + 3;
+}
+
+
+static int squashfs_readdir(struct file *file, void *dirent, filldir_t filldir)
+{
+       struct inode *inode = file->f_dentry->d_inode;
+       struct squashfs_sb_info *msblk = inode->i_sb->s_fs_info;
+       u64 block = squashfs_i(inode)->start + msblk->directory_table;
+       int offset = squashfs_i(inode)->offset, length = 0, dir_count, size,
+                               type, err;
+       unsigned int inode_number;
+       struct squashfs_dir_header dirh;
+       struct squashfs_dir_entry *dire;
+
+       TRACE("Entered squashfs_readdir [%llx:%x]\n", block, offset);
+
+       dire = kmalloc(sizeof(*dire) + SQUASHFS_NAME_LEN + 1, GFP_KERNEL);
+       if (dire == NULL) {
+               ERROR("Failed to allocate squashfs_dir_entry\n");
+               goto finish;
+       }
+
+       /*
+        * Return "." and  ".." entries as the first two filenames in the
+        * directory.  To maximise compression these two entries are not
+        * stored in the directory, and so we invent them here.
+        *
+        * It also means that the external f_pos is offset by 3 from the
+        * on-disk directory f_pos.
+        */
+       while (file->f_pos < 3) {
+               char *name;
+               int i_ino;
+
+               if (file->f_pos == 0) {
+                       name = ".";
+                       size = 1;
+                       i_ino = inode->i_ino;
+               } else {
+                       name = "..";
+                       size = 2;
+                       i_ino = squashfs_i(inode)->parent;
+               }
+
+               TRACE("Calling filldir(%p, %s, %d, %lld, %d, %d)\n",
+                               dirent, name, size, file->f_pos, i_ino,
+                               squashfs_filetype_table[1]);
+
+               if (filldir(dirent, name, size, file->f_pos, i_ino,
+                               squashfs_filetype_table[1]) < 0) {
+                               TRACE("Filldir returned less than 0\n");
+                       goto finish;
+               }
+
+               file->f_pos += size;
+       }
+
+       length = get_dir_index_using_offset(inode->i_sb, &block, &offset,
+                               squashfs_i(inode)->dir_idx_start,
+                               squashfs_i(inode)->dir_idx_offset,
+                               squashfs_i(inode)->dir_idx_cnt,
+                               file->f_pos);
+
+       while (length < i_size_read(inode)) {
+               /*
+                * Read directory header
+                */
+               err = squashfs_read_metadata(inode->i_sb, &dirh, &block,
+                                       &offset, sizeof(dirh));
+               if (err < 0)
+                       goto failed_read;
+
+               length += sizeof(dirh);
+
+               dir_count = le32_to_cpu(dirh.count) + 1;
+               while (dir_count--) {
+                       /*
+                        * Read directory entry.
+                        */
+                       err = squashfs_read_metadata(inode->i_sb, dire, &block,
+                                       &offset, sizeof(*dire));
+                       if (err < 0)
+                               goto failed_read;
+
+                       size = le16_to_cpu(dire->size) + 1;
+
+                       err = squashfs_read_metadata(inode->i_sb, dire->name,
+                                       &block, &offset, size);
+                       if (err < 0)
+                               goto failed_read;
+
+                       length += sizeof(*dire) + size;
+
+                       if (file->f_pos >= length)
+                               continue;
+
+                       dire->name[size] = '\0';
+                       inode_number = le32_to_cpu(dirh.inode_number) +
+                               ((short) le16_to_cpu(dire->inode_number));
+                       type = le16_to_cpu(dire->type);
+
+                       TRACE("Calling filldir(%p, %s, %d, %lld, %x:%x, %d, %d)"
+                                       "\n", dirent, dire->name, size,
+                                       file->f_pos,
+                                       le32_to_cpu(dirh.start_block),
+                                       le16_to_cpu(dire->offset),
+                                       inode_number,
+                                       squashfs_filetype_table[type]);
+
+                       if (filldir(dirent, dire->name, size, file->f_pos,
+                                       inode_number,
+                                       squashfs_filetype_table[type]) < 0) {
+                               TRACE("Filldir returned less than 0\n");
+                               goto finish;
+                       }
+
+                       file->f_pos = length;
+               }
+       }
+
+finish:
+       kfree(dire);
+       return 0;
+
+failed_read:
+       ERROR("Unable to read directory block [%llx:%x]\n", block, offset);
+       kfree(dire);
+       return 0;
+}
+
+
+const struct file_operations squashfs_dir_ops = {
+       .read = generic_read_dir,
+       .readdir = squashfs_readdir
+};
diff --git a/fs/squashfs/export.c b/fs/squashfs/export.c

new file mode 100644 (file)

index 0000000..69e971d
--- /dev/null
+++ b/fs/squashfs/export.c
@@ -0,0 +1,155 @@
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * export.c
+ */
+
+/*
+ * This file implements code to make Squashfs filesystems exportable (NFS etc.)
+ *
+ * The export code uses an inode lookup table to map inode numbers passed in
+ * filehandles to an inode location on disk.  This table is stored compressed
+ * into metadata blocks.  A second index table is used to locate these.  This
+ * second index table for speed of access (and because it is small) is read at
+ * mount time and cached in memory.
+ *
+ * The inode lookup table is used only by the export code, inode disk
+ * locations are directly encoded in directories, enabling direct access
+ * without an intermediate lookup for all operations except the export ops.
+ */
+
+#include <linux/fs.h>
+#include <linux/vfs.h>
+#include <linux/dcache.h>
+#include <linux/exportfs.h>
+#include <linux/zlib.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs_fs_i.h"
+#include "squashfs.h"
+
+/*
+ * Look-up inode number (ino) in table, returning the inode location.
+ */
+static long long squashfs_inode_lookup(struct super_block *sb, int ino_num)
+{
+       struct squashfs_sb_info *msblk = sb->s_fs_info;
+       int blk = SQUASHFS_LOOKUP_BLOCK(ino_num - 1);
+       int offset = SQUASHFS_LOOKUP_BLOCK_OFFSET(ino_num - 1);
+       u64 start = le64_to_cpu(msblk->inode_lookup_table[blk]);
+       __le64 ino;
+       int err;
+
+       TRACE("Entered squashfs_inode_lookup, inode_number = %d\n", ino_num);
+
+       err = squashfs_read_metadata(sb, &ino, &start, &offset, sizeof(ino));
+       if (err < 0)
+               return err;
+
+       TRACE("squashfs_inode_lookup, inode = 0x%llx\n",
+               (u64) le64_to_cpu(ino));
+
+       return le64_to_cpu(ino);
+}
+
+
+static struct dentry *squashfs_export_iget(struct super_block *sb,
+       unsigned int ino_num)
+{
+       long long ino;
+       struct dentry *dentry = ERR_PTR(-ENOENT);
+
+       TRACE("Entered squashfs_export_iget\n");
+
+       ino = squashfs_inode_lookup(sb, ino_num);
+       if (ino >= 0)
+               dentry = d_obtain_alias(squashfs_iget(sb, ino, ino_num));
+
+       return dentry;
+}
+
+
+static struct dentry *squashfs_fh_to_dentry(struct super_block *sb,
+               struct fid *fid, int fh_len, int fh_type)
+{
+       if ((fh_type != FILEID_INO32_GEN && fh_type != FILEID_INO32_GEN_PARENT)
+                       || fh_len < 2)
+               return NULL;
+
+       return squashfs_export_iget(sb, fid->i32.ino);
+}
+
+
+static struct dentry *squashfs_fh_to_parent(struct super_block *sb,
+               struct fid *fid, int fh_len, int fh_type)
+{
+       if (fh_type != FILEID_INO32_GEN_PARENT || fh_len < 4)
+               return NULL;
+
+       return squashfs_export_iget(sb, fid->i32.parent_ino);
+}
+
+
+static struct dentry *squashfs_get_parent(struct dentry *child)
+{
+       struct inode *inode = child->d_inode;
+       unsigned int parent_ino = squashfs_i(inode)->parent;
+
+       return squashfs_export_iget(inode->i_sb, parent_ino);
+}
+
+
+/*
+ * Read uncompressed inode lookup table indexes off disk into memory
+ */
+__le64 *squashfs_read_inode_lookup_table(struct super_block *sb,
+               u64 lookup_table_start, unsigned int inodes)
+{
+       unsigned int length = SQUASHFS_LOOKUP_BLOCK_BYTES(inodes);
+       __le64 *inode_lookup_table;
+       int err;
+
+       TRACE("In read_inode_lookup_table, length %d\n", length);
+
+       /* Allocate inode lookup table indexes */
+       inode_lookup_table = kmalloc(length, GFP_KERNEL);
+       if (inode_lookup_table == NULL) {
+               ERROR("Failed to allocate inode lookup table\n");
+               return ERR_PTR(-ENOMEM);
+       }
+
+       err = squashfs_read_table(sb, inode_lookup_table, lookup_table_start,
+                       length);
+       if (err < 0) {
+               ERROR("unable to read inode lookup table\n");
+               kfree(inode_lookup_table);
+               return ERR_PTR(err);
+       }
+
+       return inode_lookup_table;
+}
+
+
+const struct export_operations squashfs_export_ops = {
+       .fh_to_dentry = squashfs_fh_to_dentry,
+       .fh_to_parent = squashfs_fh_to_parent,
+       .get_parent = squashfs_get_parent
+};
diff --git a/fs/squashfs/file.c b/fs/squashfs/file.c

new file mode 100644 (file)

index 0000000..717767d
--- /dev/null
+++ b/fs/squashfs/file.c
@@ -0,0 +1,502 @@
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * file.c
+ */
+
+/*
+ * This file contains code for handling regular files.  A regular file
+ * consists of a sequence of contiguous compressed blocks, and/or a
+ * compressed fragment block (tail-end packed block).   The compressed size
+ * of each datablock is stored in a block list contained within the
+ * file inode (itself stored in one or more compressed metadata blocks).
+ *
+ * To speed up access to datablocks when reading 'large' files (256 Mbytes or
+ * larger), the code implements an index cache that caches the mapping from
+ * block index to datablock location on disk.
+ *
+ * The index cache allows Squashfs to handle large files (up to 1.75 TiB) while
+ * retaining a simple and space-efficient block list on disk.  The cache
+ * is split into slots, caching up to eight 224 GiB files (128 KiB blocks).
+ * Larger files use multiple slots, with 1.75 TiB files using all 8 slots.
+ * The index cache is designed to be memory efficient, and by default uses
+ * 16 KiB.
+ */
+
+#include <linux/fs.h>
+#include <linux/vfs.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/pagemap.h>
+#include <linux/mutex.h>
+#include <linux/zlib.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs_fs_i.h"
+#include "squashfs.h"
+
+/*
+ * Locate cache slot in range [offset, index] for specified inode.  If
+ * there's more than one return the slot closest to index.
+ */
+static struct meta_index *locate_meta_index(struct inode *inode, int offset,
+                               int index)
+{
+       struct meta_index *meta = NULL;
+       struct squashfs_sb_info *msblk = inode->i_sb->s_fs_info;
+       int i;
+
+       mutex_lock(&msblk->meta_index_mutex);
+
+       TRACE("locate_meta_index: index %d, offset %d\n", index, offset);
+
+       if (msblk->meta_index == NULL)
+               goto not_allocated;
+
+       for (i = 0; i < SQUASHFS_META_SLOTS; i++) {
+               if (msblk->meta_index[i].inode_number == inode->i_ino &&
+                               msblk->meta_index[i].offset >= offset &&
+                               msblk->meta_index[i].offset <= index &&
+                               msblk->meta_index[i].locked == 0) {
+                       TRACE("locate_meta_index: entry %d, offset %d\n", i,
+                                       msblk->meta_index[i].offset);
+                       meta = &msblk->meta_index[i];
+                       offset = meta->offset;
+               }
+       }
+
+       if (meta)
+               meta->locked = 1;
+
+not_allocated:
+       mutex_unlock(&msblk->meta_index_mutex);
+
+       return meta;
+}
+
+
+/*
+ * Find and initialise an empty cache slot for index offset.
+ */
+static struct meta_index *empty_meta_index(struct inode *inode, int offset,
+                               int skip)
+{
+       struct squashfs_sb_info *msblk = inode->i_sb->s_fs_info;
+       struct meta_index *meta = NULL;
+       int i;
+
+       mutex_lock(&msblk->meta_index_mutex);
+
+       TRACE("empty_meta_index: offset %d, skip %d\n", offset, skip);
+
+       if (msblk->meta_index == NULL) {
+               /*
+                * First time cache index has been used, allocate and
+                * initialise.  The cache index could be allocated at
+                * mount time but doing it here means it is allocated only
+                * if a 'large' file is read.
+                */
+               msblk->meta_index = kcalloc(SQUASHFS_META_SLOTS,
+                       sizeof(*(msblk->meta_index)), GFP_KERNEL);
+               if (msblk->meta_index == NULL) {
+                       ERROR("Failed to allocate meta_index\n");
+                       goto failed;
+               }
+               for (i = 0; i < SQUASHFS_META_SLOTS; i++) {
+                       msblk->meta_index[i].inode_number = 0;
+                       msblk->meta_index[i].locked = 0;
+               }
+               msblk->next_meta_index = 0;
+       }
+
+       for (i = SQUASHFS_META_SLOTS; i &&
+                       msblk->meta_index[msblk->next_meta_index].locked; i--)
+               msblk->next_meta_index = (msblk->next_meta_index + 1) %
+                       SQUASHFS_META_SLOTS;
+
+       if (i == 0) {
+               TRACE("empty_meta_index: failed!\n");
+               goto failed;
+       }
+
+       TRACE("empty_meta_index: returned meta entry %d, %p\n",
+                       msblk->next_meta_index,
+                       &msblk->meta_index[msblk->next_meta_index]);
+
+       meta = &msblk->meta_index[msblk->next_meta_index];
+       msblk->next_meta_index = (msblk->next_meta_index + 1) %
+                       SQUASHFS_META_SLOTS;
+
+       meta->inode_number = inode->i_ino;
+       meta->offset = offset;
+       meta->skip = skip;
+       meta->entries = 0;
+       meta->locked = 1;
+
+failed:
+       mutex_unlock(&msblk->meta_index_mutex);
+       return meta;
+}
+
+
+static void release_meta_index(struct inode *inode, struct meta_index *meta)
+{
+       struct squashfs_sb_info *msblk = inode->i_sb->s_fs_info;
+       mutex_lock(&msblk->meta_index_mutex);
+       meta->locked = 0;
+       mutex_unlock(&msblk->meta_index_mutex);
+}
+
+
+/*
+ * Read the next n blocks from the block list, starting from
+ * metadata block <start_block, offset>.
+ */
+static long long read_indexes(struct super_block *sb, int n,
+                               u64 *start_block, int *offset)
+{
+       int err, i;
+       long long block = 0;
+       __le32 *blist = kmalloc(PAGE_CACHE_SIZE, GFP_KERNEL);
+
+       if (blist == NULL) {
+               ERROR("read_indexes: Failed to allocate block_list\n");
+               return -ENOMEM;
+       }
+
+       while (n) {
+               int blocks = min_t(int, n, PAGE_CACHE_SIZE >> 2);
+
+               err = squashfs_read_metadata(sb, blist, start_block,
+                               offset, blocks << 2);
+               if (err < 0) {
+                       ERROR("read_indexes: reading block [%llx:%x]\n",
+                               *start_block, *offset);
+                       goto failure;
+               }
+
+               for (i = 0; i < blocks; i++) {
+                       int size = le32_to_cpu(blist[i]);
+                       block += SQUASHFS_COMPRESSED_SIZE_BLOCK(size);
+               }
+               n -= blocks;
+       }
+
+       kfree(blist);
+       return block;
+
+failure:
+       kfree(blist);
+       return err;
+}
+
+
+/*
+ * Each cache index slot has SQUASHFS_META_ENTRIES, each of which
+ * can cache one index -> datablock/blocklist-block mapping.  We wish
+ * to distribute these over the length of the file, entry[0] maps index x,
+ * entry[1] maps index x + skip, entry[2] maps index x + 2 * skip, and so on.
+ * The larger the file, the greater the skip factor.  The skip factor is
+ * limited to the size of the metadata cache (SQUASHFS_CACHED_BLKS) to ensure
+ * the number of metadata blocks that need to be read fits into the cache.
+ * If the skip factor is limited in this way then the file will use multiple
+ * slots.
+ */
+static inline int calculate_skip(int blocks)
+{
+       int skip = blocks / ((SQUASHFS_META_ENTRIES + 1)
+                * SQUASHFS_META_INDEXES);
+       return min(SQUASHFS_CACHED_BLKS - 1, skip + 1);
+}
+
+
+/*
+ * Search and grow the index cache for the specified inode, returning the
+ * on-disk locations of the datablock and block list metadata block
+ * <index_block, index_offset> for index (scaled to nearest cache index).
+ */
+static int fill_meta_index(struct inode *inode, int index,
+               u64 *index_block, int *index_offset, u64 *data_block)
+{
+       struct squashfs_sb_info *msblk = inode->i_sb->s_fs_info;
+       int skip = calculate_skip(i_size_read(inode) >> msblk->block_log);
+       int offset = 0;
+       struct meta_index *meta;
+       struct meta_entry *meta_entry;
+       u64 cur_index_block = squashfs_i(inode)->block_list_start;
+       int cur_offset = squashfs_i(inode)->offset;
+       u64 cur_data_block = squashfs_i(inode)->start;
+       int err, i;
+
+       /*
+        * Scale index to cache index (cache slot entry)
+        */
+       index /= SQUASHFS_META_INDEXES * skip;
+
+       while (offset < index) {
+               meta = locate_meta_index(inode, offset + 1, index);
+
+               if (meta == NULL) {
+                       meta = empty_meta_index(inode, offset + 1, skip);
+                       if (meta == NULL)
+                               goto all_done;
+               } else {
+                       offset = index < meta->offset + meta->entries ? index :
+                               meta->offset + meta->entries - 1;
+                       meta_entry = &meta->meta_entry[offset - meta->offset];
+                       cur_index_block = meta_entry->index_block +
+                               msblk->inode_table;
+                       cur_offset = meta_entry->offset;
+                       cur_data_block = meta_entry->data_block;
+                       TRACE("get_meta_index: offset %d, meta->offset %d, "
+                               "meta->entries %d\n", offset, meta->offset,
+                               meta->entries);
+                       TRACE("get_meta_index: index_block 0x%llx, offset 0x%x"
+                               " data_block 0x%llx\n", cur_index_block,
+                               cur_offset, cur_data_block);
+               }
+
+               /*
+                * If necessary grow cache slot by reading block list.  Cache
+                * slot is extended up to index or to the end of the slot, in
+                * which case further slots will be used.
+                */
+               for (i = meta->offset + meta->entries; i <= index &&
+                               i < meta->offset + SQUASHFS_META_ENTRIES; i++) {
+                       int blocks = skip * SQUASHFS_META_INDEXES;
+                       long long res = read_indexes(inode->i_sb, blocks,
+                                       &cur_index_block, &cur_offset);
+
+                       if (res < 0) {
+                               if (meta->entries == 0)
+                                       /*
+                                        * Don't leave an empty slot on read
+                                        * error allocated to this inode...
+                                        */
+                                       meta->inode_number = 0;
+                               err = res;
+                               goto failed;
+                       }
+
+                       cur_data_block += res;
+                       meta_entry = &meta->meta_entry[i - meta->offset];
+                       meta_entry->index_block = cur_index_block -
+                               msblk->inode_table;
+                       meta_entry->offset = cur_offset;
+                       meta_entry->data_block = cur_data_block;
+                       meta->entries++;
+                       offset++;
+               }
+
+               TRACE("get_meta_index: meta->offset %d, meta->entries %d\n",
+                               meta->offset, meta->entries);
+
+               release_meta_index(inode, meta);
+       }
+
+all_done:
+       *index_block = cur_index_block;
+       *index_offset = cur_offset;
+       *data_block = cur_data_block;
+
+       /*
+        * Scale cache index (cache slot entry) to index
+        */
+       return offset * SQUASHFS_META_INDEXES * skip;
+
+failed:
+       release_meta_index(inode, meta);
+       return err;
+}
+
+
+/*
+ * Get the on-disk location and compressed size of the datablock
+ * specified by index.  Fill_meta_index() does most of the work.
+ */
+static int read_blocklist(struct inode *inode, int index, u64 *block)
+{
+       u64 start;
+       long long blks;
+       int offset;
+       __le32 size;
+       int res = fill_meta_index(inode, index, &start, &offset, block);
+
+       TRACE("read_blocklist: res %d, index %d, start 0x%llx, offset"
+                      " 0x%x, block 0x%llx\n", res, index, start, offset,
+                       *block);
+
+       if (res < 0)
+               return res;
+
+       /*
+        * res contains the index of the mapping returned by fill_meta_index(),
+        * this will likely be less than the desired index (because the
+        * meta_index cache works at a higher granularity).  Read any
+        * extra block indexes needed.
+        */
+       if (res < index) {
+               blks = read_indexes(inode->i_sb, index - res, &start, &offset);
+               if (blks < 0)
+                       return (int) blks;
+               *block += blks;
+       }
+
+       /*
+        * Read length of block specified by index.
+        */
+       res = squashfs_read_metadata(inode->i_sb, &size, &start, &offset,
+                       sizeof(size));
+       if (res < 0)
+               return res;
+       return le32_to_cpu(size);
+}
+
+
+static int squashfs_readpage(struct file *file, struct page *page)
+{
+       struct inode *inode = page->mapping->host;
+       struct squashfs_sb_info *msblk = inode->i_sb->s_fs_info;
+       int bytes, i, offset = 0, sparse = 0;
+       struct squashfs_cache_entry *buffer = NULL;
+       void *pageaddr;
+
+       int mask = (1 << (msblk->block_log - PAGE_CACHE_SHIFT)) - 1;
+       int index = page->index >> (msblk->block_log - PAGE_CACHE_SHIFT);
+       int start_index = page->index & ~mask;
+       int end_index = start_index | mask;
+       int file_end = i_size_read(inode) >> msblk->block_log;
+
+       TRACE("Entered squashfs_readpage, page index %lx, start block %llx\n",
+                               page->index, squashfs_i(inode)->start);
+
+       if (page->index >= ((i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+                                       PAGE_CACHE_SHIFT))
+               goto out;
+
+       if (index < file_end || squashfs_i(inode)->fragment_block ==
+                                       SQUASHFS_INVALID_BLK) {
+               /*
+                * Reading a datablock from disk.  Need to read block list
+                * to get location and block size.
+                */
+               u64 block = 0;
+               int bsize = read_blocklist(inode, index, &block);
+               if (bsize < 0)
+                       goto error_out;
+
+               if (bsize == 0) { /* hole */
+                       bytes = index == file_end ?
+                               (i_size_read(inode) & (msblk->block_size - 1)) :
+                                msblk->block_size;
+                       sparse = 1;
+               } else {
+                       /*
+                        * Read and decompress datablock.
+                        */
+                       buffer = squashfs_get_datablock(inode->i_sb,
+                                                               block, bsize);
+                       if (buffer->error) {
+                               ERROR("Unable to read page, block %llx, size %x"
+                                       "\n", block, bsize);
+                               squashfs_cache_put(buffer);
+                               goto error_out;
+                       }
+                       bytes = buffer->length;
+               }
+       } else {
+               /*
+                * Datablock is stored inside a fragment (tail-end packed
+                * block).
+                */
+               buffer = squashfs_get_fragment(inode->i_sb,
+                               squashfs_i(inode)->fragment_block,
+                               squashfs_i(inode)->fragment_size);
+
+               if (buffer->error) {
+                       ERROR("Unable to read page, block %llx, size %x\n",
+                               squashfs_i(inode)->fragment_block,
+                               squashfs_i(inode)->fragment_size);
+                       squashfs_cache_put(buffer);
+                       goto error_out;
+               }
+               bytes = i_size_read(inode) & (msblk->block_size - 1);
+               offset = squashfs_i(inode)->fragment_offset;
+       }
+
+       /*
+        * Loop copying datablock into pages.  As the datablock likely covers
+        * many PAGE_CACHE_SIZE pages (default block size is 128 KiB) explicitly
+        * grab the pages from the page cache, except for the page that we've
+        * been called to fill.
+        */
+       for (i = start_index; i <= end_index && bytes > 0; i++,
+                       bytes -= PAGE_CACHE_SIZE, offset += PAGE_CACHE_SIZE) {
+               struct page *push_page;
+               int avail = sparse ? 0 : min_t(int, bytes, PAGE_CACHE_SIZE);
+
+               TRACE("bytes %d, i %d, available_bytes %d\n", bytes, i, avail);
+
+               push_page = (i == page->index) ? page :
+                       grab_cache_page_nowait(page->mapping, i);
+
+               if (!push_page)
+                       continue;
+
+               if (PageUptodate(push_page))
+                       goto skip_page;
+
+               pageaddr = kmap_atomic(push_page, KM_USER0);
+               squashfs_copy_data(pageaddr, buffer, offset, avail);
+               memset(pageaddr + avail, 0, PAGE_CACHE_SIZE - avail);
+               kunmap_atomic(pageaddr, KM_USER0);
+               flush_dcache_page(push_page);
+               SetPageUptodate(push_page);
+skip_page:
+               unlock_page(push_page);
+               if (i != page->index)
+                       page_cache_release(push_page);
+       }
+
+       if (!sparse)
+               squashfs_cache_put(buffer);
+
+       return 0;
+
+error_out:
+       SetPageError(page);
+out:
+       pageaddr = kmap_atomic(page, KM_USER0);
+       memset(pageaddr, 0, PAGE_CACHE_SIZE);
+       kunmap_atomic(pageaddr, KM_USER0);
+       flush_dcache_page(page);
+       if (!PageError(page))
+               SetPageUptodate(page);
+       unlock_page(page);
+
+       return 0;
+}
+
+
+const struct address_space_operations squashfs_aops = {
+       .readpage = squashfs_readpage
+};
diff --git a/fs/squashfs/fragment.c b/fs/squashfs/fragment.c

new file mode 100644 (file)

index 0000000..b5a2c15
--- /dev/null
+++ b/fs/squashfs/fragment.c
@@ -0,0 +1,98 @@
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * fragment.c
+ */
+
+/*
+ * This file implements code to handle compressed fragments (tail-end packed
+ * datablocks).
+ *
+ * Regular files contain a fragment index which is mapped to a fragment
+ * location on disk and compressed size using a fragment lookup table.
+ * Like everything in Squashfs this fragment lookup table is itself stored
+ * compressed into metadata blocks.  A second index table is used to locate
+ * these.  This second index table for speed of access (and because it
+ * is small) is read at mount time and cached in memory.
+ */
+
+#include <linux/fs.h>
+#include <linux/vfs.h>
+#include <linux/slab.h>
+#include <linux/zlib.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs_fs_i.h"
+#include "squashfs.h"
+
+/*
+ * Look-up fragment using the fragment index table.  Return the on disk
+ * location of the fragment and its compressed size
+ */
+int squashfs_frag_lookup(struct super_block *sb, unsigned int fragment,
+                               u64 *fragment_block)
+{
+       struct squashfs_sb_info *msblk = sb->s_fs_info;
+       int block = SQUASHFS_FRAGMENT_INDEX(fragment);
+       int offset = SQUASHFS_FRAGMENT_INDEX_OFFSET(fragment);
+       u64 start_block = le64_to_cpu(msblk->fragment_index[block]);
+       struct squashfs_fragment_entry fragment_entry;
+       int size;
+
+       size = squashfs_read_metadata(sb, &fragment_entry, &start_block,
+                                       &offset, sizeof(fragment_entry));
+       if (size < 0)
+               return size;
+
+       *fragment_block = le64_to_cpu(fragment_entry.start_block);
+       size = le32_to_cpu(fragment_entry.size);
+
+       return size;
+}
+
+
+/*
+ * Read the uncompressed fragment lookup table indexes off disk into memory
+ */
+__le64 *squashfs_read_fragment_index_table(struct super_block *sb,
+       u64 fragment_table_start, unsigned int fragments)
+{
+       unsigned int length = SQUASHFS_FRAGMENT_INDEX_BYTES(fragments);
+       __le64 *fragment_index;
+       int err;
+
+       /* Allocate fragment lookup table indexes */
+       fragment_index = kmalloc(length, GFP_KERNEL);
+       if (fragment_index == NULL) {
+               ERROR("Failed to allocate fragment index table\n");
+               return ERR_PTR(-ENOMEM);
+       }
+
+       err = squashfs_read_table(sb, fragment_index, fragment_table_start,
+                       length);
+       if (err < 0) {
+               ERROR("unable to read fragment index table\n");
+               kfree(fragment_index);
+               return ERR_PTR(err);
+       }
+
+       return fragment_index;
+}
diff --git a/fs/squashfs/id.c b/fs/squashfs/id.c

new file mode 100644 (file)

index 0000000..3795b83
--- /dev/null
+++ b/fs/squashfs/id.c
@@ -0,0 +1,94 @@
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * id.c
+ */
+
+/*
+ * This file implements code to handle uids and gids.
+ *
+ * For space efficiency regular files store uid and gid indexes, which are
+ * converted to 32-bit uids/gids using an id look up table.  This table is
+ * stored compressed into metadata blocks.  A second index table is used to
+ * locate these.  This second index table for speed of access (and because it
+ * is small) is read at mount time and cached in memory.
+ */
+
+#include <linux/fs.h>
+#include <linux/vfs.h>
+#include <linux/slab.h>
+#include <linux/zlib.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs_fs_i.h"
+#include "squashfs.h"
+
+/*
+ * Map uid/gid index into real 32-bit uid/gid using the id look up table
+ */
+int squashfs_get_id(struct super_block *sb, unsigned int index,
+                                       unsigned int *id)
+{
+       struct squashfs_sb_info *msblk = sb->s_fs_info;
+       int block = SQUASHFS_ID_BLOCK(index);
+       int offset = SQUASHFS_ID_BLOCK_OFFSET(index);
+       u64 start_block = le64_to_cpu(msblk->id_table[block]);
+       __le32 disk_id;
+       int err;
+
+       err = squashfs_read_metadata(sb, &disk_id, &start_block, &offset,
+                                                       sizeof(disk_id));
+       if (err < 0)
+               return err;
+
+       *id = le32_to_cpu(disk_id);
+       return 0;
+}
+
+
+/*
+ * Read uncompressed id lookup table indexes from disk into memory
+ */
+__le64 *squashfs_read_id_index_table(struct super_block *sb,
+                       u64 id_table_start, unsigned short no_ids)
+{
+       unsigned int length = SQUASHFS_ID_BLOCK_BYTES(no_ids);
+       __le64 *id_table;
+       int err;
+
+       TRACE("In read_id_index_table, length %d\n", length);
+
+       /* Allocate id lookup table indexes */
+       id_table = kmalloc(length, GFP_KERNEL);
+       if (id_table == NULL) {
+               ERROR("Failed to allocate id index table\n");
+               return ERR_PTR(-ENOMEM);
+       }
+
+       err = squashfs_read_table(sb, id_table, id_table_start, length);
+       if (err < 0) {
+               ERROR("unable to read id index table\n");
+               kfree(id_table);
+               return ERR_PTR(err);
+       }
+
+       return id_table;
+}
diff --git a/fs/squashfs/inode.c b/fs/squashfs/inode.c

new file mode 100644 (file)

index 0000000..7a63398
--- /dev/null
+++ b/fs/squashfs/inode.c
@@ -0,0 +1,346 @@
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * inode.c
+ */
+
+/*
+ * This file implements code to create and read inodes from disk.
+ *
+ * Inodes in Squashfs are identified by a 48-bit inode which encodes the
+ * location of the compressed metadata block containing the inode, and the byte
+ * offset into that block where the inode is placed (<block, offset>).
+ *
+ * To maximise compression there are different inodes for each file type
+ * (regular file, directory, device, etc.), the inode contents and length
+ * varying with the type.
+ *
+ * To further maximise compression, two types of regular file inode and
+ * directory inode are defined: inodes optimised for frequently occurring
+ * regular files and directories, and extended types where extra
+ * information has to be stored.
+ */
+
+#include <linux/fs.h>
+#include <linux/vfs.h>
+#include <linux/zlib.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs_fs_i.h"
+#include "squashfs.h"
+
+/*
+ * Initialise VFS inode with the base inode information common to all
+ * Squashfs inode types.  Sqsh_ino contains the unswapped base inode
+ * off disk.
+ */
+static int squashfs_new_inode(struct super_block *sb, struct inode *inode,
+                               struct squashfs_base_inode *sqsh_ino)
+{
+       int err;
+
+       err = squashfs_get_id(sb, le16_to_cpu(sqsh_ino->uid), &inode->i_uid);
+       if (err)
+               return err;
+
+       err = squashfs_get_id(sb, le16_to_cpu(sqsh_ino->guid), &inode->i_gid);
+       if (err)
+               return err;
+
+       inode->i_ino = le32_to_cpu(sqsh_ino->inode_number);
+       inode->i_mtime.tv_sec = le32_to_cpu(sqsh_ino->mtime);
+       inode->i_atime.tv_sec = inode->i_mtime.tv_sec;
+       inode->i_ctime.tv_sec = inode->i_mtime.tv_sec;
+       inode->i_mode = le16_to_cpu(sqsh_ino->mode);
+       inode->i_size = 0;
+
+       return err;
+}
+
+
+struct inode *squashfs_iget(struct super_block *sb, long long ino,
+                               unsigned int ino_number)
+{
+       struct inode *inode = iget_locked(sb, ino_number);
+       int err;
+
+       TRACE("Entered squashfs_iget\n");
+
+       if (!inode)
+               return ERR_PTR(-ENOMEM);
+       if (!(inode->i_state & I_NEW))
+               return inode;
+
+       err = squashfs_read_inode(inode, ino);
+       if (err) {
+               iget_failed(inode);
+               return ERR_PTR(err);
+       }
+
+       unlock_new_inode(inode);
+       return inode;
+}
+
+
+/*
+ * Initialise VFS inode by reading inode from inode table (compressed
+ * metadata).  The format and amount of data read depends on type.
+ */
+int squashfs_read_inode(struct inode *inode, long long ino)
+{
+       struct super_block *sb = inode->i_sb;
+       struct squashfs_sb_info *msblk = sb->s_fs_info;
+       u64 block = SQUASHFS_INODE_BLK(ino) + msblk->inode_table;
+       int err, type, offset = SQUASHFS_INODE_OFFSET(ino);
+       union squashfs_inode squashfs_ino;
+       struct squashfs_base_inode *sqshb_ino = &squashfs_ino.base;
+
+       TRACE("Entered squashfs_read_inode\n");
+
+       /*
+        * Read inode base common to all inode types.
+        */
+       err = squashfs_read_metadata(sb, sqshb_ino, &block,
+                               &offset, sizeof(*sqshb_ino));
+       if (err < 0)
+               goto failed_read;
+
+       err = squashfs_new_inode(sb, inode, sqshb_ino);
+       if (err)
+               goto failed_read;
+
+       block = SQUASHFS_INODE_BLK(ino) + msblk->inode_table;
+       offset = SQUASHFS_INODE_OFFSET(ino);
+
+       type = le16_to_cpu(sqshb_ino->inode_type);
+       switch (type) {
+       case SQUASHFS_REG_TYPE: {
+               unsigned int frag_offset, frag_size, frag;
+               u64 frag_blk;
+               struct squashfs_reg_inode *sqsh_ino = &squashfs_ino.reg;
+
+               err = squashfs_read_metadata(sb, sqsh_ino, &block, &offset,
+                                                       sizeof(*sqsh_ino));
+               if (err < 0)
+                       goto failed_read;
+
+               frag = le32_to_cpu(sqsh_ino->fragment);
+               if (frag != SQUASHFS_INVALID_FRAG) {
+                       frag_offset = le32_to_cpu(sqsh_ino->offset);
+                       frag_size = squashfs_frag_lookup(sb, frag, &frag_blk);
+                       if (frag_size < 0) {
+                               err = frag_size;
+                               goto failed_read;
+                       }
+               } else {
+                       frag_blk = SQUASHFS_INVALID_BLK;
+                       frag_size = 0;
+                       frag_offset = 0;
+               }
+
+               inode->i_nlink = 1;
+               inode->i_size = le32_to_cpu(sqsh_ino->file_size);
+               inode->i_fop = &generic_ro_fops;
+               inode->i_mode |= S_IFREG;
+               inode->i_blocks = ((inode->i_size - 1) >> 9) + 1;
+               squashfs_i(inode)->fragment_block = frag_blk;
+               squashfs_i(inode)->fragment_size = frag_size;
+               squashfs_i(inode)->fragment_offset = frag_offset;
+               squashfs_i(inode)->start = le32_to_cpu(sqsh_ino->start_block);
+               squashfs_i(inode)->block_list_start = block;
+               squashfs_i(inode)->offset = offset;
+               inode->i_data.a_ops = &squashfs_aops;
+
+               TRACE("File inode %x:%x, start_block %llx, block_list_start "
+                       "%llx, offset %x\n", SQUASHFS_INODE_BLK(ino),
+                       offset, squashfs_i(inode)->start, block, offset);
+               break;
+       }
+       case SQUASHFS_LREG_TYPE: {
+               unsigned int frag_offset, frag_size, frag;
+               u64 frag_blk;
+               struct squashfs_lreg_inode *sqsh_ino = &squashfs_ino.lreg;
+
+               err = squashfs_read_metadata(sb, sqsh_ino, &block, &offset,
+                                                       sizeof(*sqsh_ino));
+               if (err < 0)
+                       goto failed_read;
+
+               frag = le32_to_cpu(sqsh_ino->fragment);
+               if (frag != SQUASHFS_INVALID_FRAG) {
+                       frag_offset = le32_to_cpu(sqsh_ino->offset);
+                       frag_size = squashfs_frag_lookup(sb, frag, &frag_blk);
+                       if (frag_size < 0) {
+                               err = frag_size;
+                               goto failed_read;
+                       }
+               } else {
+                       frag_blk = SQUASHFS_INVALID_BLK;
+                       frag_size = 0;
+                       frag_offset = 0;
+               }
+
+               inode->i_nlink = le32_to_cpu(sqsh_ino->nlink);
+               inode->i_size = le64_to_cpu(sqsh_ino->file_size);
+               inode->i_fop = &generic_ro_fops;
+               inode->i_mode |= S_IFREG;
+               inode->i_blocks = ((inode->i_size -
+                               le64_to_cpu(sqsh_ino->sparse) - 1) >> 9) + 1;
+
+               squashfs_i(inode)->fragment_block = frag_blk;
+               squashfs_i(inode)->fragment_size = frag_size;
+               squashfs_i(inode)->fragment_offset = frag_offset;
+               squashfs_i(inode)->start = le64_to_cpu(sqsh_ino->start_block);
+               squashfs_i(inode)->block_list_start = block;
+               squashfs_i(inode)->offset = offset;
+               inode->i_data.a_ops = &squashfs_aops;
+
+               TRACE("File inode %x:%x, start_block %llx, block_list_start "
+                       "%llx, offset %x\n", SQUASHFS_INODE_BLK(ino),
+                       offset, squashfs_i(inode)->start, block, offset);
+               break;
+       }
+       case SQUASHFS_DIR_TYPE: {
+               struct squashfs_dir_inode *sqsh_ino = &squashfs_ino.dir;
+
+               err = squashfs_read_metadata(sb, sqsh_ino, &block, &offset,
+                               sizeof(*sqsh_ino));
+               if (err < 0)
+                       goto failed_read;
+
+               inode->i_nlink = le32_to_cpu(sqsh_ino->nlink);
+               inode->i_size = le16_to_cpu(sqsh_ino->file_size);
+               inode->i_op = &squashfs_dir_inode_ops;
+               inode->i_fop = &squashfs_dir_ops;
+               inode->i_mode |= S_IFDIR;
+               squashfs_i(inode)->start = le32_to_cpu(sqsh_ino->start_block);
+               squashfs_i(inode)->offset = le16_to_cpu(sqsh_ino->offset);
+               squashfs_i(inode)->dir_idx_cnt = 0;
+               squashfs_i(inode)->parent = le32_to_cpu(sqsh_ino->parent_inode);
+
+               TRACE("Directory inode %x:%x, start_block %llx, offset %x\n",
+                               SQUASHFS_INODE_BLK(ino), offset,
+                               squashfs_i(inode)->start,
+                               le16_to_cpu(sqsh_ino->offset));
+               break;
+       }
+       case SQUASHFS_LDIR_TYPE: {
+               struct squashfs_ldir_inode *sqsh_ino = &squashfs_ino.ldir;
+
+               err = squashfs_read_metadata(sb, sqsh_ino, &block, &offset,
+                               sizeof(*sqsh_ino));
+               if (err < 0)
+                       goto failed_read;
+
+               inode->i_nlink = le32_to_cpu(sqsh_ino->nlink);
+               inode->i_size = le32_to_cpu(sqsh_ino->file_size);
+               inode->i_op = &squashfs_dir_inode_ops;
+               inode->i_fop = &squashfs_dir_ops;
+               inode->i_mode |= S_IFDIR;
+               squashfs_i(inode)->start = le32_to_cpu(sqsh_ino->start_block);
+               squashfs_i(inode)->offset = le16_to_cpu(sqsh_ino->offset);
+               squashfs_i(inode)->dir_idx_start = block;
+               squashfs_i(inode)->dir_idx_offset = offset;
+               squashfs_i(inode)->dir_idx_cnt = le16_to_cpu(sqsh_ino->i_count);
+               squashfs_i(inode)->parent = le32_to_cpu(sqsh_ino->parent_inode);
+
+               TRACE("Long directory inode %x:%x, start_block %llx, offset "
+                               "%x\n", SQUASHFS_INODE_BLK(ino), offset,
+                               squashfs_i(inode)->start,
+                               le16_to_cpu(sqsh_ino->offset));
+               break;
+       }
+       case SQUASHFS_SYMLINK_TYPE:
+       case SQUASHFS_LSYMLINK_TYPE: {
+               struct squashfs_symlink_inode *sqsh_ino = &squashfs_ino.symlink;
+
+               err = squashfs_read_metadata(sb, sqsh_ino, &block, &offset,
+                               sizeof(*sqsh_ino));
+               if (err < 0)
+                       goto failed_read;
+
+               inode->i_nlink = le32_to_cpu(sqsh_ino->nlink);
+               inode->i_size = le32_to_cpu(sqsh_ino->symlink_size);
+               inode->i_op = &page_symlink_inode_operations;
+               inode->i_data.a_ops = &squashfs_symlink_aops;
+               inode->i_mode |= S_IFLNK;
+               squashfs_i(inode)->start = block;
+               squashfs_i(inode)->offset = offset;
+
+               TRACE("Symbolic link inode %x:%x, start_block %llx, offset "
+                               "%x\n", SQUASHFS_INODE_BLK(ino), offset,
+                               block, offset);
+               break;
+       }
+       case SQUASHFS_BLKDEV_TYPE:
+       case SQUASHFS_CHRDEV_TYPE:
+       case SQUASHFS_LBLKDEV_TYPE:
+       case SQUASHFS_LCHRDEV_TYPE: {
+               struct squashfs_dev_inode *sqsh_ino = &squashfs_ino.dev;
+               unsigned int rdev;
+
+               err = squashfs_read_metadata(sb, sqsh_ino, &block, &offset,
+                               sizeof(*sqsh_ino));
+               if (err < 0)
+                       goto failed_read;
+
+               if (type == SQUASHFS_CHRDEV_TYPE)
+                       inode->i_mode |= S_IFCHR;
+               else
+                       inode->i_mode |= S_IFBLK;
+               inode->i_nlink = le32_to_cpu(sqsh_ino->nlink);
+               rdev = le32_to_cpu(sqsh_ino->rdev);
+               init_special_inode(inode, inode->i_mode, new_decode_dev(rdev));
+
+               TRACE("Device inode %x:%x, rdev %x\n",
+                               SQUASHFS_INODE_BLK(ino), offset, rdev);
+               break;
+       }
+       case SQUASHFS_FIFO_TYPE:
+       case SQUASHFS_SOCKET_TYPE:
+       case SQUASHFS_LFIFO_TYPE:
+       case SQUASHFS_LSOCKET_TYPE: {
+               struct squashfs_ipc_inode *sqsh_ino = &squashfs_ino.ipc;
+
+               err = squashfs_read_metadata(sb, sqsh_ino, &block, &offset,
+                               sizeof(*sqsh_ino));
+               if (err < 0)
+                       goto failed_read;
+
+               if (type == SQUASHFS_FIFO_TYPE)
+                       inode->i_mode |= S_IFIFO;
+               else
+                       inode->i_mode |= S_IFSOCK;
+               inode->i_nlink = le32_to_cpu(sqsh_ino->nlink);
+               init_special_inode(inode, inode->i_mode, 0);
+               break;
+       }
+       default:
+               ERROR("Unknown inode type %d in squashfs_iget!\n", type);
+               return -EINVAL;
+       }
+
+       return 0;
+
+failed_read:
+       ERROR("Unable to read inode 0x%llx\n", ino);
+       return err;
+}
diff --git a/fs/squashfs/namei.c b/fs/squashfs/namei.c

new file mode 100644 (file)

index 0000000..9e39865
--- /dev/null
+++ b/fs/squashfs/namei.c
@@ -0,0 +1,242 @@
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * namei.c
+ */
+
+/*
+ * This file implements code to do filename lookup in directories.
+ *
+ * Like inodes, directories are packed into compressed metadata blocks, stored
+ * in a directory table.  Directories are accessed using the start address of
+ * the metablock containing the directory and the offset into the
+ * decompressed block (<block, offset>).
+ *
+ * Directories are organised in a slightly complex way, and are not simply
+ * a list of file names.  The organisation takes advantage of the
+ * fact that (in most cases) the inodes of the files will be in the same
+ * compressed metadata block, and therefore, can share the start block.
+ * Directories are therefore organised in a two level list, a directory
+ * header containing the shared start block value, and a sequence of directory
+ * entries, each of which share the shared start block.  A new directory header
+ * is written once/if the inode start block changes.  The directory
+ * header/directory entry list is repeated as many times as necessary.
+ *
+ * Directories are sorted, and can contain a directory index to speed up
+ * file lookup.  Directory indexes store one entry per metablock, each entry
+ * storing the index/filename mapping to the first directory header
+ * in each metadata block.  Directories are sorted in alphabetical order,
+ * and at lookup the index is scanned linearly looking for the first filename
+ * alphabetically larger than the filename being looked up.  At this point the
+ * location of the metadata block the filename is in has been found.
+ * The general idea of the index is ensure only one metadata block needs to be
+ * decompressed to do a lookup irrespective of the length of the directory.
+ * This scheme has the advantage that it doesn't require extra memory overhead
+ * and doesn't require much extra storage on disk.
+ */
+
+#include <linux/fs.h>
+#include <linux/vfs.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/dcache.h>
+#include <linux/zlib.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs_fs_i.h"
+#include "squashfs.h"
+
+/*
+ * Lookup name in the directory index, returning the location of the metadata
+ * block containing it, and the directory index this represents.
+ *
+ * If we get an error reading the index then return the part of the index
+ * (if any) we have managed to read - the index isn't essential, just
+ * quicker.
+ */
+static int get_dir_index_using_name(struct super_block *sb,
+                       u64 *next_block, int *next_offset, u64 index_start,
+                       int index_offset, int i_count, const char *name,
+                       int len)
+{
+       struct squashfs_sb_info *msblk = sb->s_fs_info;
+       int i, size, length = 0, err;
+       struct squashfs_dir_index *index;
+       char *str;
+
+       TRACE("Entered get_dir_index_using_name, i_count %d\n", i_count);
+
+       index = kmalloc(sizeof(*index) + SQUASHFS_NAME_LEN * 2 + 2, GFP_KERNEL);
+       if (index == NULL) {
+               ERROR("Failed to allocate squashfs_dir_index\n");
+               goto out;
+       }
+
+       str = &index->name[SQUASHFS_NAME_LEN + 1];
+       strncpy(str, name, len);
+       str[len] = '\0';
+
+       for (i = 0; i < i_count; i++) {
+               err = squashfs_read_metadata(sb, index, &index_start,
+                                       &index_offset, sizeof(*index));
+               if (err < 0)
+                       break;
+
+
+               size = le32_to_cpu(index->size) + 1;
+
+               err = squashfs_read_metadata(sb, index->name, &index_start,
+                                       &index_offset, size);
+               if (err < 0)
+                       break;
+
+               index->name[size] = '\0';
+
+               if (strcmp(index->name, str) > 0)
+                       break;
+
+               length = le32_to_cpu(index->index);
+               *next_block = le32_to_cpu(index->start_block) +
+                                       msblk->directory_table;
+       }
+
+       *next_offset = (length + *next_offset) % SQUASHFS_METADATA_SIZE;
+       kfree(index);
+
+out:
+       /*
+        * Return index (f_pos) of the looked up metadata block.  Translate
+        * from internal f_pos to external f_pos which is offset by 3 because
+        * we invent "." and ".." entries which are not actually stored in the
+        * directory.
+        */
+       return length + 3;
+}
+
+
+static struct dentry *squashfs_lookup(struct inode *dir, struct dentry *dentry,
+                                struct nameidata *nd)
+{
+       const unsigned char *name = dentry->d_name.name;
+       int len = dentry->d_name.len;
+       struct inode *inode = NULL;
+       struct squashfs_sb_info *msblk = dir->i_sb->s_fs_info;
+       struct squashfs_dir_header dirh;
+       struct squashfs_dir_entry *dire;
+       u64 block = squashfs_i(dir)->start + msblk->directory_table;
+       int offset = squashfs_i(dir)->offset;
+       int err, length = 0, dir_count, size;
+
+       TRACE("Entered squashfs_lookup [%llx:%x]\n", block, offset);
+
+       dire = kmalloc(sizeof(*dire) + SQUASHFS_NAME_LEN + 1, GFP_KERNEL);
+       if (dire == NULL) {
+               ERROR("Failed to allocate squashfs_dir_entry\n");
+               return ERR_PTR(-ENOMEM);
+       }
+
+       if (len > SQUASHFS_NAME_LEN) {
+               err = -ENAMETOOLONG;
+               goto failed;
+       }
+
+       length = get_dir_index_using_name(dir->i_sb, &block, &offset,
+                               squashfs_i(dir)->dir_idx_start,
+                               squashfs_i(dir)->dir_idx_offset,
+                               squashfs_i(dir)->dir_idx_cnt, name, len);
+
+       while (length < i_size_read(dir)) {
+               /*
+                * Read directory header.
+                */
+               err = squashfs_read_metadata(dir->i_sb, &dirh, &block,
+                               &offset, sizeof(dirh));
+               if (err < 0)
+                       goto read_failure;
+
+               length += sizeof(dirh);
+
+               dir_count = le32_to_cpu(dirh.count) + 1;
+               while (dir_count--) {
+                       /*
+                        * Read directory entry.
+                        */
+                       err = squashfs_read_metadata(dir->i_sb, dire, &block,
+                                       &offset, sizeof(*dire));
+                       if (err < 0)
+                               goto read_failure;
+
+                       size = le16_to_cpu(dire->size) + 1;
+
+                       err = squashfs_read_metadata(dir->i_sb, dire->name,
+                                       &block, &offset, size);
+                       if (err < 0)
+                               goto read_failure;
+
+                       length += sizeof(*dire) + size;
+
+                       if (name[0] < dire->name[0])
+                               goto exit_lookup;
+
+                       if (len == size && !strncmp(name, dire->name, len)) {
+                               unsigned int blk, off, ino_num;
+                               long long ino;
+                               blk = le32_to_cpu(dirh.start_block);
+                               off = le16_to_cpu(dire->offset);
+                               ino_num = le32_to_cpu(dirh.inode_number) +
+                                       (short) le16_to_cpu(dire->inode_number);
+                               ino = SQUASHFS_MKINODE(blk, off);
+
+                               TRACE("calling squashfs_iget for directory "
+                                       "entry %s, inode  %x:%x, %d\n", name,
+                                       blk, off, ino_num);
+
+                               inode = squashfs_iget(dir->i_sb, ino, ino_num);
+                               if (IS_ERR(inode)) {
+                                       err = PTR_ERR(inode);
+                                       goto failed;
+                               }
+
+                               goto exit_lookup;
+                       }
+               }
+       }
+
+exit_lookup:
+       kfree(dire);
+       if (inode)
+               return d_splice_alias(inode, dentry);
+       d_add(dentry, inode);
+       return ERR_PTR(0);
+
+read_failure:
+       ERROR("Unable to read directory block [%llx:%x]\n",
+               squashfs_i(dir)->start + msblk->directory_table,
+               squashfs_i(dir)->offset);
+failed:
+       kfree(dire);
+       return ERR_PTR(err);
+}
+
+
+const struct inode_operations squashfs_dir_inode_ops = {
+       .lookup = squashfs_lookup
+};
diff --git a/fs/squashfs/squashfs.h b/fs/squashfs/squashfs.h

new file mode 100644 (file)

index 0000000..6b2515d
--- /dev/null
+++ b/fs/squashfs/squashfs.h
@@ -0,0 +1,90 @@
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * squashfs.h
+ */
+
+#define TRACE(s, args...)      pr_debug("SQUASHFS: "s, ## args)
+
+#define ERROR(s, args...)      pr_err("SQUASHFS error: "s, ## args)
+
+#define WARNING(s, args...)    pr_warning("SQUASHFS: "s, ## args)
+
+static inline struct squashfs_inode_info *squashfs_i(struct inode *inode)
+{
+       return list_entry(inode, struct squashfs_inode_info, vfs_inode);
+}
+
+/* block.c */
+extern int squashfs_read_data(struct super_block *, void **, u64, int, u64 *,
+                               int);
+
+/* cache.c */
+extern struct squashfs_cache *squashfs_cache_init(char *, int, int);
+extern void squashfs_cache_delete(struct squashfs_cache *);
+extern struct squashfs_cache_entry *squashfs_cache_get(struct super_block *,
+                               struct squashfs_cache *, u64, int);
+extern void squashfs_cache_put(struct squashfs_cache_entry *);
+extern int squashfs_copy_data(void *, struct squashfs_cache_entry *, int, int);
+extern int squashfs_read_metadata(struct super_block *, void *, u64 *,
+                               int *, int);
+extern struct squashfs_cache_entry *squashfs_get_fragment(struct super_block *,
+                               u64, int);
+extern struct squashfs_cache_entry *squashfs_get_datablock(struct super_block *,
+                               u64, int);
+extern int squashfs_read_table(struct super_block *, void *, u64, int);
+
+/* export.c */
+extern __le64 *squashfs_read_inode_lookup_table(struct super_block *, u64,
+                               unsigned int);
+
+/* fragment.c */
+extern int squashfs_frag_lookup(struct super_block *, unsigned int, u64 *);
+extern __le64 *squashfs_read_fragment_index_table(struct super_block *,
+                               u64, unsigned int);
+
+/* id.c */
+extern int squashfs_get_id(struct super_block *, unsigned int, unsigned int *);
+extern __le64 *squashfs_read_id_index_table(struct super_block *, u64,
+                               unsigned short);
+
+/* inode.c */
+extern struct inode *squashfs_iget(struct super_block *, long long,
+                               unsigned int);
+extern int squashfs_read_inode(struct inode *, long long);
+
+/*
+ * Inodes and files operations
+ */
+
+/* dir.c */
+extern const struct file_operations squashfs_dir_ops;
+
+/* export.c */
+extern const struct export_operations squashfs_export_ops;
+
+/* file.c */
+extern const struct address_space_operations squashfs_aops;
+
+/* namei.c */
+extern const struct inode_operations squashfs_dir_inode_ops;
+
+/* symlink.c */
+extern const struct address_space_operations squashfs_symlink_aops;
diff --git a/fs/squashfs/squashfs_fs.h b/fs/squashfs/squashfs_fs.h

new file mode 100644 (file)

index 0000000..6840da1
--- /dev/null
+++ b/fs/squashfs/squashfs_fs.h
@@ -0,0 +1,381 @@
+#ifndef SQUASHFS_FS
+#define SQUASHFS_FS
+/*
+ * Squashfs
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * squashfs_fs.h
+ */
+
+#define SQUASHFS_CACHED_FRAGMENTS      CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE
+#define SQUASHFS_MAJOR                 4
+#define SQUASHFS_MINOR                 0
+#define SQUASHFS_MAGIC                 0x73717368
+#define SQUASHFS_START                 0
+
+/* size of metadata (inode and directory) blocks */
+#define SQUASHFS_METADATA_SIZE         8192
+#define SQUASHFS_METADATA_LOG          13
+
+/* default size of data blocks */
+#define SQUASHFS_FILE_SIZE             131072
+#define SQUASHFS_FILE_LOG              17
+
+#define SQUASHFS_FILE_MAX_SIZE         1048576
+#define SQUASHFS_FILE_MAX_LOG          20
+
+/* Max number of uids and gids */
+#define SQUASHFS_IDS                   65536
+
+/* Max length of filename (not 255) */
+#define SQUASHFS_NAME_LEN              256
+
+#define SQUASHFS_INVALID_FRAG          (0xffffffffU)
+#define SQUASHFS_INVALID_BLK           (-1LL)
+
+/* Filesystem flags */
+#define SQUASHFS_NOI                   0
+#define SQUASHFS_NOD                   1
+#define SQUASHFS_NOF                   3
+#define SQUASHFS_NO_FRAG               4
+#define SQUASHFS_ALWAYS_FRAG           5
+#define SQUASHFS_DUPLICATE             6
+#define SQUASHFS_EXPORT                        7
+
+#define SQUASHFS_BIT(flag, bit)                ((flag >> bit) & 1)
+
+#define SQUASHFS_UNCOMPRESSED_INODES(flags)    SQUASHFS_BIT(flags, \
+                                               SQUASHFS_NOI)
+
+#define SQUASHFS_UNCOMPRESSED_DATA(flags)      SQUASHFS_BIT(flags, \
+                                               SQUASHFS_NOD)
+
+#define SQUASHFS_UNCOMPRESSED_FRAGMENTS(flags) SQUASHFS_BIT(flags, \
+                                               SQUASHFS_NOF)
+
+#define SQUASHFS_NO_FRAGMENTS(flags)           SQUASHFS_BIT(flags, \
+                                               SQUASHFS_NO_FRAG)
+
+#define SQUASHFS_ALWAYS_FRAGMENTS(flags)       SQUASHFS_BIT(flags, \
+                                               SQUASHFS_ALWAYS_FRAG)
+
+#define SQUASHFS_DUPLICATES(flags)             SQUASHFS_BIT(flags, \
+                                               SQUASHFS_DUPLICATE)
+
+#define SQUASHFS_EXPORTABLE(flags)             SQUASHFS_BIT(flags, \
+                                               SQUASHFS_EXPORT)
+
+/* Max number of types and file types */
+#define SQUASHFS_DIR_TYPE              1
+#define SQUASHFS_REG_TYPE              2
+#define SQUASHFS_SYMLINK_TYPE          3
+#define SQUASHFS_BLKDEV_TYPE           4
+#define SQUASHFS_CHRDEV_TYPE           5
+#define SQUASHFS_FIFO_TYPE             6
+#define SQUASHFS_SOCKET_TYPE           7
+#define SQUASHFS_LDIR_TYPE             8
+#define SQUASHFS_LREG_TYPE             9
+#define SQUASHFS_LSYMLINK_TYPE         10
+#define SQUASHFS_LBLKDEV_TYPE          11
+#define SQUASHFS_LCHRDEV_TYPE          12
+#define SQUASHFS_LFIFO_TYPE            13
+#define SQUASHFS_LSOCKET_TYPE          14
+
+/* Flag whether block is compressed or uncompressed, bit is set if block is
+ * uncompressed */
+#define SQUASHFS_COMPRESSED_BIT                (1 << 15)
+
+#define SQUASHFS_COMPRESSED_SIZE(B)    (((B) & ~SQUASHFS_COMPRESSED_BIT) ? \
+               (B) & ~SQUASHFS_COMPRESSED_BIT :  SQUASHFS_COMPRESSED_BIT)
+
+#define SQUASHFS_COMPRESSED(B)         (!((B) & SQUASHFS_COMPRESSED_BIT))
+
+#define SQUASHFS_COMPRESSED_BIT_BLOCK  (1 << 24)
+
+#define SQUASHFS_COMPRESSED_SIZE_BLOCK(B)      ((B) & \
+                                               ~SQUASHFS_COMPRESSED_BIT_BLOCK)
+
+#define SQUASHFS_COMPRESSED_BLOCK(B)   (!((B) & SQUASHFS_COMPRESSED_BIT_BLOCK))
+
+/*
+ * Inode number ops.  Inodes consist of a compressed block number, and an
+ * uncompressed offset within that block
+ */
+#define SQUASHFS_INODE_BLK(A)          ((unsigned int) ((A) >> 16))
+
+#define SQUASHFS_INODE_OFFSET(A)       ((unsigned int) ((A) & 0xffff))
+
+#define SQUASHFS_MKINODE(A, B)         ((long long)(((long long) (A)\
+                                       << 16) + (B)))
+
+/* Translate between VFS mode and squashfs mode */
+#define SQUASHFS_MODE(A)               ((A) & 0xfff)
+
+/* fragment and fragment table defines */
+#define SQUASHFS_FRAGMENT_BYTES(A)     \
+                               ((A) * sizeof(struct squashfs_fragment_entry))
+
+#define SQUASHFS_FRAGMENT_INDEX(A)     (SQUASHFS_FRAGMENT_BYTES(A) / \
+                                       SQUASHFS_METADATA_SIZE)
+
+#define SQUASHFS_FRAGMENT_INDEX_OFFSET(A)      (SQUASHFS_FRAGMENT_BYTES(A) % \
+                                               SQUASHFS_METADATA_SIZE)
+
+#define SQUASHFS_FRAGMENT_INDEXES(A)   ((SQUASHFS_FRAGMENT_BYTES(A) + \
+                                       SQUASHFS_METADATA_SIZE - 1) / \
+                                       SQUASHFS_METADATA_SIZE)
+
+#define SQUASHFS_FRAGMENT_INDEX_BYTES(A)       (SQUASHFS_FRAGMENT_INDEXES(A) *\
+                                               sizeof(u64))
+
+/* inode lookup table defines */
+#define SQUASHFS_LOOKUP_BYTES(A)       ((A) * sizeof(u64))
+
+#define SQUASHFS_LOOKUP_BLOCK(A)       (SQUASHFS_LOOKUP_BYTES(A) / \
+                                       SQUASHFS_METADATA_SIZE)
+
+#define SQUASHFS_LOOKUP_BLOCK_OFFSET(A)        (SQUASHFS_LOOKUP_BYTES(A) % \
+                                       SQUASHFS_METADATA_SIZE)
+
+#define SQUASHFS_LOOKUP_BLOCKS(A)      ((SQUASHFS_LOOKUP_BYTES(A) + \
+                                       SQUASHFS_METADATA_SIZE - 1) / \
+                                       SQUASHFS_METADATA_SIZE)
+
+#define SQUASHFS_LOOKUP_BLOCK_BYTES(A) (SQUASHFS_LOOKUP_BLOCKS(A) *\
+                                       sizeof(u64))
+
+/* uid/gid lookup table defines */
+#define SQUASHFS_ID_BYTES(A)           ((A) * sizeof(unsigned int))
+
+#define SQUASHFS_ID_BLOCK(A)           (SQUASHFS_ID_BYTES(A) / \
+                                       SQUASHFS_METADATA_SIZE)
+
+#define SQUASHFS_ID_BLOCK_OFFSET(A)    (SQUASHFS_ID_BYTES(A) % \
+                                       SQUASHFS_METADATA_SIZE)
+
+#define SQUASHFS_ID_BLOCKS(A)          ((SQUASHFS_ID_BYTES(A) + \
+                                       SQUASHFS_METADATA_SIZE - 1) / \
+                                       SQUASHFS_METADATA_SIZE)
+
+#define SQUASHFS_ID_BLOCK_BYTES(A)     (SQUASHFS_ID_BLOCKS(A) *\
+                                       sizeof(u64))
+
+/* cached data constants for filesystem */
+#define SQUASHFS_CACHED_BLKS           8
+
+#define SQUASHFS_MAX_FILE_SIZE_LOG     64
+
+#define SQUASHFS_MAX_FILE_SIZE         (1LL << \
+                                       (SQUASHFS_MAX_FILE_SIZE_LOG - 2))
+
+#define SQUASHFS_MARKER_BYTE           0xff
+
+/* meta index cache */
+#define SQUASHFS_META_INDEXES  (SQUASHFS_METADATA_SIZE / sizeof(unsigned int))
+#define SQUASHFS_META_ENTRIES  127
+#define SQUASHFS_META_SLOTS    8
+
+struct meta_entry {
+       u64                     data_block;
+       unsigned int            index_block;
+       unsigned short          offset;
+       unsigned short          pad;
+};
+
+struct meta_index {
+       unsigned int            inode_number;
+       unsigned int            offset;
+       unsigned short          entries;
+       unsigned short          skip;
+       unsigned short          locked;
+       unsigned short          pad;
+       struct meta_entry       meta_entry[SQUASHFS_META_ENTRIES];
+};
+
+
+/*
+ * definitions for structures on disk
+ */
+#define ZLIB_COMPRESSION        1
+
+struct squashfs_super_block {
+       __le32                  s_magic;
+       __le32                  inodes;
+       __le32                  mkfs_time;
+       __le32                  block_size;
+       __le32                  fragments;
+       __le16                  compression;
+       __le16                  block_log;
+       __le16                  flags;
+       __le16                  no_ids;
+       __le16                  s_major;
+       __le16                  s_minor;
+       __le64                  root_inode;
+       __le64                  bytes_used;
+       __le64                  id_table_start;
+       __le64                  xattr_table_start;
+       __le64                  inode_table_start;
+       __le64                  directory_table_start;
+       __le64                  fragment_table_start;
+       __le64                  lookup_table_start;
+};
+
+struct squashfs_dir_index {
+       __le32                  index;
+       __le32                  start_block;
+       __le32                  size;
+       unsigned char           name[0];
+};
+
+struct squashfs_base_inode {
+       __le16                  inode_type;
+       __le16                  mode;
+       __le16                  uid;
+       __le16                  guid;
+       __le32                  mtime;
+       __le32                  inode_number;
+};
+
+struct squashfs_ipc_inode {
+       __le16                  inode_type;
+       __le16                  mode;
+       __le16                  uid;
+       __le16                  guid;
+       __le32                  mtime;
+       __le32                  inode_number;
+       __le32                  nlink;
+};
+
+struct squashfs_dev_inode {
+       __le16                  inode_type;
+       __le16                  mode;
+       __le16                  uid;
+       __le16                  guid;
+       __le32                  mtime;
+       __le32                  inode_number;
+       __le32                  nlink;
+       __le32                  rdev;
+};
+
+struct squashfs_symlink_inode {
+       __le16                  inode_type;
+       __le16                  mode;
+       __le16                  uid;
+       __le16                  guid;
+       __le32                  mtime;
+       __le32                  inode_number;
+       __le32                  nlink;
+       __le32                  symlink_size;
+       char                    symlink[0];
+};
+
+struct squashfs_reg_inode {
+       __le16                  inode_type;
+       __le16                  mode;
+       __le16                  uid;
+       __le16                  guid;
+       __le32                  mtime;
+       __le32                  inode_number;
+       __le32                  start_block;
+       __le32                  fragment;
+       __le32                  offset;
+       __le32                  file_size;
+       __le16                  block_list[0];
+};
+
+struct squashfs_lreg_inode {
+       __le16                  inode_type;
+       __le16                  mode;
+       __le16                  uid;
+       __le16                  guid;
+       __le32                  mtime;
+       __le32                  inode_number;
+       __le64                  start_block;
+       __le64                  file_size;
+       __le64                  sparse;
+       __le32                  nlink;
+       __le32                  fragment;
+       __le32                  offset;
+       __le32                  xattr;
+       __le16                  block_list[0];
+};
+
+struct squashfs_dir_inode {
+       __le16                  inode_type;
+       __le16                  mode;
+       __le16                  uid;
+       __le16                  guid;
+       __le32                  mtime;
+       __le32                  inode_number;
+       __le32                  start_block;
+       __le32                  nlink;
+       __le16                  file_size;
+       __le16                  offset;
+       __le32                  parent_inode;
+};
+
+struct squashfs_ldir_inode {
+       __le16                  inode_type;
+       __le16                  mode;
+       __le16                  uid;
+       __le16                  guid;
+       __le32                  mtime;
+       __le32                  inode_number;
+       __le32                  nlink;
+       __le32                  file_size;
+       __le32                  start_block;
+       __le32                  parent_inode;
+       __le16                  i_count;
+       __le16                  offset;
+       __le32                  xattr;
+       struct squashfs_dir_index       index[0];
+};
+
+union squashfs_inode {
+       struct squashfs_base_inode              base;
+       struct squashfs_dev_inode               dev;
+       struct squashfs_symlink_inode           symlink;
+       struct squashfs_reg_inode               reg;
+       struct squashfs_lreg_inode              lreg;
+       struct squashfs_dir_inode               dir;
+       struct squashfs_ldir_inode              ldir;
+       struct squashfs_ipc_inode               ipc;
+};
+
+struct squashfs_dir_entry {
+       __le16                  offset;
+       __le16                  inode_number;
+       __le16                  type;
+       __le16                  size;
+       char                    name[0];
+};
+
+struct squashfs_dir_header {
+       __le32                  count;
+       __le32                  start_block;
+       __le32                  inode_number;
+};
+
+struct squashfs_fragment_entry {
+       __le64                  start_block;
+       __le32                  size;
+       unsigned int            unused;
+};
+
+#endif
diff --git a/fs/squashfs/squashfs_fs_i.h b/fs/squashfs/squashfs_fs_i.h

new file mode 100644 (file)

index 0000000..fbfca30
--- /dev/null
+++ b/fs/squashfs/squashfs_fs_i.h
@@ -0,0 +1,45 @@
+#ifndef SQUASHFS_FS_I
+#define SQUASHFS_FS_I
+/*
+ * Squashfs
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * squashfs_fs_i.h
+ */
+
+struct squashfs_inode_info {
+       u64             start;
+       int             offset;
+       union {
+               struct {
+                       u64             fragment_block;
+                       int             fragment_size;
+                       int             fragment_offset;
+                       u64             block_list_start;
+               };
+               struct {
+                       u64             dir_idx_start;
+                       int             dir_idx_offset;
+                       int             dir_idx_cnt;
+                       int             parent;
+               };
+       };
+       struct inode    vfs_inode;
+};
+#endif
diff --git a/fs/squashfs/squashfs_fs_sb.h b/fs/squashfs/squashfs_fs_sb.h

new file mode 100644 (file)

index 0000000..c8c6561
--- /dev/null
+++ b/fs/squashfs/squashfs_fs_sb.h
@@ -0,0 +1,76 @@
+#ifndef SQUASHFS_FS_SB
+#define SQUASHFS_FS_SB
+/*
+ * Squashfs
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * squashfs_fs_sb.h
+ */
+
+#include "squashfs_fs.h"
+
+struct squashfs_cache {
+       char                    *name;
+       int                     entries;
+       int                     next_blk;
+       int                     num_waiters;
+       int                     unused;
+       int                     block_size;
+       int                     pages;
+       spinlock_t              lock;
+       wait_queue_head_t       wait_queue;
+       struct squashfs_cache_entry *entry;
+};
+
+struct squashfs_cache_entry {
+       u64                     block;
+       int                     length;
+       int                     refcount;
+       u64                     next_index;
+       int                     pending;
+       int                     error;
+       int                     num_waiters;
+       wait_queue_head_t       wait_queue;
+       struct squashfs_cache   *cache;
+       void                    **data;
+};
+
+struct squashfs_sb_info {
+       int                     devblksize;
+       int                     devblksize_log2;
+       struct squashfs_cache   *block_cache;
+       struct squashfs_cache   *fragment_cache;
+       struct squashfs_cache   *read_page;
+       int                     next_meta_index;
+       __le64                  *id_table;
+       __le64                  *fragment_index;
+       unsigned int            *fragment_index_2;
+       struct mutex            read_data_mutex;
+       struct mutex            meta_index_mutex;
+       struct meta_index       *meta_index;
+       z_stream                stream;
+       __le64                  *inode_lookup_table;
+       u64                     inode_table;
+       u64                     directory_table;
+       unsigned int            block_size;
+       unsigned short          block_log;
+       long long               bytes_used;
+       unsigned int            inodes;
+};
+#endif
diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c

new file mode 100644 (file)

index 0000000..a0466d7
--- /dev/null
+++ b/fs/squashfs/super.c
@@ -0,0 +1,440 @@
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * super.c
+ */
+
+/*
+ * This file implements code to read the superblock, read and initialise
+ * in-memory structures at mount time, and all the VFS glue code to register
+ * the filesystem.
+ */
+
+#include <linux/fs.h>
+#include <linux/vfs.h>
+#include <linux/slab.h>
+#include <linux/mutex.h>
+#include <linux/pagemap.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/zlib.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs_fs_i.h"
+#include "squashfs.h"
+
+static struct file_system_type squashfs_fs_type;
+static struct super_operations squashfs_super_ops;
+
+static int supported_squashfs_filesystem(short major, short minor, short comp)
+{
+       if (major < SQUASHFS_MAJOR) {
+               ERROR("Major/Minor mismatch, older Squashfs %d.%d "
+                       "filesystems are unsupported\n", major, minor);
+               return -EINVAL;
+       } else if (major > SQUASHFS_MAJOR || minor > SQUASHFS_MINOR) {
+               ERROR("Major/Minor mismatch, trying to mount newer "
+                       "%d.%d filesystem\n", major, minor);
+               ERROR("Please update your kernel\n");
+               return -EINVAL;
+       }
+
+       if (comp != ZLIB_COMPRESSION)
+               return -EINVAL;
+
+       return 0;
+}
+
+
+static int squashfs_fill_super(struct super_block *sb, void *data, int silent)
+{
+       struct squashfs_sb_info *msblk;
+       struct squashfs_super_block *sblk = NULL;
+       char b[BDEVNAME_SIZE];
+       struct inode *root;
+       long long root_inode;
+       unsigned short flags;
+       unsigned int fragments;
+       u64 lookup_table_start;
+       int err;
+
+       TRACE("Entered squashfs_fill_superblock\n");
+
+       sb->s_fs_info = kzalloc(sizeof(*msblk), GFP_KERNEL);
+       if (sb->s_fs_info == NULL) {
+               ERROR("Failed to allocate squashfs_sb_info\n");
+               return -ENOMEM;
+       }
+       msblk = sb->s_fs_info;
+
+       msblk->stream.workspace = kmalloc(zlib_inflate_workspacesize(),
+               GFP_KERNEL);
+       if (msblk->stream.workspace == NULL) {
+               ERROR("Failed to allocate zlib workspace\n");
+               goto failure;
+       }
+
+       sblk = kzalloc(sizeof(*sblk), GFP_KERNEL);
+       if (sblk == NULL) {
+               ERROR("Failed to allocate squashfs_super_block\n");
+               goto failure;
+       }
+
+       msblk->devblksize = sb_min_blocksize(sb, BLOCK_SIZE);
+       msblk->devblksize_log2 = ffz(~msblk->devblksize);
+
+       mutex_init(&msblk->read_data_mutex);
+       mutex_init(&msblk->meta_index_mutex);
+
+       /*
+        * msblk->bytes_used is checked in squashfs_read_table to ensure reads
+        * are not beyond filesystem end.  But as we're using
+        * squashfs_read_table here to read the superblock (including the value
+        * of bytes_used) we need to set it to an initial sensible dummy value
+        */
+       msblk->bytes_used = sizeof(*sblk);
+       err = squashfs_read_table(sb, sblk, SQUASHFS_START, sizeof(*sblk));
+
+       if (err < 0) {
+               ERROR("unable to read squashfs_super_block\n");
+               goto failed_mount;
+       }
+
+       /* Check it is a SQUASHFS superblock */
+       sb->s_magic = le32_to_cpu(sblk->s_magic);
+       if (sb->s_magic != SQUASHFS_MAGIC) {
+               if (!silent)
+                       ERROR("Can't find a SQUASHFS superblock on %s\n",
+                                               bdevname(sb->s_bdev, b));
+               err = -EINVAL;
+               goto failed_mount;
+       }
+
+       /* Check the MAJOR & MINOR versions and compression type */
+       err = supported_squashfs_filesystem(le16_to_cpu(sblk->s_major),
+                       le16_to_cpu(sblk->s_minor),
+                       le16_to_cpu(sblk->compression));
+       if (err < 0)
+               goto failed_mount;
+
+       err = -EINVAL;
+
+       /*
+        * Check if there's xattrs in the filesystem.  These are not
+        * supported in this version, so warn that they will be ignored.
+        */
+       if (le64_to_cpu(sblk->xattr_table_start) != SQUASHFS_INVALID_BLK)
+               ERROR("Xattrs in filesystem, these will be ignored\n");
+
+       /* Check the filesystem does not extend beyond the end of the
+          block device */
+       msblk->bytes_used = le64_to_cpu(sblk->bytes_used);
+       if (msblk->bytes_used < 0 || msblk->bytes_used >
+                       i_size_read(sb->s_bdev->bd_inode))
+               goto failed_mount;
+
+       /* Check block size for sanity */
+       msblk->block_size = le32_to_cpu(sblk->block_size);
+       if (msblk->block_size > SQUASHFS_FILE_MAX_SIZE)
+               goto failed_mount;
+
+       msblk->block_log = le16_to_cpu(sblk->block_log);
+       if (msblk->block_log > SQUASHFS_FILE_MAX_LOG)
+               goto failed_mount;
+
+       /* Check the root inode for sanity */
+       root_inode = le64_to_cpu(sblk->root_inode);
+       if (SQUASHFS_INODE_OFFSET(root_inode) > SQUASHFS_METADATA_SIZE)
+               goto failed_mount;
+
+       msblk->inode_table = le64_to_cpu(sblk->inode_table_start);
+       msblk->directory_table = le64_to_cpu(sblk->directory_table_start);
+       msblk->inodes = le32_to_cpu(sblk->inodes);
+       flags = le16_to_cpu(sblk->flags);
+
+       TRACE("Found valid superblock on %s\n", bdevname(sb->s_bdev, b));
+       TRACE("Inodes are %scompressed\n", SQUASHFS_UNCOMPRESSED_INODES(flags)
+                               ? "un" : "");
+       TRACE("Data is %scompressed\n", SQUASHFS_UNCOMPRESSED_DATA(flags)
+                               ? "un" : "");
+       TRACE("Filesystem size %lld bytes\n", msblk->bytes_used);
+       TRACE("Block size %d\n", msblk->block_size);
+       TRACE("Number of inodes %d\n", msblk->inodes);
+       TRACE("Number of fragments %d\n", le32_to_cpu(sblk->fragments));
+       TRACE("Number of ids %d\n", le16_to_cpu(sblk->no_ids));
+       TRACE("sblk->inode_table_start %llx\n", msblk->inode_table);
+       TRACE("sblk->directory_table_start %llx\n", msblk->directory_table);
+       TRACE("sblk->fragment_table_start %llx\n",
+               (u64) le64_to_cpu(sblk->fragment_table_start));
+       TRACE("sblk->id_table_start %llx\n",
+               (u64) le64_to_cpu(sblk->id_table_start));
+
+       sb->s_maxbytes = MAX_LFS_FILESIZE;
+       sb->s_flags |= MS_RDONLY;
+       sb->s_op = &squashfs_super_ops;
+
+       err = -ENOMEM;
+
+       msblk->block_cache = squashfs_cache_init("metadata",
+                       SQUASHFS_CACHED_BLKS, SQUASHFS_METADATA_SIZE);
+       if (msblk->block_cache == NULL)
+               goto failed_mount;
+
+       /* Allocate read_page block */
+       msblk->read_page = squashfs_cache_init("data", 1, msblk->block_size);
+       if (msblk->read_page == NULL) {
+               ERROR("Failed to allocate read_page block\n");
+               goto failed_mount;
+       }
+
+       /* Allocate and read id index table */
+       msblk->id_table = squashfs_read_id_index_table(sb,
+               le64_to_cpu(sblk->id_table_start), le16_to_cpu(sblk->no_ids));
+       if (IS_ERR(msblk->id_table)) {
+               err = PTR_ERR(msblk->id_table);
+               msblk->id_table = NULL;
+               goto failed_mount;
+       }
+
+       fragments = le32_to_cpu(sblk->fragments);
+       if (fragments == 0)
+               goto allocate_lookup_table;
+
+       msblk->fragment_cache = squashfs_cache_init("fragment",
+               SQUASHFS_CACHED_FRAGMENTS, msblk->block_size);
+       if (msblk->fragment_cache == NULL) {
+               err = -ENOMEM;
+               goto failed_mount;
+       }
+
+       /* Allocate and read fragment index table */
+       msblk->fragment_index = squashfs_read_fragment_index_table(sb,
+               le64_to_cpu(sblk->fragment_table_start), fragments);
+       if (IS_ERR(msblk->fragment_index)) {
+               err = PTR_ERR(msblk->fragment_index);
+               msblk->fragment_index = NULL;
+               goto failed_mount;
+       }
+
+allocate_lookup_table:
+       lookup_table_start = le64_to_cpu(sblk->lookup_table_start);
+       if (lookup_table_start == SQUASHFS_INVALID_BLK)
+               goto allocate_root;
+
+       /* Allocate and read inode lookup table */
+       msblk->inode_lookup_table = squashfs_read_inode_lookup_table(sb,
+               lookup_table_start, msblk->inodes);
+       if (IS_ERR(msblk->inode_lookup_table)) {
+               err = PTR_ERR(msblk->inode_lookup_table);
+               msblk->inode_lookup_table = NULL;
+               goto failed_mount;
+       }
+
+       sb->s_export_op = &squashfs_export_ops;
+
+allocate_root:
+       root = new_inode(sb);
+       if (!root) {
+               err = -ENOMEM;
+               goto failed_mount;
+       }
+
+       err = squashfs_read_inode(root, root_inode);
+       if (err) {
+               iget_failed(root);
+               goto failed_mount;
+       }
+       insert_inode_hash(root);
+
+       sb->s_root = d_alloc_root(root);
+       if (sb->s_root == NULL) {
+               ERROR("Root inode create failed\n");
+               err = -ENOMEM;
+               iput(root);
+               goto failed_mount;
+       }
+
+       TRACE("Leaving squashfs_fill_super\n");
+       kfree(sblk);
+       return 0;
+
+failed_mount:
+       squashfs_cache_delete(msblk->block_cache);
+       squashfs_cache_delete(msblk->fragment_cache);
+       squashfs_cache_delete(msblk->read_page);
+       kfree(msblk->inode_lookup_table);
+       kfree(msblk->fragment_index);
+       kfree(msblk->id_table);
+       kfree(msblk->stream.workspace);
+       kfree(sb->s_fs_info);
+       sb->s_fs_info = NULL;
+       kfree(sblk);
+       return err;
+
+failure:
+       kfree(msblk->stream.workspace);
+       kfree(sb->s_fs_info);
+       sb->s_fs_info = NULL;
+       return -ENOMEM;
+}
+
+
+static int squashfs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+       struct squashfs_sb_info *msblk = dentry->d_sb->s_fs_info;
+
+       TRACE("Entered squashfs_statfs\n");
+
+       buf->f_type = SQUASHFS_MAGIC;
+       buf->f_bsize = msblk->block_size;
+       buf->f_blocks = ((msblk->bytes_used - 1) >> msblk->block_log) + 1;
+       buf->f_bfree = buf->f_bavail = 0;
+       buf->f_files = msblk->inodes;
+       buf->f_ffree = 0;
+       buf->f_namelen = SQUASHFS_NAME_LEN;
+
+       return 0;
+}
+
+
+static int squashfs_remount(struct super_block *sb, int *flags, char *data)
+{
+       *flags |= MS_RDONLY;
+       return 0;
+}
+
+
+static void squashfs_put_super(struct super_block *sb)
+{
+       if (sb->s_fs_info) {
+               struct squashfs_sb_info *sbi = sb->s_fs_info;
+               squashfs_cache_delete(sbi->block_cache);
+               squashfs_cache_delete(sbi->fragment_cache);
+               squashfs_cache_delete(sbi->read_page);
+               kfree(sbi->id_table);
+               kfree(sbi->fragment_index);
+               kfree(sbi->meta_index);
+               kfree(sbi->stream.workspace);
+               kfree(sb->s_fs_info);
+               sb->s_fs_info = NULL;
+       }
+}
+
+
+static int squashfs_get_sb(struct file_system_type *fs_type, int flags,
+                               const char *dev_name, void *data,
+                               struct vfsmount *mnt)
+{
+       return get_sb_bdev(fs_type, flags, dev_name, data, squashfs_fill_super,
+                               mnt);
+}
+
+
+static struct kmem_cache *squashfs_inode_cachep;
+
+
+static void init_once(void *foo)
+{
+       struct squashfs_inode_info *ei = foo;
+
+       inode_init_once(&ei->vfs_inode);
+}
+
+
+static int __init init_inodecache(void)
+{
+       squashfs_inode_cachep = kmem_cache_create("squashfs_inode_cache",
+               sizeof(struct squashfs_inode_info), 0,
+               SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT, init_once);
+
+       return squashfs_inode_cachep ? 0 : -ENOMEM;
+}
+
+
+static void destroy_inodecache(void)
+{
+       kmem_cache_destroy(squashfs_inode_cachep);
+}
+
+
+static int __init init_squashfs_fs(void)
+{
+       int err = init_inodecache();
+
+       if (err)
+               return err;
+
+       err = register_filesystem(&squashfs_fs_type);
+       if (err) {
+               destroy_inodecache();
+               return err;
+       }
+
+       printk(KERN_INFO "squashfs: version 4.0 (2009/01/03) "
+               "Phillip Lougher\n");
+
+       return 0;
+}
+
+
+static void __exit exit_squashfs_fs(void)
+{
+       unregister_filesystem(&squashfs_fs_type);
+       destroy_inodecache();
+}
+
+
+static struct inode *squashfs_alloc_inode(struct super_block *sb)
+{
+       struct squashfs_inode_info *ei =
+               kmem_cache_alloc(squashfs_inode_cachep, GFP_KERNEL);
+
+       return ei ? &ei->vfs_inode : NULL;
+}
+
+
+static void squashfs_destroy_inode(struct inode *inode)
+{
+       kmem_cache_free(squashfs_inode_cachep, squashfs_i(inode));
+}
+
+
+static struct file_system_type squashfs_fs_type = {
+       .owner = THIS_MODULE,
+       .name = "squashfs",
+       .get_sb = squashfs_get_sb,
+       .kill_sb = kill_block_super,
+       .fs_flags = FS_REQUIRES_DEV
+};
+
+static struct super_operations squashfs_super_ops = {
+       .alloc_inode = squashfs_alloc_inode,
+       .destroy_inode = squashfs_destroy_inode,
+       .statfs = squashfs_statfs,
+       .put_super = squashfs_put_super,
+       .remount_fs = squashfs_remount
+};
+
+module_init(init_squashfs_fs);
+module_exit(exit_squashfs_fs);
+MODULE_DESCRIPTION("squashfs 4.0, a compressed read-only filesystem");
+MODULE_AUTHOR("Phillip Lougher <phillip@lougher.demon.co.uk>");
+MODULE_LICENSE("GPL");
diff --git a/fs/squashfs/symlink.c b/fs/squashfs/symlink.c

new file mode 100644 (file)

index 0000000..83d8788
--- /dev/null
+++ b/fs/squashfs/symlink.c
@@ -0,0 +1,118 @@
+/*
+ * Squashfs - a compressed read only filesystem for Linux
+ *
+ * Copyright (c) 2002, 2003, 2004, 2005, 2006, 2007, 2008
+ * Phillip Lougher <phillip@lougher.demon.co.uk>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2,
+ * or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * symlink.c
+ */
+
+/*
+ * This file implements code to handle symbolic links.
+ *
+ * The data contents of symbolic links are stored inside the symbolic
+ * link inode within the inode table.  This allows the normally small symbolic
+ * link to be compressed as part of the inode table, achieving much greater
+ * compression than if the symbolic link was compressed individually.
+ */
+
+#include <linux/fs.h>
+#include <linux/vfs.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/pagemap.h>
+#include <linux/zlib.h>
+
+#include "squashfs_fs.h"
+#include "squashfs_fs_sb.h"
+#include "squashfs_fs_i.h"
+#include "squashfs.h"
+
+static int squashfs_symlink_readpage(struct file *file, struct page *page)
+{
+       struct inode *inode = page->mapping->host;
+       struct super_block *sb = inode->i_sb;
+       struct squashfs_sb_info *msblk = sb->s_fs_info;
+       int index = page->index << PAGE_CACHE_SHIFT;
+       u64 block = squashfs_i(inode)->start;
+       int offset = squashfs_i(inode)->offset;
+       int length = min_t(int, i_size_read(inode) - index, PAGE_CACHE_SIZE);
+       int bytes, copied;
+       void *pageaddr;
+       struct squashfs_cache_entry *entry;
+
+       TRACE("Entered squashfs_symlink_readpage, page index %ld, start block "
+                       "%llx, offset %x\n", page->index, block, offset);
+
+       /*
+        * Skip index bytes into symlink metadata.
+        */
+       if (index) {
+               bytes = squashfs_read_metadata(sb, NULL, &block, &offset,
+                                                               index);
+               if (bytes < 0) {
+                       ERROR("Unable to read symlink [%llx:%x]\n",
+                               squashfs_i(inode)->start,
+                               squashfs_i(inode)->offset);
+                       goto error_out;
+               }
+       }
+
+       /*
+        * Read length bytes from symlink metadata.  Squashfs_read_metadata
+        * is not used here because it can sleep and we want to use
+        * kmap_atomic to map the page.  Instead call the underlying
+        * squashfs_cache_get routine.  As length bytes may overlap metadata
+        * blocks, we may need to call squashfs_cache_get multiple times.
+        */
+       for (bytes = 0; bytes < length; offset = 0, bytes += copied) {
+               entry = squashfs_cache_get(sb, msblk->block_cache, block, 0);
+               if (entry->error) {
+                       ERROR("Unable to read symlink [%llx:%x]\n",
+                               squashfs_i(inode)->start,
+                               squashfs_i(inode)->offset);
+                       squashfs_cache_put(entry);
+                       goto error_out;
+               }
+
+               pageaddr = kmap_atomic(page, KM_USER0);
+               copied = squashfs_copy_data(pageaddr + bytes, entry, offset,
+                                                               length - bytes);
+               if (copied == length - bytes)
+                       memset(pageaddr + length, 0, PAGE_CACHE_SIZE - length);
+               else
+                       block = entry->next_index;
+               kunmap_atomic(pageaddr, KM_USER0);
+               squashfs_cache_put(entry);
+       }
+
+       flush_dcache_page(page);
+       SetPageUptodate(page);
+       unlock_page(page);
+       return 0;
+
+error_out:
+       SetPageError(page);
+       unlock_page(page);
+       return 0;
+}
+
+
+const struct address_space_operations squashfs_symlink_aops = {
+       .readpage = squashfs_symlink_readpage
+};
diff --git a/include/asm-frv/mmu.h b/include/asm-frv/mmu.h

index 22c03714fb144d30b4c1bfd76fb30556eac373fa..86ca0e86e7d2579c0c096f7a57037627cf9dfa94 100644 (file)
--- a/include/asm-frv/mmu.h
+++ b/include/asm-frv/mmu.h
@@ -22,7 +22,6 @@ typedef struct {
         unsigned long   dtlb_ptd_mapping;       /* [DAMR5] PTD mapping for dtlb cached PGE */
  
  #else
-       struct vm_list_struct   *vmlist;
         unsigned long           end_brk;
  
  #endif
diff --git a/include/asm-m32r/mmu.h b/include/asm-m32r/mmu.h

index d9bd724479cf0a078ea9b92fa4f73045468a8234..150cb92bb6664e0ead1795a81e93ebed8f33f579 100644 (file)
--- a/include/asm-m32r/mmu.h
+++ b/include/asm-m32r/mmu.h
@@ -4,7 +4,6 @@
  #if !defined(CONFIG_MMU)
  
  typedef struct {
-       struct vm_list_struct   *vmlist;
         unsigned long           end_brk;
  } mm_context_t;
  
diff --git a/include/linux/backlight.h b/include/linux/backlight.h

index 1ee9488ca2e40589adf8634c5131c00424a7696f..79ca2da81c87a9710f790241dc494f67ed003dae 100644 (file)
--- a/include/linux/backlight.h
+++ b/include/linux/backlight.h
@@ -31,6 +31,10 @@ struct backlight_device;
  struct fb_info;
  
  struct backlight_ops {
+       unsigned int options;
+
+#define BL_CORE_SUSPENDRESUME  (1 << 0)
+
         /* Notify the backlight driver some property has changed */
         int (*update_status)(struct backlight_device *);
         /* Return the current backlight brightness (accounting for power,
@@ -51,7 +55,19 @@ struct backlight_properties {
            modes; 4: full off), see FB_BLANK_XXX */
         int power;
         /* FB Blanking active? (values as for power) */
+       /* Due to be removed, please use (state & BL_CORE_FBBLANK) */
         int fb_blank;
+       /* Flags used to signal drivers of state changes */
+       /* Upper 4 bits are reserved for driver internal use */
+       unsigned int state;
+
+#define BL_CORE_SUSPENDED      (1 << 0)        /* backlight is suspended */
+#define BL_CORE_FBBLANK                (1 << 1)        /* backlight is under an fb blank event */
+#define BL_CORE_DRIVER4                (1 << 28)       /* reserved for driver specific use */
+#define BL_CORE_DRIVER3                (1 << 29)       /* reserved for driver specific use */
+#define BL_CORE_DRIVER2                (1 << 30)       /* reserved for driver specific use */
+#define BL_CORE_DRIVER1                (1 << 31)       /* reserved for driver specific use */
+
  };
  
  struct backlight_device {
diff --git a/include/linux/leds-pca9532.h b/include/linux/leds-pca9532.h

index 81b4207deb957dcdcacafefa3d26cd89343f95c3..96eea90f01a87711440b6fd3162e2d40d38269b4 100644 (file)
--- a/include/linux/leds-pca9532.h
+++ b/include/linux/leds-pca9532.h
@@ -15,6 +15,7 @@
  #define __LINUX_PCA9532_H
  
  #include <linux/leds.h>
+#include <linux/workqueue.h>
  
  enum pca9532_state {
         PCA9532_OFF  = 0x0,
@@ -31,6 +32,7 @@ struct pca9532_led {
         struct i2c_client *client;
         char *name;
         struct led_classdev ldev;
+       struct work_struct work;
         enum pca9532_type type;
         enum pca9532_state state;
  };
diff --git a/include/linux/leds.h b/include/linux/leds.h

index d3a73f5a48c3f0514508da1395914e7846b488c5..24489da701e331b1a4d1810b40afd7209775e265 100644 (file)
--- a/include/linux/leds.h
+++ b/include/linux/leds.h
@@ -32,7 +32,10 @@ struct led_classdev {
         int                      brightness;
         int                      flags;
  
+       /* Lower 16 bits reflect status */
  #define LED_SUSPENDED          (1 << 0)
+       /* Upper 16 bits reflect control information */
+#define LED_CORE_SUSPENDRESUME (1 << 16)
  
         /* Set LED brightness level */
         /* Must not sleep, use a workqueue if needed */
@@ -62,7 +65,7 @@ struct led_classdev {
  
  extern int led_classdev_register(struct device *parent,
                                  struct led_classdev *led_cdev);
-extern void led_classdev_unregister(struct led_classdev *lcd);
+extern void led_classdev_unregister(struct led_classdev *led_cdev);
  extern void led_classdev_suspend(struct led_classdev *led_cdev);
  extern void led_classdev_resume(struct led_classdev *led_cdev);
  
diff --git a/include/linux/mfd/wm8350/pmic.h b/include/linux/mfd/wm8350/pmic.h

index 96acbfc8aa12f2c607a97e110af29ce370389297..be3264e286e0199588ee651c772ad16552ac70a5 100644 (file)
--- a/include/linux/mfd/wm8350/pmic.h
+++ b/include/linux/mfd/wm8350/pmic.h
@@ -13,6 +13,10 @@
  #ifndef __LINUX_MFD_WM8350_PMIC_H
  #define __LINUX_MFD_WM8350_PMIC_H
  
+#include <linux/platform_device.h>
+#include <linux/leds.h>
+#include <linux/regulator/machine.h>
+
  /*
   * Register values.
   */
@@ -700,6 +704,33 @@ struct wm8350;
  struct platform_device;
  struct regulator_init_data;
  
+/*
+ * WM8350 LED platform data
+ */
+struct wm8350_led_platform_data {
+       const char *name;
+       const char *default_trigger;
+       int max_uA;
+};
+
+struct wm8350_led {
+       struct platform_device *pdev;
+       struct mutex mutex;
+       struct work_struct work;
+       spinlock_t value_lock;
+       enum led_brightness value;
+       struct led_classdev cdev;
+       int max_uA_index;
+       int enabled;
+
+       struct regulator *isink;
+       struct regulator_consumer_supply isink_consumer;
+       struct regulator_init_data isink_init;
+       struct regulator *dcdc;
+       struct regulator_consumer_supply dcdc_consumer;
+       struct regulator_init_data dcdc_init;
+};
+
  struct wm8350_pmic {
         /* Number of regulators of each type on this device */
         int max_dcdc;
@@ -717,10 +748,15 @@ struct wm8350_pmic {
  
         /* regulator devices */
         struct platform_device *pdev[NUM_WM8350_REGULATORS];
+
+       /* LED devices */
+       struct wm8350_led led[2];
  };
  
  int wm8350_register_regulator(struct wm8350 *wm8350, int reg,
                               struct regulator_init_data *initdata);
+int wm8350_register_led(struct wm8350 *wm8350, int lednum, int dcdc, int isink,
+                       struct wm8350_led_platform_data *pdata);
  
  /*
   * Additional DCDC control not supported via regulator API
diff --git a/include/linux/mm.h b/include/linux/mm.h

index 4a3d28c86443f8dfdfe5e4ee49a47ef1d27a0f44..b91a73fd1bcc2d22f64d5387b5c2dfe7473456c8 100644 (file)
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -56,19 +56,9 @@ extern unsigned long mmap_min_addr;
  
  extern struct kmem_cache *vm_area_cachep;
  
-/*
- * This struct defines the per-mm list of VMAs for uClinux. If CONFIG_MMU is
- * disabled, then there's a single shared list of VMAs maintained by the
- * system, and mm's subscribe to these individually
- */
-struct vm_list_struct {
-       struct vm_list_struct   *next;
-       struct vm_area_struct   *vma;
-};
-
  #ifndef CONFIG_MMU
-extern struct rb_root nommu_vma_tree;
-extern struct rw_semaphore nommu_vma_sem;
+extern struct rb_root nommu_region_tree;
+extern struct rw_semaphore nommu_region_sem;
  
  extern unsigned int kobjsize(const void *objp);
  #endif
@@ -1061,6 +1051,7 @@ extern void memmap_init_zone(unsigned long, int, unsigned long,
                                 unsigned long, enum memmap_context);
  extern void setup_per_zone_pages_min(void);
  extern void mem_init(void);
+extern void __init mmap_init(void);
  extern void show_mem(void);
  extern void si_meminfo(struct sysinfo * val);
  extern void si_meminfo_node(struct sysinfo *val, int nid);
@@ -1072,6 +1063,9 @@ extern void setup_per_cpu_pageset(void);
  static inline void setup_per_cpu_pageset(void) {}
  #endif
  
+/* nommu.c */
+extern atomic_t mmap_pages_allocated;
+
  /* prio_tree.c */
  void vma_prio_tree_add(struct vm_area_struct *, struct vm_area_struct *old);
  void vma_prio_tree_insert(struct vm_area_struct *, struct prio_tree_root *);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h

index 9cfc9b627fdd745d4702a903e29db36f2676aa34..92915e81443ff028afc2ac8a188a7bdbf94721ee 100644 (file)
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -96,6 +96,23 @@ struct page {
  #endif /* WANT_PAGE_VIRTUAL */
  };
  
+/*
+ * A region containing a mapping of a non-memory backed file under NOMMU
+ * conditions.  These are held in a global tree and are pinned by the VMAs that
+ * map parts of them.
+ */
+struct vm_region {
+       struct rb_node  vm_rb;          /* link in global region tree */
+       unsigned long   vm_flags;       /* VMA vm_flags */
+       unsigned long   vm_start;       /* start address of region */
+       unsigned long   vm_end;         /* region initialised to here */
+       unsigned long   vm_top;         /* region allocated to here */
+       unsigned long   vm_pgoff;       /* the offset in vm_file corresponding to vm_start */
+       struct file     *vm_file;       /* the backing file or NULL */
+
+       atomic_t        vm_usage;       /* region usage count */
+};
+
  /*
   * This struct defines a memory VMM memory area. There is one of these
   * per VM-area/task.  A VM area is any part of the process virtual memory
@@ -152,7 +169,7 @@ struct vm_area_struct {
         unsigned long vm_truncate_count;/* truncate_count or restart_addr */
  
  #ifndef CONFIG_MMU
-       atomic_t vm_usage;              /* refcount (VMAs shared if !MMU) */
+       struct vm_region *vm_region;    /* NOMMU mapping region */
  #endif
  #ifdef CONFIG_NUMA
         struct mempolicy *vm_policy;    /* NUMA policy for the VMA */
diff --git a/include/linux/oprofile.h b/include/linux/oprofile.h

index 1ce9fe572e514ef60a985d1ea050c36f48edb21b..1d9518bc4c58f25426898d5e7f12c9a3766663c0 100644 (file)
--- a/include/linux/oprofile.h
+++ b/include/linux/oprofile.h
@@ -164,4 +164,22 @@ void oprofile_put_buff(unsigned long *buf, unsigned int start,
  unsigned long oprofile_get_cpu_buffer_size(void);
  void oprofile_cpu_buffer_inc_smpl_lost(void);
   
+/* cpu buffer functions */
+
+struct op_sample;
+
+struct op_entry {
+       struct ring_buffer_event *event;
+       struct op_sample *sample;
+       unsigned long irq_flags;
+       unsigned long size;
+       unsigned long *data;
+};
+
+void oprofile_write_reserve(struct op_entry *entry,
+                           struct pt_regs * const regs,
+                           unsigned long pc, int code, int size);
+int oprofile_add_data(struct op_entry *entry, unsigned long val);
+int oprofile_write_commit(struct op_entry *entry);
+
  #endif /* OPROFILE_H */
diff --git a/include/linux/spi/tdo24m.h b/include/linux/spi/tdo24m.h

new file mode 100644 (file)

index 0000000..7572d4e
--- /dev/null
+++ b/include/linux/spi/tdo24m.h
@@ -0,0 +1,13 @@
+#ifndef __TDO24M_H__
+#define __TDO24M_H__
+
+enum tdo24m_model {
+       TDO24M,
+       TDO35S,
+};
+
+struct tdo24m_platform_data {
+       enum tdo24m_model model;
+};
+
+#endif /* __TDO24M_H__ */
diff --git a/init/do_mounts_rd.c b/init/do_mounts_rd.c

index a7c748fa977a44c2887ad9fcaee3beddf8d4f45e..0f0f0cf3ba9aa97c34d88824fbf4644ca76b0fac 100644 (file)
--- a/init/do_mounts_rd.c
+++ b/init/do_mounts_rd.c
@@ -9,6 +9,7 @@
  #include <linux/string.h>
  
  #include "do_mounts.h"
+#include "../fs/squashfs/squashfs_fs.h"
  
  int __initdata rd_prompt = 1;/* 1 = prompt for RAM disk, 0 = don't prompt */
  
@@ -41,6 +42,7 @@ static int __init crd_load(int in_fd, int out_fd);
   *     ext2
   *     romfs
   *     cramfs
+ *     squashfs
   *     gzip
   */
  static int __init 
@@ -51,6 +53,7 @@ identify_ramdisk_image(int fd, int start_block)
         struct ext2_super_block *ext2sb;
         struct romfs_super_block *romfsb;
         struct cramfs_super *cramfsb;
+       struct squashfs_super_block *squashfsb;
         int nblocks = -1;
         unsigned char *buf;
  
@@ -62,6 +65,7 @@ identify_ramdisk_image(int fd, int start_block)
         ext2sb = (struct ext2_super_block *) buf;
         romfsb = (struct romfs_super_block *) buf;
         cramfsb = (struct cramfs_super *) buf;
+       squashfsb = (struct squashfs_super_block *) buf;
         memset(buf, 0xe5, size);
  
         /*
@@ -99,6 +103,16 @@ identify_ramdisk_image(int fd, int start_block)
                 goto done;
         }
  
+       /* squashfs is at block zero too */
+       if (le32_to_cpu(squashfsb->s_magic) == SQUASHFS_MAGIC) {
+               printk(KERN_NOTICE
+                      "RAMDISK: squashfs filesystem found at block %d\n",
+                      start_block);
+               nblocks = (le64_to_cpu(squashfsb->bytes_used) + BLOCK_SIZE - 1)
+                        >> BLOCK_SIZE_BITS;
+               goto done;
+       }
+
         /*
          * Read block 1 to test for minix and ext2 superblock
          */
diff --git a/init/initramfs.c b/init/initramfs.c

index 4f5ba75aaa7c67d1d80754e6b0be67f68151dfda..d9c941c0c3cacd26fb051059540673b5e011d967 100644 (file)
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -317,6 +317,7 @@ static int __init do_name(void)
                         if (wfd >= 0) {
                                 sys_fchown(wfd, uid, gid);
                                 sys_fchmod(wfd, mode);
+                               sys_ftruncate(wfd, body_len);
                                 vcollected = kstrdup(collected, GFP_KERNEL);
                                 state = CopyFile;
                         }
diff --git a/ipc/shm.c b/ipc/shm.c

index b125b560240ec1ceb204f5ed0ff6f63ba97081d8..d0ab5527bf45607ae5ba8abc66219c218fab694a 100644 (file)
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -990,6 +990,7 @@ asmlinkage long sys_shmdt(char __user *shmaddr)
          */
         vma = find_vma(mm, addr);
  
+#ifdef CONFIG_MMU
         while (vma) {
                 next = vma->vm_next;
  
@@ -1034,6 +1035,17 @@ asmlinkage long sys_shmdt(char __user *shmaddr)
                 vma = next;
         }
  
+#else /* CONFIG_MMU */
+       /* under NOMMU conditions, the exact address to be destroyed must be
+        * given */
+       retval = -EINVAL;
+       if (vma->vm_start == addr && vma->vm_ops == &shm_vm_ops) {
+               do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start);
+               retval = 0;
+       }
+
+#endif
+
         up_write(&mm->mmap_sem);
         return retval;
  }
diff --git a/kernel/cred.c b/kernel/cred.c

index 043f78c133c46ccbfe4a56bcbf92b35e57f92018..3a039189d70748b294082301d9636d467c019c3b 100644 (file)
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -372,7 +372,8 @@ int commit_creds(struct cred *new)
             old->fsuid != new->fsuid ||
             old->fsgid != new->fsgid ||
             !cap_issubset(new->cap_permitted, old->cap_permitted)) {
-               set_dumpable(task->mm, suid_dumpable);
+               if (task->mm)
+                       set_dumpable(task->mm, suid_dumpable);
                 task->pdeath_signal = 0;
                 smp_wmb();
         }
diff --git a/kernel/fork.c b/kernel/fork.c

index 4018308048cf8f52db8202cf5969c14a919830d1..1d68f1255dd824cf2fd1c06a1e6464bac718d26f 100644 (file)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1481,12 +1481,10 @@ void __init proc_caches_init(void)
         fs_cachep = kmem_cache_create("fs_cache",
                         sizeof(struct fs_struct), 0,
                         SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
-       vm_area_cachep = kmem_cache_create("vm_area_struct",
-                       sizeof(struct vm_area_struct), 0,
-                       SLAB_PANIC, NULL);
         mm_cachep = kmem_cache_create("mm_struct",
                         sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
                         SLAB_HWCACHE_ALIGN|SLAB_PANIC, NULL);
+       mmap_init();
  }
  
  /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c

index 92f6e5bc3c24f2700b88e31a689f642a83a29a4b..89d74436318ce9bd2db36deadfbd7fbe850944da 100644 (file)
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -82,6 +82,9 @@ extern int percpu_pagelist_fraction;
  extern int compat_log;
  extern int latencytop_enabled;
  extern int sysctl_nr_open_min, sysctl_nr_open_max;
+#ifndef CONFIG_MMU
+extern int sysctl_nr_trim_pages;
+#endif
  #ifdef CONFIG_RCU_TORTURE_TEST
  extern int rcutorture_runnable;
  #endif /* #ifdef CONFIG_RCU_TORTURE_TEST */
@@ -1102,6 +1105,17 @@ static struct ctl_table vm_table[] = {
                 .mode           = 0644,
                 .proc_handler   = &proc_dointvec
         },
+#else
+       {
+               .ctl_name       = CTL_UNNUMBERED,
+               .procname       = "nr_trim_pages",
+               .data           = &sysctl_nr_trim_pages,
+               .maxlen         = sizeof(sysctl_nr_trim_pages),
+               .mode           = 0644,
+               .proc_handler   = &proc_dointvec_minmax,
+               .strategy       = &sysctl_intvec,
+               .extra1         = &zero,
+       },
  #endif
         {
                 .ctl_name       = VM_LAPTOP_MODE,
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c

index a9d9760dc7b62ae7d17894f815067ced748db2cd..8b0daf0662ef45e7ad6ad9df71e52dfd3c3459ba 100644 (file)
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -168,7 +168,13 @@ rb_event_length(struct ring_buffer_event *event)
   */
  unsigned ring_buffer_event_length(struct ring_buffer_event *event)
  {
-       return rb_event_length(event);
+       unsigned length = rb_event_length(event);
+       if (event->type != RINGBUF_TYPE_DATA)
+               return length;
+       length -= RB_EVNT_HDR_SIZE;
+       if (length > RB_MAX_SMALL_DATA + sizeof(event->array[0]))
+                length -= sizeof(event->array[0]);
+       return length;
  }
  EXPORT_SYMBOL_GPL(ring_buffer_event_length);
  
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug

index 2e75478e9c696bc6933ca8952c0a013dd30eb78a..d0a32aab03ff66cbda68d3d60b30980d006d86ba 100644 (file)
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -512,6 +512,13 @@ config DEBUG_VIRTUAL
  
           If unsure, say N.
  
+config DEBUG_NOMMU_REGIONS
+       bool "Debug the global anon/private NOMMU mapping region tree"
+       depends on DEBUG_KERNEL && !MMU
+       help
+         This option causes the global tree of anonymous and private mapping
+         regions to be regularly checked for invalid topology.
+
  config DEBUG_WRITECOUNT
         bool "Debug filesystem writers count"
         depends on DEBUG_KERNEL
diff --git a/mm/mmap.c b/mm/mmap.c

index a910c045cfd4241601306257c8822bda7f1ee05a..749623196cb96facba001c1dc55ae3ab235ff67b 100644 (file)
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2472,3 +2472,13 @@ void mm_drop_all_locks(struct mm_struct *mm)
  
         mutex_unlock(&mm_all_locks_mutex);
  }
+
+/*
+ * initialise the VMA slab
+ */
+void __init mmap_init(void)
+{
+       vm_area_cachep = kmem_cache_create("vm_area_struct",
+                       sizeof(struct vm_area_struct), 0,
+                       SLAB_PANIC, NULL);
+}
diff --git a/mm/nommu.c b/mm/nommu.c

index 1c28ea3a4e9c9054975683a19237d2d823f98ead..60ed8375c986f56604e1551836fcf0f48d1273a3 100644 (file)
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -6,11 +6,11 @@
   *
   *  See Documentation/nommu-mmap.txt
   *
- *  Copyright (c) 2004-2005 David Howells <dhowells@redhat.com>
+ *  Copyright (c) 2004-2008 David Howells <dhowells@redhat.com>
   *  Copyright (c) 2000-2003 David McCullough <davidm@snapgear.com>
   *  Copyright (c) 2000-2001 D Jeff Dionne <jeff@uClinux.org>
   *  Copyright (c) 2002      Greg Ungerer <gerg@snapgear.com>
- *  Copyright (c) 2007      Paul Mundt <lethal@linux-sh.org>
+ *  Copyright (c) 2007-2008 Paul Mundt <lethal@linux-sh.org>
   */
  
  #include <linux/module.h>
@@ -33,6 +33,28 @@
  #include <asm/uaccess.h>
  #include <asm/tlb.h>
  #include <asm/tlbflush.h>
+#include "internal.h"
+
+static inline __attribute__((format(printf, 1, 2)))
+void no_printk(const char *fmt, ...)
+{
+}
+
+#if 0
+#define kenter(FMT, ...) \
+       printk(KERN_DEBUG "==> %s("FMT")\n", __func__, ##__VA_ARGS__)
+#define kleave(FMT, ...) \
+       printk(KERN_DEBUG "<== %s()"FMT"\n", __func__, ##__VA_ARGS__)
+#define kdebug(FMT, ...) \
+       printk(KERN_DEBUG "xxx" FMT"yyy\n", ##__VA_ARGS__)
+#else
+#define kenter(FMT, ...) \
+       no_printk(KERN_DEBUG "==> %s("FMT")\n", __func__, ##__VA_ARGS__)
+#define kleave(FMT, ...) \
+       no_printk(KERN_DEBUG "<== %s()"FMT"\n", __func__, ##__VA_ARGS__)
+#define kdebug(FMT, ...) \
+       no_printk(KERN_DEBUG FMT"\n", ##__VA_ARGS__)
+#endif
  
  #include "internal.h"
  
@@ -40,19 +62,22 @@ void *high_memory;
  struct page *mem_map;
  unsigned long max_mapnr;
  unsigned long num_physpages;
-unsigned long askedalloc, realalloc;
  atomic_long_t vm_committed_space = ATOMIC_LONG_INIT(0);
  int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
  int sysctl_overcommit_ratio = 50; /* default is 50% */
  int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT;
+int sysctl_nr_trim_pages = 1; /* page trimming behaviour */
  int heap_stack_gap = 0;
  
+atomic_t mmap_pages_allocated;
+
  EXPORT_SYMBOL(mem_map);
  EXPORT_SYMBOL(num_physpages);
  
-/* list of shareable VMAs */
-struct rb_root nommu_vma_tree = RB_ROOT;
-DECLARE_RWSEM(nommu_vma_sem);
+/* list of mapped, potentially shareable regions */
+static struct kmem_cache *vm_region_jar;
+struct rb_root nommu_region_tree = RB_ROOT;
+DECLARE_RWSEM(nommu_region_sem);
  
  struct vm_operations_struct generic_file_vm_ops = {
  };
@@ -123,6 +148,20 @@ unsigned int kobjsize(const void *objp)
         if (PageSlab(page))
                 return ksize(objp);
  
+       /*
+        * If it's not a compound page, see if we have a matching VMA
+        * region. This test is intentionally done in reverse order,
+        * so if there's no VMA, we still fall through and hand back
+        * PAGE_SIZE for 0-order pages.
+        */
+       if (!PageCompound(page)) {
+               struct vm_area_struct *vma;
+
+               vma = find_vma(current->mm, (unsigned long)objp);
+               if (vma)
+                       return vma->vm_end - vma->vm_start;
+       }
+
         /*
          * The ksize() function is only guaranteed to work for pointers
          * returned by kmalloc(). So handle arbitrary pointers here.
@@ -401,129 +440,178 @@ asmlinkage unsigned long sys_brk(unsigned long brk)
         return mm->brk = brk;
  }
  
-#ifdef DEBUG
-static void show_process_blocks(void)
+/*
+ * initialise the VMA and region record slabs
+ */
+void __init mmap_init(void)
  {
-       struct vm_list_struct *vml;
-
-       printk("Process blocks %d:", current->pid);
-
-       for (vml = &current->mm->context.vmlist; vml; vml = vml->next) {
-               printk(" %p: %p", vml, vml->vma);
-               if (vml->vma)
-                       printk(" (%d @%lx #%d)",
-                              kobjsize((void *) vml->vma->vm_start),
-                              vml->vma->vm_start,
-                              atomic_read(&vml->vma->vm_usage));
-               printk(vml->next ? " ->" : ".\n");
-       }
+       vm_region_jar = kmem_cache_create("vm_region_jar",
+                                         sizeof(struct vm_region), 0,
+                                         SLAB_PANIC, NULL);
+       vm_area_cachep = kmem_cache_create("vm_area_struct",
+                                          sizeof(struct vm_area_struct), 0,
+                                          SLAB_PANIC, NULL);
  }
-#endif /* DEBUG */
  
  /*
- * add a VMA into a process's mm_struct in the appropriate place in the list
- * - should be called with mm->mmap_sem held writelocked
+ * validate the region tree
+ * - the caller must hold the region lock
   */
-static void add_vma_to_mm(struct mm_struct *mm, struct vm_list_struct *vml)
+#ifdef CONFIG_DEBUG_NOMMU_REGIONS
+static noinline void validate_nommu_regions(void)
  {
-       struct vm_list_struct **ppv;
-
-       for (ppv = &current->mm->context.vmlist; *ppv; ppv = &(*ppv)->next)
-               if ((*ppv)->vma->vm_start > vml->vma->vm_start)
-                       break;
-
-       vml->next = *ppv;
-       *ppv = vml;
+       struct vm_region *region, *last;
+       struct rb_node *p, *lastp;
+
+       lastp = rb_first(&nommu_region_tree);
+       if (!lastp)
+               return;
+
+       last = rb_entry(lastp, struct vm_region, vm_rb);
+       if (unlikely(last->vm_end <= last->vm_start))
+               BUG();
+       if (unlikely(last->vm_top < last->vm_end))
+               BUG();
+
+       while ((p = rb_next(lastp))) {
+               region = rb_entry(p, struct vm_region, vm_rb);
+               last = rb_entry(lastp, struct vm_region, vm_rb);
+
+               if (unlikely(region->vm_end <= region->vm_start))
+                       BUG();
+               if (unlikely(region->vm_top < region->vm_end))
+                       BUG();
+               if (unlikely(region->vm_start < last->vm_top))
+                       BUG();
+
+               lastp = p;
+       }
  }
+#else
+#define validate_nommu_regions() do {} while(0)
+#endif
  
  /*
- * look up the first VMA in which addr resides, NULL if none
- * - should be called with mm->mmap_sem at least held readlocked
+ * add a region into the global tree
   */
-struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+static void add_nommu_region(struct vm_region *region)
  {
-       struct vm_list_struct *loop, *vml;
+       struct vm_region *pregion;
+       struct rb_node **p, *parent;
  
-       /* search the vm_start ordered list */
-       vml = NULL;
-       for (loop = mm->context.vmlist; loop; loop = loop->next) {
-               if (loop->vma->vm_start > addr)
-                       break;
-               vml = loop;
+       validate_nommu_regions();
+
+       BUG_ON(region->vm_start & ~PAGE_MASK);
+
+       parent = NULL;
+       p = &nommu_region_tree.rb_node;
+       while (*p) {
+               parent = *p;
+               pregion = rb_entry(parent, struct vm_region, vm_rb);
+               if (region->vm_start < pregion->vm_start)
+                       p = &(*p)->rb_left;
+               else if (region->vm_start > pregion->vm_start)
+                       p = &(*p)->rb_right;
+               else if (pregion == region)
+                       return;
+               else
+                       BUG();
         }
  
-       if (vml && vml->vma->vm_end > addr)
-               return vml->vma;
+       rb_link_node(&region->vm_rb, parent, p);
+       rb_insert_color(&region->vm_rb, &nommu_region_tree);
  
-       return NULL;
+       validate_nommu_regions();
  }
-EXPORT_SYMBOL(find_vma);
  
  /*
- * find a VMA
- * - we don't extend stack VMAs under NOMMU conditions
+ * delete a region from the global tree
   */
-struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr)
+static void delete_nommu_region(struct vm_region *region)
  {
-       return find_vma(mm, addr);
-}
+       BUG_ON(!nommu_region_tree.rb_node);
  
-int expand_stack(struct vm_area_struct *vma, unsigned long address)
-{
-       return -ENOMEM;
+       validate_nommu_regions();
+       rb_erase(&region->vm_rb, &nommu_region_tree);
+       validate_nommu_regions();
  }
  
  /*
- * look up the first VMA exactly that exactly matches addr
- * - should be called with mm->mmap_sem at least held readlocked
+ * free a contiguous series of pages
   */
-static inline struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
-                                                   unsigned long addr)
+static void free_page_series(unsigned long from, unsigned long to)
  {
-       struct vm_list_struct *vml;
-
-       /* search the vm_start ordered list */
-       for (vml = mm->context.vmlist; vml; vml = vml->next) {
-               if (vml->vma->vm_start == addr)
-                       return vml->vma;
-               if (vml->vma->vm_start > addr)
-                       break;
+       for (; from < to; from += PAGE_SIZE) {
+               struct page *page = virt_to_page(from);
+
+               kdebug("- free %lx", from);
+               atomic_dec(&mmap_pages_allocated);
+               if (page_count(page) != 1)
+                       kdebug("free page %p [%d]", page, page_count(page));
+               put_page(page);
         }
-
-       return NULL;
  }
  
  /*
- * find a VMA in the global tree
+ * release a reference to a region
+ * - the caller must hold the region semaphore, which this releases
+ * - the region may not have been added to the tree yet, in which case vm_top
+ *   will equal vm_start
   */
-static inline struct vm_area_struct *find_nommu_vma(unsigned long start)
+static void __put_nommu_region(struct vm_region *region)
+       __releases(nommu_region_sem)
  {
-       struct vm_area_struct *vma;
-       struct rb_node *n = nommu_vma_tree.rb_node;
+       kenter("%p{%d}", region, atomic_read(&region->vm_usage));
  
-       while (n) {
-               vma = rb_entry(n, struct vm_area_struct, vm_rb);
+       BUG_ON(!nommu_region_tree.rb_node);
  
-               if (start < vma->vm_start)
-                       n = n->rb_left;
-               else if (start > vma->vm_start)
-                       n = n->rb_right;
-               else
-                       return vma;
+       if (atomic_dec_and_test(&region->vm_usage)) {
+               if (region->vm_top > region->vm_start)
+                       delete_nommu_region(region);
+               up_write(&nommu_region_sem);
+
+               if (region->vm_file)
+                       fput(region->vm_file);
+
+               /* IO memory and memory shared directly out of the pagecache
+                * from ramfs/tmpfs mustn't be released here */
+               if (region->vm_flags & VM_MAPPED_COPY) {
+                       kdebug("free series");
+                       free_page_series(region->vm_start, region->vm_top);
+               }
+               kmem_cache_free(vm_region_jar, region);
+       } else {
+               up_write(&nommu_region_sem);
         }
+}
  
-       return NULL;
+/*
+ * release a reference to a region
+ */
+static void put_nommu_region(struct vm_region *region)
+{
+       down_write(&nommu_region_sem);
+       __put_nommu_region(region);
  }
  
  /*
- * add a VMA in the global tree
+ * add a VMA into a process's mm_struct in the appropriate place in the list
+ * and tree and add to the address space's page tree also if not an anonymous
+ * page
+ * - should be called with mm->mmap_sem held writelocked
   */
-static void add_nommu_vma(struct vm_area_struct *vma)
+static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
  {
-       struct vm_area_struct *pvma;
+       struct vm_area_struct *pvma, **pp;
         struct address_space *mapping;
-       struct rb_node **p = &nommu_vma_tree.rb_node;
-       struct rb_node *parent = NULL;
+       struct rb_node **p, *parent;
+
+       kenter(",%p", vma);
+
+       BUG_ON(!vma->vm_region);
+
+       mm->map_count++;
+       vma->vm_mm = mm;
  
         /* add the VMA to the mapping */
         if (vma->vm_file) {
@@ -534,42 +622,62 @@ static void add_nommu_vma(struct vm_area_struct *vma)
                 flush_dcache_mmap_unlock(mapping);
         }
  
-       /* add the VMA to the master list */
+       /* add the VMA to the tree */
+       parent = NULL;
+       p = &mm->mm_rb.rb_node;
         while (*p) {
                 parent = *p;
                 pvma = rb_entry(parent, struct vm_area_struct, vm_rb);
  
-               if (vma->vm_start < pvma->vm_start) {
+               /* sort by: start addr, end addr, VMA struct addr in that order
+                * (the latter is necessary as we may get identical VMAs) */
+               if (vma->vm_start < pvma->vm_start)
                         p = &(*p)->rb_left;
-               }
-               else if (vma->vm_start > pvma->vm_start) {
+               else if (vma->vm_start > pvma->vm_start)
                         p = &(*p)->rb_right;
-               }
-               else {
-                       /* mappings are at the same address - this can only
-                        * happen for shared-mem chardevs and shared file
-                        * mappings backed by ramfs/tmpfs */
-                       BUG_ON(!(pvma->vm_flags & VM_SHARED));
-
-                       if (vma < pvma)
-                               p = &(*p)->rb_left;
-                       else if (vma > pvma)
-                               p = &(*p)->rb_right;
-                       else
-                               BUG();
-               }
+               else if (vma->vm_end < pvma->vm_end)
+                       p = &(*p)->rb_left;
+               else if (vma->vm_end > pvma->vm_end)
+                       p = &(*p)->rb_right;
+               else if (vma < pvma)
+                       p = &(*p)->rb_left;
+               else if (vma > pvma)
+                       p = &(*p)->rb_right;
+               else
+                       BUG();
         }
  
         rb_link_node(&vma->vm_rb, parent, p);
-       rb_insert_color(&vma->vm_rb, &nommu_vma_tree);
+       rb_insert_color(&vma->vm_rb, &mm->mm_rb);
+
+       /* add VMA to the VMA list also */
+       for (pp = &mm->mmap; (pvma = *pp); pp = &(*pp)->vm_next) {
+               if (pvma->vm_start > vma->vm_start)
+                       break;
+               if (pvma->vm_start < vma->vm_start)
+                       continue;
+               if (pvma->vm_end < vma->vm_end)
+                       break;
+       }
+
+       vma->vm_next = *pp;
+       *pp = vma;
  }
  
  /*
- * delete a VMA from the global list
+ * delete a VMA from its owning mm_struct and address space
   */
-static void delete_nommu_vma(struct vm_area_struct *vma)
+static void delete_vma_from_mm(struct vm_area_struct *vma)
  {
+       struct vm_area_struct **pp;
         struct address_space *mapping;
+       struct mm_struct *mm = vma->vm_mm;
+
+       kenter("%p", vma);
+
+       mm->map_count--;
+       if (mm->mmap_cache == vma)
+               mm->mmap_cache = NULL;
  
         /* remove the VMA from the mapping */
         if (vma->vm_file) {
@@ -580,8 +688,115 @@ static void delete_nommu_vma(struct vm_area_struct *vma)
                 flush_dcache_mmap_unlock(mapping);
         }
  
-       /* remove from the master list */
-       rb_erase(&vma->vm_rb, &nommu_vma_tree);
+       /* remove from the MM's tree and list */
+       rb_erase(&vma->vm_rb, &mm->mm_rb);
+       for (pp = &mm->mmap; *pp; pp = &(*pp)->vm_next) {
+               if (*pp == vma) {
+                       *pp = vma->vm_next;
+                       break;
+               }
+       }
+
+       vma->vm_mm = NULL;
+}
+
+/*
+ * destroy a VMA record
+ */
+static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+       kenter("%p", vma);
+       if (vma->vm_ops && vma->vm_ops->close)
+               vma->vm_ops->close(vma);
+       if (vma->vm_file) {
+               fput(vma->vm_file);
+               if (vma->vm_flags & VM_EXECUTABLE)
+                       removed_exe_file_vma(mm);
+       }
+       put_nommu_region(vma->vm_region);
+       kmem_cache_free(vm_area_cachep, vma);
+}
+
+/*
+ * look up the first VMA in which addr resides, NULL if none
+ * - should be called with mm->mmap_sem at least held readlocked
+ */
+struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+{
+       struct vm_area_struct *vma;
+       struct rb_node *n = mm->mm_rb.rb_node;
+
+       /* check the cache first */
+       vma = mm->mmap_cache;
+       if (vma && vma->vm_start <= addr && vma->vm_end > addr)
+               return vma;
+
+       /* trawl the tree (there may be multiple mappings in which addr
+        * resides) */
+       for (n = rb_first(&mm->mm_rb); n; n = rb_next(n)) {
+               vma = rb_entry(n, struct vm_area_struct, vm_rb);
+               if (vma->vm_start > addr)
+                       return NULL;
+               if (vma->vm_end > addr) {
+                       mm->mmap_cache = vma;
+                       return vma;
+               }
+       }
+
+       return NULL;
+}
+EXPORT_SYMBOL(find_vma);
+
+/*
+ * find a VMA
+ * - we don't extend stack VMAs under NOMMU conditions
+ */
+struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr)
+{
+       return find_vma(mm, addr);
+}
+
+/*
+ * expand a stack to a given address
+ * - not supported under NOMMU conditions
+ */
+int expand_stack(struct vm_area_struct *vma, unsigned long address)
+{
+       return -ENOMEM;
+}
+
+/*
+ * look up the first VMA exactly that exactly matches addr
+ * - should be called with mm->mmap_sem at least held readlocked
+ */
+static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
+                                            unsigned long addr,
+                                            unsigned long len)
+{
+       struct vm_area_struct *vma;
+       struct rb_node *n = mm->mm_rb.rb_node;
+       unsigned long end = addr + len;
+
+       /* check the cache first */
+       vma = mm->mmap_cache;
+       if (vma && vma->vm_start == addr && vma->vm_end == end)
+               return vma;
+
+       /* trawl the tree (there may be multiple mappings in which addr
+        * resides) */
+       for (n = rb_first(&mm->mm_rb); n; n = rb_next(n)) {
+               vma = rb_entry(n, struct vm_area_struct, vm_rb);
+               if (vma->vm_start < addr)
+                       continue;
+               if (vma->vm_start > addr)
+                       return NULL;
+               if (vma->vm_end == end) {
+                       mm->mmap_cache = vma;
+                       return vma;
+               }
+       }
+
+       return NULL;
  }
  
  /*
@@ -596,7 +811,7 @@ static int validate_mmap_request(struct file *file,
                                  unsigned long pgoff,
                                  unsigned long *_capabilities)
  {
-       unsigned long capabilities;
+       unsigned long capabilities, rlen;
         unsigned long reqprot = prot;
         int ret;
  
@@ -616,12 +831,12 @@ static int validate_mmap_request(struct file *file,
                 return -EINVAL;
  
         /* Careful about overflows.. */
-       len = PAGE_ALIGN(len);
-       if (!len || len > TASK_SIZE)
+       rlen = PAGE_ALIGN(len);
+       if (!rlen || rlen > TASK_SIZE)
                 return -ENOMEM;
  
         /* offset overflow? */
-       if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
+       if ((pgoff + (rlen >> PAGE_SHIFT)) < pgoff)
                 return -EOVERFLOW;
  
         if (file) {
@@ -795,13 +1010,18 @@ static unsigned long determine_vm_flags(struct file *file,
  }
  
  /*
- * set up a shared mapping on a file
+ * set up a shared mapping on a file (the driver or filesystem provides and
+ * pins the storage)
   */
-static int do_mmap_shared_file(struct vm_area_struct *vma, unsigned long len)
+static int do_mmap_shared_file(struct vm_area_struct *vma)
  {
         int ret;
  
         ret = vma->vm_file->f_op->mmap(vma->vm_file, vma);
+       if (ret == 0) {
+               vma->vm_region->vm_top = vma->vm_region->vm_end;
+               return ret;
+       }
         if (ret != -ENOSYS)
                 return ret;
  
@@ -815,10 +1035,14 @@ static int do_mmap_shared_file(struct vm_area_struct *vma, unsigned long len)
  /*
   * set up a private mapping or an anonymous shared mapping
   */
-static int do_mmap_private(struct vm_area_struct *vma, unsigned long len)
+static int do_mmap_private(struct vm_area_struct *vma,
+                          struct vm_region *region,
+                          unsigned long len)
  {
+       struct page *pages;
+       unsigned long total, point, n, rlen;
         void *base;
-       int ret;
+       int ret, order;
  
         /* invoke the file's mapping function so that it can keep track of
          * shared mappings on devices or memory
@@ -826,34 +1050,63 @@ static int do_mmap_private(struct vm_area_struct *vma, unsigned long len)
          */
         if (vma->vm_file) {
                 ret = vma->vm_file->f_op->mmap(vma->vm_file, vma);
-               if (ret != -ENOSYS) {
+               if (ret == 0) {
                         /* shouldn't return success if we're not sharing */
-                       BUG_ON(ret == 0 && !(vma->vm_flags & VM_MAYSHARE));
-                       return ret; /* success or a real error */
+                       BUG_ON(!(vma->vm_flags & VM_MAYSHARE));
+                       vma->vm_region->vm_top = vma->vm_region->vm_end;
+                       return ret;
                 }
+               if (ret != -ENOSYS)
+                       return ret;
  
                 /* getting an ENOSYS error indicates that direct mmap isn't
                  * possible (as opposed to tried but failed) so we'll try to
                  * make a private copy of the data and map that instead */
         }
  
+       rlen = PAGE_ALIGN(len);
+
         /* allocate some memory to hold the mapping
          * - note that this may not return a page-aligned address if the object
          *   we're allocating is smaller than a page
          */
-       base = kmalloc(len, GFP_KERNEL|__GFP_COMP);
-       if (!base)
+       order = get_order(rlen);
+       kdebug("alloc order %d for %lx", order, len);
+
+       pages = alloc_pages(GFP_KERNEL, order);
+       if (!pages)
                 goto enomem;
  
-       vma->vm_start = (unsigned long) base;
-       vma->vm_end = vma->vm_start + len;
-       vma->vm_flags |= VM_MAPPED_COPY;
+       total = 1 << order;
+       atomic_add(total, &mmap_pages_allocated);
+
+       point = rlen >> PAGE_SHIFT;
+
+       /* we allocated a power-of-2 sized page set, so we may want to trim off
+        * the excess */
+       if (sysctl_nr_trim_pages && total - point >= sysctl_nr_trim_pages) {
+               while (total > point) {
+                       order = ilog2(total - point);
+                       n = 1 << order;
+                       kdebug("shave %lu/%lu @%lu", n, total - point, total);
+                       atomic_sub(n, &mmap_pages_allocated);
+                       total -= n;
+                       set_page_refcounted(pages + total);
+                       __free_pages(pages + total, order);
+               }
+       }
+
+       for (point = 1; point < total; point++)
+               set_page_refcounted(&pages[point]);
  
-#ifdef WARN_ON_SLACK
-       if (len + WARN_ON_SLACK <= kobjsize(result))
-               printk("Allocation of %lu bytes from process %d has %lu bytes of slack\n",
-                      len, current->pid, kobjsize(result) - len);
-#endif
+       base = page_address(pages);
+       region->vm_flags = vma->vm_flags |= VM_MAPPED_COPY;
+       region->vm_start = (unsigned long) base;
+       region->vm_end   = region->vm_start + rlen;
+       region->vm_top   = region->vm_start + (total << PAGE_SHIFT);
+
+       vma->vm_start = region->vm_start;
+       vma->vm_end   = region->vm_start + len;
  
         if (vma->vm_file) {
                 /* read the contents of a file into the copy */
@@ -865,26 +1118,28 @@ static int do_mmap_private(struct vm_area_struct *vma, unsigned long len)
  
                 old_fs = get_fs();
                 set_fs(KERNEL_DS);
-               ret = vma->vm_file->f_op->read(vma->vm_file, base, len, &fpos);
+               ret = vma->vm_file->f_op->read(vma->vm_file, base, rlen, &fpos);
                 set_fs(old_fs);
  
                 if (ret < 0)
                         goto error_free;
  
                 /* clear the last little bit */
-               if (ret < len)
-                       memset(base + ret, 0, len - ret);
+               if (ret < rlen)
+                       memset(base + ret, 0, rlen - ret);
  
         } else {
                 /* if it's an anonymous mapping, then just clear it */
-               memset(base, 0, len);
+               memset(base, 0, rlen);
         }
  
         return 0;
  
  error_free:
-       kfree(base);
-       vma->vm_start = 0;
+       free_page_series(region->vm_start, region->vm_end);
+       region->vm_start = vma->vm_start = 0;
+       region->vm_end   = vma->vm_end = 0;
+       region->vm_top   = 0;
         return ret;
  
  enomem:
@@ -904,13 +1159,14 @@ unsigned long do_mmap_pgoff(struct file *file,
                             unsigned long flags,
                             unsigned long pgoff)
  {
-       struct vm_list_struct *vml = NULL;
-       struct vm_area_struct *vma = NULL;
+       struct vm_area_struct *vma;
+       struct vm_region *region;
         struct rb_node *rb;
-       unsigned long capabilities, vm_flags;
-       void *result;
+       unsigned long capabilities, vm_flags, result;
         int ret;
  
+       kenter(",%lx,%lx,%lx,%lx,%lx", addr, len, prot, flags, pgoff);
+
         if (!(flags & MAP_FIXED))
                 addr = round_hint_to_min(addr);
  
@@ -918,73 +1174,120 @@ unsigned long do_mmap_pgoff(struct file *file,
          * mapping */
         ret = validate_mmap_request(file, addr, len, prot, flags, pgoff,
                                     &capabilities);
-       if (ret < 0)
+       if (ret < 0) {
+               kleave(" = %d [val]", ret);
                 return ret;
+       }
  
         /* we've determined that we can make the mapping, now translate what we
          * now know into VMA flags */
         vm_flags = determine_vm_flags(file, prot, flags, capabilities);
  
-       /* we're going to need to record the mapping if it works */
-       vml = kzalloc(sizeof(struct vm_list_struct), GFP_KERNEL);
-       if (!vml)
-               goto error_getting_vml;
+       /* we're going to need to record the mapping */
+       region = kmem_cache_zalloc(vm_region_jar, GFP_KERNEL);
+       if (!region)
+               goto error_getting_region;
+
+       vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+       if (!vma)
+               goto error_getting_vma;
+
+       atomic_set(&region->vm_usage, 1);
+       region->vm_flags = vm_flags;
+       region->vm_pgoff = pgoff;
+
+       INIT_LIST_HEAD(&vma->anon_vma_node);
+       vma->vm_flags = vm_flags;
+       vma->vm_pgoff = pgoff;
  
-       down_write(&nommu_vma_sem);
+       if (file) {
+               region->vm_file = file;
+               get_file(file);
+               vma->vm_file = file;
+               get_file(file);
+               if (vm_flags & VM_EXECUTABLE) {
+                       added_exe_file_vma(current->mm);
+                       vma->vm_mm = current->mm;
+               }
+       }
  
-       /* if we want to share, we need to check for VMAs created by other
+       down_write(&nommu_region_sem);
+
+       /* if we want to share, we need to check for regions created by other
          * mmap() calls that overlap with our proposed mapping
-        * - we can only share with an exact match on most regular files
+        * - we can only share with a superset match on most regular files
          * - shared mappings on character devices and memory backed files are
          *   permitted to overlap inexactly as far as we are concerned for in
          *   these cases, sharing is handled in the driver or filesystem rather
          *   than here
          */
         if (vm_flags & VM_MAYSHARE) {
-               unsigned long pglen = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-               unsigned long vmpglen;
+               struct vm_region *pregion;
+               unsigned long pglen, rpglen, pgend, rpgend, start;
  
-               /* suppress VMA sharing for shared regions */
-               if (vm_flags & VM_SHARED &&
-                   capabilities & BDI_CAP_MAP_DIRECT)
-                       goto dont_share_VMAs;
+               pglen = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+               pgend = pgoff + pglen;
  
-               for (rb = rb_first(&nommu_vma_tree); rb; rb = rb_next(rb)) {
-                       vma = rb_entry(rb, struct vm_area_struct, vm_rb);
+               for (rb = rb_first(&nommu_region_tree); rb; rb = rb_next(rb)) {
+                       pregion = rb_entry(rb, struct vm_region, vm_rb);
  
-                       if (!(vma->vm_flags & VM_MAYSHARE))
+                       if (!(pregion->vm_flags & VM_MAYSHARE))
                                 continue;
  
                         /* search for overlapping mappings on the same file */
-                       if (vma->vm_file->f_path.dentry->d_inode != file->f_path.dentry->d_inode)
+                       if (pregion->vm_file->f_path.dentry->d_inode !=
+                           file->f_path.dentry->d_inode)
                                 continue;
  
-                       if (vma->vm_pgoff >= pgoff + pglen)
+                       if (pregion->vm_pgoff >= pgend)
                                 continue;
  
-                       vmpglen = vma->vm_end - vma->vm_start + PAGE_SIZE - 1;
-                       vmpglen >>= PAGE_SHIFT;
-                       if (pgoff >= vma->vm_pgoff + vmpglen)
+                       rpglen = pregion->vm_end - pregion->vm_start;
+                       rpglen = (rpglen + PAGE_SIZE - 1) >> PAGE_SHIFT;
+                       rpgend = pregion->vm_pgoff + rpglen;
+                       if (pgoff >= rpgend)
                                 continue;
  
-                       /* handle inexactly overlapping matches between mappings */
-                       if (vma->vm_pgoff != pgoff || vmpglen != pglen) {
+                       /* handle inexactly overlapping matches between
+                        * mappings */
+                       if ((pregion->vm_pgoff != pgoff || rpglen != pglen) &&
+                           !(pgoff >= pregion->vm_pgoff && pgend <= rpgend)) {
+                               /* new mapping is not a subset of the region */
                                 if (!(capabilities & BDI_CAP_MAP_DIRECT))
                                         goto sharing_violation;
                                 continue;
                         }
  
-                       /* we've found a VMA we can share */
-                       atomic_inc(&vma->vm_usage);
-
-                       vml->vma = vma;
-                       result = (void *) vma->vm_start;
-                       goto shared;
+                       /* we've found a region we can share */
+                       atomic_inc(&pregion->vm_usage);
+                       vma->vm_region = pregion;
+                       start = pregion->vm_start;
+                       start += (pgoff - pregion->vm_pgoff) << PAGE_SHIFT;
+                       vma->vm_start = start;
+                       vma->vm_end = start + len;
+
+                       if (pregion->vm_flags & VM_MAPPED_COPY) {
+                               kdebug("share copy");
+                               vma->vm_flags |= VM_MAPPED_COPY;
+                       } else {
+                               kdebug("share mmap");
+                               ret = do_mmap_shared_file(vma);
+                               if (ret < 0) {
+                                       vma->vm_region = NULL;
+                                       vma->vm_start = 0;
+                                       vma->vm_end = 0;
+                                       atomic_dec(&pregion->vm_usage);
+                                       pregion = NULL;
+                                       goto error_just_free;
+                               }
+                       }
+                       fput(region->vm_file);
+                       kmem_cache_free(vm_region_jar, region);
+                       region = pregion;
+                       result = start;
+                       goto share;
                 }
  
-       dont_share_VMAs:
-               vma = NULL;
-
                 /* obtain the address at which to make a shared mapping
                  * - this is the hook for quasi-memory character devices to
                  *   tell us the location of a shared mapping
@@ -995,113 +1298,93 @@ unsigned long do_mmap_pgoff(struct file *file,
                         if (IS_ERR((void *) addr)) {
                                 ret = addr;
                                 if (ret != (unsigned long) -ENOSYS)
-                                       goto error;
+                                       goto error_just_free;
  
                                 /* the driver refused to tell us where to site
                                  * the mapping so we'll have to attempt to copy
                                  * it */
                                 ret = (unsigned long) -ENODEV;
                                 if (!(capabilities & BDI_CAP_MAP_COPY))
-                                       goto error;
+                                       goto error_just_free;
  
                                 capabilities &= ~BDI_CAP_MAP_DIRECT;
+                       } else {
+                               vma->vm_start = region->vm_start = addr;
+                               vma->vm_end = region->vm_end = addr + len;
                         }
                 }
         }
  
-       /* we're going to need a VMA struct as well */
-       vma = kzalloc(sizeof(struct vm_area_struct), GFP_KERNEL);
-       if (!vma)
-               goto error_getting_vma;
-
-       INIT_LIST_HEAD(&vma->anon_vma_node);
-       atomic_set(&vma->vm_usage, 1);
-       if (file) {
-               get_file(file);
-               if (vm_flags & VM_EXECUTABLE) {
-                       added_exe_file_vma(current->mm);
-                       vma->vm_mm = current->mm;
-               }
-       }
-       vma->vm_file    = file;
-       vma->vm_flags   = vm_flags;
-       vma->vm_start   = addr;
-       vma->vm_end     = addr + len;
-       vma->vm_pgoff   = pgoff;
-
-       vml->vma = vma;
+       vma->vm_region = region;
  
         /* set up the mapping */
         if (file && vma->vm_flags & VM_SHARED)
-               ret = do_mmap_shared_file(vma, len);
+               ret = do_mmap_shared_file(vma);
         else
-               ret = do_mmap_private(vma, len);
+               ret = do_mmap_private(vma, region, len);
         if (ret < 0)
-               goto error;
-
-       /* okay... we have a mapping; now we have to register it */
-       result = (void *) vma->vm_start;
+               goto error_put_region;
  
-       if (vma->vm_flags & VM_MAPPED_COPY) {
-               realalloc += kobjsize(result);
-               askedalloc += len;
-       }
+       add_nommu_region(region);
  
-       realalloc += kobjsize(vma);
-       askedalloc += sizeof(*vma);
+       /* okay... we have a mapping; now we have to register it */
+       result = vma->vm_start;
  
         current->mm->total_vm += len >> PAGE_SHIFT;
  
-       add_nommu_vma(vma);
-
- shared:
-       realalloc += kobjsize(vml);
-       askedalloc += sizeof(*vml);
-
-       add_vma_to_mm(current->mm, vml);
+share:
+       add_vma_to_mm(current->mm, vma);
  
-       up_write(&nommu_vma_sem);
+       up_write(&nommu_region_sem);
  
         if (prot & PROT_EXEC)
-               flush_icache_range((unsigned long) result,
-                                  (unsigned long) result + len);
+               flush_icache_range(result, result + len);
  
-#ifdef DEBUG
-       printk("do_mmap:\n");
-       show_process_blocks();
-#endif
-
-       return (unsigned long) result;
+       kleave(" = %lx", result);
+       return result;
  
- error:
-       up_write(&nommu_vma_sem);
-       kfree(vml);
+error_put_region:
+       __put_nommu_region(region);
         if (vma) {
                 if (vma->vm_file) {
                         fput(vma->vm_file);
                         if (vma->vm_flags & VM_EXECUTABLE)
                                 removed_exe_file_vma(vma->vm_mm);
                 }
-               kfree(vma);
+               kmem_cache_free(vm_area_cachep, vma);
         }
+       kleave(" = %d [pr]", ret);
         return ret;
  
- sharing_violation:
-       up_write(&nommu_vma_sem);
-       printk("Attempt to share mismatched mappings\n");
-       kfree(vml);
-       return -EINVAL;
+error_just_free:
+       up_write(&nommu_region_sem);
+error:
+       fput(region->vm_file);
+       kmem_cache_free(vm_region_jar, region);
+       fput(vma->vm_file);
+       if (vma->vm_flags & VM_EXECUTABLE)
+               removed_exe_file_vma(vma->vm_mm);
+       kmem_cache_free(vm_area_cachep, vma);
+       kleave(" = %d", ret);
+       return ret;
  
- error_getting_vma:
-       up_write(&nommu_vma_sem);
-       kfree(vml);
-       printk("Allocation of vma for %lu byte allocation from process %d failed\n",
+sharing_violation:
+       up_write(&nommu_region_sem);
+       printk(KERN_WARNING "Attempt to share mismatched mappings\n");
+       ret = -EINVAL;
+       goto error;
+
+error_getting_vma:
+       kmem_cache_free(vm_region_jar, region);
+       printk(KERN_WARNING "Allocation of vma for %lu byte allocation"
+              " from process %d failed\n",
                len, current->pid);
         show_free_areas();
         return -ENOMEM;
  
- error_getting_vml:
-       printk("Allocation of vml for %lu byte allocation from process %d failed\n",
+error_getting_region:
+       printk(KERN_WARNING "Allocation of vm region for %lu byte allocation"
+              " from process %d failed\n",
                len, current->pid);
         show_free_areas();
         return -ENOMEM;
@@ -1109,85 +1392,183 @@ unsigned long do_mmap_pgoff(struct file *file,
  EXPORT_SYMBOL(do_mmap_pgoff);
  
  /*
- * handle mapping disposal for uClinux
+ * split a vma into two pieces at address 'addr', a new vma is allocated either
+ * for the first part or the tail.
   */
-static void put_vma(struct mm_struct *mm, struct vm_area_struct *vma)
+int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
+             unsigned long addr, int new_below)
  {
-       if (vma) {
-               down_write(&nommu_vma_sem);
+       struct vm_area_struct *new;
+       struct vm_region *region;
+       unsigned long npages;
  
-               if (atomic_dec_and_test(&vma->vm_usage)) {
-                       delete_nommu_vma(vma);
+       kenter("");
  
-                       if (vma->vm_ops && vma->vm_ops->close)
-                               vma->vm_ops->close(vma);
+       /* we're only permitted to split anonymous regions that have a single
+        * owner */
+       if (vma->vm_file ||
+           atomic_read(&vma->vm_region->vm_usage) != 1)
+               return -ENOMEM;
  
-                       /* IO memory and memory shared directly out of the pagecache from
-                        * ramfs/tmpfs mustn't be released here */
-                       if (vma->vm_flags & VM_MAPPED_COPY) {
-                               realalloc -= kobjsize((void *) vma->vm_start);
-                               askedalloc -= vma->vm_end - vma->vm_start;
-                               kfree((void *) vma->vm_start);
-                       }
+       if (mm->map_count >= sysctl_max_map_count)
+               return -ENOMEM;
  
-                       realalloc -= kobjsize(vma);
-                       askedalloc -= sizeof(*vma);
+       region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
+       if (!region)
+               return -ENOMEM;
  
-                       if (vma->vm_file) {
-                               fput(vma->vm_file);
-                               if (vma->vm_flags & VM_EXECUTABLE)
-                                       removed_exe_file_vma(mm);
-                       }
-                       kfree(vma);
-               }
+       new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+       if (!new) {
+               kmem_cache_free(vm_region_jar, region);
+               return -ENOMEM;
+       }
+
+       /* most fields are the same, copy all, and then fixup */
+       *new = *vma;
+       *region = *vma->vm_region;
+       new->vm_region = region;
+
+       npages = (addr - vma->vm_start) >> PAGE_SHIFT;
  
-               up_write(&nommu_vma_sem);
+       if (new_below) {
+               region->vm_top = region->vm_end = new->vm_end = addr;
+       } else {
+               region->vm_start = new->vm_start = addr;
+               region->vm_pgoff = new->vm_pgoff += npages;
+       }
+
+       if (new->vm_ops && new->vm_ops->open)
+               new->vm_ops->open(new);
+
+       delete_vma_from_mm(vma);
+       down_write(&nommu_region_sem);
+       delete_nommu_region(vma->vm_region);
+       if (new_below) {
+               vma->vm_region->vm_start = vma->vm_start = addr;
+               vma->vm_region->vm_pgoff = vma->vm_pgoff += npages;
+       } else {
+               vma->vm_region->vm_end = vma->vm_end = addr;
+               vma->vm_region->vm_top = addr;
         }
+       add_nommu_region(vma->vm_region);
+       add_nommu_region(new->vm_region);
+       up_write(&nommu_region_sem);
+       add_vma_to_mm(mm, vma);
+       add_vma_to_mm(mm, new);
+       return 0;
  }
  
  /*
- * release a mapping
- * - under NOMMU conditions the parameters must match exactly to the mapping to
- *   be removed
+ * shrink a VMA by removing the specified chunk from either the beginning or
+ * the end
   */
-int do_munmap(struct mm_struct *mm, unsigned long addr, size_t len)
+static int shrink_vma(struct mm_struct *mm,
+                     struct vm_area_struct *vma,
+                     unsigned long from, unsigned long to)
  {
-       struct vm_list_struct *vml, **parent;
-       unsigned long end = addr + len;
+       struct vm_region *region;
  
-#ifdef DEBUG
-       printk("do_munmap:\n");
-#endif
+       kenter("");
  
-       for (parent = &mm->context.vmlist; *parent; parent = &(*parent)->next) {
-               if ((*parent)->vma->vm_start > addr)
-                       break;
-               if ((*parent)->vma->vm_start == addr &&
-                   ((len == 0) || ((*parent)->vma->vm_end == end)))
-                       goto found;
+       /* adjust the VMA's pointers, which may reposition it in the MM's tree
+        * and list */
+       delete_vma_from_mm(vma);
+       if (from > vma->vm_start)
+               vma->vm_end = from;
+       else
+               vma->vm_start = to;
+       add_vma_to_mm(mm, vma);
+
+       /* cut the backing region down to size */
+       region = vma->vm_region;
+       BUG_ON(atomic_read(&region->vm_usage) != 1);
+
+       down_write(&nommu_region_sem);
+       delete_nommu_region(region);
+       if (from > region->vm_start) {
+               to = region->vm_top;
+               region->vm_top = region->vm_end = from;
+       } else {
+               region->vm_start = to;
         }
+       add_nommu_region(region);
+       up_write(&nommu_region_sem);
  
-       printk("munmap of non-mmaped memory by process %d (%s): %p\n",
-              current->pid, current->comm, (void *) addr);
-       return -EINVAL;
+       free_page_series(from, to);
+       return 0;
+}
  
- found:
-       vml = *parent;
+/*
+ * release a mapping
+ * - under NOMMU conditions the chunk to be unmapped must be backed by a single
+ *   VMA, though it need not cover the whole VMA
+ */
+int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
+{
+       struct vm_area_struct *vma;
+       struct rb_node *rb;
+       unsigned long end = start + len;
+       int ret;
  
-       put_vma(mm, vml->vma);
+       kenter(",%lx,%zx", start, len);
  
-       *parent = vml->next;
-       realalloc -= kobjsize(vml);
-       askedalloc -= sizeof(*vml);
-       kfree(vml);
+       if (len == 0)
+               return -EINVAL;
  
-       update_hiwater_vm(mm);
-       mm->total_vm -= len >> PAGE_SHIFT;
+       /* find the first potentially overlapping VMA */
+       vma = find_vma(mm, start);
+       if (!vma) {
+               printk(KERN_WARNING
+                      "munmap of memory not mmapped by process %d (%s):"
+                      " 0x%lx-0x%lx\n",
+                      current->pid, current->comm, start, start + len - 1);
+               return -EINVAL;
+       }
  
-#ifdef DEBUG
-       show_process_blocks();
-#endif
+       /* we're allowed to split an anonymous VMA but not a file-backed one */
+       if (vma->vm_file) {
+               do {
+                       if (start > vma->vm_start) {
+                               kleave(" = -EINVAL [miss]");
+                               return -EINVAL;
+                       }
+                       if (end == vma->vm_end)
+                               goto erase_whole_vma;
+                       rb = rb_next(&vma->vm_rb);
+                       vma = rb_entry(rb, struct vm_area_struct, vm_rb);
+               } while (rb);
+               kleave(" = -EINVAL [split file]");
+               return -EINVAL;
+       } else {
+               /* the chunk must be a subset of the VMA found */
+               if (start == vma->vm_start && end == vma->vm_end)
+                       goto erase_whole_vma;
+               if (start < vma->vm_start || end > vma->vm_end) {
+                       kleave(" = -EINVAL [superset]");
+                       return -EINVAL;
+               }
+               if (start & ~PAGE_MASK) {
+                       kleave(" = -EINVAL [unaligned start]");
+                       return -EINVAL;
+               }
+               if (end != vma->vm_end && end & ~PAGE_MASK) {
+                       kleave(" = -EINVAL [unaligned split]");
+                       return -EINVAL;
+               }
+               if (start != vma->vm_start && end != vma->vm_end) {
+                       ret = split_vma(mm, vma, start, 1);
+                       if (ret < 0) {
+                               kleave(" = %d [split]", ret);
+                               return ret;
+                       }
+               }
+               return shrink_vma(mm, vma, start, end);
+       }
  
+erase_whole_vma:
+       delete_vma_from_mm(vma);
+       delete_vma(mm, vma);
+       kleave(" = 0");
         return 0;
  }
  EXPORT_SYMBOL(do_munmap);
@@ -1204,32 +1585,26 @@ asmlinkage long sys_munmap(unsigned long addr, size_t len)
  }
  
  /*
- * Release all mappings
+ * release all the mappings made in a process's VM space
   */
-void exit_mmap(struct mm_struct * mm)
+void exit_mmap(struct mm_struct *mm)
  {
-       struct vm_list_struct *tmp;
-
-       if (mm) {
-#ifdef DEBUG
-               printk("Exit_mmap:\n");
-#endif
+       struct vm_area_struct *vma;
  
-               mm->total_vm = 0;
+       if (!mm)
+               return;
  
-               while ((tmp = mm->context.vmlist)) {
-                       mm->context.vmlist = tmp->next;
-                       put_vma(mm, tmp->vma);
+       kenter("");
  
-                       realalloc -= kobjsize(tmp);
-                       askedalloc -= sizeof(*tmp);
-                       kfree(tmp);
-               }
+       mm->total_vm = 0;
  
-#ifdef DEBUG
-               show_process_blocks();
-#endif
+       while ((vma = mm->mmap)) {
+               mm->mmap = vma->vm_next;
+               delete_vma_from_mm(vma);
+               delete_vma(mm, vma);
         }
+
+       kleave("");
  }
  
  unsigned long do_brk(unsigned long addr, unsigned long len)
@@ -1242,8 +1617,8 @@ unsigned long do_brk(unsigned long addr, unsigned long len)
   * time (controlled by the MREMAP_MAYMOVE flag and available VM space)
   *
   * under NOMMU conditions, we only permit changing a mapping's size, and only
- * as long as it stays within the hole allocated by the kmalloc() call in
- * do_mmap_pgoff() and the block is not shareable
+ * as long as it stays within the region allocated by do_mmap_private() and the
+ * block is not shareable
   *
   * MREMAP_FIXED is not supported under NOMMU conditions
   */
@@ -1254,13 +1629,16 @@ unsigned long do_mremap(unsigned long addr,
         struct vm_area_struct *vma;
  
         /* insanity checks first */
-       if (new_len == 0)
+       if (old_len == 0 || new_len == 0)
                 return (unsigned long) -EINVAL;
  
+       if (addr & ~PAGE_MASK)
+               return -EINVAL;
+
         if (flags & MREMAP_FIXED && new_addr != addr)
                 return (unsigned long) -EINVAL;
  
-       vma = find_vma_exact(current->mm, addr);
+       vma = find_vma_exact(current->mm, addr, old_len);
         if (!vma)
                 return (unsigned long) -EINVAL;
  
@@ -1270,22 +1648,19 @@ unsigned long do_mremap(unsigned long addr,
         if (vma->vm_flags & VM_MAYSHARE)
                 return (unsigned long) -EPERM;
  
-       if (new_len > kobjsize((void *) addr))
+       if (new_len > vma->vm_region->vm_end - vma->vm_region->vm_start)
                 return (unsigned long) -ENOMEM;
  
         /* all checks complete - do it */
         vma->vm_end = vma->vm_start + new_len;
-
-       askedalloc -= old_len;
-       askedalloc += new_len;
-
         return vma->vm_start;
  }
  EXPORT_SYMBOL(do_mremap);
  
-asmlinkage unsigned long sys_mremap(unsigned long addr,
-       unsigned long old_len, unsigned long new_len,
-       unsigned long flags, unsigned long new_addr)
+asmlinkage
+unsigned long sys_mremap(unsigned long addr,
+                        unsigned long old_len, unsigned long new_len,
+                        unsigned long flags, unsigned long new_addr)
  {
         unsigned long ret;
author	Linus Torvalds <torvalds@linux-foundation.org>
	Fri, 9 Jan 2009 23:27:39 +0000 (15:27 -0800)
committer	Linus Torvalds <torvalds@linux-foundation.org>
	Fri, 9 Jan 2009 23:27:39 +0000 (15:27 -0800)
Documentation/filesystems/btrfs.txt	[new file with mode: 0644]	patch \| blob
Documentation/filesystems/squashfs.txt	[new file with mode: 0644]	patch \| blob
Documentation/kernel-parameters.txt		patch \| blob \| history
Documentation/nommu-mmap.txt		patch \| blob \| history
Documentation/sysctl/vm.txt		patch \| blob \| history
MAINTAINERS		patch \| blob \| history
arch/arm/include/asm/mmu.h		patch \| blob \| history
arch/arm/mm/dma-mapping.c		patch \| blob \| history
arch/blackfin/include/asm/mmu.h		patch \| blob \| history
arch/blackfin/kernel/ptrace.c		patch \| blob \| history
arch/blackfin/kernel/traps.c		patch \| blob \| history
arch/frv/kernel/ptrace.c		patch \| blob \| history
arch/h8300/include/asm/mmu.h		patch \| blob \| history
arch/m68knommu/include/asm/mmu.h		patch \| blob \| history
arch/parisc/include/asm/Kbuild		patch \| blob \| history
arch/parisc/include/asm/byteorder.h		patch \| blob \| history
arch/parisc/include/asm/swab.h	[new file with mode: 0644]	patch \| blob
arch/powerpc/include/asm/cell-pmu.h		patch \| blob \| history
arch/powerpc/include/asm/oprofile_impl.h		patch \| blob \| history
arch/powerpc/oprofile/cell/pr_util.h		patch \| blob \| history
arch/powerpc/oprofile/cell/spu_profiler.c		patch \| blob \| history
arch/powerpc/oprofile/common.c		patch \| blob \| history
arch/powerpc/oprofile/op_model_cell.c		patch \| blob \| history
arch/s390/include/asm/chpid.h		patch \| blob \| history
arch/s390/include/asm/chsc.h		patch \| blob \| history
arch/s390/include/asm/cmb.h		patch \| blob \| history
arch/s390/include/asm/dasd.h		patch \| blob \| history
arch/s390/include/asm/kvm.h		patch \| blob \| history
arch/s390/include/asm/posix_types.h		patch \| blob \| history
arch/s390/include/asm/ptrace.h		patch \| blob \| history
arch/s390/include/asm/qeth.h		patch \| blob \| history
arch/s390/include/asm/schid.h		patch \| blob \| history
arch/s390/include/asm/swab.h		patch \| blob \| history
arch/s390/include/asm/types.h		patch \| blob \| history
arch/s390/kernel/entry.h		patch \| blob \| history
arch/s390/kernel/smp.c		patch \| blob \| history
arch/s390/kernel/sys_s390.c		patch \| blob \| history
arch/s390/kernel/vdso.c		patch \| blob \| history
arch/s390/kernel/vdso32/gettimeofday.S		patch \| blob \| history
arch/s390/kvm/diag.c		patch \| blob \| history
arch/s390/kvm/interrupt.c		patch \| blob \| history
arch/s390/kvm/priv.c		patch \| blob \| history
arch/sh/include/asm/mmu.h		patch \| blob \| history
arch/x86/include/asm/bitops.h		patch \| blob \| history
arch/x86/kernel/early-quirks.c		patch \| blob \| history
arch/x86/oprofile/op_model_amd.c		patch \| blob \| history
drivers/char/hvc_iucv.c		patch \| blob \| history
drivers/leds/Kconfig		patch \| blob \| history
drivers/leds/Makefile		patch \| blob \| history
drivers/leds/led-class.c		patch \| blob \| history
drivers/leds/leds-alix2.c	[new file with mode: 0644]	patch \| blob
drivers/leds/leds-ams-delta.c		patch \| blob \| history
drivers/leds/leds-clevo-mail.c		patch \| blob \| history
drivers/leds/leds-fsg.c		patch \| blob \| history
drivers/leds/leds-gpio.c		patch \| blob \| history
drivers/leds/leds-hp-disk.c		patch \| blob \| history
drivers/leds/leds-hp6xx.c		patch \| blob \| history
drivers/leds/leds-net48xx.c		patch \| blob \| history
drivers/leds/leds-pca9532.c		patch \| blob \| history
drivers/leds/leds-s3c24xx.c		patch \| blob \| history
drivers/leds/leds-wm8350.c	[new file with mode: 0644]	patch \| blob
drivers/leds/leds-wrap.c		patch \| blob \| history
drivers/leds/ledtrig-timer.c		patch \| blob \| history
drivers/mfd/wm8350-core.c		patch \| blob \| history
drivers/oprofile/buffer_sync.c		patch \| blob \| history
drivers/oprofile/cpu_buffer.c		patch \| blob \| history
drivers/oprofile/cpu_buffer.h		patch \| blob \| history
drivers/oprofile/event_buffer.c		patch \| blob \| history
drivers/oprofile/oprof.c		patch \| blob \| history
drivers/oprofile/oprof.h		patch \| blob \| history
drivers/oprofile/oprofile_files.c		patch \| blob \| history
drivers/regulator/wm8350-regulator.c		patch \| blob \| history
drivers/s390/block/dasd.c		patch \| blob \| history
drivers/s390/block/dasd_devmap.c		patch \| blob \| history
drivers/s390/block/dasd_diag.c		patch \| blob \| history
drivers/s390/block/dasd_eckd.c		patch \| blob \| history
drivers/s390/block/dasd_fba.c		patch \| blob \| history
drivers/s390/char/Kconfig		patch \| blob \| history
drivers/s390/cio/qdio_debug.c		patch \| blob \| history
drivers/video/backlight/Kconfig		patch \| blob \| history
drivers/video/backlight/Makefile		patch \| blob \| history
drivers/video/backlight/backlight.c		patch \| blob \| history
drivers/video/backlight/corgi_bl.c	[deleted file]	patch \| blob \| history
drivers/video/backlight/cr_bllcd.c		patch \| blob \| history
drivers/video/backlight/generic_bl.c	[new file with mode: 0644]	patch \| blob
drivers/video/backlight/hp680_bl.c		patch \| blob \| history
drivers/video/backlight/mbp_nvidia_bl.c		patch \| blob \| history
drivers/video/backlight/progear_bl.c		patch \| blob \| history
drivers/video/backlight/tdo24m.c		patch \| blob \| history
drivers/video/backlight/tosa_lcd.c		patch \| blob \| history
drivers/video/backlight/vgg2432a4.c		patch \| blob \| history
fs/Kconfig		patch \| blob \| history
fs/Makefile		patch \| blob \| history
fs/binfmt_elf_fdpic.c		patch \| blob \| history
fs/binfmt_flat.c		patch \| blob \| history
fs/btrfs/Makefile	[new file with mode: 0644]	patch \| blob
fs/btrfs/acl.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/async-thread.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/async-thread.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/btrfs_inode.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/compat.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/compression.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/compression.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/crc32c.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/ctree.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/ctree.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/dir-item.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/disk-io.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/disk-io.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/export.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/export.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/extent-tree.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/extent_io.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/extent_io.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/extent_map.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/extent_map.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/file-item.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/file.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/free-space-cache.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/hash.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/inode-item.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/inode-map.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/inode.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/ioctl.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/ioctl.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/locking.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/locking.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/ordered-data.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/ordered-data.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/orphan.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/print-tree.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/print-tree.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/ref-cache.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/ref-cache.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/root-tree.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/struct-funcs.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/super.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/sysfs.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/transaction.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/transaction.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/tree-defrag.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/tree-log.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/tree-log.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/version.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/version.sh	[new file with mode: 0644]	patch \| blob
fs/btrfs/volumes.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/volumes.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/xattr.c	[new file with mode: 0644]	patch \| blob
fs/btrfs/xattr.h	[new file with mode: 0644]	patch \| blob
fs/btrfs/zlib.c	[new file with mode: 0644]	patch \| blob
fs/partitions/check.c		patch \| blob \| history
fs/proc/internal.h		patch \| blob \| history
fs/proc/meminfo.c		patch \| blob \| history
fs/proc/nommu.c		patch \| blob \| history
fs/proc/task_nommu.c		patch \| blob \| history
fs/ramfs/file-nommu.c		patch \| blob \| history
fs/squashfs/Makefile	[new file with mode: 0644]	patch \| blob
fs/squashfs/block.c	[new file with mode: 0644]	patch \| blob
fs/squashfs/cache.c	[new file with mode: 0644]	patch \| blob
fs/squashfs/dir.c	[new file with mode: 0644]	patch \| blob
fs/squashfs/export.c	[new file with mode: 0644]	patch \| blob
fs/squashfs/file.c	[new file with mode: 0644]	patch \| blob
fs/squashfs/fragment.c	[new file with mode: 0644]	patch \| blob
fs/squashfs/id.c	[new file with mode: 0644]	patch \| blob
fs/squashfs/inode.c	[new file with mode: 0644]	patch \| blob
fs/squashfs/namei.c	[new file with mode: 0644]	patch \| blob
fs/squashfs/squashfs.h	[new file with mode: 0644]	patch \| blob
fs/squashfs/squashfs_fs.h	[new file with mode: 0644]	patch \| blob
fs/squashfs/squashfs_fs_i.h	[new file with mode: 0644]	patch \| blob
fs/squashfs/squashfs_fs_sb.h	[new file with mode: 0644]	patch \| blob
fs/squashfs/super.c	[new file with mode: 0644]	patch \| blob
fs/squashfs/symlink.c	[new file with mode: 0644]	patch \| blob
include/asm-frv/mmu.h		patch \| blob \| history
include/asm-m32r/mmu.h		patch \| blob \| history
include/linux/backlight.h		patch \| blob \| history
include/linux/leds-pca9532.h		patch \| blob \| history
include/linux/leds.h		patch \| blob \| history
include/linux/mfd/wm8350/pmic.h		patch \| blob \| history
include/linux/mm.h		patch \| blob \| history
include/linux/mm_types.h		patch \| blob \| history
include/linux/oprofile.h		patch \| blob \| history
include/linux/spi/tdo24m.h	[new file with mode: 0644]	patch \| blob
init/do_mounts_rd.c		patch \| blob \| history
init/initramfs.c		patch \| blob \| history
ipc/shm.c		patch \| blob \| history
kernel/cred.c		patch \| blob \| history
kernel/fork.c		patch \| blob \| history
kernel/sysctl.c		patch \| blob \| history
kernel/trace/ring_buffer.c		patch \| blob \| history
lib/Kconfig.debug		patch \| blob \| history
mm/mmap.c		patch \| blob \| history
mm/nommu.c		patch \| blob \| history