aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs/volumes.h
Commit message (Collapse)AuthorAgeFilesLines
* Btrfs: detect wether a device supports discardJosef Bacik2011-08-291-0/+2
| | | | | | | | | | | | | | | | | | commit d5e2003c2bcda93a8f2e668eb4642d70c9c38301 upstream. We have a problem where if a user specifies discard but doesn't actually support it we will return EOPNOTSUPP from btrfs_discard_extent. This is a problem because this gets called (in a fashion) from the tree log recovery code, which has a nice little BUG_ON(ret) after it, which causes us to fail the tree log replay. So instead detect wether our devices support discard when we're adding them and then don't issue discards if we know that the device doesn't support it. And just for good measure set ret = 0 in btrfs_issue_discard just in case we still get EOPNOTSUPP so we don't screw anybody up like this again. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* Merge branch 'cleanups_and_fixes' into inode_numbersChris Mason2011-05-231-0/+2
|\ | | | | | | | | | | | | | | Conflicts: fs/btrfs/tree-log.c fs/btrfs/volumes.c Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * Btrfs: using rcu lock in the reader side of devices listXiao Guangrong2011-05-231-0/+2
| | | | | | | | | | | | | | | | fs_devices->devices is only updated on remove and add device paths, so we can use rcu to protect it in the reader side Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* | Merge branch 'for-chris' of ↵Chris Mason2011-05-231-0/+6
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers Conflicts: fs/btrfs/Makefile fs/btrfs/ctree.h fs/btrfs/volumes.h Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * | btrfs: scrubArne Jansen2011-05-121-0/+6
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds an initial implementation for scrub. It works quite straightforward. The usermode issues an ioctl for each device in the fs. For each device, it enumerates the allocated device chunks. For each chunk, the contained extents are enumerated and the data checksums fetched. The extents are read sequentially and the checksums verified. If an error occurs (checksum or EIO), a good copy is searched for. If one is found, the bad copy will be rewritten. All enumerations happen from the commit roots. During a transaction commit, the scrubs get paused and afterwards continue from the new roots. This commit is based on the series originally posted to linux-btrfs with some improvements that resulted from comments from David Sterba, Ilya Dryomov and Jan Schmidt. Signed-off-by: Arne Jansen <sensille@gmx.net>
* | Merge branch 'allocator' of ↵Chris Mason2011-05-221-15/+1
|\ \ | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * | btrfs: quasi-round-robin for chunk allocationArne Jansen2011-05-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In a multi device setup, the chunk allocator currently always allocates chunks on the devices in the same order. This leads to a very uneven distribution, especially with RAID1 or RAID10 and an uneven number of devices. This patch always sorts the devices before allocating, and allocates the stripes on the devices with the most available space, as long as there is enough space available. In a low space situation, it first tries to maximize striping. The patch also simplifies the allocator and reduces the checks for corner cases. The simplification is done by several means. First, it defines the properties of each RAID type upfront. These properties are used afterwards instead of differentiating cases in several places. Second, the old allocator defined a minimum stripe size for each block group type, tried to find a large enough chunk, and if this fails just allocates a smaller one. This is now done in one step. The largest possible chunk (up to max_chunk_size) is searched and allocated. Because we now have only one pass, the allocation of the map (struct map_lookup) is moved down to the point where the number of stripes is already known. This way we avoid reallocation of the map. We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
| * | btrfs: move btrfs_cmp_device_free_bytes to super.cArne Jansen2011-05-131-15/+0
| |/ | | | | | | | | this function won't be used here anymore, so move it super.c where it is used for df-calculation
* | btrfs: remove all unused functionsDavid Sterba2011-05-061-3/+0
| | | | | | | | | | | | | | | | | | | | | | Remove static and global declarations and/or definitions. Reduces size of btrfs.ko by ~3.4kB. text data bss dec hex filename 402081 7464 200 409745 64091 btrfs.ko.base 398620 7144 200 405964 631cc btrfs.ko.remove-all Signed-off-by: David Sterba <dsterba@suse.cz>
* | btrfs: remove unused function prototypesDavid Sterba2011-05-041-2/+0
|/ | | | | | function prototypes without a body Signed-off-by: David Sterba <dsterba@suse.cz>
* Btrfs: make btrfs_map_block() return entire free extent for each device of ↵Li Dongyang2011-03-281-0/+1
| | | | | | | | | | | RAID0/1/10/DUP btrfs_map_block() will only return a single stripe length, but we want the full extent be mapped to each disk when we are trimming the extent, so we add length to btrfs_bio_stripe and fill it if we are mapping for REQ_DISCARD. Signed-off-by: Li Dongyang <lidongyang@novell.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: add initial tracepoint support for btrfsliubo2011-03-281-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tracepoints can provide insight into why btrfs hits bugs and be greatly helpful for debugging, e.g dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0 dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0 btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0) btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0) btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8 flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0) flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0) flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0) btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0) Here is what I have added: 1) ordere_extent: btrfs_ordered_extent_add btrfs_ordered_extent_remove btrfs_ordered_extent_start btrfs_ordered_extent_put These provide critical information to understand how ordered_extents are updated. 2) extent_map: btrfs_get_extent extent_map is used in both read and write cases, and it is useful for tracking how btrfs specific IO is running. 3) writepage: __extent_writepage btrfs_writepage_end_io_hook Pages are cirtical resourses and produce a lot of corner cases during writeback, so it is valuable to know how page is written to disk. 4) inode: btrfs_inode_new btrfs_inode_request btrfs_inode_evict These can show where and when a inode is created, when a inode is evicted. 5) sync: btrfs_sync_file btrfs_sync_fs These show sync arguments. 6) transaction: btrfs_transaction_commit In transaction based filesystem, it will be useful to know the generation and who does commit. 7) back reference and cow: btrfs_delayed_tree_ref btrfs_delayed_data_ref btrfs_delayed_ref_head btrfs_cow_block Btrfs natively supports back references, these tracepoints are helpful on understanding btrfs's COW mechanism. 8) chunk: btrfs_chunk_alloc btrfs_chunk_free Chunk is a link between physical offset and logical offset, and stands for space infomation in btrfs, and these are helpful on tracing space things. 9) reserved_extent: btrfs_reserved_extent_alloc btrfs_reserved_extent_free These can show how btrfs uses its space. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2011-01-171-0/+27
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits) Btrfs: forced readonly mounts on errors btrfs: Require CAP_SYS_ADMIN for filesystem rebalance Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check btrfs: Fix memory leak in btrfs_read_fs_root_no_radix() btrfs: check NULL or not btrfs: Don't pass NULL ptr to func that may deref it. btrfs: mount failure return value fix btrfs: Mem leak in btrfs_get_acl() btrfs: fix wrong free space information of btrfs btrfs: make the chunk allocator utilize the devices better btrfs: restructure find_free_dev_extent() btrfs: fix wrong calculation of stripe size btrfs: try to reclaim some space when chunk allocation fails btrfs: fix wrong data space statistics fs/btrfs: Fix build of ctree Btrfs: fix off by one while setting block groups readonly Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls Btrfs: Add readonly snapshots support Btrfs: Refactor btrfs_ioctl_snap_create() btrfs: Extract duplicate decompress code ...
| * btrfs: fix wrong free space information of btrfsMiao Xie2011-01-161-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we store data by raid profile in btrfs with two or more different size disks, df command shows there is some free space in the filesystem, but the user can not write any data in fact, df command shows the wrong free space information of btrfs. # mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10 # btrfs-show Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64 Total devices 2 FS bytes used 28.00KB devid 1 size 5.01GB used 2.03GB path /dev/sda9 devid 2 size 10.00GB used 2.01GB path /dev/sda10 # btrfs device scan /dev/sda9 /dev/sda10 # mount /dev/sda9 /mnt # dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999 (fill the filesystem) # sync # df -TH Filesystem Type Size Used Avail Use% Mounted on /dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt # btrfs-show Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64 Total devices 2 FS bytes used 3.99GB devid 1 size 5.01GB used 5.01GB path /dev/sda9 devid 2 size 10.00GB used 4.99GB path /dev/sda10 It is because btrfs cannot allocate chunks when one of the pairing disks has no space, the free space on the other disks can not be used for ever, and should be subtracted from the total space, but btrfs doesn't subtract this space from the total. It is strange to the user. This patch fixes it by calcing the free space that can be used to allocate chunks. Implementation: 1. get all the devices free space, and align them by stripe length. 2. sort the devices by the free space. 3. check the free space of the devices, 3.1. if it is not zero, and then check the number of the devices that has more free space than this device, if the number of the devices is beyond the min stripe number, the free space can be used, and add into total free space. if the number of the devices is below the min stripe number, we can not use the free space, the check ends. 3.2. if the free space is zero, check the next devices, goto 3.1 This implementation is just likely fake chunk allocation. After appling this patch, df can show correct space information: # df -TH Filesystem Type Size Used Avail Use% Mounted on /dev/sda9 btrfs 17G 8.6G 0 100% /mnt Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
| * btrfs: make the chunk allocator utilize the devices betterMiao Xie2011-01-161-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With this patch, we change the handling method when we can not get enough free extents with default size. Implementation: 1. Look up the suitable free extent on each device and keep the search result. If not find a suitable free extent, keep the max free extent 2. If we get enough suitable free extents with default size, chunk allocation succeeds. 3. If we can not get enough free extents, but the number of the extent with default size is >= min_stripes, we just change the mapping information (reduce the number of stripes in the extent map), and chunk allocation succeeds. 4. If the number of the extent with default size is < min_stripes, sort the devices by its max free extent's size descending 5. Use the size of the max free extent on the (num_stripes - 1)th device as the stripe size to allocate the device space By this way, the chunk allocator can allocate chunks as large as possible when the devices' space is not enough and make full use of the devices. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* | Merge branch 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2011-01-131-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits) block: ensure that completion error gets properly traced blktrace: add missing probe argument to block_bio_complete block cfq: don't use atomic_t for cfq_group block cfq: don't use atomic_t for cfq_queue block: trace event block fix unassigned field block: add internal hd part table references block: fix accounting bug on cross partition merges kref: add kref_test_and_get bio-integrity: mark kintegrityd_wq highpri and CPU intensive block: make kblockd_workqueue smarter Revert "sd: implement sd_check_events()" block: Clean up exit_io_context() source code. Fix compile warnings due to missing removal of a 'ret' variable fs/block: type signature of major_to_index(int) to major_to_index(unsigned) block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p) cfq-iosched: don't check cfqg in choose_service_tree() fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors cdrom: export cdrom_check_events() sd: implement sd_check_events() sr: implement sr_check_events() ...
| * | block: clean up blkdev_get() wrappers and their usersTejun Heo2010-11-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After recent blkdev_get() modifications, open_by_devnum() and open_bdev_exclusive() are simple wrappers around blkdev_get(). Replace them with blkdev_get_by_dev() and blkdev_get_by_path(). blkdev_get_by_dev() is identical to open_by_devnum(). blkdev_get_by_path() is slightly different in that it doesn't automatically add %FMODE_EXCL to @mode. All users are converted. Most conversions are mechanical and don't introduce any behavior difference. There are several exceptions. * btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no reason to OR it explicitly on blkdev_put(). * gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in sb->s_mode. * With the above changes, sb->s_mode now always should contain FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect errors. The new blkdev_get_*() functions are with proper docbook comments. While at it, add function description to blkdev_get() too. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Neil Brown <neilb@suse.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Joern Engel <joern@lazybastard.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jan Kara <jack@suse.cz> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: reiserfs-devel@vger.kernel.org Cc: xfs-masters@oss.sgi.com Cc: Alexander Viro <viro@zeniv.linux.org.uk>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstableLinus Torvalds2010-12-141-0/+2
|\ \ \ | |/ / |/| / | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: prevent RAID level downgrades when space is low Btrfs: account for missing devices in RAID allocation profiles Btrfs: EIO when we fail to read tree roots Btrfs: fix compiler warnings Btrfs: Make async snapshot ioctl more generic Btrfs: pwrite blocked when writing from the mmaped buffer of the same page Btrfs: Fix a crash when mounting a subvolume Btrfs: fix sync subvol/snapshot creation Btrfs: Fix page leak in compressed writeback path Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots Btrfs: fixup return code for btrfs_del_orphan_item Btrfs: do not do fast caching if we are allocating blocks for tree_root Btrfs: deal with space cache errors better Btrfs: fix use after free in O_DIRECT
| * Btrfs: account for missing devices in RAID allocation profilesChris Mason2010-12-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we mount in RAID degraded mode without adding a new device to replace the failed one, we can end up using the wrong RAID flags for allocations. This results in strange combinations of block groups (raid1 in a raid10 filesystem) and corruptions when we try to allocate blocks from single spindle chunks on drives that are actually missing. The first device has two small 4MB chunks in it that mkfs creates and these are usually unused in a raid1 or raid10 setup. But, in -o degraded, the allocator will fall back to these because the mask of desired raid groups isn't correct. The fix here is to count the missing devices as we build up the list of devices in the system. This count is used when picking the raid level to make sure we continue using the same levels that were in place before we lost a drive. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* | btrfs: replace barriers with explicit flush / FUA usageChristoph Hellwig2010-09-101-1/+0
|/ | | | | | | | | | Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP detection for barriers and stop setting the barrier flag for discards. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* Btrfs: make balance code choose more wisely when relocatingJosef Bacik2009-09-211-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, we can panic the box if the first block group we go to move is of a type where there is no space left to move those extents. For example, if we fill the disk up with data, and then we try to balance and we have no room to move the data nor room to allocate new chunks, we will panic. Change this by checking to see if we have room to move this chunk around, and if not, return -ENOSPC and move on to the next chunk. This will make sure we remove block groups that are moveable, like if we have alot of empty metadata block groups, and then that way we make room to be able to balance our data chunks as well. Tested this with an fs that would panic on btrfs-vol -b normally, but no longer panics with this patch. V1->V2: -actually search for a free extent on the device to make sure we can allocate a chunk if need be. -fix btrfs_shrink_device to make sure we actually try to relocate all the chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r we don't remove the device with data still on it. -check to make sure the block group we are going to relocate isn't the last one in that particular space -fix a bug in btrfs_shrink_device where we would change the device's size and not fix it if we fail to do our relocate Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: avoid races between super writeout and device list updatesChris Mason2009-06-101-1/+6
| | | | | | | | | | | | | | | On multi-device filesystems, btrfs writes supers to all of the devices before considering a sync complete. There wasn't any additional locking between super writeout and the device list management code because device management was done inside a transaction and super writeout only happened with no transation writers running. With the btrfs fsync log and other async transaction updates, this has been racey for some time. This adds a mutex to protect the device list. The existing volume mutex could not be reused due to transaction lock ordering requirements. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: autodetect SSD devicesChris Mason2009-06-101-0/+5
| | | | | | | | | | | During mount, btrfs will check the queue nonrot flag for all the devices found in the FS. If they are all non-rotating, SSD mode is enabled by default. If the FS was mounted with -o nossd, the non-rotating flag is ignored. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: When shrinking, only update disk size on successChris Ball2009-04-271-0/+3
| | | | | | | | | | Previously, we updated a device's size prior to attempting a shrink operation. This patch moves the device resizing logic to only happen if the shrink completes successfully. In the process, it introduces a new field to btrfs_device -- disk_total_bytes -- to track the on-disk size. Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: use WRITE_SYNC for synchronous writesChris Mason2009-04-201-2/+11
| | | | | | | | | | | | | | | | | Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for writes we plan on waiting on in the near future. This patch mirrors recent changes in other filesystems and the generic code to use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for other latency critical writes. Btrfs uses async worker threads for checksumming before the write is done, and then again to actually submit the bios. The bio submission code just runs a per-device list of bios that need to be sent down the pipe. This list is split into low priority and high priority lists so the WRITE_SYNC IO happens first. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: fix typos in commentsWu Fengguang2009-04-021-1/+1
| | | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: shared seed deviceYan Zheng2008-12-121-2/+1
| | | | | | | | | | This patch makes seed device possible to be shared by multiple mounted file systems. The sharing is achieved by cloning seed device's btrfs_fs_devices structure. Thanks you, Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: superblock duplicationYan Zheng2008-12-081-2/+4
| | | | | | | | | | | This patch implements superblock duplication. Superblocks are stored at offset 16K, 64M and 256G on every devices. Spaces used by superblocks are preserved by the allocator, which uses a reverse mapping function to find the logical addresses that correspond to superblocks. Thank you, Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: corret fmode_t annotationsChristoph Hellwig2008-12-021-2/+2
| | | | | | | | Make sure to propagate fmode_t properly and use the right constants for it. Signed-off-by: Christoph Hellwig <hch@lst.de>
* Btrfs: Fixes for 2.6.28-rc API changesChris Mason2008-11-191-0/+3
| | | | | | | | * open/close_bdev_excl -> open/close_bdev_exclusive * blkdev_issue_discard takes a GFP mask now * Fix blkdev_issue_discard usage now that it is enabled Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Seed device supportYan Zheng2008-11-171-6/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Seed device is a special btrfs with SEEDING super flag set and can only be mounted in read-only mode. Seed devices allow people to create new btrfs on top of it. The new FS contains the same contents as the seed device, but it can be mounted in read-write mode. This patch does the following: 1) split code in btrfs_alloc_chunk into two parts. The first part does makes the newly allocated chunk usable, but does not do any operation that modifies the chunk tree. The second part does the the chunk tree modifications. This division is for the bootstrap step of adding storage to the seed device. 2) Update device management code to handle seed device. The basic idea is: For an FS grown from seed devices, its seed devices are put into a list. Seed devices are opened on demand at mounting time. If any seed device is missing or has been changed, btrfs kernel module will refuse to mount the FS. 3) make btrfs_find_block_group not return NULL when all block groups are read-only. Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
* Btrfs: Fix the multi-bio code to save the original bio for completionChris Mason2008-09-251-0/+1
| | | | | | | | | | | | | | | | The multi-bio code is responsible for duplicating blocks in raid1 and single spindle duplication. It has counters to make sure all of the locations for a given extent are properly written before io completion is returned to the higher layers. But, it didn't always complete the same bio it was given, sometimes a clone was completed instead. This lead to problems with the async work queues because they saved a pointer to the bio in a struct off bi_private. The fix is to remember the original bio and only complete that one. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add async worker threads for pre and post IO checksummingChris Mason2008-09-251-3/+7
| | | | | | | | | | | | | | | | | | | | Btrfs has been using workqueues to spread the checksumming load across other CPUs in the system. But, workqueues only schedule work on the same CPU that queued the work, giving them a limited benefit for systems with higher CPU counts. This code adds a generic facility to schedule work with pools of kthreads, and changes the bio submission code to queue bios up. The queueing is important to make sure large numbers of procs on the system don't turn streaming workloads into random workloads by sending IO down concurrently. The end result of all of this is much higher performance (and CPU usage) when doing checksumming on large machines. Two worker pools are created, one for writes and one for endio processing. The two could deadlock if we tried to service both from a single pool. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Fix btrfs_open_devices to deal with changes since the scan ioctlsChris Mason2008-09-251-2/+2
| | | | | | | Devices can change after the scan ioctls are done, and btrfs_open_devices needs to be able to verify them as they are opened and used by the FS. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add mount -o degraded to allow mounts to continue with missing devicesChris Mason2008-09-251-0/+3
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add support for online device removalChris Mason2008-09-251-0/+3
| | | | | | | | | | | | | This required a few structural changes to the code that manages bdev pointers: The VFS super block now gets an anon-bdev instead of a pointer to the lowest bdev. This allows us to avoid swapping the super block bdev pointer around at run time. The code to read in the super block no longer goes through the extent buffer interface. Things got ugly keeping the mapping constant. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Deal with failed writes in mirrored configurationsChris Mason2008-09-251-1/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add balance ioctl to restripe the chunksChris Mason2008-09-251-0/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add new ioctl to add devicesChris Mason2008-09-251-0/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Make the resizer work based on shrinking and growing devicesChris Mason2008-09-251-0/+5
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add a special device list for chunk allocationsChris Mason2008-09-251-0/+7
| | | | | | | This allows other code that needs to walk every device in the FS to do so without locking against allocations. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Make an unplug function that doesn't unplug every spindleChris Mason2008-09-251-0/+2
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add chunk uuids and update multi-device back referencesChris Mason2008-09-251-2/+4
| | | | | | | | | | | | | | | | | | | | Block headers now store the chunk tree uuid Chunk items records the device uuid for each stripes Device extent items record better back refs to the chunk tree Block groups record better back refs to the chunk tree The chunk tree format has also changed. The objectid of BTRFS_CHUNK_ITEM_KEY used to be the logical offset of the chunk. Now it is a chunk tree id, with the logical offset being stored in the offset field of the key. This allows a single chunk tree to record multiple logical address spaces, upping the number of bytes indexed by a chunk tree from 2^64 to 2^128. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Write out all super blocks on commit, and bring back proper barrier ↵Chris Mason2008-09-251-0/+3
| | | | | | support Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Retry metadata reads in the face of checksum failuresChris Mason2008-09-251-2/+4
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Change btrfs_map_block to return a structure with mappings for all stripesChris Mason2008-09-251-3/+22
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add support for mirroring across drivesChris Mason2008-09-251-5/+6
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Add support for device scanning and detection ioctlsChris Mason2008-09-251-0/+25
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Bring back mount -o ssd optimizationsChris Mason2008-09-251-0/+3
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Btrfs: Move device information into the super block so it can be scannedChris Mason2008-09-251-12/+1
| | | | Signed-off-by: Chris Mason <chris.mason@oracle.com>