aboutsummaryrefslogtreecommitdiffstats
path: root/fs/xfs/linux-2.6
Commit message (Collapse)AuthorAgeFilesLines
* getting newer filesystem code workingWolfgang Wiedmeyer2015-10-2338-16268/+0
|
* tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checkingHugh Dickins2012-10-211-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 35c2a7f4908d404c9124c2efc6ada4640ca4d5d5 upstream. Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(), u64 inum = fid->raw[2]; which is unhelpfully reported as at the end of shmem_alloc_inode(): BUG: unable to handle kernel paging request at ffff880061cd3000 IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Call Trace: [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0 [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0 [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10 [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6 Right, tmpfs is being stupid to access fid->raw[2] before validating that fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may fall at the end of a page, and the next page not be present. But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and could oops in the same way: add the missing fh_len checks to those. Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Sage Weil <sage@inktank.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix endian conversion issue in discard codeDave Chinner2012-02-031-2/+2
| | | | | | | | | | | | | | | | | | commit b1c770c273a4787069306fc82aab245e9ac72e9d upstream When finding the longest extent in an AG, we read the value directly out of the AGF buffer without endian conversion. This will give an incorrect length, resulting in FITRIM operations potentially not trimming everything that it should. Note, for 3.0-stable this has been modified to apply to fs/xfs/linux-2.6/xfs_discard.c instead of fs/xfs/xfs_discard.c. -bpm Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* xfs: fix acl count validation in xfs_acl_from_disk()Xi Wang2012-01-121-1/+1
| | | | | | | | | | | | | commit 093019cf1b18dd31b2c3b77acce4e000e2cbc9ce upstream. Commit fa8b18ed didn't prevent the integer overflow and possible memory corruption. "count" can go negative and bypass the check. Signed-off-by: Xi Wang <xi.wang@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: log all dirty inodes in xfs_fs_sync_fsChristoph Hellwig2012-01-062-0/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit be4f1ac828776bbc7868a68b465cd8eedb733cfd upstream. Since Linux 2.6.36 the writeback code has introduces various measures for live lock prevention during sync(). Unfortunately some of these are actively harmful for the XFS model, where the inode gets marked dirty for metadata from the data I/O handler. The older_than_this checks that are now more strictly enforced since writeback: avoid livelocking WB_SYNC_ALL writeback by only calling into __writeback_inodes_sb and thus only sampling the current cut off time once. But on a slow enough devices the previous asynchronous sync pass might not have fully completed yet, and thus XFS might mark metadata dirty only after that sampling of the cut off time for the blocking pass already happened. I have not myself reproduced this myself on a real system, but by introducing artificial delay into the XFS I/O completion workqueues it can be reproduced easily. Fix this by iterating over all XFS inodes in ->sync_fs and log all that are dirty. This might log inode that only got redirtied after the previous pass, but given how cheap delayed logging of inodes is it isn't a major concern for performance. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: log the inode in ->write_inode calls for kupdateChristoph Hellwig2012-01-061-25/+5
| | | | | | | | | | | | | | | | | | | | | | | | Commit 0b8fd3033c308e4088760aa1d38ce77197b4e074 upstream. If the writeback code writes back an inode because it has expired we currently use the non-blockin ->write_inode path. This means any inode that is pinned is skipped. With delayed logging and a workload that has very little log traffic otherwise it is very likely that an inode that gets constantly written to is always pinned, and thus we keep refusing to write it. The VM writeback code at that point redirties it and doesn't try to write it again for another 30 seconds. This means under certain scenarious time based metadata writeback never happens. Fix this by calling into xfs_log_inode for kupdate in addition to data integrity syncs, and thus transfer the inode to the log ASAP. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Tested-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: fix nfs export of 64-bit inodes numbers on 32-bit kernelsChristoph Hellwig2011-12-211-4/+4
| | | | | | | | | | | | | | | | | commit c29f7d457ac63311feb11928a866efd2fe153d74 upstream. The i_ino field in the VFS inode is of type unsigned long and thus can't hold the full 64-bit inode number on 32-bit kernels. We have the full inode number in the XFS inode, so use that one for nfs exports. Note that I've also switched the 32-bit file handles types to it, just to make the code more consistent and copy & paste errors less likely to happen. Reported-by: Guoquan Yang <ygq51@hotmail.com> Reported-by: Hank Peng <pengxihan@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: force buffer writeback before blocking on the ilock in inode reclaimChristoph Hellwig2011-12-091-0/+11
| | | | | | | | | | | | | | | | | | | | commit 4dd2cb4a28b7ab1f37163a4eba280926a13a8749 upstream. If we are doing synchronous inode reclaim we block the VM from making progress in memory reclaim. So if we encouter a flush locked inode promote it in the delwri list and wake up xfsbufd to write it out now. Without this we can get hangs of up to 30 seconds during workloads hitting synchronous inode reclaim. The scheme is copied from what we do for dquot reclaims. Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Ben Myers <bpm@sgi.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: validate acl countChristoph Hellwig2011-12-091-0/+2
| | | | | | | | | | | | | commit fa8b18edd752a8b4e9d1ee2cd615b82c93cf8bba upstream. This prevents in-memory corruption and possible panics if the on-disk ACL is badly corrupted. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: fix ->write_inode return valuesChristoph Hellwig2011-11-261-25/+9
| | | | | | | | | | | | | | | | | patch 58d84c4ee0389ddeb86238d5d8359a982c9f7a5b upstream. Currently we always redirty an inode that was attempted to be written out synchronously but has been cleaned by an AIL pushed internall, which is rather bogus. Fix that by doing the i_update_core check early on and return 0 for it. Also include async calls for it, as doing any work for those is just as pointless. While we're at it also fix the sign for the EIO return in case of a filesystem shutdown, and fix the completely non-sensical locking around xfs_log_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: fix buffer flushing during unmountChristoph Hellwig2011-11-261-1/+0
| | | | | | | | | | | | | | | | | | | | | | commit 87c7bec7fc3377b3873eb3a0f4b603981ea16ebb upstream. The code to flush buffers in the umount code is a bit iffy: we first flush all delwri buffers out, but then might be able to queue up a new one when logging the sb counts. On a normal shutdown that one would get flushed out when doing the synchronous superblock write in xfs_unmountfs_writesb, but we skip that one if the filesystem has been shut down. Fix this by moving the delwri list flushing until just before unmounting the log, and while we're at it also remove the superflous delwri list and buffer lru flusing for the rt and log device that can never have cached or delwri buffers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Tested-by: Amit Sahrawat <amit.sahrawat83@gmail.com> Signed-off-by: Alex Elder <aelder@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: Return -EIO when xfs_vn_getattr() failedMitsuo Hayasaka2011-11-261-1/+1
| | | | | | | | | | | | | | | | | | commit ed32201e65e15f3e6955cb84cbb544b08f81e5a5 upstream. An attribute of inode can be fetched via xfs_vn_getattr() in XFS. Currently it returns EIO, not negative value, when it failed. As a result, the system call returns not negative value even though an error occured. The stat(2), ls and mv commands cannot handle this error and do not work correctly. This patch fixes this bug, and returns -EIO, not EIO when an error is detected in xfs_vn_getattr(). Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: avoid direct I/O write vs buffered I/O raceChristoph Hellwig2011-11-261-3/+14
| | | | | | | | | | | | | | | | commit c58cb165bd44de8aaee9755a144136ae743be116 upstream. Currently a buffered reader or writer can add pages to the pagecache while we are waiting for the iolock in xfs_file_dio_aio_write. Prevent this by re-checking mapping->nrpages after we got the iolock, and if nessecary upgrade the lock to exclusive mode. To simplify this a bit only take the ilock inside of xfs_file_aio_write_checks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: dont serialise direct IO reads on page cacheDave Chinner2011-11-261-3/+14
| | | | | | | | | | | | | | | | | | | | | | | commit 0c38a2512df272b14ef4238b476a2e4f70da1479 upstream. There is no need to grab the i_mutex of the IO lock in exclusive mode if we don't need to invalidate the page cache. Taking these locks on every direct IO effective serialises them as taking the IO lock in exclusive mode has to wait for all shared holders to drop the lock. That only happens when IO is complete, so effective it prevents dispatch of concurrent direct IO reads to the same inode. Fix this by taking the IO lock shared to check the page cache state, and only then drop it and take the IO lock exclusively if there is work to be done. Hence for the normal direct IO case, no exclusive locking will occur. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Joern Engel <joern@logfs.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: fix xfs_mark_inode_dirty during umountChristoph Hellwig2011-11-261-3/+11
| | | | | | | | | | | | | | | | | | | | | | | commit 866e4ed77448a0c311e1b055eb72ea05423fd799 upstream. During umount we do not add a dirty inode to the lru and wait for it to become clean first, but force writeback of data and metadata with I_WILL_FREE set. Currently there is no way for XFS to detect that the inode has been redirtied for metadata operations, as we skip the mark_inode_dirty call during teardown. Fix this by setting i_update_core nanually in that case, so that the inode gets flushed during inode reclaim. Alternatively we could enable calling mark_inode_dirty for inodes in I_WILL_FREE state, and let the VFS dirty tracking handle this. I decided against this as we will get better I/O patterns from reclaim compared to the synchronous writeout in write_inode_now, and always marking the inode dirty in some way from xfs_mark_inode_dirty is a better safetly net in either case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: revert to using a kthread for AIL pushingChristoph Hellwig2011-10-252-12/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 0030807c66f058230bcb20d2573bcaf28852e804 upstream Currently we have a few issues with the way the workqueue code is used to implement AIL pushing: - it accidentally uses the same workqueue as the syncer action, and thus can be prevented from running if there are enough sync actions active in the system. - it doesn't use the HIGHPRI flag to queue at the head of the queue of work items At this point I'm not confident enough in getting all the workqueue flags and tweaks right to provide a perfectly reliable execution context for AIL pushing, which is the most important piece in XFS to make forward progress when the log fills. Revert back to use a kthread per filesystem which fixes all the above issues at the cost of having a task struct and stack around for each mounted filesystem. In addition this also gives us much better ways to diagnose any issues involving hung AIL pushing and removes a small amount of code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Stefan Priebe <s.priebe@profihost.ag> Tested-by: Stefan Priebe <s.priebe@profihost.ag> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: start periodic workers laterChristoph Hellwig2011-10-251-21/+14
| | | | | | | | | | | | | | | | | | | commit 2bcf6e970f5a88fa05dced5eeb0326e13d93c4a1 upstream Start the periodic sync workers only after we have finished xfs_mountfs and thus fully set up the filesystem structures. Without this we can call into xfs_qm_sync before the quotainfo strucute is set up if the mount takes unusually long, and probably hit other incomplete states as well. Also clean up the xfs_fs_fill_super error path by using consistent label names, and removing an impossible to reach case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl> Reviewed-by: Alex Elder <aelder@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* xfs: make log devices with write back caches workChristoph Hellwig2011-06-162-94/+31
| | | | | | | | | | | | | There's no reason not to support cache flushing on external log devices. The only thing this really requires is flushing the data device first both in fsync and log commits. A side effect is that we also have to remove the barrier write test during mount, which has been superflous since the new FLUSH+FUA code anyway. Also use the chance to flush the RT subvolume write cache before the fsync commit, which is required for correct semantics. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: fix ->mknod() return value on xfs_get_acl() failureAl Viro2011-06-141-1/+1
| | | | | | | | | ->mknod() should return negative on errors and PTR_ERR() gives already negative value... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* fs: pass exact type of data dirties to ->dirty_inodeChristoph Hellwig2011-05-271-1/+2
| | | | | | | | | | | | | | | | | Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or anything else, so that the filesystem can track internally if it needs to push out a transaction for fdatasync or not. This is just the prototype change with no user for it yet. I plan to push large XFS changes for the next merge window, and getting this trivial infrastructure in this window would help a lot to avoid tree interdependencies. Also remove incorrect comments that ->dirty_inode can't block. That has been changed a long time ago, and many implementations rely on it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds2011-05-263-2/+47
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: correctly decrement the extent buffer index in xfs_bmap_del_extent xfs: check for valid indices in xfs_iext_get_ext and xfs_iext_idx_to_irec xfs: fix up asserts in xfs_iflush_fork xfs: do not do pointer arithmetic on extent records xfs: do not use unchecked extent indices in xfs_bunmapi xfs: do not use unchecked extent indices in xfs_bmapi xfs: do not use unchecked extent indices in xfs_bmap_add_extent_* xfs: remove if_lastex xfs: remove the unused XFS_BMAPI_RSVBLOCKS flag xfs: do not discard alloc btree blocks xfs: add online discard support
| * xfs: add online discard supportChristoph Hellwig2011-05-243-2/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we have reliably tracking of deleted extents in a transaction we can easily implement "online" discard support which calls blkdev_issue_discard once a transaction commits. The actual discard is a two stage operation as we first have to mark the busy extent as not available for reuse before we can start the actual discard. Note that we don't bother supporting discard for the non-delaylog mode. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* | vmscan: change shrinker API by passing shrink_control structYing Han2011-05-252-4/+5
|/ | | | | | | | | | | | | | | | | | | | | | | | | Change each shrinker's API by consolidating the existing parameters into shrink_control struct. This will simplify any further features added w/o touching each file of shrinker. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: fix warning] [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API] [akpm@linux-foundation.org: fix xfs warning] [akpm@linux-foundation.org: update gfs2] Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* xfs: reset buffer pointers before freeing themDave Chinner2011-05-192-0/+22
| | | | | | | | | | | | | | | | | | | | | | When we free a vmapped buffer, we need to ensure the vmap address and length we free is the same as when it was allocated. In various places in the log code we change the memory the buffer is pointing to before issuing IO, but we never reset the buffer to point back to it's original memory (or no memory, if that is the case for the buffer). As a result, when we free the buffer it points to memory that is owned by something else and attempts to unmap and free it. Because the range does not match any known mapped range, it can trigger BUG_ON() traps in the vmap code, and potentially corrupt the vmap area tracking. Fix this by always resetting these buffers to their original state before freeing them. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: avoid getting stuck during async inode flushesDave Chinner2011-05-191-0/+10
| | | | | | | | | | | | | | | | | When the underlying inode buffer is locked and xfs_sync_inode_attr() is doing a non-blocking flush, xfs_iflush() can return EAGAIN. When this happens, clear the error rather than returning it to xfs_inode_ag_walk(), as returning EAGAIN will result in the AG walk delaying for a short while and trying again. This can result in background walks getting stuck on the one AG until inode buffer is unlocked by some other means. This behaviour was noticed when analysing event traces followed by code inspection and verification of the fix via further traces. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: fix duplicate workqueue initialisationDave Chinner2011-05-191-4/+0
| | | | | | | | | The workqueue initialisation function is called twice when initialising the XFS subsystem. Remove the second initialisation call. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: kill off xfs_printk()Joe Perches2011-05-192-23/+4
| | | | | | | | | | | | | | | | | | | xfs_alert_tag() can be defined using xfs_alert(), and thereby avoid using xfs_printk() altogether. This is the only remaining use of xfs_printk(), so changing it this way means xfs_printk() can simply be eliminated.can simply be eliminated.can simply be eliminated.can simply be eliminated.can simply be eliminated.can simply be eliminated.can simply be eliminated.can simply be eliminated.can simply be eliminated. Also add format checking to the non-debug inline function xfs_debug. Miscellaneous function prototype argument alignment. (Updated to delete the definition of xfs_printk(), which is no longer used or needed.) Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: ensure reclaim cursor is reset correctly at end of AGDave Chinner2011-05-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | On a 32 bit highmem PowerPC machine, the XFS inode cache was growing without bound and exhausting low memory causing the OOM killer to be triggered. After some effort, the problem was reproduced on a 32 bit x86 highmem machine. The problem is that the per-ag inode reclaim index cursor was not getting reset to the start of the AG if the radix tree tag lookup found no more reclaimable inodes. Hence every further reclaim attempt started at the same index beyond where any reclaimable inodes lay, and no further background reclaim ever occurred from the AG. Without background inode reclaim the VM driven cache shrinker simply cannot keep up with cache growth, and OOM is the result. While the change that exposed the problem was the conversion of the inode reclaim to use work queues for background reclaim, it was not the cause of the bug. The bug was introduced when the cursor code was added, just waiting for some weird configuration to strike.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-By: Christian Kujau <lists@nerdbynature.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
* xfs: add an x86 compat handler for XFS_IOC_ZERO_RANGEChristoph Hellwig2011-04-282-1/+3
| | | | | | | | | | XFS_IOC_ZERO_RANGE uses struct xfs_flock64, and thus requires argument translation for 32-bit binaries on x86. Add the required XFS_IOC_ZERO_RANGE_32 defined and add it to the list of commands that require xfs_flock64 translation. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: fix compiler warning in xfs_trace.hChristoph Hellwig2011-04-281-1/+1
| | | | | | | | | | | xfs_fsblock_t may be a 32-bit type on if XFS_BIG_BLKNOS is not set, make sure to cast a value of this type to an unsigned long long before using the ll printk qualifier. Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: reduce the number of pagb_lock roundtrips in xfs_alloc_clear_busyChristoph Hellwig2011-04-282-1/+1
| | | | | | | | | | Instead of finding the per-ag and then taking and releasing the pagb_lock for every single busy extent completed sort the list of busy extents and only switch betweens AGs where nessecary. This becomes especially important with the online discard support which will hit this lock more often. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: exact busy extent trackingChristoph Hellwig2011-04-281-68/+11
| | | | | | | | | | | | Update the extent tree in case we have to reuse a busy extent, so that it always is kept uptodate. This is done by replacing the busy list searches with a new xfs_alloc_busy_reuse helper, which updates the busy extent tree in case of a reuse. This allows us to allow reusing metadata extents unconditionally, and thus avoid log forces especially for allocation btree blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: do not immediately reuse busy extent rangesChristoph Hellwig2011-04-281-0/+33
| | | | | | | | | | | | | | | | | | | | | Every time we reallocate a busy extent, we cause a synchronous log force to occur to ensure the freeing transaction is on disk before we continue and use the newly allocated extent. This is extremely sub-optimal as we have to mark every transaction with blocks that get reused as synchronous. Instead of searching the busy extent list after deciding on the extent to allocate, check each candidate extent during the allocation decisions as to whether they are in the busy list. If they are in the busy list, we trim the busy range out of the extent we have found and determine if that trimmed range is still OK for allocation. In many cases, this check can be incorporated into the allocation extent alignment code which already does trimming of the found extent before determining if it is a valid candidate for allocation. Based on earlier patches from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* xfs: fix duplicate message outputDave Chinner2011-04-201-1/+3
| | | | | | | | | | | Commit 957935dc ("xfs: fix xfs_debug warnings" broke the logic in __xfs_printk(). Instead of only printing one of two possible output strings based on whether the fs has a name or not, it outputs both. Fix it to only output one message again. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
* Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds2011-04-116-240/+194
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: use proper interfaces for on-stack plugging xfs: fix xfs_debug warnings xfs: fix variable set but not used warnings xfs: convert log tail checking to a warning xfs: catch bad block numbers freeing extents. xfs: push the AIL from memory reclaim and periodic sync xfs: clean up code layout in xfs_trans_ail.c xfs: convert the xfsaild threads to a workqueue xfs: introduce background inode reclaim work xfs: convert ENOSPC inode flushing to use new syncd workqueue xfs: introduce a xfssyncd workqueue xfs: fix extent format buffer allocation size xfs: fix unreferenced var error in xfs_buf.c Also, applied patch from Tony Luck that fixes ia64: xfs_destroy_workqueues() should not be tagged with__exit in the branch before merging.
| * xfs_destroy_workqueues() should not be tagged with__exitLuck, Tony2011-04-111-1/+1
| | | | | | | | | | | | | | | | | | | | ia64 throws away .exit sections for the built-in CONFIG case, so routines that are used in other circumstances should not be tagged as __exit. Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * xfs: use proper interfaces for on-stack pluggingChristoph Hellwig2011-04-081-11/+9
| | | | | | | | | | | | | | | | | | | | Add proper blk_start_plug/blk_finish_plug pairs for the two places where we issue buffer I/O, and remove the blk_flush_plug in xfs_buf_lock and xfs_buf_iowait, given that context switches already flush the per-process plugging lists. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * xfs: fix xfs_debug warningsChristoph Hellwig2011-04-082-29/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For a CONFIG_XFS_DEBUG=n build gcc complains about statements with no effect in xfs_debug: fs/xfs/quota/xfs_qm_syscalls.c: In function 'xfs_qm_scall_trunc_qfiles': fs/xfs/quota/xfs_qm_syscalls.c:291:3: warning: statement with no effect The reason for that is that the various new xfs message functions have a return value which is never used, and in case of the non-debug build xfs_debug the macro evaluates to a plain 0 which produces the above warnings. This can be fixed by turning xfs_debug into an inline function instead of a macro, but in addition to that I've also changed all the message helpers to return void as we never use their return values. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
| * xfs: fix variable set but not used warningsChristoph Hellwig2011-04-081-2/+0
| | | | | | | | | | | | | | | | GCC 4.6 now warnings about variables set but not used. Fix the trivially fixable warnings of this sort. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
| * xfs: push the AIL from memory reclaim and periodic syncDave Chinner2011-04-081-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we are short on memory, we want to expedite the cleaning of dirty objects. Hence when we run short on memory, we need to kick the AIL flushing into action to clean as many dirty objects as quickly as possible. To implement this, sample the lsn of the log item at the head of the AIL and use that as the push target for the AIL flush. Further, we keep items in the AIL that are dirty that are not tracked any other way, so we can get objects sitting in the AIL that don't get written back until the AIL is pushed. Hence to get the filesystem to the idle state, we might need to push the AIL to flush out any remaining dirty objects sitting in the AIL. This requires the same push mechanism as the reclaim push. This patch also renames xfs_trans_ail_tail() to xfs_ail_min_lsn() to match the new xfs_ail_max_lsn() function introduced in this patch. Similarly for xfs_trans_ail_push -> xfs_ail_push. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>
| * xfs: convert the xfsaild threads to a workqueueDave Chinner2011-04-081-84/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to the xfssyncd, the per-filesystem xfsaild threads can be converted to a global workqueue and run periodically by delayed works. This makes sense for the AIL pushing because it uses variable timeouts depending on the work that needs to be done. By removing the xfsaild, we simplify the AIL pushing code and remove the need to spread the code to implement the threading and pushing across multiple files. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
| * xfs: introduce background inode reclaim workDave Chinner2011-04-081-3/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Background inode reclaim needs to run more frequently that the XFS syncd work is run as 30s is too long between optimal reclaim runs. Add a new periodic work item to the xfs syncd workqueue to run a fast, non-blocking inode reclaim scan. Background inode reclaim is kicked by the act of marking inodes for reclaim. When an AG is first marked as having reclaimable inodes, the background reclaim work is kicked. It will continue to run periodically untill it detects that there are no more reclaimable inodes. It will be kicked again when the first inode is queued for reclaim. To ensure shrinker based inode reclaim throttles to the inode cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the background inode reclaim so that when we are low on memory we are trying to reclaim inodes as efficiently as possible. This kick shoul d not be necessary, but it will protect against failures to kick the background reclaim when inodes are first dirtied. To provide the rate throttling, make the shrinker pass do synchronous inode reclaim so that it blocks on inodes under IO. This means that the shrinker will reclaim inodes rather than just skipping over them, but it does not adversely affect the rate of reclaim because most dirty inodes are already under IO due to the background reclaim work the shrinker kicked. These two modifications solve one of the two OOM killer invocations Chris Mason reported recently when running a stress testing script. The particular workload trigger for the OOM killer invocation is where there are more threads than CPUs all unlinking files in an extremely memory constrained environment. Unlike other solutions, this one does not have a performance impact on performance when memory is not constrained or the number of concurrent threads operating is <= to the number of CPUs. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
| * xfs: convert ENOSPC inode flushing to use new syncd workqueueDave Chinner2011-04-082-99/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On of the problems with the current inode flush at ENOSPC is that we queue a flush per ENOSPC event, regardless of how many are already queued. Thi can result in hundreds of queued flushes, most of which simply burn CPU scanned and do no real work. This simply slows down allocation at ENOSPC. We really only need one active flush at a time, and we can easily implement that via the new xfs_syncd_wq. All we need to do is queue a flush if one is not already active, then block waiting for the currently active flush to complete. The result is that we only ever have a single ENOSPC inode flush active at a time and this greatly reduces the overhead of ENOSPC processing. On my 2p test machine, this results in tests exercising ENOSPC conditions running significantly faster - 042 halves execution time, 083 drops from 60s to 5s, etc - while not introducing test regressions. This allows us to remove the old xfssyncd threads and infrastructure as they are no longer used. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
| * xfs: introduce a xfssyncd workqueueDave Chinner2011-04-083-58/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All of the work xfssyncd does is background functionality. There is no need for a thread per filesystem to do this work - it can al be managed by a global workqueue now they manage concurrency effectively. Introduce a new gglobal xfssyncd workqueue, and convert the periodic work to use this new functionality. To do this, use a delayed work construct to schedule the next running of the periodic sync work for the filesystem. When the sync work is complete, queue a new delayed work for the next running of the sync work. For laptop mode, we wait on completion for the sync works, so ensure that the sync work queuing interface can flush and wait for work to complete to enable the work queue infrastructure to replace the current sequence number and wakeup that is used. Because the sync work does non-trivial amounts of work, mark the new work queue as CPU intensive. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
| * xfs: fix unreferenced var error in xfs_buf.cDave Chinner2011-03-301-2/+0
| | | | | | | | | | Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
* | Fix common misspellingsLucas De Marchi2011-03-315-6/+6
|/ | | | | | Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
* Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds2011-03-286-309/+151
|\ | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: stop using the page cache to back the buffer cache xfs: register the inode cache shrinker before quotachecks xfs: xfs_trans_read_buf() should return an error on failure xfs: introduce inode cluster buffer trylocks for xfs_iflush vmap: flush vmap aliases when mapping fails xfs: preallocation transactions do not need to be synchronous Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_buf.c due to plug removal.
| * xfs: stop using the page cache to back the buffer cacheDave Chinner2011-03-262-297/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that the buffer cache has it's own LRU, we do not need to use the page cache to provide persistent caching and reclaim infrastructure. Convert the buffer cache to use alloc_pages() instead of the page cache. This will remove all the overhead of page cache management from setup and teardown of the buffers, as well as needing to mark pages accessed as we find buffers in the buffer cache. By avoiding the page cache, we also remove the need to keep state in the page_private(page) field for persistant storage across buffer free/buffer rebuild and so all that code can be removed. This also fixes the long-standing problem of not having enough bits in the page_private field to track all the state needed for a 512 sector/64k page setup. It also removes the need for page locking during reads as the pages are unique to the buffer and nobody else will be attempting to access them. Finally, it removes the buftarg address space lock as a point of global contention on workloads that allocate and free buffers quickly such as when creating or removing large numbers of inodes in parallel. This remove the 16TB limit on filesystem size on 32 bit machines as the page index (32 bit) is no longer used for lookups of metadata buffers - the buffer cache is now solely indexed by disk address which is stored in a 64 bit field in the buffer. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>
| * xfs: register the inode cache shrinker before quotachecksDave Chinner2011-03-261-10/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During mount, we can do a quotacheck that involves a bulkstat pass on all inodes. If there are more inodes in the filesystem than can be held in memory, we require the inode cache shrinker to run to ensure that we don't run out of memory. Unfortunately, the inode cache shrinker is not registered until we get to the end of the superblock setup process, which is after a quotacheck is run if it is needed. Hence we need to register the inode cache shrinker earlier in the mount process so that we don't OOM during mount. This requires that we also initialise the syncd work before we register the shrinker, so we nee dto juggle that around as well. While there, make sure that we have set up the block sizes in the VFS superblock correctly before the quotacheck is run so that any inodes that are cached as a result of the quotacheck have their block size fields set up correctly. Cc: stable@kernel.org Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>
| * xfs: introduce inode cluster buffer trylocks for xfs_iflushDave Chinner2011-03-262-4/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is an ABBA deadlock between synchronous inode flushing in xfs_reclaim_inode and xfs_icluster_free. xfs_icluster_free locks the buffer, then takes inode ilocks, whilst synchronous reclaim takes the ilock followed by the buffer lock in xfs_iflush(). To avoid this deadlock, separate the inode cluster buffer locking semantics from the synchronous inode flush semantics, allowing callers to attempt to lock the buffer but still issue synchronous IO if it can get the buffer. This requires xfs_iflush() calls that currently use non-blocking semantics to pass SYNC_TRYLOCK rather than 0 as the flags parameter. This allows xfs_reclaim_inode to avoid the deadlock on the buffer lock and detect the failure so that it can drop the inode ilock and restart the reclaim attempt on the inode. This allows xfs_ifree_cluster to obtain the inode lock, mark the inode stale and release it and hence defuse the deadlock situation. It also has the pleasant side effect of avoiding IO in xfs_reclaim_inode when it tries to next reclaim the inode as it is now marked stale. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>