aboutsummaryrefslogtreecommitdiffstats
path: root/fs/ext4
Commit message (Collapse)AuthorAgeFilesLines
...
* ext4/jbd2: don't wait (forever) for stale tid caused by wraparoundTheodore Ts'o2013-05-132-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d76a3a77113db020d9bb1e894822869410450bd9 upstream. In the case where an inode has a very stale transaction id (tid) in i_datasync_tid or i_sync_tid, it's possible that after a very large (2**31) number of transactions, that the tid number space might wrap, causing tid_geq()'s calculations to fail. Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily", attempted to fix this problem, but it only avoided kjournald spinning forever by fixing the logic in jbd2_log_start_commit(). Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c that might call jbd2_log_start_commit() with a stale tid, those functions will subsequently call jbd2_log_wait_commit() with the same stale tid, and then wait for a very long time. To fix this, we replace the calls to jbd2_log_start_commit() and jbd2_log_wait_commit() with a call to a new function, jbd2_complete_transaction(), which will correctly handle stale tid's. As a bonus, jbd2_complete_transaction() will avoid locking j_state_lock for writing unless a commit needs to be started. This should have a small (but probably not measurable) improvement for ext4's scalability. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Ben Hutchings <ben@decadent.org.uk> Reported-by: George Barnett <gbarnett@atlassian.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix data=journal fast mount/umount hangTheodore Ts'o2013-03-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2b405bfa84063bfa35621d2d6879f52693c614b0 upstream. In data=journal mode, if we unmount the file system before a transaction has a chance to complete, when the journal inode is being evicted, we can end up calling into jbd2_log_wait_commit() for the last transaction, after the journalling machinery has been shut down. Arguably we should adjust ext4_should_journal_data() to return FALSE for the journal inode, but the only place it matters is ext4_evict_inode(), and so to save a bit of CPU time, and to make the patch much more obviously correct by inspection(tm), we'll fix it by explicitly not trying to waiting for a journal commit when we are evicting the journal inode, since it's guaranteed to never succeed in this case. This can be easily replicated via: mount -t ext4 -o data=journal /dev/vdb /vdb ; umount /vdb ------------[ cut here ]------------ WARNING: at /usr/projects/linux/ext4/fs/jbd2/journal.c:542 __jbd2_log_start_commit+0xba/0xcd() Hardware name: Bochs JBD2: bad log_start_commit: 3005630206 3005630206 0 0 Modules linked in: Pid: 2909, comm: umount Not tainted 3.8.0-rc3 #1020 Call Trace: [<c015c0ef>] warn_slowpath_common+0x68/0x7d [<c02b7e7d>] ? __jbd2_log_start_commit+0xba/0xcd [<c015c177>] warn_slowpath_fmt+0x2b/0x2f [<c02b7e7d>] __jbd2_log_start_commit+0xba/0xcd [<c02b8075>] jbd2_log_start_commit+0x24/0x34 [<c0279ed5>] ext4_evict_inode+0x71/0x2e3 [<c021f0ec>] evict+0x94/0x135 [<c021f9aa>] iput+0x10a/0x110 [<c02b7836>] jbd2_journal_destroy+0x190/0x1ce [<c0175284>] ? bit_waitqueue+0x50/0x50 [<c028d23f>] ext4_put_super+0x52/0x294 [<c020efe3>] generic_shutdown_super+0x48/0xb4 [<c020f071>] kill_block_super+0x22/0x60 [<c020f3e0>] deactivate_locked_super+0x22/0x49 [<c020f5d6>] deactivate_super+0x30/0x33 [<c0222795>] mntput_no_expire+0x107/0x10c [<c02233a7>] sys_umount+0x2cf/0x2e0 [<c02233ca>] sys_oldumount+0x12/0x14 [<c08096b8>] syscall_call+0x7/0xb ---[ end trace 6a954cc790501c1f ]--- jbd2_log_wait_commit: error: j_commit_request=-1289337090, tid=0 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: use atomic64_t for the per-flexbg free_clusters countTheodore Ts'o2013-03-275-15/+15
| | | | | | | | | | | | | | | | | | | | | commit 90ba983f6889e65a3b506b30dc606aa9d1d46cd2 upstream. A user who was using a 8TB+ file system and with a very large flexbg size (> 65536) could cause the atomic_t used in the struct flex_groups to overflow. This was detected by PaX security patchset: http://forums.grsecurity.net/viewtopic.php?f=3&t=3289&p=12551#p12551 This bug was introduced in commit 9f24e4208f7e, so it's been around since 2.6.30. :-( Fix this by using an atomic64_t for struct orlav_stats's free_clusters. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: convert number of blocks to clusters properlyLukas Czerner2013-03-273-7/+7
| | | | | | | | | | | | | | | commit 810da240f221d64bf90020f25941b05b378186fe upstream. We're using macro EXT4_B2C() to convert number of blocks to number of clusters for bigalloc file systems. However, we should be using EXT4_NUM_B2C(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> [bwh: Backported to 3.2: - Adjust context - Drop changes in ext4_setup_new_descs() and ext4_calculate_overhead()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix the wrong number of the allocated blocks in ext4_split_extent()Zheng Liu2013-03-271-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3a2256702e47f68f921dfad41b1764d05c572329 upstream. This commit fixes a wrong return value of the number of the allocated blocks in ext4_split_extent. When the length of blocks we want to allocate is greater than the length of the current extent, we return a wrong number. Let's see what happens in the following case when we call ext4_split_extent(). map: [48, 72] ex: [32, 64, u] 'ex' will be split into two parts: ex1: [32, 47, u] ex2: [48, 64, w] 'map->m_len' is returned from this function, and the value is 24. But the real length is 16. So it should be fixed. Meanwhile in this commit we use right length of the allocated blocks when get_reserved_cluster_alloc in ext4_ext_handle_uninitialized_extents is called. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix kernel BUG on large-scale rm -rf commandsTheodore Ts'o2013-03-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 89a4e48f8479f8145eca9698f39fe188c982212f upstream. Commit 968dee7722: "ext4: fix hole punch failure when depth is greater than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel crashes when users ran run "rm -rf" on large directory hierarchy on ext4 filesystems on RAID devices: BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 Process rm (pid: 18229, threadinfo ffff8801276bc000, task ffff880123631710) Call Trace: [<ffffffff81236483>] ? __ext4_handle_dirty_metadata+0x83/0x110 [<ffffffff812353d3>] ext4_ext_truncate+0x193/0x1d0 [<ffffffff8120a8cf>] ? ext4_mark_inode_dirty+0x7f/0x1f0 [<ffffffff81207e05>] ext4_truncate+0xf5/0x100 [<ffffffff8120cd51>] ext4_evict_inode+0x461/0x490 [<ffffffff811a1312>] evict+0xa2/0x1a0 [<ffffffff811a1513>] iput+0x103/0x1f0 [<ffffffff81196d84>] do_unlinkat+0x154/0x1c0 [<ffffffff8118cc3a>] ? sys_newfstatat+0x2a/0x40 [<ffffffff81197b0b>] sys_unlinkat+0x1b/0x50 [<ffffffff816135e9>] system_call_fastpath+0x16/0x1b Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49 89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 ae f8 ff ff 0f 1f 00 49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 85 f8 ff ff 0f 1f 80 00 00 00 RIP [<ffffffff81233164>] ext4_ext_remove_space+0xa34/0xdf0 This could be reproduced as follows: The problem in commit 968dee7722 was that caused the variable 'i' to be left uninitialized if the truncate required more space than was available in the journal. This resulted in the function ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused ext4_ext_remove_space() to restart the truncate operation after starting a new jbd2 handle. Reported-by: Maciej Żenczykowski <maze@google.com> Reported-by: Marti Raudsepp <marti@juffo.org> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix hole punch failure when depth is greater than 0Ashish Sangwan2013-03-061-17/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 968dee77220768a5f52cf8b21d0bdb73486febef upstream. Whether to continue removing extents or not is decided by the return value of function ext4_ext_more_to_rm() which checks 2 conditions: a) if there are no more indexes to process. b) if the number of entries are decreased in the header of "depth -1". In case of hole punch, if the last block to be removed is not part of the last extent index than this index will not be deleted, hence the number of valid entries in the extent header of "depth - 1" will remain as it is and ext4_ext_more_to_rm will return 0 although the required blocks are not yet removed. This patch fixes the above mentioned problem as instead of removing the extents from the end of file, it starts removing the blocks from the particular extent from which removing blocks is actually required and continue backward until done. Signed-off-by: Ashish Sangwan <ashish.sangwan2@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: rewrite punch hole to use ext4_ext_remove_space()Lukas Czerner2013-03-061-85/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 5f95d21fb6f2aaa52830e5b7fb405f6c71d3ab85 upstream. This commit rewrites ext4 punch hole implementation to use ext4_ext_remove_space() instead of its home gown way of doing this via ext4_ext_map_blocks(). There are several reasons for changing this. Firstly it is quite non obvious that punching hole needs to ext4_ext_map_blocks() to punch a hole, especially given that this function should map blocks, not unmap it. It also required a lot of new code in ext4_ext_map_blocks(). Secondly the design of it is not very effective. The reason is that we are trying to punch out blocks in ext4_ext_punch_hole() in opposite direction than in ext4_ext_rm_leaf() which causes the ext4_ext_rm_leaf() to iterate through the whole tree from the end to the start to find the requested extent for every extent we are going to punch out. And finally the current implementation does not use the existing code, but bring a lot of new code, which is IMO unnecessary since there already is some infrastructure we can use. Specifically ext4_ext_remove_space(). This commit changes ext4_ext_remove_space() to accept 'end' parameter so we can not only truncate to the end of file, but also remove the space in the middle of the file (punch a hole). Moreover, because the last block to punch out, might be in the middle of the extent, we have to split the extent at 'end + 1' so ext4_ext_rm_leaf() can easily either remove the whole fist part of split extent, or change its size. ext4_ext_remove_space() is then used to actually remove the space (extents) from within the hole, instead of ext4_ext_map_blocks(). Note that this also fix the issue with punch hole, where we would forget to remove empty index blocks from the extent tree, resulting in double free block error and file system corruption. This is simply because we now use different code path, where this problem does not exist. This has been tested with fsx running for several days and xfstests, plus xfstest #251 with '-o discard' run on the loop image (which converts discard requestes into punch hole to the backing file). All of it on 1K and 4K file system block size. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> [bwh: Backported to 3.2.y: move EXT4_EXT_DATA_VALID{1,2} along with the other extent splitting flags] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix free clusters calculation in bigalloc filesystemLukas Czerner2013-03-061-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 304e220f0879198b1f5309ad6f0be862b4009491 upstream. ext4_has_free_clusters() should tell us whether there is enough free clusters to allocate, however number of free clusters in the file system is converted to blocks using EXT4_C2B() which is not only wrong use of the macro (we should have used EXT4_NUM_B2C) but it's also completely wrong concept since everything else is in cluster units. Moreover when calculating number of root clusters we should be using macro EXT4_NUM_B2C() instead of EXT4_B2C() otherwise the result might be off by one. However r_blocks_count should always be a multiple of the cluster ratio so doing a plain bit shift should be enough here. We avoid using EXT4_B2C() because it's confusing. As a result of the first problem number of free clusters is much bigger than it should have been and ext4_has_free_clusters() would return 1 even if there is really not enough free clusters available. Fix this by removing the EXT4_C2B() conversion of free clusters and using bit shift when calculating number of root clusters. This bug affects number of xfstests tests covering file system ENOSPC situation handling. With this patch most of the ENOSPC problems with bigalloc file system disappear, especially the errors caused by delayed allocation not having enough space when the actual allocation is finally requested. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix xattr block allocation/release with bigallocLukas Czerner2013-03-061-3/+4
| | | | | | | | | | | | | | | | | | | | | | | commit 1231b3a1eb5740192aeebf5344dd6d6da000febf upstream. Currently when new xattr block is created or released we we would call dquot_free_block() or dquot_alloc_block() respectively, among the else decrementing or incrementing the number of blocks assigned to the inode by one block. This however does not work for bigalloc file system because we always allocate/free the whole cluster so we have to count with that in dquot_free_block() and dquot_alloc_block() as well. Use the clusters-to-blocks conversion EXT4_C2B() when passing number of blocks to the dquot_alloc/free functions to fix the problem. The problem has been revealed by xfstests #117 (and possibly others). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix race in ext4_mb_add_n_trim()Niu Yawei2013-03-061-3/+3
| | | | | | | | | | | commit f1167009711032b0d747ec89a632a626c901a1ad upstream. In ext4_mb_add_n_trim(), lg_prealloc_lock should be taken when changing the lg_prealloc_list. Signed-off-by: Niu Yawei <yawei.niu@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix possible use-after-free with AIOJan Kara2013-03-062-6/+5
| | | | | | | | | | | | | | | | commit 091e26dfc156aeb3b73bc5c5f277e433ad39331c upstream. Running AIO is pinning inode in memory using file reference. Once AIO is completed using aio_complete(), file reference is put and inode can be freed from memory. So we have to be sure that calling aio_complete() is the last thing we do with the inode. Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: return ENOMEM if sb_getblk() failsTheodore Ts'o2013-03-066-24/+30
| | | | | | | | | | | | | | | | commit 860d21e2c585f7ee8a4ecc06f474fdc33c9474f4 upstream. The only reason for sb_getblk() failing is if it can't allocate the buffer_head. So ENOMEM is more appropriate than EIO. In addition, make sure that the file system is marked as being inconsistent if sb_getblk() fails. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> [bwh: Backported to 3.2: - Adjust context - Drop change to inline.c - Call to ext4_ext_check() from ext4_ext_find_extent() is conditional] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: do not try to write superblock on ro remount w/o journalMichael Tokarev2013-01-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | commit d096ad0f79a782935d2e06ae8fb235e8c5397775 upstream. When a journal-less ext4 filesystem is mounted on a read-only block device (blockdev --setro will do), each remount (for other, unrelated, flags, like suid=>nosuid etc) results in a series of scary messages from kernel telling about I/O errors on the device. This is becauese of the following code ext4_remount(): if (sbi->s_journal == NULL) ext4_commit_super(sb, 1); at the end of remount procedure, which forces writing (flushing) of a superblock regardless whenever it is dirty or not, if the filesystem is readonly or not, and whenever the device itself is readonly or not. We only need call ext4_commit_super when the file system had been previously mounted read/write. Thanks to Eric Sandeen for help in diagnosing this issue. Signed-off-By: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: check dioread_nolock on remountJan Kara2013-01-161-9/+11
| | | | | | | | | | | | | | | | commit 261cb20cb2f0737a247aaf08dff7eb065e3e5b66 upstream. Currently we allow enabling dioread_nolock mount option on remount for filesystems where blocksize < PAGE_CACHE_SIZE. This isn't really supported so fix the bug by moving the check for blocksize != PAGE_CACHE_SIZE into parse_options(). Change the original PAGE_SIZE to PAGE_CACHE_SIZE along the way because that's what we are really interested in. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix extent tree corruption caused by hole punchForrest Liu2013-01-161-4/+18
| | | | | | | | | | | | commit c36575e663e302dbaa4d16b9c72d2c9a913a9aef upstream. When depth of extent tree is greater than 1, logical start value of interior node is not correctly updated in ext4_ext_rm_idx. Signed-off-by: Forrest Liu <forrestl@synology.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Ashish Sangwan <ashishsangwan2@gmail.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: init pagevec in ext4_da_block_invalidatepagesEric Sandeen2013-01-031-0/+1
| | | | | | | | | | | | | | | | | | commit 66bea92c69477a75a5d37b9bfed5773c92a3c4b4 upstream. ext4_da_block_invalidatepages is missing a pagevec_init(), which means that pvec->cold contains random garbage. This affects whether the page goes to the front or back of the LRU when ->cold makes it to free_hot_cold_page() Reviewed-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix memory leak in ext4_xattr_set_acl()'s error pathEugene Shatokhin2013-01-031-2/+4
| | | | | | | | | | | | | | | commit 24ec19b0ae83a385ad9c55520716da671274b96c upstream. In ext4_xattr_set_acl(), if ext4_journal_start() returns an error, posix_acl_release() will not be called for 'acl' which may result in a memory leak. This patch fixes that. Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: race-condition protection for ext4_convert_unwritten_extents_endioDmitry Monakhov2012-10-301-11/+46
| | | | | | | | | | | | | | | | | | commit dee1f973ca341c266229faa5a1a5bb268bed3531 upstream. We assumed that at the time we call ext4_convert_unwritten_extents_endio() extent in question is fully inside [map.m_lblk, map->m_len] because it was already split during submission. But this may not be true due to a race between writeback vs fallocate. If extent in question is larger than requested we will split it again. Special precautions should being done if zeroout required because [map.m_lblk, map->m_len] already contains valid data. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix fdatasync() for files with only i_size changesJan Kara2012-10-171-2/+6
| | | | | | | | | | | | | | | | | commit b71fc079b5d8f42b2a52743c8d2f1d35d655b1c5 upstream. Code tracking when transaction needs to be committed on fdatasync(2) forgets to handle a situation when only inode's i_size is changed. Thus in such situations fdatasync(2) doesn't force transaction with new i_size to disk and that can result in wrong i_size after a crash. Fix the issue by updating inode's i_datasync_tid whenever its size is updated. Reported-by: Kristian Nielsen <knielsen@knielsen-hq.org> Signed-off-by: Jan Kara <jack@suse.cz> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: always set i_op in ext4_mknod()Bernd Schubert2012-10-171-2/+0
| | | | | | | | | | | | commit 6a08f447facb4f9e29fcc30fb68060bb5a0d21c2 upstream. ext4_special_inode_operations have their own ifdef CONFIG_EXT4_FS_XATTR to mask those methods. And ext4_iget also always sets it, so there is an inconsistency. Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: online defrag is not supported for journaled filesDmitry Monakhov2012-10-171-1/+6
| | | | | | | | | | | | commit f066055a3449f0e5b0ae4f3ceab4445bead47638 upstream. Proper block swap for inodes with full journaling enabled is truly non obvious task. In order to be on a safe side let's explicitly disable it for now. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: move_extent code cleanupDmitry Monakhov2012-10-171-120/+47
| | | | | | | | | | | | | | | commit 03bd8b9b896c8e940f282f540e6b4de90d666b7c upstream. - Remove usless checks, because it is too late to check that inode != NULL at the moment it was referenced several times. - Double lock routines looks very ugly and locking ordering relays on order of i_ino, but other kernel code rely on order of pointers. Let's make them simple and clean. - check that inodes belongs to the same SB as soon as possible. Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix potential deadlock in ext4_nonda_switch()Theodore Ts'o2012-10-171-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 00d4e7362ed01987183e9528295de3213031309c upstream. In ext4_nonda_switch(), if the file system is getting full we used to call writeback_inodes_sb_if_idle(). The problem is that we can be holding i_mutex already, and this causes a potential deadlock when writeback_inodes_sb_if_idle() when it tries to take s_umount. (See lockdep output below). As it turns out we don't need need to hold s_umount; the fact that we are in the middle of the write(2) system call will keep the superblock pinned. Unfortunately writeback_inodes_sb() checks to make sure s_umount is taken, and the VFS uses a different mechanism for making sure the file system doesn't get unmounted out from under us. The simplest way of dealing with this is to just simply grab s_umount using a trylock, and skip kicking the writeback flusher thread in the very unlikely case that we can't take a read lock on s_umount without blocking. Also, we now check the cirteria for kicking the writeback thread before we decide to whether to fall back to non-delayed writeback, so if there are any outstanding delayed allocation writes, we try to get them resolved as soon as possible. [ INFO: possible circular locking dependency detected ] 3.6.0-rc1-00042-gce894ca #367 Not tainted ------------------------------------------------------- dd/8298 is trying to acquire lock: (&type->s_umount_key#18){++++..}, at: [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 but task is already holding lock: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 which lock already depends on the new lock. 2 locks held by dd/8298: #0: (sb_writers#2){.+.+.+}, at: [<c01ddcc5>] generic_file_aio_write+0x56/0xd3 #1: (&sb->s_type->i_mutex_key#8){+.+...}, at: [<c01ddcce>] generic_file_aio_write+0x5f/0xd3 stack backtrace: Pid: 8298, comm: dd Not tainted 3.6.0-rc1-00042-gce894ca #367 Call Trace: [<c015b79c>] ? console_unlock+0x345/0x372 [<c06d62a1>] print_circular_bug+0x190/0x19d [<c019906c>] __lock_acquire+0x86d/0xb6c [<c01999db>] ? mark_held_locks+0x5c/0x7b [<c0199724>] lock_acquire+0x66/0xb9 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c06db935>] down_read+0x28/0x58 [<c02277d4>] ? writeback_inodes_sb_if_idle+0x28/0x46 [<c02277d4>] writeback_inodes_sb_if_idle+0x28/0x46 [<c026f3b2>] ext4_nonda_switch+0xe1/0xf4 [<c0271ece>] ext4_da_write_begin+0x27/0x193 [<c01dcdb0>] generic_file_buffered_write+0xc8/0x1bb [<c01ddc47>] __generic_file_aio_write+0x1dd/0x205 [<c01ddce7>] generic_file_aio_write+0x78/0xd3 [<c026d336>] ext4_file_write+0x480/0x4a6 [<c0198c1d>] ? __lock_acquire+0x41e/0xb6c [<c0180944>] ? sched_clock_cpu+0x11a/0x13e [<c01967e9>] ? trace_hardirqs_off+0xb/0xd [<c018099f>] ? local_clock+0x37/0x4e [<c0209f2c>] do_sync_write+0x67/0x9d [<c0209ec5>] ? wait_on_retry_sync_kiocb+0x44/0x44 [<c020a7b9>] vfs_write+0x7b/0xe6 [<c020a9a6>] sys_write+0x3b/0x64 [<c06dd4bd>] syscall_call+0x7/0xb Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix long mount times on very big file systemsTheodore Ts'o2012-09-121-0/+4
| | | | | | | | | | | | | commit 0548bbb85337e532ca2ed697c3e9b227ff2ed4b4 upstream. Commit 8aeb00ff85a: "ext4: fix overhead calculation used by ext4_statfs()" introduced a O(n**2) calculation which makes very large file systems take forever to mount. Fix this with an optimization for non-bigalloc file systems. (For bigalloc file systems the overhead needs to be set in the the superblock.) Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: avoid kmemcheck complaint from reading uninitialized memoryTheodore Ts'o2012-09-121-0/+1
| | | | | | | | | | | | | | | | | | commit 7e731bc9a12339f344cddf82166b82633d99dd86 upstream. Commit 03179fe923 introduced a kmemcheck complaint in ext4_da_get_block_prep() because we save and restore ei->i_da_metadata_calc_last_lblock even though it is left uninitialized in the case where i_da_metadata_calc_len is zero. This doesn't hurt anything, but silencing the kmemcheck complaint makes it easier for people to find real bugs. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=45631 (which is marked as a regression). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: undo ext4_calc_metadata_amount if we fail to claim spaceTheodore Ts'o2012-08-021-11/+21
| | | | | | | | | | | | | | | | | | commit 03179fe92318e7934c180d96f12eff2cb36ef7b6 upstream. The function ext4_calc_metadata_amount() has side effects, although it's not obvious from its function name. So if we fail to claim space, regardless of whether we retry to claim the space again, or return an error, we need to undo these side effects. Otherwise we can end up incorrectly calculating the number of metadata blocks needed for the operation, which was responsible for an xfstests failure for test #271 when using an ext2 file system with delalloc enabled. Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: don't let i_reserved_meta_blocks go negativeBrian Foster2012-08-021-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | commit 97795d2a5b8d3c8dc4365d4bd3404191840453ba upstream. If we hit a condition where we have allocated metadata blocks that were not appropriately reserved, we risk underflow of ei->i_reserved_meta_blocks. In turn, this can throw sbi->s_dirtyclusters_counter significantly out of whack and undermine the nondelalloc fallback logic in ext4_nonda_switch(). Warn if this occurs and set i_allocated_meta_blocks to avoid this problem. This condition is reproduced by xfstests 270 against ext2 with delalloc enabled: Mar 28 08:58:02 localhost kernel: [ 171.526344] EXT4-fs (loop1): delayed block allocation failed for inode 14 at logical offset 64486 with max blocks 64 with error -28 Mar 28 08:58:02 localhost kernel: [ 171.526346] EXT4-fs (loop1): This should not happen!! Data will be lost 270 ultimately fails with an inconsistent filesystem and requires an fsck to repair. The cause of the error is an underflow in ext4_da_update_reserve_space() due to an unreserved meta block allocation. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix overhead calculation used by ext4_statfs()Theodore Ts'o2012-08-024-56/+131
| | | | | | | | | | | | | | | commit 952fc18ef9ec707ebdc16c0786ec360295e5ff15 upstream. Commit f975d6bcc7a introduced bug which caused ext4_statfs() to miscalculate the number of file system overhead blocks. This causes the f_blocks field in the statfs structure to be larger than it should be. This would in turn cause the "df" output to show the number of data blocks in the file system and the number of data blocks used to be larger than they should be. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: pass a char * to ext4_count_free() instead of a buffer_head ptrTheodore Ts'o2012-08-024-8/+8
| | | | | | | | | | | commit f6fb99cadcd44660c68e13f6eab28333653621e6 upstream. Make it possible for ext4_count_free to operate on buffers and not just data in buffer_heads. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: Report max_batch_time option correctlyBen Hutchings2012-07-251-1/+1
| | | | | | | | | | commit 1d526fc91bea04ee35b7599bf8b82f86c0aaf46c upstream. Currently the value reported for max_batch_time is really the value of min_batch_time. Reported-by: Russell Coker <russell@coker.com.au> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix the free blocks calculation for ext3 file systems w/ uninit_bgTheodore Ts'o2012-06-101-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b0dd6b70f0fda17ae9762fbb72d98e40a4f66556 upstream. Ext3 filesystems that are converted to use as many ext4 file system features as possible will enable uninit_bg to speed up e2fsck times. These file systems will have a native ext3 layout of inode tables and block allocation bitmaps (as opposed to ext4's flex_bg layout). Unfortunately, in these cases, when first allocating a block in an uninitialized block group, ext4 would incorrectly calculate the number of free blocks in that block group, and then errorneously report that the file system was corrupt: EXT4-fs error (device vdd): ext4_mb_generate_buddy:741: group 30, 32254 clusters in bitmap, 32258 in gd This problem can be reproduced via: mke2fs -q -t ext4 -O ^flex_bg /dev/vdd 5g mount -t ext4 /dev/vdd /mnt fallocate -l 4600m /mnt/test The problem was caused by a bone headed mistake in the check to see if a particular metadata block was part of the block group. Many thanks to Kees Cook for finding and bisecting the buggy commit which introduced this bug (commit fd034a84e1, present since v3.2). Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Reported-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Tested-by: Kees Cook <keescook@chromium.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: don't set i_flags in EXT4_IOC_SETFLAGSTao Ma2012-06-101-1/+0
| | | | | | | | | | | | | commit b22b1f178f6799278d3178d894f37facb2085765 upstream. Commit 7990696 uses the ext4_{set,clear}_inode_flags() functions to change the i_flags automatically but fails to remove the error setting of i_flags. So we still have the problem of trashing state flags. Fix this by removing the assignment. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: don't trash state flags in EXT4_IOC_SETFLAGSTheodore Ts'o2012-06-101-3/+9
| | | | | | | | | | | | | | | commit 79906964a187c405db72a3abc60eb9b50d804fbc upstream. In commit 353eb83c we removed i_state_flags with 64-bit longs, But when handling the EXT4_IOC_SETFLAGS ioctl, we replace i_flags directly, which trashes the state flags which are stored in the high 32-bits of i_flags on 64-bit platforms. So use the the ext4_{set,clear}_inode_flags() functions which use atomic bit manipulation functions instead. Reported-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: remove mb_groups before tearing down the buddy_cacheSalman Qazi2012-06-101-2/+3
| | | | | | | | | | | | | | | | | | | | commit 95599968d19db175829fb580baa6b68939b320fb upstream. We can't have references held on pages in the s_buddy_cache while we are trying to truncate its pages and put the inode. All the pages must be gone before we reach clear_inode. This can only be gauranteed if we can prevent new users from grabbing references to s_buddy_cache's pages. The original bug can be reproduced and the bug fix can be verified by: while true; do mount -t ext4 /dev/ram0 /export/hda3/ram0; \ umount /export/hda3/ram0; done & while true; do cat /proc/fs/ext4/ram0/mb_groups; done Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: add ext4_mb_unload_buddy in the error pathSalman Qazi2012-06-101-0/+1
| | | | | | | | | | | commit 02b7831019ea4e7994968c84b5826fa8b248ffc8 upstream. ext4_free_blocks fails to pair an ext4_mb_load_buddy with a matching ext4_mb_unload_buddy when it fails a memory allocation. Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: add missing save_error_info() to ext4_error()Theodore Ts'o2012-06-101-0/+1
| | | | | | | | | | | | | | | commit f3fc0210c0fc91900766c995f089c39170e68305 upstream. The ext4_error() function is missing a call to save_error_info(). Since this is the function which marks the file system as containing an error, this oversight (which was introduced in 2.6.36) is quite significant, and should be backported to older stable kernels with high urgency. Reported-by: Ken Sumrall <ksumrall@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: ksumrall@google.com Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: disallow hard-linked directory in ext4_lookupAndreas Dilger2012-06-101-0/+6
| | | | | | | | | | | | commit 7e936b737211e6b54e34b71a827e56b872e958d8 upstream. A hard-linked directory to its parent can cause the VFS to deadlock, and is a sign of a corrupted file system. So detect this case in ext4_lookup(), before the rmdir() lockup scenario can take place. Signed-off-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: force ro mount if ext4_setup_super() failsEric Sandeen2012-06-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | commit 7e84b6216467b84cd332c8e567bf5aa113fd2f38 upstream. If ext4_setup_super() fails i.e. due to a too-high revision, the error is logged in dmesg but the fs is not mounted RO as indicated. Tested by: # mkfs.ext4 -r 4 /dev/sdb6 # mount /dev/sdb6 /mnt/test # dmesg | grep "too high" [164919.759248] EXT4-fs (sdb6): revision level too high, forcing read-only mode # grep sdb6 /proc/mounts /dev/sdb6 /mnt/test2 ext4 rw,seclabel,relatime,data=ordered 0 0 Reviewed-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix error handling on inode bitmap corruptionJan Kara2012-05-311-2/+6
| | | | | | | | | | | | | | commit acd6ad83517639e8f09a8c5525b1dccd81cd2a10 upstream. When insert_inode_locked() fails in ext4_new_inode() it most likely means inode bitmap got corrupted and we allocated again inode which is already in use. Also doing unlock_new_inode() during error recovery is wrong since the inode does not have I_NEW set. Fix the problem by jumping to fail: (instead of fail_drop:) which declares filesystem error and does not call unlock_new_inode(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: avoid deadlock on sync-mounted FS w/o journalEric Sandeen2012-05-201-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c1bb05a657fb3d8c6179a4ef7980261fae4521d7 upstream. Processes hang forever on a sync-mounted ext2 file system that is mounted with the ext4 module (default in Fedora 16). I can reproduce this reliably by mounting an ext2 partition with "-o sync" and opening a new file an that partition with vim. vim will hang in "D" state forever. The same happens on ext4 without a journal. I am attaching a small patch here that solves this issue for me. In the sync mounted case without a journal, ext4_handle_dirty_metadata() may call sync_dirty_buffer(), which can't be called with buffer lock held. Also move mb_cache_entry_release inside lock to avoid race fixed previously by 8a2bfdcb ext[34]: EA block reference count racing fix Note too that ext2 fixed this same problem in 2006 with b2f49033 [PATCH] fix deadlock in ext2 Signed-off-by: Martin.Wilck@ts.fujitsu.com [sandeen@redhat.com: move mb_cache_entry_release before unlock, edit commit msg] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: fix endianness breakage in ext4_split_extent_at()Al Viro2012-05-111-1/+1
| | | | | | | | | | commit af1584f570b19b0285e4402a0b54731495d31784 upstream. ->ee_len is __le16, so assigning cpu_to_le32() to it is going to do Bad Things(tm) on big-endian hosts... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
* ext4: address scalability issue by removing extent cache statisticsTheodore Ts'o2012-04-223-23/+0
| | | | | | | | | | | | commit 9cd70b347e9761ea2d2ac3d758c529a48a8193e6 upstream. Andi Kleen and Tim Chen have reported that under certain circumstances the extent cache statistics are causing scalability problems due to cache line bounces. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: check for zero length extentTheodore Ts'o2012-04-021-0/+2
| | | | | | | | | | | | | | | | | | | | commit 31d4f3a2f3c73f279ff96a7135d7202ef6833f12 upstream. Explicitly test for an extent whose length is zero, and flag that as a corrupted extent. This avoids a kernel BUG_ON assertion failure. Tested: Without this patch, the file system image found in tests/f_ext_zero_len/image.gz in the latest e2fsprogs sources causes a kernel panic. With this patch, an ext4 file system error is noted instead, and the file system is marked as being corrupted. https://bugzilla.kernel.org/show_bug.cgi?id=42859 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix race between sync and completed io workJeff Moyer2012-04-023-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 491caa43639abcffaa645fbab372a7ef4ce2975c upstream. The following command line will leave the aio-stress process unkillable on an ext4 file system (in my case, mounted on /mnt/test): aio-stress -t 20 -s 10 -O -S -o 2 -I 1000 /mnt/test/aiostress.3561.4 /mnt/test/aiostress.3561.4.20 /mnt/test/aiostress.3561.4.19 /mnt/test/aiostress.3561.4.18 /mnt/test/aiostress.3561.4.17 /mnt/test/aiostress.3561.4.16 /mnt/test/aiostress.3561.4.15 /mnt/test/aiostress.3561.4.14 /mnt/test/aiostress.3561.4.13 /mnt/test/aiostress.3561.4.12 /mnt/test/aiostress.3561.4.11 /mnt/test/aiostress.3561.4.10 /mnt/test/aiostress.3561.4.9 /mnt/test/aiostress.3561.4.8 /mnt/test/aiostress.3561.4.7 /mnt/test/aiostress.3561.4.6 /mnt/test/aiostress.3561.4.5 /mnt/test/aiostress.3561.4.4 /mnt/test/aiostress.3561.4.3 /mnt/test/aiostress.3561.4.2 This is using the aio-stress program from the xfstests test suite. That particular command line tells aio-stress to do random writes to 20 files from 20 threads (one thread per file). The files are NOT preallocated, so you will get writes to random offsets within the file, thus creating holes and extending i_size. It also opens the file with O_DIRECT and O_SYNC. On to the problem. When an I/O requires unwritten extent conversion, it is queued onto the completed_io_list for the ext4 inode. Two code paths will pull work items from this list. The first is the ext4_end_io_work routine, and the second is ext4_flush_completed_IO, which is called via the fsync path (and O_SYNC handling, as well). There are two issues I've found in these code paths. First, if the fsync path beats the work routine to a particular I/O, the work routine will free the io_end structure! It does not take into account the fact that the io_end may still be in use by the fsync path. I've fixed this issue by adding yet another IO_END flag, indicating that the io_end is being processed by the fsync path. The second problem is that the work routine will make an assignment to io->flag outside of the lock. I have witnessed this result in a hang at umount. Moving the flag setting inside the lock resolved that problem. The problem was introduced by commit b82e384c7b ("ext4: optimize locking for end_io extent conversion"), which first appeared in 3.2. As such, the fix should be backported to that release (probably along with the unwritten extent conversion race fix). Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix race between unwritten extent conversion and truncateJeff Moyer2012-04-023-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | commit 266991b13890049ee1a6bb95b9817f06339ee3d7 upstream. The following comment in ext4_end_io_dio caught my attention: /* XXX: probably should move into the real I/O completion handler */ inode_dio_done(inode); The truncate code takes i_mutex, then calls inode_dio_wait. Because the ext4 code path above will end up dropping the mutex before it is reacquired by the worker thread that does the extent conversion, it seems to me that the truncate can happen out of order. Jan Kara mentioned that this might result in error messages in the system logs, but that should be the extent of the "damage." The fix is pretty straight-forward: don't call inode_dio_done until the extent conversion is complete. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: ignore EXT4_INODE_JOURNAL_DATA flag with delallocLukas Czerner2012-04-022-41/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 3d2b158262826e8b75bbbfb7b97010838dd92ac7 upstream. Ext4 does not support data journalling with delayed allocation enabled. We even do not allow to mount the file system with delayed allocation and data journalling enabled, however it can be set via FS_IOC_SETFLAGS so we can hit the inode with EXT4_INODE_JOURNAL_DATA set even on file system mounted with delayed allocation (default) and that's where problem arises. The easies way to reproduce this problem is with the following set of commands: mkfs.ext4 /dev/sdd mount /dev/sdd /mnt/test1 dd if=/dev/zero of=/mnt/test1/file bs=1M count=4 chattr +j /mnt/test1/file dd if=/dev/zero of=/mnt/test1/file bs=1M count=4 conv=notrunc chattr -j /mnt/test1/file Additionally it can be reproduced quite reliably with xfstests 272 and 269. In fact the above reproducer is a part of test 272. To fix this we should ignore the EXT4_INODE_JOURNAL_DATA inode flag if the file system is mounted with delayed allocation. This can be easily done by fixing ext4_should_*_data() functions do ignore data journal flag when delalloc is set (suggested by Ted). We also have to set the appropriate address space operations for the inode (again, ignoring data journal flag if delalloc enabled). Additionally this commit introduces ext4_inode_journal_mode() function because ext4_should_*_data() has already had a lot of common code and this change is putting it all into one function so it is easier to read. Successfully tested with xfstests in following configurations: delalloc + data=ordered delalloc + data=writeback data=journal nodelalloc + data=ordered nodelalloc + data=writeback nodelalloc + data=journal Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ext4: fix undefined behavior in ext4_fill_flex_info()Xi Wang2012-01-251-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit d50f2ab6f050311dbf7b8f5501b25f0bf64a439b upstream. Commit 503358ae01b70ce6909d19dd01287093f6b6271c ("ext4: avoid divide by zero when trying to mount a corrupted file system") fixes CVE-2009-4307 by performing a sanity check on s_log_groups_per_flex, since it can be set to a bogus value by an attacker. sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex; groups_per_flex = 1 << sbi->s_log_groups_per_flex; if (groups_per_flex < 2) { ... } This patch fixes two potential issues in the previous commit. 1) The sanity check might only work on architectures like PowerPC. On x86, 5 bits are used for the shifting amount. That means, given a large s_log_groups_per_flex value like 36, groups_per_flex = 1 << 36 is essentially 1 << 4 = 16, rather than 0. This will bypass the check, leaving s_log_groups_per_flex and groups_per_flex inconsistent. 2) The sanity check relies on undefined behavior, i.e., oversized shift. A standard-confirming C compiler could rewrite the check in unexpected ways. Consider the following equivalent form, assuming groups_per_flex is unsigned for simplicity. groups_per_flex = 1 << sbi->s_log_groups_per_flex; if (groups_per_flex == 0 || groups_per_flex == 1) { We compile the code snippet using Clang 3.0 and GCC 4.6. Clang will completely optimize away the check groups_per_flex == 0, leaving the patched code as vulnerable as the original. GCC keeps the check, but there is no guarantee that future versions will do the same. Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* ext4: add missing ext4_resize_end on error pathsDjalal Harouni2012-01-251-10/+18
| | | | | | | | | | | | | | | | | | | commit 014a1770371a028d22f364718c805f4216911ecd upstream. Online resize ioctls 'EXT4_IOC_GROUP_EXTEND' and 'EXT4_IOC_GROUP_ADD' call ext4_resize_begin() to check permissions and to set the EXT4_RESIZING bit lock, they do their work and they must finish with ext4_resize_end() which calls clear_bit_unlock() to unlock and to avoid -EBUSY errors for the next resize operations. This patch adds the missing ext4_resize_end() calls on error paths. Patch tested. Signed-off-by: Djalal Harouni <tixxdz@opendz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* ext4: handle EOF correctly in ext4_bio_write_page()Yongqiang Yang2011-12-131-0/+12
| | | | | | | | | We need to zero out part of a page which beyond EOF before setting uptodate, otherwise, mapread or write will see non-zero data beyond EOF. Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org