aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/md/md.c
Commit message (Collapse)AuthorAgeFilesLines
* md: use kzalloc() when bitmap is disabledBenjamin Randazzo2016-03-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit b6878d9e03043695dbf3fa1caa6dfc09db225b16 upstream. In drivers/md/md.c get_bitmap_file() uses kmalloc() for creating a mdu_bitmap_file_t called "file". 5769 file = kmalloc(sizeof(*file), GFP_NOIO); 5770 if (!file) 5771 return -ENOMEM; This structure is copied to user space at the end of the function. 5786 if (err == 0 && 5787 copy_to_user(arg, file, sizeof(*file))) 5788 err = -EFAULT But if bitmap is disabled only the first byte of "file" is initialized with zero, so it's possible to read some bytes (up to 4095) of kernel space memory from user space. This is an information leak. 5775 /* bitmap disabled, zero the first byte and copy out */ 5776 if (!mddev->bitmap_info.file) 5777 file->pathname[0] = '\0'; Change-Id: I7cd2a3c7fad2e2cb9edb8b4eff2af8a3a8f40149 Signed-off-by: Benjamin Randazzo <benjamin@randazzo.fr> Signed-off-by: NeilBrown <neilb@suse.com> [lizf: Backported to 3.4: fix both branches] Signed-off-by: Zefan Li <lizefan@huawei.com>
* md: protect against crash upon fsync on ro arraySebastian Riemer2013-03-201-0/+4
| | | | | | | | | | | | | | | | | | | | commit bbfa57c0f2243a7c31fd248d22e9861a2802cad5 upstream. If an fsync occurs on a read-only array, we need to send a completion for the IO and may not increment the active IO count. Otherwise, we hit a bug trace and can't stop the MD array anymore. By advice of Christoph Hellwig we return success upon a flush request but we return -EROFS for other writes. We detect flush requests by checking if the bio has zero sectors. Signed-off-by: Sebastian Riemer <sebastian.riemer@profitbricks.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: NeilBrown <neilb@suse.de> Reported-by: Ben Hutchings <ben@decadent.org.uk> Acked-by: Paul Menzel <paulepanter@users.sourceforge.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md: Don't truncate size at 4TB for RAID0 and LinearNeilBrown2012-10-021-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | commit 667a5313ecd7308d79629c0738b0db588b0b0a4e upstream. commit 27a7b260f71439c40546b43588448faac01adb93 md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. changed 0.90 metadata handling to truncated size to 4TB as that is all that 0.90 can record. However for RAID0 and Linear, 0.90 doesn't need to record the size, so this truncation is not needed and causes working arrays to become too small. So avoid the truncation for RAID0 and Linear This bug was introduced in 3.1 and is suitable for any stable kernels from then onwards. As the offending commit was tagged for 'stable', any stable kernel that it was applied to should also get this patch. That includes at least 2.6.32, 2.6.33 and 3.0. (Thanks to Ben Hutchings for providing that list). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md: using GFP_NOIO to allocate bio for flush requestShaohua Li2012-06-011-1/+1
| | | | | | | | | | | | | | | | commit b5e1b8cee7ad58a15d2fa79bcd7946acb592602d upstream. A flush request is usually issued in transaction commit code path, so using GFP_KERNEL to allocate memory for flush request bio falls into the classic deadlock issue. This is suitable for any -stable kernel to which it applies as it avoids a possible deadlock. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* MD: Add del_timer_sync to mddev_suspend (fix nasty panic)Jonathan Brassow2012-05-211-0/+2
| | | | | | | | | | | | | | | | commit 0d9f4f135eb6dea06bdcb7065b1e4ff78274a5e9 upstream. Use del_timer_sync to remove timer before mddev_suspend finishes. We don't want a timer going off after an mddev_suspend is called. This is especially true with device-mapper, since it can call the destructor function immediately following a suspend. This results in the removal (kfree) of the structures upon which the timer depends - resulting in a very ugly panic. Therefore, we add a del_timer_sync to mddev_suspend to prevent this. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* md: Avoid waking up a thread after it has been freed.NeilBrown2011-10-161-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 01f96c0a9922cd9919baf9d16febdf7016177a12 upstream. Two related problems: 1/ some error paths call "md_unregister_thread(mddev->thread)" without subsequently clearing ->thread. A subsequent call to mddev_unlock will try to wake the thread, and crash. 2/ Most calls to md_wakeup_thread are protected against the thread disappeared either by: - holding the ->mutex - having an active request, so something else must be keeping the array active. However mddev_unlock calls md_wakeup_thread after dropping the mutex and without any certainty of an active request, so the ->thread could theoretically disappear. So we need a spinlock to provide some protections. So change md_unregister_thread to take a pointer to the thread pointer, and ensure that it always does the required locking, and clears the pointer properly. Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.NeilBrown2011-10-031-2/+10
| | | | | | | | | | | | | | | | | | | | | | | commit 27a7b260f71439c40546b43588448faac01adb93 upstream. 0.90 metadata uses an unsigned 32bit number to count the number of kilobytes used from each device. This should allow up to 4TB per device. However we multiply this by 2 (to get sectors) before casting to a larger type, so sizes above 2TB get truncated. Also we allow rdev->sectors to be larger than 4TB, so it is possible for the array to be resized larger than the metadata can handle. So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in used. Also the sanity check at the end of super_90_load should include level 1 as it used ->size too. (RAID0 and Linear don't use ->size at all). Reported-by: Pim Zandbergen <P.Zandbergen@macroscoop.nl> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* md: avoid endless recovery loop when waiting for fail device to complete.NeilBrown2011-06-281-0/+1
| | | | | | | | | | | | | | | | | | | | | If a device fails in a way that causes pending request to take a while to complete, md will not be able to immediately remove it from the array in remove_and_add_spares. It will then incorrectly look like a spare device and md will try to recover it even though it is failed. This leads to a recovery process starting and instantly aborting over and over again. We should check if the device is faulty before considering it to be a spare. This will avoid trying to start a recovery that cannot proceed. This bug was introduced in 2.6.26 so that patch is suitable for any kernel since then. Cc: stable@kernel.org Reported-by: Jim Paradis <james.paradis@stratus.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: check ->hot_remove_disk when removing diskNamhyung Kim2011-06-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check pers->hot_remove_disk instead of pers->hot_add_disk in slot_store() during disk removal. The linear personality only has ->hot_add_disk and no ->hot_remove_disk, so that removing disk in the array resulted to following kernel bug: $ sudo mdadm --create /dev/md0 --level=linear --raid-devices=4 /dev/loop[0-3] $ echo none | sudo tee /sys/block/md0/md/dev-loop2/slot BUG: unable to handle kernel NULL pointer dereference at (null) IP: [< (null)>] (null) PGD c9f5d067 PUD 8575a067 PMD 0 Oops: 0010 [#1] SMP CPU 2 Modules linked in: linear loop bridge stp llc kvm_intel kvm asus_atk0110 sr_mod cdrom sg Pid: 10450, comm: tee Not tainted 3.0.0-rc1-leonard+ #173 System manufacturer System Product Name/P5G41TD-M PRO RIP: 0010:[<0000000000000000>] [< (null)>] (null) RSP: 0018:ffff880085757df0 EFLAGS: 00010282 RAX: ffffffffa00168e0 RBX: ffff8800d1431800 RCX: 000000000000006e RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88008543c000 RBP: ffff880085757e48 R08: 0000000000000002 R09: 000000000000000a R10: 0000000000000000 R11: ffff88008543c2e0 R12: 00000000ffffffff R13: ffff8800b4641000 R14: 0000000000000005 R15: 0000000000000000 FS: 00007fe8c9e05700(0000) GS:ffff88011fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000b4502000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process tee (pid: 10450, threadinfo ffff880085756000, task ffff8800c9f08000) Stack: ffffffff8138496a ffff8800b4641000 ffff88008543c268 0000000000000000 ffff8800b4641000 ffff88008543c000 ffff8800d1431868 ffffffff81a78a90 ffff8800b4641000 ffff88008543c000 ffff8800d1431800 ffff880085757e98 Call Trace: [<ffffffff8138496a>] ? slot_store+0xaa/0x265 [<ffffffff81384bae>] rdev_attr_store+0x89/0xa8 [<ffffffff8115a96a>] sysfs_write_file+0x108/0x144 [<ffffffff81106b87>] vfs_write+0xb1/0x10d [<ffffffff8106e6c0>] ? trace_hardirqs_on_caller+0x111/0x135 [<ffffffff81106cac>] sys_write+0x4d/0x77 [<ffffffff814fe702>] system_call_fastpath+0x16/0x1b Code: Bad RIP value. RIP [< (null)>] (null) RSP <ffff880085757df0> CR2: 0000000000000000 ---[ end trace ba5fc64319a826fb ]--- Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
* md: Using poll /proc/mdstat can monitor the events of adding a spare disks马建朋2011-06-091-0/+2
| | | | | Signed-off-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD: add sync_super to mddev_t structJonathan Brassow2011-06-081-2/+13
| | | | | | | | | | | | | | Add the 'sync_super' function pointer to MD array structure (struct mddev_s) If device-mapper (dm-raid.c) is to define its own on-disk superblock and be able to load it, there must still be a way for MD to initiate superblock updates. The simplest way to make this happen is to provide a pointer in the MD array structure that can be set by device-mapper (or other module) with a function to do this. If the function has been set, it will be used; otherwise, the method with be looked up via 'super_types' as usual. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD: move thread wakeups into resumeJonathan Brassow2011-06-081-3/+7
| | | | | | | | | | | | | | | Move personality and sync/recovery thread starting outside md_run. Moving the wakeup's of the personality and sync/recovery threads out of md_run and into do_md_run and mddev_resume solves two issues: 1) It allows bitmap_load to be called before the sync_thread is run and 2) when MD personalities are used by device-mapper (dm-raid.c), the start-up of the array is better alligned with device-mapper primatives (CTR/resume/suspend/DTR). I/O - in this case, recovery operations - should not happen until after a resume has taken place. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD: possible typoJonathan Brassow2011-06-081-2/+2
| | | | | | | | | | | Make message a bit clearer by s/blocks/k/ I chose 'k' vs 'kiB' or 'kB' because it is what is used earlier in the message. 'k' may be a bit ambigous, but I think it's better than "blocks" which normally means 512, but means 1024 in MD. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD: no sync IO while suspendedJonathan Brassow2011-06-081-1/+3
| | | | | | | | | | Disallow resync I/O while the RAID array is suspended. Recovery, resync, and metadata I/O should not be allowed while a device is suspended. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* MD: no integrity register if no gendiskJonathan Brassow2011-06-081-2/+2
| | | | | | | | | | Don't attempt md_integrity_register if there is no gendisk struct available. When MD arrays are built via device-mapper, the gendisk structure is not available via mddev. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: allow resync_start to be set while an array is active.NeilBrown2011-05-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sysfs attribute 'resync_start' (known internally as recovery_cp), records where a resync is up to. A value of 0 means the array is not known to be in-sync at all. A value of MaxSector means the array is believed to be fully in-sync. When the size of member devices of an array (RAID1,RAID4/5/6) is increased, the array can be increased to match. This process sets resync_start to the old end-of-device offset so that the new part of the array gets resynced. However with RAID1 (and RAID6) a resync is not technically necessary and may be undesirable. So it would be good if the implied resync after the array is resized could be avoided. So: change 'resync_start' so the value can be changed while the array is active, and as a precaution only allow it to be changed while resync/recovery is 'frozen'. Changing it once resync has started is not going to be useful anyway. This allows the array to be resized without a resync by: write 'frozen' to 'sync_action' write new size to 'component_size' (this will set resync_start) write 'none' to 'resync_start' write 'idle' to 'sync_action'. Also slightly improve some tests on recovery_cp when resizing raid1/raid5. Now that an arbitrary value could be set we should be more careful in our tests. Signed-off-by: NeilBrown <neilb@suse.de>
* md: reject a re-add request that cannot be honoured.NeilBrown2011-05-111-0/+10
| | | | | | | | | | | | | | | The 'add_new_disk' ioctl can be used to add a device either as a spare, or as an active disk that just needs to be resynced based on write-intent-bitmap information (re-add) Currently if a re-add is requested but fails we add as a spare instead. This makes it impossible for user-space to check for failure. So change to require that a re-add attempt will either succeed or completely fail. User-space can then decide what to do next. Signed-off-by: NeilBrown <neilb@suse.de>
* md: Fix race when creating a new md device.NeilBrown2011-05-111-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a race when creating an md device by opening /dev/mdXX. If two processes do this at much the same time they will follow the call path __blkdev_get -> get_gendisk -> kobj_lookup The first will call -> md_probe -> md_alloc -> add_disk -> blk_register_region and the race happens when the second gets to kobj_lookup after add_disk has called blk_register_region but before it returns to md_alloc. In the case the second will not call md_probe (as the probe is already done) but will get a handle on the gendisk, return to __blkdev_get which will then call md_open (via the ->open) pointer. As mddev->gendisk hasn't been set yet, md_open will think something is wrong an return with ERESTARTSYS. This can loop endlessly while the first thread makes no progress through add_disk. Nothing is blocking it, but due to scheduler behaviour it doesn't get a turn. So this is essentially a live-lock. We fix this by simply moving the assignment to mddev->gendisk before the call the add_disk() so md_open doesn't get confused. Also move blk_queue_flush earlier because add_disk should be as late as possible. To make sure that md_open doesn't complete until md_alloc has done all that is needed, we take mddev->open_mutex during the last part of md_alloc. md_open will wait for this. This can cause a lock-up on boot so Cc:ing for stable. For 2.6.36 and earlier a different patch will be needed as the 'blk_queue_flush' call isn't there. Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com> Tested-by: Thomas Jarosch <thomas.jarosch@intra2net.com> Cc: stable@kernel.org
* md: Cleanup after raid45->raid0 takeoverKrzysztof Wojcik2011-04-201-0/+1
| | | | | | | | | | | | | | Problem: After raid4->raid0 takeover operation, another takeover operation (e.g raid0->raid10) results "kernel oops". Root cause: Variables 'degraded' in mddev structure is not cleared on raid45->raid0 takeover. This patch reset this variable. Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
* md: provide generic support for handling unplug callbacks.NeilBrown2011-04-181-0/+56
| | | | | | | | | | | | | When an md device adds a request to a queue, it can call mddev_check_plugged. If this succeeds then we know that the md thread will be woken up shortly, and ->plug_cnt will be non-zero until then, so some processing can be delayed. If it fails, then no unplug callback is expected and the make_request function needs to do whatever is required to make the request happen. Signed-off-by: NeilBrown <neilb@suse.de>
* md - remove old plugging code.NeilBrown2011-04-181-51/+0
| | | | | | | | | | | | | md has some plugging infrastructure for RAID5 to use because the normal plugging infrastructure required a 'request_queue', and when called from dm, RAID5 doesn't have one of those available. This relied on the ->unplug_fn callback which doesn't exist any more. So remove all of that code, both in md and raid5. Subsequent patches with restore the plugging functionality. Signed-off-by: NeilBrown <neilb@suse.de>
* Fix common misspellingsLucas De Marchi2011-03-311-1/+1
| | | | | | Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
* md: Fix integrity registration error when no devices are capableMartin K. Petersen2011-03-281-6/+2
| | | | | | | | | | | | We incorrectly returned -EINVAL when none of the devices in the array had an integrity profile. This in turn prevented mdadm from starting the metadevice. Fix this so we only return errors on mismatched profiles and memory allocation failures. Reported-by: Giacomo Catenazzi <cate@cateee.net> Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2011-03-241-12/+8
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}
| * block: Require subsystems to explicitly allocate bio_set integrity mempoolMartin K. Petersen2011-03-171-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MD and DM create a new bio_set for every metadevice. Each bio_set has an integrity mempool attached regardless of whether the metadevice is capable of passing integrity metadata. This is a waste of memory. Instead we defer the allocation decision to MD and DM since we know at metadevice creation time whether integrity passthrough is needed or not. Automatic integrity mempool allocation can then be removed from bioset_create() and we make an explicit integrity allocation for the fs_bio_set. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Acked-by: Mike Snitzer <snizer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * Merge branch 'for-2.6.39/stack-plug' into for-2.6.39/coreJens Axboe2011-03-101-10/+2
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: block/blk-core.c block/blk-flush.c drivers/md/raid1.c drivers/md/raid10.c drivers/md/raid5.c fs/nilfs2/btnode.c fs/nilfs2/mdt.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * block: kill off REQ_UNPLUGJens Axboe2011-03-101-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the plugging now being explicitly controlled by the submitter, callers need not pass down unplugging hints to the block layer. If they want to unplug, it's because they manually plugged on their own - in which case, they should just unplug at will. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| | * block: remove per-queue pluggingJens Axboe2011-03-101-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* | | Merge branch 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds2011-03-161-1/+1
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix build failure introduced by s/freezeable/freezable/ workqueue: add system_freezeable_wq rds/ib: use system_wq instead of rds_ib_fmr_wq net/9p: replace p9_poll_task with a work net/9p: use system_wq instead of p9_mux_wq xfs: convert to alloc_workqueue() reiserfs: make commit_wq use the default concurrency level ocfs2: use system_wq instead of ocfs2_quota_wq ext4: convert to alloc_workqueue() scsi/scsi_tgt_lib: scsi_tgtd isn't used in memory reclaim path scsi/be2iscsi,qla2xxx: convert to alloc_workqueue() misc/iwmc3200top: use system_wq instead of dedicated workqueues i2o: use alloc_workqueue() instead of create_workqueue() acpi: kacpi*_wq don't need WQ_MEM_RECLAIM fs/aio: aio_wq isn't used in memory reclaim path input/tps6507x-ts: use system_wq instead of dedicated workqueue cpufreq: use system_wq instead of dedicated workqueues wireless/ipw2x00: use system_wq instead of dedicated workqueues arm/omap: use system_wq in mailbox workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUER
| * | Merge branch 'master' into for-2.6.39Tejun Heo2011-02-211-15/+32
| |\ \ | | |/
| * | workqueue: use WQ_MEM_RECLAIM instead of WQ_RESCUERTejun Heo2011-01-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | WQ_RESCUER is now an internal flag and should only be used in the workqueue implementation proper. Use WQ_MEM_RECLAIM instead. This doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de>
* | | md: Fix - again - partition detection when array becomes activeNeilBrown2011-02-241-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Revert b821eaa572fd737faaf6928ba046e571526c36c6 and f3b99be19ded511a1bf05a148276239d9f13eefa When I wrote the first of these I had a wrong idea about the lifetime of 'struct block_device'. It can disappear at any time that the block device is not open if it falls out of the inode cache. So relying on the 'size' recorded with it to detect when the device size has changed and so we need to revalidate, is wrong. Rather, we really do need the 'changed' attribute stored directly in the mddev and set/tested as appropriate. Without this patch, a sequence of: mknod / open / close / unlink (which can cause a block_device to be created and then destroyed) will result in a rescan of the partition table and consequence removal and addition of partitions. Several of these in a row can get udev racing to create and unlink and other code can get confused. With the patch, the rescan is only performed when needed and so there are no races. This is suitable for any stable kernel from 2.6.35. Reported-by: "Wojcik, Krzysztof" <krzysztof.wojcik@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
* | | md: correctly handle probe of an 'mdp' device.NeilBrown2011-02-161-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'mdp' devices are md devices with preallocated device numbers for partitions. As such it is possible to mknod and open a partition before opening the whole device. this causes md_probe() to be called with a device number of a partition, which in-turn calls mddev_find with such a number. However mddev_find expects the number of a 'whole device' and does the wrong thing with partition numbers. So add code to mddev_find to remove the 'partition' part of a device number and just work with the 'whole device'. This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=28652 Reported-by: hkmaly@bigfoot.com Signed-off-by: NeilBrown <neilb@suse.de> Cc: <stable@kernel.org>
* | | md: don't set_capacity before array is active.NeilBrown2011-02-161-3/+3
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the desired size of an array is set (via sysfs) before the array is active (which is the normal sequence), we currrently call set_capacity immediately. This means that a subsequent 'open' (as can be caused by some udev-triggers program) will notice the new size and try to probe for partitions. However as the array isn't quite ready yet the read will fail. Then when the array is read, as the size doesn't change again we don't try to re-probe. So when setting array size via sysfs, only call set_capacity if the array is already active. Signed-off-by: NeilBrown <neilb@suse.de>
* | md_make_request: don't touch the bio after calling make_requestChris Mason2011-02-081-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | md_make_request was calling bio_sectors() for part_stat_add after it was calling the make_request function. This is bad because the make_request function can free the bio and because the bi_size field can change around. The fix here was suggested by Jens Axboe. It saves the sector count before the make_request call. I hit this with CONFIG_DEBUG_PAGEALLOC turned on while trying to break his pretty fusionio card. Cc: <stable@kernel.org> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: NeilBrown <neilb@suse.de>
* | md: Don't allow slot_store while resync/recovery is happening.NeilBrown2011-02-021-0/+3
| | | | | | | | | | | | | | | | | | | | Activating a spare in an array while resync/recovery is already happening can lead the that spare being marked in-sync when it isn't really. So don't allow the 'slot' to be set (this activating the device) while resync/recovery is happening. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: don't clear curr_resync_completed at end of resync.NeilBrown2011-01-311-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no need to set this to zero at this point. It will be set to zero by remove_and_add_spares or at the start of md_do_sync at the latest. And setting it to zero before MD_RECOVERY_RUNNING is cleared can make a 'zero' appear briefly in the 'sync_completed' sysfs attribute just as resync is finishing. So simply remove this setting to zero. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: Don't use remove_and_add_spares to remove failed devices from a ↵NeilBrown2011-01-311-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | read-only array remove_and_add_spares is called in two places where the needs really are very different. remove_and_add_spares should not be called on an array which is about to be reshaped as some extra devices might have been manually added and that would remove them. However if the array is 'read-auto', that will currently happen, which is bad. So in the 'ro != 0' case don't call remove_and_add_spares but simply remove the failed devices as the comment suggests is needed. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: Remove the AllReserved flag for component devices.NeilBrown2011-01-311-8/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This flag is not needed and is used badly. Devices that are included in a native-metadata array are reserved exclusively for that array - and currently have AllReserved set. They all are bd_claimed for the rdev and so cannot be shared. Devices that are included in external-metadata arrays can be shared among multiple arrays - providing there is no overlap. These are bd_claimed for md in general - not for a particular rdev. When changing the amount of a device that is used in an array we need to check for overlap. This currently includes a check on AllReserved So even without overlap, sharing with an AllReserved device is not allowed. However the bd_claim usage already precludes sharing with these devices, so the test on AllReserved is not needed. And in fact it is wrong. As this is the only use of AllReserved, simply remove all usage and definition of AllReserved. Signed-off-by: NeilBrown <neilb@suse.de>
* | md: revert change to raid_disks on failure.NeilBrown2011-01-311-0/+2
|/ | | | | | | | If we try to update_raid_disks and it fails, we should put 'delta_disks' back to zero. This is important because some code, such as slot_store, assumes that delta_disks has been validated. Signed-off-by: NeilBrown <neilb@suse.de>
* block: restore multiple bd_link_disk_holder() supportTejun Heo2011-01-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Commit e09b457b (block: simplify holder symlink handling) incorrectly assumed that there is only one link at maximum. dm may use multiple links and expects block layer to track reference count for each link, which is different from and unrelated to the exclusive device holder identified by @holder when the device is opened. Remove the single holder assumption and automatic removal of the link and revive the per-link reference count tracking. The code essentially behaves the same as before commit e09b457b sans the unnecessary kobject reference count dancing. While at it, note that this facility should not be used by anyone else than the current ones. Sysfs symlinks shouldn't be abused like this and the whole thing doesn't belong in the block layer at all. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Milan Broz <mbroz@redhat.com> Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds2011-01-131-77/+120
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://neil.brown.name/md: md: Fix removal of extra drives when converting RAID6 to RAID5 md: range check slot number when manually adding a spare. md/raid5: handle manually-added spares in start_reshape. md: fix sync_completed reporting for very large drives (>2TB) md: allow suspend_lo and suspend_hi to decrease as well as increase. md: Don't let implementation detail of curr_resync leak out through sysfs. md: separate meta and data devs md-new-param-to_sync_page_io md-new-param-to-calc_dev_sboffset md: Be more careful about clearing flags bit in ->recovery md: md_stop_writes requires mddev_lock. md/raid5: use sysfs_notify_dirent_safe to avoid NULL pointer md: Ensure no IO request to get md device before it is properly initialised. md: Fix single printks with multiple KERN_<level>s md: fix regression resulting in delays in clearing bits in a bitmap md: fix regression with re-adding devices to arrays with no metadata
| * md: Fix removal of extra drives when converting RAID6 to RAID5NeilBrown2011-01-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When a RAID6 is converted to a RAID5, the extra drive should be discarded. However it isn't due to a typo in a comparison. This bug was introduced in commit e93f68a1fc6 in 2.6.35-rc4 and is suitable for any -stable since than. As the extra drive is not removed, the 'degraded' counter is wrong and so the RAID5 will not respond correctly to a subsequent failure. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
| * md: range check slot number when manually adding a spare.NeilBrown2011-01-141-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | When adding a spare to an active array, we should check the slot number, but allow it to be larger than raid_disks if a reshape is being prepared. Apply the same test when adding a device to an array-under-construction. It already had most of the test in place, but not quite all. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: fix sync_completed reporting for very large drives (>2TB)Rémi Rérolle2011-01-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | The values exported in the sync_completed file are unsigned long, which overflows with very large drives, resulting in wrong values reported. Since sync_completed uses sectors as unit, we'll start getting wrong values with components larger than 2TB. This patch simply replaces the use of unsigned long by unsigned long long. Signed-off-by: Rémi Rérolle <rrerolle@lacie.com> Signed-off-by: NeilBrown <neilb@suse.de>
| * md: allow suspend_lo and suspend_hi to decrease as well as increase.NeilBrown2011-01-141-12/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | The sysfs attributes 'suspend_lo' and 'suspend_hi' describe a region to which read/writes are suspended so that the under lying data can be manipulated without user-space noticing. Currently the window they describe can only move forwards along the device. However this is an unnecessary restriction which will cause problems with planned developments. So relax this restriction and allow these endpoints to move arbitrarily. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: Don't let implementation detail of curr_resync leak out through sysfs.NeilBrown2011-01-141-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mddev->curr_resync has artificial values of '1' and '2' which are used by the code which ensures only one resync is happening at a time on any given device. These values are internal and should never be exposed to user-space (except when translated appropriately as in the 'pending' status in /proc/mdstat). Unfortunately they are as ->curr_resync is assigned to ->curr_resync_completed and that value is directly visible through sysfs. So change the assignments to ->curr_resync_completed to get the same valued from elsewhere in a form that doesn't have the magic '1' or '2' values. Signed-off-by: NeilBrown <neilb@suse.de>
| * md: separate meta and data devsJonathan Brassow2011-01-141-4/+6
| | | | | | | | | | | | | | | | | | | | | | Allow the metadata to be on a separate device from the data. This doesn't mean the data and metadata will by on separate physical devices - it simply gives device-mapper and userspace tools more flexibility. Signed-off-by: NeilBrown <neilb@suse.de>
| * md-new-param-to_sync_page_ioJonathan Brassow2011-01-141-3/+6
| | | | | | | | | | | | | | | | | | | | Add new parameter to 'sync_page_io'. The new parameter allows us to distinguish between metadata and data operations. This becomes important later when we add the ability to use separate devices for data and metadata. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
| * md-new-param-to-calc_dev_sboffsetJonathan Brassow2011-01-141-6/+6
| | | | | | | | | | | | | | | | When we allow for separate devices for data and metadata in a later patch, we will need to be able to calculate the superblock offset based on more than the bdev. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>