Linux MD Raid Bug(?) w/Kernel sync_speed

Discussion:

Linux MD Raid Bug(?) w/Kernel sync_speed_min Option

Justin Piszcz

2007-05-08 12:27:35 UTC

Kernel: 2.6.21.1

Here is the bug:

md2: RAID1 (works fine)
md3: RAID5 (only syncs at the sync_speed_min set by the kernel)

If I do not run this command:
echo 55000 > /sys/block/md3/md/sync_speed_min

I will get 2 megabytes per second check speed for RAID 5.

However, the odd part is I can leave it the default for RAID1 and it will
use the maximum IO available between both drives to run the check.

I think there is some kind of bug, essentially with RAID5 check's-- it
only runs at the minimum value set (default in the kernel for raid5 is
~2mb/s).

md2 : active raid1 sdb3[1] sda3[0]
55681216 blocks [2/2] [UU]
[===========>.........] check = 59.1% (32937536/55681216)
finish=7.4min speed=50947K/sec

md3 : active raid5 sdl1[9] sdk1[8] sdj1[7] sdi1[6] sdh1[5] sdg1[4] sdf1[3]
sde1[
2] sdd1[1] sdc1[0]
1318686336 blocks level 5, 128k chunk, algorithm 2 [10/10]
[UUUUUUUUUU]
[====>................] check = 24.2% (35578816/146520704)
finish=33.3min speed=55464K/sec

Set to default kernel settings, either 2000 or 2100:

echo 2000 > /sys/block/md3/md/sync_speed_min

Then,

md3 : active raid5 sdl1[9] sdk1[8] sdj1[7] sdi1[6] sdh1[5] sdg1[4] sdf1[3]
sde1[
2] sdd1[1] sdc1[0]
1318686336 blocks level 5, 128k chunk, algorithm 2 [10/10]
[UUUUUUUUUU]
[======>..............] check = 31.5% (46191744/146520704)
finish=715.7min speed=2335K/sec

There is some kind of nasty bug going on here with RAID 5 devices in the
kernel. Also, incase you wondered, there is little to no I/O on the RAID
5 device when this check is being run, same for the root volume.

Justin.
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Neil Brown

2007-05-08 13:03:18 UTC

Permalink

Post by Justin Piszcz
Kernel: 2.6.21.1
md2: RAID1 (works fine)
md3: RAID5 (only syncs at the sync_speed_min set by the kernel)
echo 55000 > /sys/block/md3/md/sync_speed_min
I will get 2 megabytes per second check speed for RAID 5.

I can only reproduce this if I set the stripe_cache_size somewhat
larger than the default of 256 - did you do this?

This code (is_mddev_idle) has always been a bit fragile, particularly
so since the block layer started account IO when it finished rather
than when it started.

This patch might help though. Let me know if it does what you expect.

Thanks,
NeilBrown

Signed-off-by: Neil Brown <***@suse.de>

### Diffstat output
./drivers/md/md.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2007-05-07 17:47:15.000000000 +1000
+++ ./drivers/md/md.c 2007-05-08 22:57:51.000000000 +1000
@@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
*
* Note: the following is an unsigned comparison.
*/
- if ((curr_events - rdev->last_events + 4096) > 8192) {
+ if ((long)curr_events - (long)rdev->last_events > 8192) {
rdev->last_events = curr_events;
idle = 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Justin Piszcz

2007-05-08 13:13:32 UTC

Permalink

Post by Neil Brown

I can only reproduce this if I set the stripe_cache_size somewhat
larger than the default of 256 - did you do this?
This code (is_mddev_idle) has always been a bit fragile, particularly
so since the block layer started account IO when it finished rather
than when it started.
This patch might help though. Let me know if it does what you expect.
Thanks,
NeilBrown
### Diffstat output
./drivers/md/md.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2007-05-07 17:47:15.000000000 +1000
+++ ./drivers/md/md.c 2007-05-08 22:57:51.000000000 +1000
@@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
*
* Note: the following is an unsigned comparison.
*/
- if ((curr_events - rdev->last_events + 4096) > 8192) {
+ if ((long)curr_events - (long)rdev->last_events > 8192) {
rdev->last_events = curr_events;
idle = 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
I can only reproduce this if I set the stripe_cache_size somewhat
larger than the default of 256 - did you do this?

Yes, upon bootup I use:
echo 16384 > /sys/block/md3/md/stripe_cache_size

I have applied this patch and will test it now.

Justin.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Justin Piszcz

2007-05-08 13:24:44 UTC

Permalink

Post by Neil Brown
This patch might help though. Let me know if it does what you expect.
Thanks,
NeilBrown
### Diffstat output
./drivers/md/md.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c 2007-05-07 17:47:15.000000000 +1000
+++ ./drivers/md/md.c 2007-05-08 22:57:51.000000000 +1000
@@ -5095,7 +5095,7 @@ static int is_mddev_idle(mddev_t *mddev)
*
* Note: the following is an unsigned comparison.
*/
- if ((curr_events - rdev->last_events + 4096) > 8192) {
+ if ((long)curr_events - (long)rdev->last_events > 8192) {
rdev->last_events = curr_events;
idle = 0;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

Neil, awesome patch-- what are the chances of it getting merged into
2.6.22?

md3 : active raid5 sdl1[9] sdk1[8] sdj1[7] sdi1[6] sdh1[5] sdg1[4] sdf1[3]
sde1[
2] sdd1[1] sdc1[0]
1318686336 blocks level 5, 128k chunk, algorithm 2 [10/10]
[UUUUUUUUUU]
[>....................] check = 0.5% (854084/146520704)
finish=42.6min speed=56938K/sec

md0 : active raid1 sdb1[1] sda1[0]
16787776 blocks [2/2] [UU]
[=>...................] check = 7.5% (1265984/16787776)
finish=3.6min speed=70332K/sec

$ cat /sys/block/md2/md/sync_speed_min
1000 (system)

$ cat /sys/block/md3/md/sync_speed_min
1000 (system)

Working as advertised (utilizing all idle I/O)!

Justin.

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Neil Brown

2007-05-09 09:13:10 UTC

Permalink

Post by Justin Piszcz
Neil, awesome patch-- what are the chances of it getting merged into
2.6.22?

Probably. I want to think it through a bit more - to make sure I can
write a coherent and correct changelog entry.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Mark A. O'Neil

2007-05-08 17:24:47 UTC

Permalink

Hello,

I hope this is the appropriate forum for this request if not please
direct me to the correct one.

I have a system running FC6, 2.6.20-1.2925, software RAID5 and a
power outage seems to have borked the file structure on the RAID.

Boot shows the following disks:
sda #first disk in raid5: 250GB
sdb #the boot disk: 80GB
sdc #second disk in raid5: 250GB
sdd #third disk in raid5: 250GB
sde #fourth disk in raid5: 250GB

When I boot the system kernel panics with the following info displayed:
...
ata1.00: cd c8/00:08:e6:3e:13/00:00:00:00:00/e0 tag 0 cdb 0x0 data
4096 in
exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: (BMDMA stat 0x25)
ata1.00: cd c8/00:08:e6:3e:13/00:00:00:00:00/e0 tag 0 cdb 0x0 data
4096 in
EXT3-fs error (device sda3) ext_get_inode_loc: unable to read inode
block
-inode=8, block=1027
EXT3-fs: invalid journal inode
mount: error mounting /dev/root on /sysroot as ext3: invalid argument
setuproot: moving /dev failed: no such file or directory
setuproot: error mounting /proc: no such file or directory
setuproot: error mounting /sys: no such file or directory
switchroot: mount failed: no such file of directory
Kernel panic - not synching: attempted to kil init!

At which point the system locks as expected.

Another perhaps not related tidbit is when viewing sda1 using (I
think I did not write down the command) mdadm --misc --examine device
I see (inpart) data describing the device in the array:

sda1 raid 4, total 4, active 4, working 4
and then a listing of disks sdc1, sdd1, sde1 all of which show

viewing the remaining disks in the list shows:
sdX1 raid 4, total 3, active 3, working 3

and then a listing of the disks with the first disk being shonw as
removed.
It seems that the other disks do not have a reference to sda1? That
in itself is perplexing to me but I vaguely recall seeing that before
- it has been awhile since I set the system up.

Anyway, I think the ext3-fs error is less an issue with the software
raid and more an issue that fsck could fix. My problem is how to non-
destructively mount the raid from the rescue disk so that I can run
fsck on the raid. I do not think mounting and running fsck on the
individual disks is the correct solution.

Some straight forward instructions (or a pointer to some) on doing
this from the rescue prompt would be most useful. I have been
searching the last couple evenings and have yet to find something I
completely understand. I have little experience with software raid
and mdadm and while this is an excellent opportunity to learn a bit
(and I am) I would like to successfully recover my data in a more
timely fashion rather than mess it up beyond recovery as the result
of a dolt interpretation of a man page. The applications and data
itself is replaceable - just time consuming as in days rather than
what I hope, with proper instruction, will amount to an evening or
two worth of work to mount the RAID and run fsck.

I appreciate your time and any assistance you may be able to provide.
If the above is not sufficient let me know and I will try and get at
more info.

regards and thank you,

-m

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Michael Tokarev

2007-05-08 20:04:48 UTC

Permalink

Post by Mark A. O'Neil
Hello,
I hope this is the appropriate forum for this request if not please
direct me to the correct one.
I have a system running FC6, 2.6.20-1.2925, software RAID5 and a power
outage seems to have borked the file structure on the RAID.
sda #first disk in raid5: 250GB
sdb #the boot disk: 80GB
sdc #second disk in raid5: 250GB
sdd #third disk in raid5: 250GB
sde #fourth disk in raid5: 250GB
...
ata1.00: cd c8/00:08:e6:3e:13/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in
exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: (BMDMA stat 0x25)
ata1.00: cd c8/00:08:e6:3e:13/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 in
EXT3-fs error (device sda3) ext_get_inode_loc: unable to read inode block
-inode=8, block=1027
EXT3-fs: invalid journal inode
mount: error mounting /dev/root on /sysroot as ext3: invalid argument
setuproot: moving /dev failed: no such file or directory
setuproot: error mounting /proc: no such file or directory
setuproot: error mounting /sys: no such file or directory
switchroot: mount failed: no such file of directory
Kernel panic - not synching: attempted to kil init!

Wug.

Post by Mark A. O'Neil
At which point the system locks as expected.
Another perhaps not related tidbit is when viewing sda1 using (I think
I did not write down the command) mdadm --misc --examine device I see
sda1 raid 4, total 4, active 4, working 4
and then a listing of disks sdc1, sdd1, sde1 all of which show
sdX1 raid 4, total 3, active 3, working 3

You sure it's raid4, not raid5? Because if it really is raid4, but before
you had a raid5 array, you're screwed, and the only way to recover is to
re-create the array (without losing data), re-writing superblocks (see below).

BTW, --misc can be omited - you only need

mdadm -E /dev/sda1

Post by Mark A. O'Neil
and then a listing of the disks with the first disk being shonw as removed.
It seems that the other disks do not have a reference to sda1? That in
itself is perplexing to me but I vaguely recall seeing that before - it
has been awhile since I set the system up.

Check UUID values on all drives (also from mdadm -E output) - shoule be the
same. And compare "Events" field in there too. Maybe you had 4-disk array
before, but later re-created it to be 3-disks? Another possible cause is the
disk failures resulting in bad superblock reads, but that's highly unlikely.

Post by Mark A. O'Neil
Anyway, I think the ext3-fs error is less an issue with the software
raid and more an issue that fsck could fix. My problem is how to
non-destructively mount the raid from the rescue disk so that I can run
fsck on the raid. I do not think mounting and running fsck on the
individual disks is the correct solution.
Some straight forward instructions (or a pointer to some) on doing this
from the rescue prompt would be most useful. I have been searching the
last couple evenings and have yet to find something I completely
understand. I have little experience with software raid and mdadm and
while this is an excellent opportunity to learn a bit (and I am) I would
like to successfully recover my data in a more timely fashion rather
than mess it up beyond recovery as the result of a dolt interpretation
of a man page. The applications and data itself is replaceable - just
time consuming as in days rather than what I hope, with proper
instruction, will amount to an evening or two worth of work to mount the
RAID and run fsck.

Not sure about pointers. But here are some points.

Figure out which arrays/disks you really had. The raid level and number
of drives are really important.

Now two "mantras":

mdadm --assemble /dev/md0 /dev/sda1 /dev/sdc1 /dev/sdd1 /dev/sde1

This will try to bring the array up. It will either come ok, or will
fail due to event count mismatches (more than 1 difference).

In case you have more than 1 mismatch, you can try adding --force option,
to tell mdadm to ignore mismatches and try the best it can. The array
wont resync, it will be started from "best" (n-1) drives.

If there's a drive error, you can omit the bad drive from the command
and assemble a degraded array, but before doing so, see which drives
are more fresh (by examining Event counts in mdadm -E output). If
one of the remaining drives has (much) lower event count than the rest,
while the bad one is (more-or-less) good, you've a good chance to have
bad (unrecoverable) filesystem. This happens if the lower-events drive
has been kicked off the array (for whatever reason) long before your
last disaster happened, and hence it contains very old data and you've
very few chances to recover it without the bad drive.

And another mantra, which can be helpful if assemble doesn't work for
some reason:

mdadm --create /dev/md0 --level=5 --num-devices=4 --layout=x --chunk-size=c \
--assume-clean \
/dev/sda1 /dev/sdc1 /dev/sdd1 /dev/sde1

This will re-create the superblock, but not touch any data inside. The
magic word is --assume-clean - to stop md subsystem from starting any
resync, assuming the array is already all ok.

For this to work, you have to have all the parameters correct, including
order of the component devices. You can collect that information from
your existing superblocks, and you can experiment with different options
till you see something that looks like a filesystem.

Instead of giving all the 4 devices, you can use the literal word "missing"
in place of any of them, like this:

mdadm --create /dev/md0 --level=5 --num-devices=4 --layout=x --chunk-size=c \
/dev/sda1 missing /dev/sdd1 /dev/sde1

(no need to specify --assume-clean as there's nothing to resync on a
degraded array). With the same note: you still have to specify all the
correct parameters (if you didn't specify chunk-size and layout when
initially creating the array, you can omit them here as well, since
mdadm will pick the same defaults).

And finally, when everything looks ok, you can add the missing drive by
using

mdadm --add /dev/md0 /dev/sdX1

(where sdX1 is the missing drive). Or, in case of re-creating superblocks
with --create --assume-clean, you probably should start repair on the
array, echo repair > /sys/block/md0/md/sync_action -- but I bet it will
not work this way, ie, such build will not be satisfactory).

And oh, in case you will need to re-create the array (the 2nd "mantra"),
you probably will have to rebuild your initial ramdisk too. Depending
on the way your initrd built, it may use UUID to find parts of the
array, which will be rewritten.

One additional note. You may have hard time with ext3fs trying to
forcible replay the journal while experimenting with different
options. It's sad thing, but if ext3 isn't umounted correctly,
it insists on replaying journal and refuses to work (even fsck)
without that. But while trying different combinations to find
the best set to work with, writing to the array is a no-no.
To ensure it doesn't happen, you can start the array read-only,
echo 1 > /sys/module/md_mod/parameters/start_ro will help here.
But I'm not sure if ext3fsck will be able to do anything with
a read-only device...

BTW, for such recovery purposes, I use initrd (initramfs really, but
does not matter) with a normal (but tiny) set of commands inside,
thanks to busybox. So everything can be done without any help from
external "recovery CD". Very handy at times, especially since all
the network drivers are here on the initramfs too, so I can even
start a netcat server while in initramfs, and perform recovery from
remote system... ;)

Good luck!

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Nix

2007-05-09 06:29:00 UTC

Permalink

Post by Michael Tokarev
BTW, for such recovery purposes, I use initrd (initramfs really, but
does not matter) with a normal (but tiny) set of commands inside,
thanks to busybox. So everything can be done without any help from
external "recovery CD". Very handy at times, especially since all
the network drivers are here on the initramfs too, so I can even
start a netcat server while in initramfs, and perform recovery from
remote system... ;)

What you should probably do is drop into the shell that's being used to
run init if mount fails (or, more generally, if after mount runs it
hasn't ended up mounting anything: there's no need to rely on mount's
success/failure status). e.g. from my initramfs's init script (obviously
this is not runnable as is due to all the variables, but it should get
the idea across):

if [ -n $root ]; then
/bin/mount -o $OPTS -t $TYPE $ROOT /new-root
fi

if /bin/mountpoint /new-root >/dev/null; then :; else
echo "No root filesystem given to the kernel or found on the root RAID array."
echo "Append the correct 'root=', 'root-type=', and/or 'root-options='"
echo "boot options."
echo
echo "Dropping to a minimal shell. Reboot with Ctrl-Alt-Delete."

exec /bin/sh
fi
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Michael Tokarev

2007-05-09 11:34:50 UTC

Permalink

Post by Nix

What you should probably do is drop into the shell that's being used to
run init if mount fails (or, more generally, if after mount runs it

That's exactly what my initscript does ;)

chk() {
while ! "$@"; do
warn "the following command failed:"
warn "$*"
p="** Continue(Ignore)/Shell/Retry (C/s/r)? "
while : ; do
if ! read -t 10 -p "$p" x 2>&1; then
echo "(timeout, continuing)"
return 1
fi
case "$x" in
[Ss!]*) /bin/sh 2>&1 ;;
[Rr]*) break;;
[CcIi]*|"") return 1;;
*) echo "(unrecognized response)";;
esac
done
done
}

chk mount -n -t proc proc /proc
chk mount -n -t sysfs sysfs /sys
...
info "mounting $rootfstype fs on $root (options: $rootflags)"
chk mount -n -t $rootfstype -o $rootflags $root /root
if [ $? != 0 ] && ! grep -q "^[^ ]\\+ /root " /proc/mounts; then
warn "root filesystem ($rootfstype on $root) is NOT mounted!"
fi
...

Post by Nix
hasn't ended up mounting anything: there's no need to rely on mount's
success/failure status). [...]

Well, so far exitcode has been reliable.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Nix

2007-05-09 19:50:44 UTC

Permalink

Post by Michael Tokarev

Post by Nix

What you should probably do is drop into the shell that's being used to
run init if mount fails (or, more generally, if after mount runs it

That's exactly what my initscript does ;)

I thought so. I was really talking to Mark, I suppose.

Post by Michael Tokarev
chk() {
warn "the following command failed:"
warn "$*"
p="** Continue(Ignore)/Shell/Retry (C/s/r)? "

Wow. Feature-rich :)) I may reused this rather nifty stuff.

Post by Michael Tokarev

Post by Nix
hasn't ended up mounting anything: there's no need to rely on mount's
success/failure status). [...]

Well, so far exitcode has been reliable.

I guess I was being paranoid because I'm using busybox and at various
times the exitcodes of its internal commands have been... unimplemented
or unreliable.

--
`In the future, company names will be a 32-character hex string.'
--- Bruce Schneier on the shortage of company names
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Mark A. O'Neil

2007-05-16 16:10:36 UTC

Permalink

I want to thank everyone for their suggestions.

After much fiddling about I eventually pushed things beyond repair,
so had to start from scratch after all - no big I had a backup so
that is good.

So I took the opportunity to play a bit with mdadm (adding, removing,
repairing, etc) and think a crisis will be averted should a similar
problem arise in the future.

regards,
-m
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html