Discussion:
Raid5 hang in 3.14.19
BillStuff
2014-09-26 22:33:58 UTC
Permalink
Hi Neil,

I found something that looks similar to the problem described in
"Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.

It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.

on this array:
md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 1/5 pages [4KB], 2048KB chunk

I was running a test doing parallel kernel builds, read/write loops, and
disk add / remove / check loops,
on both this array and a raid1 array.

I was trying to stress test your recent raid1 fixes, which went well,
but then after 5 days,
the raid5 array hung up with this in dmesg:

INFO: task kworker/7:2:15726 blocked for more than 120 seconds.
Tainted: P W O 3.14.19fe-dirty #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/7:2 D 00000001 0 15726 2 0x00000000
Workqueue: xfs-log/md3 xfs_log_worker [xfs]
d1809b1c 00000046 e95e0a80 00000001 c15b56a0 00000008 cf92b77a 0001721a
c17d6700 c17d6700 d198bcc0 d3b6a880 c236dc00 c8290fb0 c2328468 00000352
000000e2 c209c000 00000001 e9606000 c209c000 d1809b1c c14b05f4 c12d0c4a
Call Trace:
[<c14b05f4>] ? sch_direct_xmit+0x54/0x180
[<c12d0c4a>] ? radix_tree_lookup+0xa/0x10
[<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
[<c153e7f3>] schedule+0x23/0x60
[<c144fdf1>] md_flush_request+0x51/0xd0
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<f3dc0095>] make_request+0xb55/0xc30 [raid456]
[<c14be153>] ? ip_finish_output+0x1e3/0x440
[<c14bf5e7>] ? ip_output+0x77/0x80
[<c14beeb0>] ? ip_local_out+0x20/0x30
[<c12ae873>] ? generic_make_request_checks+0x343/0x400
[<c1049903>] ? internal_add_timer+0x13/0x40
[<c14501d1>] md_make_request+0xb1/0x1d0
[<c10e6ee3>] ? mempool_alloc_slab+0x13/0x20
[<c12af629>] generic_make_request+0x89/0xc0
[<c12af6ba>] submit_bio+0x5a/0x120
[<c1152cb8>] ? bio_add_page+0x38/0x40
[<f3c919c0>] _xfs_buf_ioapply+0x250/0x300 [xfs]
[<f3c91bb4>] xfs_buf_iorequest+0x44/0x80 [xfs]
[<f3ce7dbe>] ? xlog_bdstrat+0x1e/0x50 [xfs]
[<f3ce7dbe>] xlog_bdstrat+0x1e/0x50 [xfs]
[<f3ce9900>] xlog_sync+0x2a0/0x460 [xfs]
[<f3ce9b34>] xlog_state_release_iclog+0x74/0xa0 [xfs]
[<f3cea3ef>] xlog_write+0x4af/0x630 [xfs]
[<f3caa306>] ? kmem_zone_alloc+0x66/0xe0 [xfs]
[<f3ceb897>] xlog_cil_push+0x1e7/0x3c0 [xfs]
[<c1054c56>] ? flush_work+0x16/0x100
[<f3cec059>] xlog_cil_force_lsn+0x79/0x1c0 [xfs]
[<c106c864>] ? sched_clock_cpu+0x104/0x110
[<c106d8bc>] ? __dequeue_entity+0x3c/0x40
[<f3cea670>] _xfs_log_force+0x50/0x230 [xfs]
[<c153e254>] ? __schedule+0x224/0x7a0
[<f3cea876>] xfs_log_force+0x26/0x90 [xfs]
[<f3cebb87>] ? xlog_cil_empty+0x17/0x30 [xfs]
[<f3cea900>] xfs_log_worker+0x20/0x50 [xfs]
[<c105521c>] process_one_work+0x10c/0x340
[<c1055d31>] worker_thread+0x101/0x330
[<c1055c30>] ? manage_workers.isra.27+0x250/0x250
[<c105ae21>] kthread+0xa1/0xc0
[<c1541837>] ret_from_kernel_thread+0x1b/0x28
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
INFO: task kworker/u16:2:24280 blocked for more than 120 seconds.
Tainted: P W O 3.14.19fe-dirty #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:2 D d56879e0 0 24280 2 0x00000000
Workqueue: writeback bdi_writeback_workfn (flush-9:3)
d5687a3c 00000046 00000000 d56879e0 f2fbd600 f3dbc81b 2e86d607 0001721c
c17d6700 c17d6700 c49cbcc0 e601c6e0 f3dbc81b 00000002 00000001 e9e0fc00
c2328468 0000003b d5687a28 d5687d74 00000001 d5687a3c c12b0db6 f3dba382
Call Trace:
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<c12b0db6>] ? blk_flush_plug_list+0x86/0x1b0
[<f3dba382>] ? stripe_set_idx+0xa2/0xc0 [raid456]
[<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
[<c153e7f3>] schedule+0x23/0x60
[<f3dbe38b>] get_active_stripe+0x19b/0x5a0 [raid456]
[<c145f80b>] ? bitmap_startwrite+0x5b/0x170
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<f3dbf6b0>] make_request+0x170/0xc30 [raid456]
[<c10b7193>] ? delayacct_end+0x93/0xa0
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<c14501d1>] md_make_request+0xb1/0x1d0
[<c10e6ee3>] ? mempool_alloc_slab+0x13/0x20
[<c12af629>] generic_make_request+0x89/0xc0
[<c12af6ba>] submit_bio+0x5a/0x120
[<c1152b09>] ? __bio_add_page+0xd9/0x220
[<c1153248>] ? bvec_alloc+0x78/0xe0
[<f3c8b612>] xfs_submit_ioend_bio.isra.9+0x32/0x40 [xfs]
[<f3c8b70c>] xfs_submit_ioend+0xec/0x120 [xfs]
[<f3c8b981>] xfs_vm_writepage+0x241/0x580 [xfs]
[<c10ec730>] __writepage+0x10/0x40
[<c10ec720>] ? mapping_tagged+0x20/0x20
[<c10ecc30>] write_cache_pages+0x170/0x3d0
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<c10ec720>] ? mapping_tagged+0x20/0x20
[<c10ecec3>] generic_writepages+0x33/0x60
[<f3c8aa43>] xfs_vm_writepages+0x33/0x40 [xfs]
[<c10ee38a>] do_writepages+0x1a/0x40
[<c11486b2>] __writeback_single_inode+0x42/0x1f0
[<c107965c>] ? wake_up_bit+0x1c/0x20
[<c1149423>] writeback_sb_inodes+0x193/0x2e0
[<c11495ec>] __writeback_inodes_wb+0x7c/0xb0
[<c11497f2>] wb_writeback+0x1d2/0x270
[<c1149c11>] bdi_writeback_workfn+0x151/0x380
[<c153e254>] ? __schedule+0x224/0x7a0
[<c105521c>] process_one_work+0x10c/0x340
[<c1069cb0>] ? default_wake_function+0x10/0x20
[<c1055d31>] worker_thread+0x101/0x330
[<c1055c30>] ? manage_workers.isra.27+0x250/0x250
[<c105ae21>] kthread+0xa1/0xc0
[<c1541837>] ret_from_kernel_thread+0x1b/0x28
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
INFO: task kworker/3:3:26958 blocked for more than 120 seconds.
Tainted: P W O 3.14.19fe-dirty #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/3:3 D 00000000 0 26958 2 0x00000000
Workqueue: md md_submit_flush_data
c6d41dd8 00000046 d128391d 00000000 00000000 00000000 2d875421 0001721c
c17d6700 c17d6700 e9d3bcc0 d4bf5610 00000000 00000000 00000000 ea469748
df1132aa 0000c335 d4bf5654 d4bf5654 de5a17aa c6d41dec c106fe31 57d98c16
Call Trace:
[<c106fe31>] ? update_curr+0xf1/0x1a0
[<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
[<c153e7f3>] schedule+0x23/0x60
[<f3dbe38b>] get_active_stripe+0x19b/0x5a0 [raid456]
[<c107030f>] ? dequeue_entity+0x11f/0x510
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<f3dbf6b0>] make_request+0x170/0xc30 [raid456]
[<c153e254>] ? __schedule+0x224/0x7a0
[<c111bf45>] ? kmem_cache_free+0xe5/0x110
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<c144fd60>] md_submit_flush_data+0x30/0x70
[<c105521c>] process_one_work+0x10c/0x340
[<c104a30f>] ? mod_timer+0xef/0x1d0
[<c1055d31>] worker_thread+0x101/0x330
[<c1055c30>] ? manage_workers.isra.27+0x250/0x250
[<c105ae21>] kthread+0xa1/0xc0
[<c1541837>] ret_from_kernel_thread+0x1b/0x28
[<c105ad80>] ? kthread_create_on_node+0x110/0x110

There's more, mostly xfs threads complaining about being blocked and
some user processes.
Thinking it was an xfs problem, I did "echo w > /proc/sysrq-trigger" and
got this in syslog:

Sep 26 12:33:06 xplane kernel: SysRq : Show Blocked State
Sep 26 12:33:06 xplane kernel: task PC stack pid father
Sep 26 12:33:06 xplane kernel: kthreadd D c106ea4c 0 2 0
0x00000000
Sep 26 12:33:06 xplane kernel: kswapd0 D 00017219 0 711
2 0x00000000
Sep 26 12:33:06 xplane kernel: md3_raid5 D ea49d770 0
1698 2 0x00000000
Sep 26 12:33:06 xplane kernel: xfsaild/md3 D e7d72700 0
2234 2 0x00000000
Sep 26 12:33:06 xplane kernel: kworker/7:2 D 00000001 0
15726 2 0x00000000
Sep 26 12:33:06 xplane kernel: Workqueue: xfs-log/md3 xfs_log_worker [xfs]
Sep 26 12:33:06 xplane kernel: kworker/u16:2 D d56879e0 0
24280 2 0x00000000
Sep 26 12:33:06 xplane kernel: Workqueue: writeback bdi_writeback_workfn
(flush-9:3)
Sep 26 12:33:06 xplane kernel: kworker/3:3 D 00000000 0
26958 2 0x00000000
Sep 26 12:33:06 xplane kernel: Workqueue: md md_submit_flush_data
Sep 26 12:33:06 xplane kernel: cat D 00000000 0 11412
11410 0x00000000
Sep 26 12:33:06 xplane kernel: cc1 D d00c3bac 0 12102
12101 0x00000000
Sep 26 12:33:06 xplane kernel: ld D c16ef900 0 12132
12130 0x00000000
Sep 26 12:33:06 xplane kernel: make D 00000000 0 12147
12047 0x00000000
Sep 26 12:33:06 xplane kernel: make D d896ea80 0 12374
11263 0x00000000
Sep 26 12:33:06 xplane kernel: cat D f2e51d0c 0 23219
23217 0x00000000
Sep 26 12:33:06 xplane kernel: gcc D f2e51d0c 0 23564
23563 0x00000000
Sep 26 12:33:06 xplane kernel: sh D f2e51d0c 0 25753
23615 0x00000000

which looks similar to the problem from the above mentioned list-post.

I did a few test writes to the filesystem, and it looked like xfs
created the files, and could read them
and delete them, but there was never any disk activity and access to
/sys/block/md3/md/{anything} was hung up

"top" showed consistent I/O wait time, first 25%, then 50%, with no disc
activity

In the end I had to reboot, and I'm running the test again.

Any thoughts on what's happening here or ideas to help debug it?

Thanks,
Bill

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
NeilBrown
2014-09-29 02:25:33 UTC
Permalink
Post by BillStuff
Hi Neil,
I found something that looks similar to the problem described in
"Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 1/5 pages [4KB], 2048KB chunk
I was running a test doing parallel kernel builds, read/write loops, and
disk add / remove / check loops,
on both this array and a raid1 array.
I was trying to stress test your recent raid1 fixes, which went well,
but then after 5 days,
I think this is different to the workqueue problem you mentioned, though as I
don't know exactly what caused either I cannot be certain.

From the data you provided it looks like everything is waiting on
get_active_stripe(), or on a process that is waiting on that.
That seems pretty common whenever anything goes wrong in raid5 :-(

The md3_raid5 task is listed as blocked, but not stack trace is given.
If the machine is still in the state, then

cat /proc/1698/stack

might be useful.
(echo t > /proc/sysrq-trigger is always a good idea)

NeilBrown
Post by BillStuff
INFO: task kworker/7:2:15726 blocked for more than 120 seconds.
Tainted: P W O 3.14.19fe-dirty #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/7:2 D 00000001 0 15726 2 0x00000000
Workqueue: xfs-log/md3 xfs_log_worker [xfs]
d1809b1c 00000046 e95e0a80 00000001 c15b56a0 00000008 cf92b77a 0001721a
c17d6700 c17d6700 d198bcc0 d3b6a880 c236dc00 c8290fb0 c2328468 00000352
000000e2 c209c000 00000001 e9606000 c209c000 d1809b1c c14b05f4 c12d0c4a
[<c14b05f4>] ? sch_direct_xmit+0x54/0x180
[<c12d0c4a>] ? radix_tree_lookup+0xa/0x10
[<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
[<c153e7f3>] schedule+0x23/0x60
[<c144fdf1>] md_flush_request+0x51/0xd0
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<f3dc0095>] make_request+0xb55/0xc30 [raid456]
[<c14be153>] ? ip_finish_output+0x1e3/0x440
[<c14bf5e7>] ? ip_output+0x77/0x80
[<c14beeb0>] ? ip_local_out+0x20/0x30
[<c12ae873>] ? generic_make_request_checks+0x343/0x400
[<c1049903>] ? internal_add_timer+0x13/0x40
[<c14501d1>] md_make_request+0xb1/0x1d0
[<c10e6ee3>] ? mempool_alloc_slab+0x13/0x20
[<c12af629>] generic_make_request+0x89/0xc0
[<c12af6ba>] submit_bio+0x5a/0x120
[<c1152cb8>] ? bio_add_page+0x38/0x40
[<f3c919c0>] _xfs_buf_ioapply+0x250/0x300 [xfs]
[<f3c91bb4>] xfs_buf_iorequest+0x44/0x80 [xfs]
[<f3ce7dbe>] ? xlog_bdstrat+0x1e/0x50 [xfs]
[<f3ce7dbe>] xlog_bdstrat+0x1e/0x50 [xfs]
[<f3ce9900>] xlog_sync+0x2a0/0x460 [xfs]
[<f3ce9b34>] xlog_state_release_iclog+0x74/0xa0 [xfs]
[<f3cea3ef>] xlog_write+0x4af/0x630 [xfs]
[<f3caa306>] ? kmem_zone_alloc+0x66/0xe0 [xfs]
[<f3ceb897>] xlog_cil_push+0x1e7/0x3c0 [xfs]
[<c1054c56>] ? flush_work+0x16/0x100
[<f3cec059>] xlog_cil_force_lsn+0x79/0x1c0 [xfs]
[<c106c864>] ? sched_clock_cpu+0x104/0x110
[<c106d8bc>] ? __dequeue_entity+0x3c/0x40
[<f3cea670>] _xfs_log_force+0x50/0x230 [xfs]
[<c153e254>] ? __schedule+0x224/0x7a0
[<f3cea876>] xfs_log_force+0x26/0x90 [xfs]
[<f3cebb87>] ? xlog_cil_empty+0x17/0x30 [xfs]
[<f3cea900>] xfs_log_worker+0x20/0x50 [xfs]
[<c105521c>] process_one_work+0x10c/0x340
[<c1055d31>] worker_thread+0x101/0x330
[<c1055c30>] ? manage_workers.isra.27+0x250/0x250
[<c105ae21>] kthread+0xa1/0xc0
[<c1541837>] ret_from_kernel_thread+0x1b/0x28
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
INFO: task kworker/u16:2:24280 blocked for more than 120 seconds.
Tainted: P W O 3.14.19fe-dirty #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u16:2 D d56879e0 0 24280 2 0x00000000
Workqueue: writeback bdi_writeback_workfn (flush-9:3)
d5687a3c 00000046 00000000 d56879e0 f2fbd600 f3dbc81b 2e86d607 0001721c
c17d6700 c17d6700 c49cbcc0 e601c6e0 f3dbc81b 00000002 00000001 e9e0fc00
c2328468 0000003b d5687a28 d5687d74 00000001 d5687a3c c12b0db6 f3dba382
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<c12b0db6>] ? blk_flush_plug_list+0x86/0x1b0
[<f3dba382>] ? stripe_set_idx+0xa2/0xc0 [raid456]
[<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
[<c153e7f3>] schedule+0x23/0x60
[<f3dbe38b>] get_active_stripe+0x19b/0x5a0 [raid456]
[<c145f80b>] ? bitmap_startwrite+0x5b/0x170
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<f3dbf6b0>] make_request+0x170/0xc30 [raid456]
[<c10b7193>] ? delayacct_end+0x93/0xa0
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<c14501d1>] md_make_request+0xb1/0x1d0
[<c10e6ee3>] ? mempool_alloc_slab+0x13/0x20
[<c12af629>] generic_make_request+0x89/0xc0
[<c12af6ba>] submit_bio+0x5a/0x120
[<c1152b09>] ? __bio_add_page+0xd9/0x220
[<c1153248>] ? bvec_alloc+0x78/0xe0
[<f3c8b612>] xfs_submit_ioend_bio.isra.9+0x32/0x40 [xfs]
[<f3c8b70c>] xfs_submit_ioend+0xec/0x120 [xfs]
[<f3c8b981>] xfs_vm_writepage+0x241/0x580 [xfs]
[<c10ec730>] __writepage+0x10/0x40
[<c10ec720>] ? mapping_tagged+0x20/0x20
[<c10ecc30>] write_cache_pages+0x170/0x3d0
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<c10ec720>] ? mapping_tagged+0x20/0x20
[<c10ecec3>] generic_writepages+0x33/0x60
[<f3c8aa43>] xfs_vm_writepages+0x33/0x40 [xfs]
[<c10ee38a>] do_writepages+0x1a/0x40
[<c11486b2>] __writeback_single_inode+0x42/0x1f0
[<c107965c>] ? wake_up_bit+0x1c/0x20
[<c1149423>] writeback_sb_inodes+0x193/0x2e0
[<c11495ec>] __writeback_inodes_wb+0x7c/0xb0
[<c11497f2>] wb_writeback+0x1d2/0x270
[<c1149c11>] bdi_writeback_workfn+0x151/0x380
[<c153e254>] ? __schedule+0x224/0x7a0
[<c105521c>] process_one_work+0x10c/0x340
[<c1069cb0>] ? default_wake_function+0x10/0x20
[<c1055d31>] worker_thread+0x101/0x330
[<c1055c30>] ? manage_workers.isra.27+0x250/0x250
[<c105ae21>] kthread+0xa1/0xc0
[<c1541837>] ret_from_kernel_thread+0x1b/0x28
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
INFO: task kworker/3:3:26958 blocked for more than 120 seconds.
Tainted: P W O 3.14.19fe-dirty #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/3:3 D 00000000 0 26958 2 0x00000000
Workqueue: md md_submit_flush_data
c6d41dd8 00000046 d128391d 00000000 00000000 00000000 2d875421 0001721c
c17d6700 c17d6700 e9d3bcc0 d4bf5610 00000000 00000000 00000000 ea469748
df1132aa 0000c335 d4bf5654 d4bf5654 de5a17aa c6d41dec c106fe31 57d98c16
[<c106fe31>] ? update_curr+0xf1/0x1a0
[<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
[<c153e7f3>] schedule+0x23/0x60
[<f3dbe38b>] get_active_stripe+0x19b/0x5a0 [raid456]
[<c107030f>] ? dequeue_entity+0x11f/0x510
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<f3dbf6b0>] make_request+0x170/0xc30 [raid456]
[<c153e254>] ? __schedule+0x224/0x7a0
[<c111bf45>] ? kmem_cache_free+0xe5/0x110
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<c144fd60>] md_submit_flush_data+0x30/0x70
[<c105521c>] process_one_work+0x10c/0x340
[<c104a30f>] ? mod_timer+0xef/0x1d0
[<c1055d31>] worker_thread+0x101/0x330
[<c1055c30>] ? manage_workers.isra.27+0x250/0x250
[<c105ae21>] kthread+0xa1/0xc0
[<c1541837>] ret_from_kernel_thread+0x1b/0x28
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
There's more, mostly xfs threads complaining about being blocked and
some user processes.
Thinking it was an xfs problem, I did "echo w > /proc/sysrq-trigger" and
Sep 26 12:33:06 xplane kernel: SysRq : Show Blocked State
Sep 26 12:33:06 xplane kernel: task PC stack pid father
Sep 26 12:33:06 xplane kernel: kthreadd D c106ea4c 0 2 0
0x00000000
Sep 26 12:33:06 xplane kernel: kswapd0 D 00017219 0 711
2 0x00000000
Sep 26 12:33:06 xplane kernel: md3_raid5 D ea49d770 0
1698 2 0x00000000
Sep 26 12:33:06 xplane kernel: xfsaild/md3 D e7d72700 0
2234 2 0x00000000
Sep 26 12:33:06 xplane kernel: kworker/7:2 D 00000001 0
15726 2 0x00000000
Sep 26 12:33:06 xplane kernel: Workqueue: xfs-log/md3 xfs_log_worker [xfs]
Sep 26 12:33:06 xplane kernel: kworker/u16:2 D d56879e0 0
24280 2 0x00000000
Sep 26 12:33:06 xplane kernel: Workqueue: writeback bdi_writeback_workfn
(flush-9:3)
Sep 26 12:33:06 xplane kernel: kworker/3:3 D 00000000 0
26958 2 0x00000000
Sep 26 12:33:06 xplane kernel: Workqueue: md md_submit_flush_data
Sep 26 12:33:06 xplane kernel: cat D 00000000 0 11412
11410 0x00000000
Sep 26 12:33:06 xplane kernel: cc1 D d00c3bac 0 12102
12101 0x00000000
Sep 26 12:33:06 xplane kernel: ld D c16ef900 0 12132
12130 0x00000000
Sep 26 12:33:06 xplane kernel: make D 00000000 0 12147
12047 0x00000000
Sep 26 12:33:06 xplane kernel: make D d896ea80 0 12374
11263 0x00000000
Sep 26 12:33:06 xplane kernel: cat D f2e51d0c 0 23219
23217 0x00000000
Sep 26 12:33:06 xplane kernel: gcc D f2e51d0c 0 23564
23563 0x00000000
Sep 26 12:33:06 xplane kernel: sh D f2e51d0c 0 25753
23615 0x00000000
which looks similar to the problem from the above mentioned list-post.
I did a few test writes to the filesystem, and it looked like xfs
created the files, and could read them
and delete them, but there was never any disk activity and access to
/sys/block/md3/md/{anything} was hung up
"top" showed consistent I/O wait time, first 25%, then 50%, with no disc
activity
In the end I had to reboot, and I'm running the test again.
Any thoughts on what's happening here or ideas to help debug it?
Thanks,
Bill
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
BillStuff
2014-09-29 03:56:19 UTC
Permalink
Post by NeilBrown
Post by BillStuff
Hi Neil,
I found something that looks similar to the problem described in
"Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 1/5 pages [4KB], 2048KB chunk
I was running a test doing parallel kernel builds, read/write loops, and
disk add / remove / check loops,
on both this array and a raid1 array.
I was trying to stress test your recent raid1 fixes, which went well,
but then after 5 days,
I think this is different to the workqueue problem you mentioned, though as I
don't know exactly what caused either I cannot be certain.
From the data you provided it looks like everything is waiting on
get_active_stripe(), or on a process that is waiting on that.
That seems pretty common whenever anything goes wrong in raid5 :-(
The md3_raid5 task is listed as blocked, but not stack trace is given.
If the machine is still in the state, then
cat /proc/1698/stack
might be useful.
(echo t > /proc/sysrq-trigger is always a good idea)
Might this help? I believe the array was doing a "check" when things
hung up.

md3_raid5 D ea49d770 0 1698 2 0x00000000
e833dda8 00000046 c106d92d ea49d770 e9d38554 1cc20b58 1e79a404 0001721a
c17d6700 c17d6700 e956d610 c2217470 c13af054 e9e8f000 00000000 00000000
e833dd78 00000000 00000000 00000271 00000000 00000005 00000000 0000a193
Call Trace:
[<c106d92d>] ? __enqueue_entity+0x6d/0x80
[<c13af054>] ? scsi_init_io+0x24/0xb0
[<c1072683>] ? enqueue_task_fair+0x2d3/0x660
[<c153e7f3>] schedule+0x23/0x60
[<c153db85>] schedule_timeout+0x145/0x1c0
[<c1065698>] ? update_rq_clock.part.92+0x18/0x50
[<c1067a65>] ? check_preempt_curr+0x65/0x90
[<c1067aa8>] ? ttwu_do_wakeup+0x18/0x120
[<c153ef5b>] wait_for_common+0x9b/0x110
[<c1069ca0>] ? wake_up_process+0x40/0x40
[<c153f077>] wait_for_completion_killable+0x17/0x30
[<c105ad0a>] kthread_create_on_node+0x9a/0x110
[<c1453ecc>] md_register_thread+0x8c/0xc0
[<c1453f00>] ? md_register_thread+0xc0/0xc0
[<c145ad14>] md_check_recovery+0x304/0x490
[<c12b1192>] ? blk_finish_plug+0x12/0x40
[<f3dc3a10>] raid5d+0x20/0x4c0 [raid456]
[<c104a022>] ? try_to_del_timer_sync+0x42/0x60
[<c153db3d>] ? schedule_timeout+0xfd/0x1c0
[<c1453fe8>] md_thread+0xe8/0x100
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<c1453f00>] ? md_register_thread+0xc0/0xc0
[<c105ae21>] kthread+0xa1/0xc0
[<c1541837>] ret_from_kernel_thread+0x1b/0x28
[<c105ad80>] ? kthread_create_on_node+0x110/0x110

I've already rebooted the system, but I did get a snapshot of all the
blocked processes.
It's kind of long but I can post it if it's useful.

Thanks,
Bill
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
NeilBrown
2014-09-29 04:08:18 UTC
Permalink
Post by BillStuff
Post by NeilBrown
Post by BillStuff
Hi Neil,
I found something that looks similar to the problem described in
"Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 1/5 pages [4KB], 2048KB chunk
I was running a test doing parallel kernel builds, read/write loops, and
disk add / remove / check loops,
on both this array and a raid1 array.
I was trying to stress test your recent raid1 fixes, which went well,
but then after 5 days,
I think this is different to the workqueue problem you mentioned, though as I
don't know exactly what caused either I cannot be certain.
From the data you provided it looks like everything is waiting on
get_active_stripe(), or on a process that is waiting on that.
That seems pretty common whenever anything goes wrong in raid5 :-(
The md3_raid5 task is listed as blocked, but not stack trace is given.
If the machine is still in the state, then
cat /proc/1698/stack
might be useful.
(echo t > /proc/sysrq-trigger is always a good idea)
Might this help? I believe the array was doing a "check" when things
hung up.
It looks like it was trying to start doing a 'check'.
The 'resync' thread hadn't been started yet.
What is 'kthreadd' doing?
My guess is that it is in try_to_free_pages() waiting for writeout
for some xfs file page onto the md array ... which won't progress until
the thread gets started.

That would suggest that we need an async way to start threads...

Thanks,
NeilBrown
BillStuff
2014-09-29 04:28:17 UTC
Permalink
Post by NeilBrown
Post by BillStuff
Post by NeilBrown
Post by BillStuff
Hi Neil,
I found something that looks similar to the problem described in
"Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 1/5 pages [4KB], 2048KB chunk
I was running a test doing parallel kernel builds, read/write loops, and
disk add / remove / check loops,
on both this array and a raid1 array.
I was trying to stress test your recent raid1 fixes, which went well,
but then after 5 days,
I think this is different to the workqueue problem you mentioned, though as I
don't know exactly what caused either I cannot be certain.
From the data you provided it looks like everything is waiting on
get_active_stripe(), or on a process that is waiting on that.
That seems pretty common whenever anything goes wrong in raid5 :-(
The md3_raid5 task is listed as blocked, but not stack trace is given.
If the machine is still in the state, then
cat /proc/1698/stack
might be useful.
(echo t > /proc/sysrq-trigger is always a good idea)
Might this help? I believe the array was doing a "check" when things
hung up.
It looks like it was trying to start doing a 'check'.
The 'resync' thread hadn't been started yet.
What is 'kthreadd' doing?
My guess is that it is in try_to_free_pages() waiting for writeout
for some xfs file page onto the md array ... which won't progress until
the thread gets started.
That would suggest that we need an async way to start threads...
Thanks,
NeilBrown
I suspect your guess is correct:

kthreadd D c106ea4c 0 2 0 0x00000000
e9d6db58 00000046 e9d6db4c c106ea4c ce493c00 00000001 1e9bb7bd 0001721a
c17d6700 c17d6700 d3b6a880 e9d38510 f2cf4c00 00000000 f2e51c00 e9d6db60
f3cec0b6 e9d38510 f2e51d14 f2e51d00 f2e51c00 00043132 00000964 0000a4b0
Call Trace:
[<c106ea4c>] ? update_blocked_averages+0x1ec/0x700
[<f3cec0b6>] ? xlog_cil_force_lsn+0xd6/0x1c0 [xfs]
[<f3cc077b>] ? xfs_bmbt_get_all+0x2b/0x40 [xfs]
[<c153e7f3>] schedule+0x23/0x60
[<f3ceaa71>] _xfs_log_force_lsn+0x141/0x270 [xfs]
[<c1069ca0>] ? wake_up_process+0x40/0x40
[<f3ceabd8>] xfs_log_force_lsn+0x38/0x90 [xfs]
[<f3cd7ee0>] __xfs_iunpin_wait+0x80/0x100 [xfs]
[<f3cdb02d>] ? xfs_iunpin_wait+0x1d/0x30 [xfs]
[<c10799d0>] ? autoremove_wake_function+0x40/0x40
[<f3cdb02d>] xfs_iunpin_wait+0x1d/0x30 [xfs]
[<f3c99938>] xfs_reclaim_inode+0x58/0x2f0 [xfs]
[<f3c99e04>] xfs_reclaim_inodes_ag+0x234/0x330 [xfs]
[<f3c9a6a1>] ? xfs_inode_set_reclaim_tag+0x91/0x150 [xfs]
[<c115cc41>] ? fsnotify_clear_marks_by_inode+0x21/0xe0
[<f3ca5ac5>] ? xfs_fs_destroy_inode+0xa5/0xd0 [xfs]
[<c113b061>] ? destroy_inode+0x31/0x50
[<c113b160>] ? evict+0xe0/0x160
[<f3c9a7ad>] xfs_reclaim_inodes_nr+0x2d/0x40 [xfs]
[<f3ca5103>] xfs_fs_free_cached_objects+0x13/0x20 [xfs]
[<c11278ce>] super_cache_scan+0x12e/0x140
[<c10f2bb5>] shrink_slab_node+0x125/0x280
[<c1101a1c>] ? compact_zone+0x2c/0x450
[<c10f3489>] shrink_slab+0xd9/0xf0
[<c10f557d>] try_to_free_pages+0x25d/0x4f0
[<c10eb96d>] __alloc_pages_nodemask+0x52d/0x820
[<c103c452>] copy_process.part.47+0xd2/0x14e0
[<c153e254>] ? __schedule+0x224/0x7a0
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
[<c103da01>] do_fork+0xc1/0x320
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
[<c103dc8d>] kernel_thread+0x2d/0x40
[<c105b542>] kthreadd+0x122/0x170
[<c1541837>] ret_from_kernel_thread+0x1b/0x28
[<c105b420>] ? kthread_create_on_cpu+0x60/0x60


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
NeilBrown
2014-09-29 04:43:56 UTC
Permalink
Post by BillStuff
Post by NeilBrown
Post by BillStuff
Post by NeilBrown
Post by BillStuff
Hi Neil,
I found something that looks similar to the problem described in
"Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 1/5 pages [4KB], 2048KB chunk
I was running a test doing parallel kernel builds, read/write loops, and
disk add / remove / check loops,
on both this array and a raid1 array.
I was trying to stress test your recent raid1 fixes, which went well,
but then after 5 days,
I think this is different to the workqueue problem you mentioned, though as I
don't know exactly what caused either I cannot be certain.
From the data you provided it looks like everything is waiting on
get_active_stripe(), or on a process that is waiting on that.
That seems pretty common whenever anything goes wrong in raid5 :-(
The md3_raid5 task is listed as blocked, but not stack trace is given.
If the machine is still in the state, then
cat /proc/1698/stack
might be useful.
(echo t > /proc/sysrq-trigger is always a good idea)
Might this help? I believe the array was doing a "check" when things
hung up.
It looks like it was trying to start doing a 'check'.
The 'resync' thread hadn't been started yet.
What is 'kthreadd' doing?
My guess is that it is in try_to_free_pages() waiting for writeout
for some xfs file page onto the md array ... which won't progress until
the thread gets started.
That would suggest that we need an async way to start threads...
Thanks,
NeilBrown
Yes, looks like it is - thanks.

I'll probably get a workqueue to start the thread, so the md thread doesn't
block on it.

thanks,
NeilBrown
Post by BillStuff
kthreadd D c106ea4c 0 2 0 0x00000000
e9d6db58 00000046 e9d6db4c c106ea4c ce493c00 00000001 1e9bb7bd 0001721a
c17d6700 c17d6700 d3b6a880 e9d38510 f2cf4c00 00000000 f2e51c00 e9d6db60
f3cec0b6 e9d38510 f2e51d14 f2e51d00 f2e51c00 00043132 00000964 0000a4b0
[<c106ea4c>] ? update_blocked_averages+0x1ec/0x700
[<f3cec0b6>] ? xlog_cil_force_lsn+0xd6/0x1c0 [xfs]
[<f3cc077b>] ? xfs_bmbt_get_all+0x2b/0x40 [xfs]
[<c153e7f3>] schedule+0x23/0x60
[<f3ceaa71>] _xfs_log_force_lsn+0x141/0x270 [xfs]
[<c1069ca0>] ? wake_up_process+0x40/0x40
[<f3ceabd8>] xfs_log_force_lsn+0x38/0x90 [xfs]
[<f3cd7ee0>] __xfs_iunpin_wait+0x80/0x100 [xfs]
[<f3cdb02d>] ? xfs_iunpin_wait+0x1d/0x30 [xfs]
[<c10799d0>] ? autoremove_wake_function+0x40/0x40
[<f3cdb02d>] xfs_iunpin_wait+0x1d/0x30 [xfs]
[<f3c99938>] xfs_reclaim_inode+0x58/0x2f0 [xfs]
[<f3c99e04>] xfs_reclaim_inodes_ag+0x234/0x330 [xfs]
[<f3c9a6a1>] ? xfs_inode_set_reclaim_tag+0x91/0x150 [xfs]
[<c115cc41>] ? fsnotify_clear_marks_by_inode+0x21/0xe0
[<f3ca5ac5>] ? xfs_fs_destroy_inode+0xa5/0xd0 [xfs]
[<c113b061>] ? destroy_inode+0x31/0x50
[<c113b160>] ? evict+0xe0/0x160
[<f3c9a7ad>] xfs_reclaim_inodes_nr+0x2d/0x40 [xfs]
[<f3ca5103>] xfs_fs_free_cached_objects+0x13/0x20 [xfs]
[<c11278ce>] super_cache_scan+0x12e/0x140
[<c10f2bb5>] shrink_slab_node+0x125/0x280
[<c1101a1c>] ? compact_zone+0x2c/0x450
[<c10f3489>] shrink_slab+0xd9/0xf0
[<c10f557d>] try_to_free_pages+0x25d/0x4f0
[<c10eb96d>] __alloc_pages_nodemask+0x52d/0x820
[<c103c452>] copy_process.part.47+0xd2/0x14e0
[<c153e254>] ? __schedule+0x224/0x7a0
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
[<c103da01>] do_fork+0xc1/0x320
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
[<c103dc8d>] kernel_thread+0x2d/0x40
[<c105b542>] kthreadd+0x122/0x170
[<c1541837>] ret_from_kernel_thread+0x1b/0x28
[<c105b420>] ? kthread_create_on_cpu+0x60/0x60
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
NeilBrown
2014-09-29 21:59:50 UTC
Permalink
Post by NeilBrown
Post by BillStuff
Post by NeilBrown
Post by BillStuff
Hi Neil,
I found something that looks similar to the problem described in
"Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 1/5 pages [4KB], 2048KB chunk
I was running a test doing parallel kernel builds, read/write loops, and
disk add / remove / check loops,
on both this array and a raid1 array.
I was trying to stress test your recent raid1 fixes, which went well,
but then after 5 days,
I think this is different to the workqueue problem you mentioned, though as I
don't know exactly what caused either I cannot be certain.
From the data you provided it looks like everything is waiting on
get_active_stripe(), or on a process that is waiting on that.
That seems pretty common whenever anything goes wrong in raid5 :-(
The md3_raid5 task is listed as blocked, but not stack trace is given.
If the machine is still in the state, then
cat /proc/1698/stack
might be useful.
(echo t > /proc/sysrq-trigger is always a good idea)
Might this help? I believe the array was doing a "check" when things
hung up.
It looks like it was trying to start doing a 'check'.
The 'resync' thread hadn't been started yet.
What is 'kthreadd' doing?
My guess is that it is in try_to_free_pages() waiting for writeout
for some xfs file page onto the md array ... which won't progress until
the thread gets started.
That would suggest that we need an async way to start threads...
Thanks,
NeilBrown
Thanks for the confirmation.

I'm thinking of something like that. Very basic suggestion suggests it
instantly crash.

If you were to apply this patch and run your test for a week or two, that
would increase my confidence (though of course testing doesn't prove the
absence of bugs....)

Thanks,
NeilBrown


diff --git a/drivers/md/md.c b/drivers/md/md.c
index a79e51d15c2b..580d4b97696c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7770,6 +7770,33 @@ no_add:
return spares;
}

+static void md_start_sync(struct work_struct *ws)
+{
+ struct mddev *mddev = container_of(ws, struct mddev, del_work);
+
+ mddev->sync_thread = md_register_thread(md_do_sync,
+ mddev,
+ "resync");
+ if (!mddev->sync_thread) {
+ printk(KERN_ERR "%s: could not start resync"
+ " thread...\n",
+ mdname(mddev));
+ /* leave the spares where they are, it shouldn't hurt */
+ clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+ clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
+ clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
+ clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
+ clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
+ if (test_and_clear_bit(MD_RECOVERY_RECOVER,
+ &mddev->recovery))
+ if (mddev->sysfs_action)
+ sysfs_notify_dirent_safe(mddev->sysfs_action);
+ } else
+ md_wakeup_thread(mddev->sync_thread);
+ sysfs_notify_dirent_safe(mddev->sysfs_action);
+ md_new_event(mddev);
+}
+
/*
* This routine is regularly called by all per-raid-array threads to
* deal with generic issues like resync and super-block update.
@@ -7823,6 +7850,7 @@ void md_check_recovery(struct mddev *mddev)

if (mddev_trylock(mddev)) {
int spares = 0;
+ bool sync_starting = false;

if (mddev->ro) {
/* On a read-only array we can:
@@ -7921,28 +7949,14 @@ void md_check_recovery(struct mddev *mddev)
*/
bitmap_write_all(mddev->bitmap);
}
- mddev->sync_thread = md_register_thread(md_do_sync,
- mddev,
- "resync");
- if (!mddev->sync_thread) {
- printk(KERN_ERR "%s: could not start resync"
- " thread...\n",
- mdname(mddev));
- /* leave the spares where they are, it shouldn't hurt */
- clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
- clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
- clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
- clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
- clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
- } else
- md_wakeup_thread(mddev->sync_thread);
- sysfs_notify_dirent_safe(mddev->sysfs_action);
- md_new_event(mddev);
+ INIT_WORK(&mddev->del_work, md_start_sync);
+ queue_work(md_misc_wq, &mddev->del_work);
+ sync_starting = true;
}
unlock:
wake_up(&mddev->sb_wait);

- if (!mddev->sync_thread) {
+ if (!mddev->sync_thread && !sync_starting) {
clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
if (test_and_clear_bit(MD_RECOVERY_RECOVER,
&mddev->recovery))
BillStuff
2014-09-30 04:19:51 UTC
Permalink
Post by NeilBrown
Post by NeilBrown
Post by BillStuff
Post by NeilBrown
Post by BillStuff
Hi Neil,
I found something that looks similar to the problem described in
"Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 1/5 pages [4KB], 2048KB chunk
I was running a test doing parallel kernel builds, read/write loops, and
disk add / remove / check loops,
on both this array and a raid1 array.
I was trying to stress test your recent raid1 fixes, which went well,
but then after 5 days,
I think this is different to the workqueue problem you mentioned, though as I
don't know exactly what caused either I cannot be certain.
From the data you provided it looks like everything is waiting on
get_active_stripe(), or on a process that is waiting on that.
That seems pretty common whenever anything goes wrong in raid5 :-(
The md3_raid5 task is listed as blocked, but not stack trace is given.
If the machine is still in the state, then
cat /proc/1698/stack
might be useful.
(echo t > /proc/sysrq-trigger is always a good idea)
Might this help? I believe the array was doing a "check" when things
hung up.
It looks like it was trying to start doing a 'check'.
The 'resync' thread hadn't been started yet.
What is 'kthreadd' doing?
My guess is that it is in try_to_free_pages() waiting for writeout
for some xfs file page onto the md array ... which won't progress until
the thread gets started.
That would suggest that we need an async way to start threads...
Thanks,
NeilBrown
Thanks for the confirmation.
I'm thinking of something like that. Very basic suggestion suggests it
instantly crash.
If you were to apply this patch and run your test for a week or two, that
would increase my confidence (though of course testing doesn't prove the
absence of bugs....)
Thanks,
NeilBrown
Got it running. I'll let you know if anything interesting happens.

Thanks,
Bill
Post by NeilBrown
diff --git a/drivers/md/md.c b/drivers/md/md.c
index a79e51d15c2b..580d4b97696c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
return spares;
}
+static void md_start_sync(struct work_struct *ws)
+{
+ struct mddev *mddev = container_of(ws, struct mddev, del_work);
+
+ mddev->sync_thread = md_register_thread(md_do_sync,
+ mddev,
+ "resync");
+ if (!mddev->sync_thread) {
+ printk(KERN_ERR "%s: could not start resync"
+ " thread...\n",
+ mdname(mddev));
+ /* leave the spares where they are, it shouldn't hurt */
+ clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+ clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
+ clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
+ clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
+ clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
+ if (test_and_clear_bit(MD_RECOVERY_RECOVER,
+ &mddev->recovery))
+ if (mddev->sysfs_action)
+ sysfs_notify_dirent_safe(mddev->sysfs_action);
+ } else
+ md_wakeup_thread(mddev->sync_thread);
+ sysfs_notify_dirent_safe(mddev->sysfs_action);
+ md_new_event(mddev);
+}
+
/*
* This routine is regularly called by all per-raid-array threads to
* deal with generic issues like resync and super-block update.
@@ -7823,6 +7850,7 @@ void md_check_recovery(struct mddev *mddev)
if (mddev_trylock(mddev)) {
int spares = 0;
+ bool sync_starting = false;
if (mddev->ro) {
@@ -7921,28 +7949,14 @@ void md_check_recovery(struct mddev *mddev)
*/
bitmap_write_all(mddev->bitmap);
}
- mddev->sync_thread = md_register_thread(md_do_sync,
- mddev,
- "resync");
- if (!mddev->sync_thread) {
- printk(KERN_ERR "%s: could not start resync"
- " thread...\n",
- mdname(mddev));
- /* leave the spares where they are, it shouldn't hurt */
- clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
- clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
- clear_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
- clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
- clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
- } else
- md_wakeup_thread(mddev->sync_thread);
- sysfs_notify_dirent_safe(mddev->sysfs_action);
- md_new_event(mddev);
+ INIT_WORK(&mddev->del_work, md_start_sync);
+ queue_work(md_misc_wq, &mddev->del_work);
+ sync_starting = true;
}
wake_up(&mddev->sb_wait);
- if (!mddev->sync_thread) {
+ if (!mddev->sync_thread && !sync_starting) {
clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
if (test_and_clear_bit(MD_RECOVERY_RECOVER,
&mddev->recovery))
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
BillStuff
2014-09-30 21:21:32 UTC
Permalink
Post by NeilBrown
Post by NeilBrown
Post by BillStuff
Post by NeilBrown
Post by BillStuff
Hi Neil,
I found something that looks similar to the problem described in
"Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 1/5 pages [4KB], 2048KB chunk
I was running a test doing parallel kernel builds, read/write loops, and
disk add / remove / check loops,
on both this array and a raid1 array.
I was trying to stress test your recent raid1 fixes, which went well,
but then after 5 days,
I think this is different to the workqueue problem you mentioned, though as I
don't know exactly what caused either I cannot be certain.
From the data you provided it looks like everything is waiting on
get_active_stripe(), or on a process that is waiting on that.
That seems pretty common whenever anything goes wrong in raid5 :-(
The md3_raid5 task is listed as blocked, but not stack trace is given.
If the machine is still in the state, then
cat /proc/1698/stack
might be useful.
(echo t > /proc/sysrq-trigger is always a good idea)
Might this help? I believe the array was doing a "check" when things
hung up.
It looks like it was trying to start doing a 'check'.
The 'resync' thread hadn't been started yet.
What is 'kthreadd' doing?
My guess is that it is in try_to_free_pages() waiting for writeout
for some xfs file page onto the md array ... which won't progress until
the thread gets started.
That would suggest that we need an async way to start threads...
Thanks,
NeilBrown
Thanks for the confirmation.
I'm thinking of something like that. Very basic suggestion suggests it
instantly crash.
If you were to apply this patch and run your test for a week or two, that
would increase my confidence (though of course testing doesn't prove the
absence of bugs....)
Thanks,
NeilBrown
Neil,

It locked up already, but in a different way.
In this case it was triggered by a re-add in a fail / remove / add loop.

It kinda seems like md3_raid5 tried to "collect" (reap) md3_resync
before it got fully started.


INFO: task md3_raid5:1698 blocked for more than 120 seconds.
Tainted: P O 3.14.19fe-dirty #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
e8371dc4 00000046 c106d92d ea483770 d18a9994 ffd2177f ffffffff e8371dac
c17d6700 c17d6700 e9d3bcc0 c231bcc0 00000420 e8371d98 00007400 00000088
00000000 00000000 c33ead90 000003be 00000000 00000005 00000000 0000528b
Call Trace:
[<c106d92d>] ? __enqueue_entity+0x6d/0x80
[<c1072683>] ? enqueue_task_fair+0x2d3/0x660
[<c153e893>] schedule+0x23/0x60
[<c153dc25>] schedule_timeout+0x145/0x1c0
[<c1069cb0>] ? default_wake_function+0x10/0x20
[<c1065698>] ? update_rq_clock.part.92+0x18/0x50
[<c1067a65>] ? check_preempt_curr+0x65/0x90
[<c1067aa8>] ? ttwu_do_wakeup+0x18/0x120
[<c153effb>] wait_for_common+0x9b/0x110
[<c1069ca0>] ? wake_up_process+0x40/0x40
[<c153f087>] wait_for_completion+0x17/0x20
[<c105b241>] kthread_stop+0x41/0xb0
[<c14540e1>] md_unregister_thread+0x31/0x40
[<c145a799>] md_reap_sync_thread+0x19/0x140
[<c145aba8>] md_check_recovery+0xe8/0x480
[<f3dc3a10>] raid5d+0x20/0x4c0 [raid456]
[<c104a022>] ? try_to_del_timer_sync+0x42/0x60
[<c104a081>] ? del_timer_sync+0x41/0x50
[<c153dbdd>] ? schedule_timeout+0xfd/0x1c0
[<c10497c0>] ? detach_if_pending+0xa0/0xa0
[<c10797b4>] ? finish_wait+0x44/0x60
[<c1454098>] md_thread+0xe8/0x100
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<c1453fb0>] ? md_start_sync+0xb0/0xb0
[<c105ae21>] kthread+0xa1/0xc0
[<c15418b7>] ret_from_kernel_thread+0x1b/0x28
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
INFO: task kworker/0:0:12491 blocked for more than 120 seconds.
Tainted: P O 3.14.19fe-dirty #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
e6f2ba58 00000046 00000086 00000086 e6f2ba08 c10795ff f395c035 000030ea
c17d6700 c17d6700 d18a9440 e8973cc0 e6f2ba14 c144fa3d 00000000 e6f2ba2c
c2054600 f3dbc81b ea5f4c80 c9f64060 c2054600 00000001 c9f64060 e6f2ba5c
Call Trace:
[<c10795ff>] ? __wake_up+0x3f/0x50
[<c144fa3d>] ? md_wakeup_thread+0x2d/0x30
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<c153e893>] schedule+0x23/0x60
[<c153dc25>] schedule_timeout+0x145/0x1c0
[<c12b1192>] ? blk_finish_plug+0x12/0x40
[<f3c919f7>] ? _xfs_buf_ioapply+0x287/0x300 [xfs]
[<c153effb>] wait_for_common+0x9b/0x110
[<c1069ca0>] ? wake_up_process+0x40/0x40
[<c153f087>] wait_for_completion+0x17/0x20
[<f3c91e80>] xfs_buf_iowait+0x50/0xb0 [xfs]
[<f3c91f17>] ? _xfs_buf_read+0x37/0x40 [xfs]
[<f3c91f17>] _xfs_buf_read+0x37/0x40 [xfs]
[<f3c91fa5>] xfs_buf_read_map+0x85/0xe0 [xfs]
[<f3cefa99>] xfs_trans_read_buf_map+0x1e9/0x3e0 [xfs]
[<f3caa977>] xfs_alloc_read_agfl+0x97/0xc0 [xfs]
[<f3cad8a3>] xfs_alloc_fix_freelist+0x193/0x430 [xfs]
[<f3ce83fe>] ? xlog_grant_sub_space.isra.4+0x1e/0x70 [xfs]
[<f3cae17e>] xfs_free_extent+0x8e/0x100 [xfs]
[<c111be1e>] ? kmem_cache_alloc+0xae/0xf0
[<f3c8e12c>] xfs_bmap_finish+0x13c/0x190 [xfs]
[<f3cda6f2>] xfs_itruncate_extents+0x1b2/0x2b0 [xfs]
[<f3c8f0c3>] xfs_free_eofblocks+0x243/0x2e0 [xfs]
[<f3c9abc6>] xfs_inode_free_eofblocks+0x96/0x140 [xfs]
[<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
[<f3c9956d>] xfs_inode_ag_walk.isra.7+0x1cd/0x2f0 [xfs]
[<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
[<c12d1631>] ? radix_tree_gang_lookup_tag+0x71/0xb0
[<f3ce55fb>] ? xfs_perag_get_tag+0x2b/0xb0 [xfs]
[<f3c9a56d>] ? xfs_inode_ag_iterator_tag+0x3d/0xa0 [xfs]
[<f3c9a59a>] xfs_inode_ag_iterator_tag+0x6a/0xa0 [xfs]
[<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
[<f3c9a82e>] xfs_icache_free_eofblocks+0x2e/0x40 [xfs]
[<f3c9a858>] xfs_eofblocks_worker+0x18/0x30 [xfs]
[<c105521c>] process_one_work+0x10c/0x340
[<c1055d31>] worker_thread+0x101/0x330
[<c1055c30>] ? manage_workers.isra.27+0x250/0x250
[<c105ae21>] kthread+0xa1/0xc0
[<c15418b7>] ret_from_kernel_thread+0x1b/0x28
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
INFO: task xfs_fsr:24262 blocked for more than 120 seconds.
Tainted: P O 3.14.19fe-dirty #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
e4d17aec 00000082 00000096 00000096 e4d17a9c c10795ff 00000000 00000000
c17d6700 c17d6700 e9d3b7b0 c70a8000 e4d17aa8 c144fa3d 00000000 e4d17ac0
c2054600 f3dbc81b ea7ace80 d7b74cc0 c2054600 00000002 d7b74cc0 e4d17af0
Call Trace:
[<c10795ff>] ? __wake_up+0x3f/0x50
[<c144fa3d>] ? md_wakeup_thread+0x2d/0x30
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<c153e893>] schedule+0x23/0x60
[<c153dc25>] schedule_timeout+0x145/0x1c0
[<c12b1192>] ? blk_finish_plug+0x12/0x40
[<f3c919f7>] ? _xfs_buf_ioapply+0x287/0x300 [xfs]
[<c153effb>] wait_for_common+0x9b/0x110
[<c1069ca0>] ? wake_up_process+0x40/0x40
[<c153f087>] wait_for_completion+0x17/0x20
[<f3c91e80>] xfs_buf_iowait+0x50/0xb0 [xfs]
[<f3c91f17>] ? _xfs_buf_read+0x37/0x40 [xfs]
[<f3c91f17>] _xfs_buf_read+0x37/0x40 [xfs]
[<f3c91fa5>] xfs_buf_read_map+0x85/0xe0 [xfs]
[<f3cefa29>] xfs_trans_read_buf_map+0x179/0x3e0 [xfs]
[<f3cde727>] xfs_imap_to_bp+0x67/0xe0 [xfs]
[<f3cdec7d>] xfs_iread+0x7d/0x3e0 [xfs]
[<f3c996e8>] ? xfs_inode_alloc+0x58/0x1b0 [xfs]
[<f3c9a092>] xfs_iget+0x192/0x580 [xfs]
[<f3caa219>] ? kmem_free+0x19/0x50 [xfs]
[<f3ca0ce6>] xfs_bulkstat_one_int+0x86/0x2c0 [xfs]
[<f3ca0f54>] xfs_bulkstat_one+0x34/0x40 [xfs]
[<f3ca0c10>] ? xfs_internal_inum+0xa0/0xa0 [xfs]
[<f3ca1360>] xfs_bulkstat+0x400/0x870 [xfs]
[<f3c9ae86>] xfs_ioc_bulkstat+0xb6/0x160 [xfs]
[<f3ca0f20>] ? xfs_bulkstat_one_int+0x2c0/0x2c0 [xfs]
[<f3c9ccb0>] ? xfs_ioc_swapext+0x160/0x160 [xfs]
[<f3c9d433>] xfs_file_ioctl+0x783/0xa60 [xfs]
[<c106bf9b>] ? __update_cpu_load+0xab/0xd0
[<c105dc58>] ? hrtimer_forward+0xa8/0x1b0
[<c12d4050>] ? timerqueue_add+0x50/0xb0
[<c108d143>] ? ktime_get+0x53/0xe0
[<c1094035>] ? clockevents_program_event+0x95/0x130
[<f3c9ccb0>] ? xfs_ioc_swapext+0x160/0x160 [xfs]
[<c11349b2>] do_vfs_ioctl+0x2e2/0x4c0
[<c1095699>] ? tick_program_event+0x29/0x30
[<c105e42c>] ? hrtimer_interrupt+0x13c/0x2a0
[<c12d8dfb>] ? lockref_put_or_lock+0xb/0x30
[<c1134c08>] SyS_ioctl+0x78/0x80
[<c1540fa8>] syscall_call+0x7/0x7
INFO: task md3_resync:24817 blocked for more than 120 seconds.
Tainted: P O 3.14.19fe-dirty #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
c2d95e08 00000046 00000092 00000000 00000020 00000092 1befb1af 000030e0
c17d6700 c17d6700 e9d3c6e0 d18a9950 c2d95ddc c12adbc5 00000000 00000001
00000000 00000001 c2d95dec e95f0000 00000001 c2d95e08 00000292 00000292
Call Trace:
[<c12adbc5>] ? queue_unplugged+0x45/0x90
[<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
[<c153e893>] schedule+0x23/0x60
[<c145782d>] md_do_sync+0x8dd/0x1010
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<c1453fb0>] ? md_start_sync+0xb0/0xb0
[<c1454098>] md_thread+0xe8/0x100
[<c107945f>] ? __wake_up_locked+0x1f/0x30
[<c1453fb0>] ? md_start_sync+0xb0/0xb0
[<c105ae21>] kthread+0xa1/0xc0
[<c15418b7>] ret_from_kernel_thread+0x1b/0x28
[<c105ad80>] ? kthread_create_on_node+0x110/0x110

Again, the blocked task info is edited for space; hopefully these are
the important ones,
but I've got more if it helps.

Thanks,
Bill
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
NeilBrown
2014-09-30 22:54:09 UTC
Permalink
Post by BillStuff
Post by NeilBrown
Post by NeilBrown
Post by BillStuff
Post by NeilBrown
Post by BillStuff
Hi Neil,
I found something that looks similar to the problem described in
"Re: seems like a deadlock in workqueue when md do a flush" from Sept 14th.
It's on 3.14.19 with 7 recent patches for fixing raid1 recovery hangs.
md3 : active raid5 sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb1[1] sda1[0]
104171200 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]
bitmap: 1/5 pages [4KB], 2048KB chunk
I was running a test doing parallel kernel builds, read/write loops, and
disk add / remove / check loops,
on both this array and a raid1 array.
I was trying to stress test your recent raid1 fixes, which went well,
but then after 5 days,
I think this is different to the workqueue problem you mentioned, though as I
don't know exactly what caused either I cannot be certain.
From the data you provided it looks like everything is waiting on
get_active_stripe(), or on a process that is waiting on that.
That seems pretty common whenever anything goes wrong in raid5 :-(
The md3_raid5 task is listed as blocked, but not stack trace is given.
If the machine is still in the state, then
cat /proc/1698/stack
might be useful.
(echo t > /proc/sysrq-trigger is always a good idea)
Might this help? I believe the array was doing a "check" when things
hung up.
It looks like it was trying to start doing a 'check'.
The 'resync' thread hadn't been started yet.
What is 'kthreadd' doing?
My guess is that it is in try_to_free_pages() waiting for writeout
for some xfs file page onto the md array ... which won't progress until
the thread gets started.
That would suggest that we need an async way to start threads...
Thanks,
NeilBrown
Thanks for the confirmation.
I'm thinking of something like that. Very basic suggestion suggests it
instantly crash.
If you were to apply this patch and run your test for a week or two, that
would increase my confidence (though of course testing doesn't prove the
absence of bugs....)
Thanks,
NeilBrown
Neil,
It locked up already, but in a different way.
In this case it was triggered by a re-add in a fail / remove / add loop.
It kinda seems like md3_raid5 tried to "collect" (reap) md3_resync
before it got fully started.
INFO: task md3_raid5:1698 blocked for more than 120 seconds.
Tainted: P O 3.14.19fe-dirty #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
e8371dc4 00000046 c106d92d ea483770 d18a9994 ffd2177f ffffffff e8371dac
c17d6700 c17d6700 e9d3bcc0 c231bcc0 00000420 e8371d98 00007400 00000088
00000000 00000000 c33ead90 000003be 00000000 00000005 00000000 0000528b
[<c106d92d>] ? __enqueue_entity+0x6d/0x80
[<c1072683>] ? enqueue_task_fair+0x2d3/0x660
[<c153e893>] schedule+0x23/0x60
[<c153dc25>] schedule_timeout+0x145/0x1c0
[<c1069cb0>] ? default_wake_function+0x10/0x20
[<c1065698>] ? update_rq_clock.part.92+0x18/0x50
[<c1067a65>] ? check_preempt_curr+0x65/0x90
[<c1067aa8>] ? ttwu_do_wakeup+0x18/0x120
[<c153effb>] wait_for_common+0x9b/0x110
[<c1069ca0>] ? wake_up_process+0x40/0x40
[<c153f087>] wait_for_completion+0x17/0x20
[<c105b241>] kthread_stop+0x41/0xb0
[<c14540e1>] md_unregister_thread+0x31/0x40
[<c145a799>] md_reap_sync_thread+0x19/0x140
[<c145aba8>] md_check_recovery+0xe8/0x480
[<f3dc3a10>] raid5d+0x20/0x4c0 [raid456]
[<c104a022>] ? try_to_del_timer_sync+0x42/0x60
[<c104a081>] ? del_timer_sync+0x41/0x50
[<c153dbdd>] ? schedule_timeout+0xfd/0x1c0
[<c10497c0>] ? detach_if_pending+0xa0/0xa0
[<c10797b4>] ? finish_wait+0x44/0x60
[<c1454098>] md_thread+0xe8/0x100
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<c1453fb0>] ? md_start_sync+0xb0/0xb0
[<c105ae21>] kthread+0xa1/0xc0
[<c15418b7>] ret_from_kernel_thread+0x1b/0x28
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
INFO: task kworker/0:0:12491 blocked for more than 120 seconds.
Tainted: P O 3.14.19fe-dirty #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
e6f2ba58 00000046 00000086 00000086 e6f2ba08 c10795ff f395c035 000030ea
c17d6700 c17d6700 d18a9440 e8973cc0 e6f2ba14 c144fa3d 00000000 e6f2ba2c
c2054600 f3dbc81b ea5f4c80 c9f64060 c2054600 00000001 c9f64060 e6f2ba5c
[<c10795ff>] ? __wake_up+0x3f/0x50
[<c144fa3d>] ? md_wakeup_thread+0x2d/0x30
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<c153e893>] schedule+0x23/0x60
[<c153dc25>] schedule_timeout+0x145/0x1c0
[<c12b1192>] ? blk_finish_plug+0x12/0x40
[<f3c919f7>] ? _xfs_buf_ioapply+0x287/0x300 [xfs]
[<c153effb>] wait_for_common+0x9b/0x110
[<c1069ca0>] ? wake_up_process+0x40/0x40
[<c153f087>] wait_for_completion+0x17/0x20
[<f3c91e80>] xfs_buf_iowait+0x50/0xb0 [xfs]
[<f3c91f17>] ? _xfs_buf_read+0x37/0x40 [xfs]
[<f3c91f17>] _xfs_buf_read+0x37/0x40 [xfs]
[<f3c91fa5>] xfs_buf_read_map+0x85/0xe0 [xfs]
[<f3cefa99>] xfs_trans_read_buf_map+0x1e9/0x3e0 [xfs]
[<f3caa977>] xfs_alloc_read_agfl+0x97/0xc0 [xfs]
[<f3cad8a3>] xfs_alloc_fix_freelist+0x193/0x430 [xfs]
[<f3ce83fe>] ? xlog_grant_sub_space.isra.4+0x1e/0x70 [xfs]
[<f3cae17e>] xfs_free_extent+0x8e/0x100 [xfs]
[<c111be1e>] ? kmem_cache_alloc+0xae/0xf0
[<f3c8e12c>] xfs_bmap_finish+0x13c/0x190 [xfs]
[<f3cda6f2>] xfs_itruncate_extents+0x1b2/0x2b0 [xfs]
[<f3c8f0c3>] xfs_free_eofblocks+0x243/0x2e0 [xfs]
[<f3c9abc6>] xfs_inode_free_eofblocks+0x96/0x140 [xfs]
[<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
[<f3c9956d>] xfs_inode_ag_walk.isra.7+0x1cd/0x2f0 [xfs]
[<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
[<c12d1631>] ? radix_tree_gang_lookup_tag+0x71/0xb0
[<f3ce55fb>] ? xfs_perag_get_tag+0x2b/0xb0 [xfs]
[<f3c9a56d>] ? xfs_inode_ag_iterator_tag+0x3d/0xa0 [xfs]
[<f3c9a59a>] xfs_inode_ag_iterator_tag+0x6a/0xa0 [xfs]
[<f3c9ab30>] ? xfs_inode_clear_eofblocks_tag+0x160/0x160 [xfs]
[<f3c9a82e>] xfs_icache_free_eofblocks+0x2e/0x40 [xfs]
[<f3c9a858>] xfs_eofblocks_worker+0x18/0x30 [xfs]
[<c105521c>] process_one_work+0x10c/0x340
[<c1055d31>] worker_thread+0x101/0x330
[<c1055c30>] ? manage_workers.isra.27+0x250/0x250
[<c105ae21>] kthread+0xa1/0xc0
[<c15418b7>] ret_from_kernel_thread+0x1b/0x28
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
INFO: task xfs_fsr:24262 blocked for more than 120 seconds.
Tainted: P O 3.14.19fe-dirty #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
e4d17aec 00000082 00000096 00000096 e4d17a9c c10795ff 00000000 00000000
c17d6700 c17d6700 e9d3b7b0 c70a8000 e4d17aa8 c144fa3d 00000000 e4d17ac0
c2054600 f3dbc81b ea7ace80 d7b74cc0 c2054600 00000002 d7b74cc0 e4d17af0
[<c10795ff>] ? __wake_up+0x3f/0x50
[<c144fa3d>] ? md_wakeup_thread+0x2d/0x30
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<f3dbc81b>] ? raid5_unplug+0xbb/0x120 [raid456]
[<c153e893>] schedule+0x23/0x60
[<c153dc25>] schedule_timeout+0x145/0x1c0
[<c12b1192>] ? blk_finish_plug+0x12/0x40
[<f3c919f7>] ? _xfs_buf_ioapply+0x287/0x300 [xfs]
[<c153effb>] wait_for_common+0x9b/0x110
[<c1069ca0>] ? wake_up_process+0x40/0x40
[<c153f087>] wait_for_completion+0x17/0x20
[<f3c91e80>] xfs_buf_iowait+0x50/0xb0 [xfs]
[<f3c91f17>] ? _xfs_buf_read+0x37/0x40 [xfs]
[<f3c91f17>] _xfs_buf_read+0x37/0x40 [xfs]
[<f3c91fa5>] xfs_buf_read_map+0x85/0xe0 [xfs]
[<f3cefa29>] xfs_trans_read_buf_map+0x179/0x3e0 [xfs]
[<f3cde727>] xfs_imap_to_bp+0x67/0xe0 [xfs]
[<f3cdec7d>] xfs_iread+0x7d/0x3e0 [xfs]
[<f3c996e8>] ? xfs_inode_alloc+0x58/0x1b0 [xfs]
[<f3c9a092>] xfs_iget+0x192/0x580 [xfs]
[<f3caa219>] ? kmem_free+0x19/0x50 [xfs]
[<f3ca0ce6>] xfs_bulkstat_one_int+0x86/0x2c0 [xfs]
[<f3ca0f54>] xfs_bulkstat_one+0x34/0x40 [xfs]
[<f3ca0c10>] ? xfs_internal_inum+0xa0/0xa0 [xfs]
[<f3ca1360>] xfs_bulkstat+0x400/0x870 [xfs]
[<f3c9ae86>] xfs_ioc_bulkstat+0xb6/0x160 [xfs]
[<f3ca0f20>] ? xfs_bulkstat_one_int+0x2c0/0x2c0 [xfs]
[<f3c9ccb0>] ? xfs_ioc_swapext+0x160/0x160 [xfs]
[<f3c9d433>] xfs_file_ioctl+0x783/0xa60 [xfs]
[<c106bf9b>] ? __update_cpu_load+0xab/0xd0
[<c105dc58>] ? hrtimer_forward+0xa8/0x1b0
[<c12d4050>] ? timerqueue_add+0x50/0xb0
[<c108d143>] ? ktime_get+0x53/0xe0
[<c1094035>] ? clockevents_program_event+0x95/0x130
[<f3c9ccb0>] ? xfs_ioc_swapext+0x160/0x160 [xfs]
[<c11349b2>] do_vfs_ioctl+0x2e2/0x4c0
[<c1095699>] ? tick_program_event+0x29/0x30
[<c105e42c>] ? hrtimer_interrupt+0x13c/0x2a0
[<c12d8dfb>] ? lockref_put_or_lock+0xb/0x30
[<c1134c08>] SyS_ioctl+0x78/0x80
[<c1540fa8>] syscall_call+0x7/0x7
INFO: task md3_resync:24817 blocked for more than 120 seconds.
Tainted: P O 3.14.19fe-dirty #3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
c2d95e08 00000046 00000092 00000000 00000020 00000092 1befb1af 000030e0
c17d6700 c17d6700 e9d3c6e0 d18a9950 c2d95ddc c12adbc5 00000000 00000001
00000000 00000001 c2d95dec e95f0000 00000001 c2d95e08 00000292 00000292
[<c12adbc5>] ? queue_unplugged+0x45/0x90
[<c10798bb>] ? prepare_to_wait_event+0x6b/0xd0
[<c153e893>] schedule+0x23/0x60
[<c145782d>] md_do_sync+0x8dd/0x1010
[<c1079990>] ? __wake_up_sync+0x20/0x20
[<c1453fb0>] ? md_start_sync+0xb0/0xb0
[<c1454098>] md_thread+0xe8/0x100
[<c107945f>] ? __wake_up_locked+0x1f/0x30
[<c1453fb0>] ? md_start_sync+0xb0/0xb0
[<c105ae21>] kthread+0xa1/0xc0
[<c15418b7>] ret_from_kernel_thread+0x1b/0x28
[<c105ad80>] ? kthread_create_on_node+0x110/0x110
Again, the blocked task info is edited for space; hopefully these are
the important ones,
but I've got more if it helps.
Thanks for the testing! You have included enough information.
I didn't really like that 'sync_starting' variable when I wrote the patch,
but it seemed do the right thing. It doesn't.

If md_check_recovery() runs again immediately after scheduling the sync
thread to run, it will not have set sync_starting but will find ->sync_thread
is NULL and so will clear MD_RECOVERY_RUNNING. The next time it runs, that
flag is still clear and ->sync_thread is not NULL so it will try to stop the
thread, which deadlocks.

This patch on top of what you have should fix it... but I might end up
redoing the logic a bit.

Thanks,
NeilBrown

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 211a615ae415..3df888b80e76 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7850,7 +7850,6 @@ void md_check_recovery(struct mddev *mddev)

if (mddev_trylock(mddev)) {
int spares = 0;
- bool sync_starting = false;

if (mddev->ro) {
/* On a read-only array we can:
@@ -7914,7 +7913,7 @@ void md_check_recovery(struct mddev *mddev)

if (!test_and_clear_bit(MD_RECOVERY_NEEDED, &mddev->recovery) ||
test_bit(MD_RECOVERY_FROZEN, &mddev->recovery))
- goto unlock;
+ goto not_running;
/* no recovery is running.
* remove any failed drives, then
* add spares if possible.
@@ -7926,7 +7925,7 @@ void md_check_recovery(struct mddev *mddev)
if (mddev->pers->check_reshape == NULL ||
mddev->pers->check_reshape(mddev) != 0)
/* Cannot proceed */
- goto unlock;
+ goto not_running;
set_bit(MD_RECOVERY_RESHAPE, &mddev->recovery);
clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
} else if ((spares = remove_and_add_spares(mddev, NULL))) {
@@ -7939,7 +7938,7 @@ void md_check_recovery(struct mddev *mddev)
clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
} else if (!test_bit(MD_RECOVERY_SYNC, &mddev->recovery))
/* nothing to be done ... */
- goto unlock;
+ goto not_running;

if (mddev->pers->sync_request) {
if (spares) {
@@ -7951,18 +7950,18 @@ void md_check_recovery(struct mddev *mddev)
}
INIT_WORK(&mddev->del_work, md_start_sync);
queue_work(md_misc_wq, &mddev->del_work);
- sync_starting = true;
+ goto unlock;
}
- unlock:
- wake_up(&mddev->sb_wait);
-
- if (!mddev->sync_thread && !sync_starting) {
+ not_running:
+ if (!mddev->sync_thread) {
clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
if (test_and_clear_bit(MD_RECOVERY_RECOVER,
&mddev->recovery))
if (mddev->sysfs_action)
sysfs_notify_dirent_safe(mddev->sysfs_action);
}
+ unlock:
+ wake_up(&mddev->sb_wait);
mddev_unlock(mddev);
}
}
BillStuff
2014-10-05 16:05:08 UTC
Permalink
[removed alot of stuff about raid5 check hanging]
Thanks for the testing! You have included enough information.
I didn't really like that 'sync_starting' variable when I wrote the patch,
but it seemed do the right thing. It doesn't.
If md_check_recovery() runs again immediately after scheduling the sync
thread to run, it will not have set sync_starting but will find ->sync_thread
is NULL and so will clear MD_RECOVERY_RUNNING. The next time it runs, that
flag is still clear and ->sync_thread is not NULL so it will try to stop the
thread, which deadlocks.
This patch on top of what you have should fix it... but I might end up
redoing the logic a bit.
Neil,

With the second patch, my test has been running well for close to 5 days
now,
but something odd happened yesterday.

It's on a raid1 array (md1) with both ext3 and xfs on LVM, but I suspect
it happened silently on my
raid5 test also.

md1 : active raid1 sdh3[0] sdg3[1]
76959296 blocks [2/2] [UU]
[=>...................] check = 9.7% (7532800/76959296)
finish=108.8min speed=10628K/sec
bitmap: 0/1 pages [0KB], 65536KB chunk



Again this is running kernel builds, read / write loops, and remove/
(re)add / check loops.
At some point while removing a disk from the array, something bad
happened, the errors below
appeared in the log, and the ext3 filesystem remounted readonly. xfs
plowed right on through.
There's no evidence of any read or write errors to the member disks,
either in logs or on the
disks themselves, and the raid1 checks came back with zero mismatches.
fsck for the ext3 fs
complained a lot but the filesystem came back into a useable state and
I've restarted my tests.
Below, dm-3 is xfs, dm-1 is ext3. The interesting stuff happens around
12:27:04. This was
immediately after (or during?) removing sdh3 from md1.

Oct 4 12:26:01 xplane kernel: md: unbind<sdg3>
Oct 4 12:26:01 xplane kernel: md: export_rdev(sdg3)
Oct 4 12:26:11 xplane kernel: md: bind<sdg3>
Oct 4 12:26:11 xplane kernel: RAID1 conf printout:
Oct 4 12:26:11 xplane kernel: --- wd:1 rd:2
Oct 4 12:26:11 xplane kernel: disk 0, wo:0, o:1, dev:sdh3
Oct 4 12:26:11 xplane kernel: disk 1, wo:1, o:1, dev:sdg3
Oct 4 12:26:12 xplane kernel: md: recovery of RAID array md1
Oct 4 12:26:12 xplane kernel: md: minimum _guaranteed_ speed: 10000
KB/sec/disk.
Oct 4 12:26:12 xplane kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for recovery.
Oct 4 12:26:12 xplane kernel: md: using 128k window, over a total of
76959296k.
Oct 4 12:27:03 xplane kernel: md: md1: recovery done.
Oct 4 12:27:03 xplane kernel: RAID1 conf printout:
Oct 4 12:27:03 xplane kernel: --- wd:2 rd:2
Oct 4 12:27:03 xplane kernel: disk 0, wo:0, o:1, dev:sdh3
Oct 4 12:27:03 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
Oct 4 12:27:04 xplane kernel: md/raid1:md1: Disk failure on sdh3,
disabling device.
Oct 4 12:27:04 xplane kernel: md/raid1:md1: Operation continuing on 1
devices.
Oct 4 12:27:05 xplane kernel: quiet_error: 912 callbacks suppressed
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331468
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331469
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331470
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331471
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331472
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331473
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331474
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331475
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331476
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331477
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Aborting journal on device dm-1.
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_blocks:
Journal has aborted
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error:
ext3_journal_start_sb: Detected aborted journal
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: remounting
filesystem read-only
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error:
ext3_journal_start_sb: Detected aborted journal
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in
ext3_writeback_write_end: IO failure
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode:
Journal has aborted
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode:
Journal has aborted
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_orphan_add:
Journal has aborted
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode:
Journal has aborted
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in ext3_new_inode:
Journal has aborted
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in
ext3_dirty_inode: IO failure
Oct 4 12:27:05 xplane kernel: RAID1 conf printout:
Oct 4 12:27:05 xplane kernel: --- wd:1 rd:2
Oct 4 12:27:05 xplane kernel: disk 0, wo:1, o:0, dev:sdh3
Oct 4 12:27:05 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
Oct 4 12:27:05 xplane kernel: RAID1 conf printout:
Oct 4 12:27:05 xplane kernel: --- wd:1 rd:2
Oct 4 12:27:05 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
Oct 4 12:27:14 xplane kernel: md: unbind<sdh3>
Oct 4 12:27:14 xplane kernel: md: export_rdev(sdh3)
Oct 4 12:27:24 xplane kernel: md: bind<sdh3>
Oct 4 12:27:24 xplane kernel: RAID1 conf printout:
Oct 4 12:27:24 xplane kernel: --- wd:1 rd:2
Oct 4 12:27:24 xplane kernel: disk 0, wo:1, o:1, dev:sdh3
Oct 4 12:27:24 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
Oct 4 12:27:24 xplane kernel: md: recovery of RAID array md1
Oct 4 12:27:24 xplane kernel: md: minimum _guaranteed_ speed: 10000
KB/sec/disk.
Oct 4 12:27:24 xplane kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for recovery.
Oct 4 12:27:24 xplane kernel: md: using 128k window, over a total of
76959296k.
Oct 4 12:27:33 xplane kernel: md: md1: recovery done.
Oct 4 12:27:33 xplane kernel: RAID1 conf printout:
Oct 4 12:27:33 xplane kernel: --- wd:2 rd:2
Oct 4 12:27:33 xplane kernel: disk 0, wo:0, o:1, dev:sdh3
Oct 4 12:27:33 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
Oct 4 12:27:35 xplane kernel: md: data-check of RAID array md1
Oct 4 12:27:35 xplane kernel: md: minimum _guaranteed_ speed: 10000
KB/sec/disk.
Oct 4 12:27:35 xplane kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for data-check.
Oct 4 12:27:35 xplane kernel: md: using 128k window, over a total of
76959296k.
Oct 4 12:29:59 xplane kernel: __journal_remove_journal_head: freeing
b_committed_data

It seems like something got out of sync as the disk was being removed,
but before the remove completed.

Again, this is 3.14.19 with these 8 patches:
md/raid1: intialise start_next_window for READ case to avoid hang
md/raid1: be more cautious where we read-balance during resync.
md/raid1: clean up request counts properly in close_sync()
md/raid1: make sure resync waits for conflicting writes to complete.
md/raid1: Don't use next_resync to determine how far resync has
progressed
md/raid1: update next_resync under resync_lock.
md/raid1: count resync requests in nr_pending.
md/raid1: fix_read_error should act on all non-faulty devices.

and the two patches for the check start hang.

Any ideas on what happened here?

Thanks,
Bill


--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
NeilBrown
2014-10-14 01:42:45 UTC
Permalink
Post by BillStuff
[removed alot of stuff about raid5 check hanging]
Thanks for the testing! You have included enough information.
I didn't really like that 'sync_starting' variable when I wrote the patch,
but it seemed do the right thing. It doesn't.
If md_check_recovery() runs again immediately after scheduling the sync
thread to run, it will not have set sync_starting but will find ->sync_thread
is NULL and so will clear MD_RECOVERY_RUNNING. The next time it runs, that
flag is still clear and ->sync_thread is not NULL so it will try to stop the
thread, which deadlocks.
This patch on top of what you have should fix it... but I might end up
redoing the logic a bit.
Neil,
With the second patch, my test has been running well for close to 5 days
now,
That is good news - thanks.
Post by BillStuff
but something odd happened yesterday.
It's on a raid1 array (md1) with both ext3 and xfs on LVM, but I suspect
it happened silently on my
raid5 test also.
md1 : active raid1 sdh3[0] sdg3[1]
76959296 blocks [2/2] [UU]
[=>...................] check = 9.7% (7532800/76959296)
finish=108.8min speed=10628K/sec
bitmap: 0/1 pages [0KB], 65536KB chunk
Again this is running kernel builds, read / write loops, and remove/
(re)add / check loops.
At some point while removing a disk from the array, something bad
happened, the errors below
appeared in the log, and the ext3 filesystem remounted readonly. xfs
plowed right on through.
There's no evidence of any read or write errors to the member disks,
either in logs or on the
disks themselves, and the raid1 checks came back with zero mismatches.
fsck for the ext3 fs
complained a lot but the filesystem came back into a useable state and
I've restarted my tests.
Below, dm-3 is xfs, dm-1 is ext3. The interesting stuff happens around
12:27:04. This was
immediately after (or during?) removing sdh3 from md1.
Oct 4 12:26:01 xplane kernel: md: unbind<sdg3>
Oct 4 12:26:01 xplane kernel: md: export_rdev(sdg3)
Oct 4 12:26:11 xplane kernel: md: bind<sdg3>
Removed and re-added sdg3
Post by BillStuff
Oct 4 12:26:11 xplane kernel: --- wd:1 rd:2
Oct 4 12:26:11 xplane kernel: disk 0, wo:0, o:1, dev:sdh3
Oct 4 12:26:11 xplane kernel: disk 1, wo:1, o:1, dev:sdg3
->add_disk included sdg3 in the array not in-sync.
Post by BillStuff
Oct 4 12:26:12 xplane kernel: md: recovery of RAID array md1
Oct 4 12:26:12 xplane kernel: md: minimum _guaranteed_ speed: 10000
KB/sec/disk.
Oct 4 12:26:12 xplane kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for recovery.
Oct 4 12:26:12 xplane kernel: md: using 128k window, over a total of
76959296k.
Recovery starts
Post by BillStuff
Oct 4 12:27:03 xplane kernel: md: md1: recovery done.
Recovery finishes after 51 seconds.
Post by BillStuff
Oct 4 12:27:03 xplane kernel: --- wd:2 rd:2
Oct 4 12:27:03 xplane kernel: disk 0, wo:0, o:1, dev:sdh3
Oct 4 12:27:03 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
->spare_active() marks sdg3 at fully in-sync (not long "wo" - write only).
Post by BillStuff
Oct 4 12:27:04 xplane kernel: md/raid1:md1: Disk failure on sdh3,
disabling device.
sdh3 fails a second later. Presumably this was requested by your testing
harness.
Post by BillStuff
Oct 4 12:27:04 xplane kernel: md/raid1:md1: Operation continuing on 1
devices.
raid.c:error() confirms the device will not be ignored.
Post by BillStuff
Oct 4 12:27:05 xplane kernel: quiet_error: 912 callbacks suppressed
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331468
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Write errors start happening.

You should only get a write error if no writes successfully completed to
in_sync, non-faulty devices.
It is possible that the write to sdg3 completed before it was marked in-sync,
and the write to sdh3 completed after it was marked as faulty.
How long after recovery completes do you fail the next device?
The logs suggest it is the next second, which could be anywhere from 1msec
to 1998 msecs.


NeilBrown
Post by BillStuff
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331469
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331470
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331471
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331472
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331473
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331474
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331475
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331476
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331477
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Aborting journal on device dm-1.
Journal has aborted
ext3_journal_start_sb: Detected aborted journal
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: remounting
filesystem read-only
ext3_journal_start_sb: Detected aborted journal
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in
ext3_writeback_write_end: IO failure
Journal has aborted
Journal has aborted
Journal has aborted
Journal has aborted
Journal has aborted
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in
ext3_dirty_inode: IO failure
Oct 4 12:27:05 xplane kernel: --- wd:1 rd:2
Oct 4 12:27:05 xplane kernel: disk 0, wo:1, o:0, dev:sdh3
Oct 4 12:27:05 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
raid1_remove_disk is called to disconnect the device from the array.
Post by BillStuff
Oct 4 12:27:05 xplane kernel: --- wd:1 rd:2
Oct 4 12:27:05 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
which it has now done.
Post by BillStuff
Oct 4 12:27:14 xplane kernel: md: unbind<sdh3>
Oct 4 12:27:14 xplane kernel: md: export_rdev(sdh3)
Oct 4 12:27:24 xplane kernel: md: bind<sdh3>
Oct 4 12:27:24 xplane kernel: --- wd:1 rd:2
Oct 4 12:27:24 xplane kernel: disk 0, wo:1, o:1, dev:sdh3
Oct 4 12:27:24 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
Oct 4 12:27:24 xplane kernel: md: recovery of RAID array md1
Oct 4 12:27:24 xplane kernel: md: minimum _guaranteed_ speed: 10000
KB/sec/disk.
Oct 4 12:27:24 xplane kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for recovery.
Oct 4 12:27:24 xplane kernel: md: using 128k window, over a total of
76959296k.
Oct 4 12:27:33 xplane kernel: md: md1: recovery done.
Oct 4 12:27:33 xplane kernel: --- wd:2 rd:2
Oct 4 12:27:33 xplane kernel: disk 0, wo:0, o:1, dev:sdh3
Oct 4 12:27:33 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
Oct 4 12:27:35 xplane kernel: md: data-check of RAID array md1
Oct 4 12:27:35 xplane kernel: md: minimum _guaranteed_ speed: 10000
KB/sec/disk.
Oct 4 12:27:35 xplane kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for data-check.
Oct 4 12:27:35 xplane kernel: md: using 128k window, over a total of
76959296k.
Oct 4 12:29:59 xplane kernel: __journal_remove_journal_head: freeing
b_committed_data
It seems like something got out of sync as the disk was being removed,
but before the remove completed.
md/raid1: intialise start_next_window for READ case to avoid hang
md/raid1: be more cautious where we read-balance during resync.
md/raid1: clean up request counts properly in close_sync()
md/raid1: make sure resync waits for conflicting writes to complete.
md/raid1: Don't use next_resync to determine how far resync has
progressed
md/raid1: update next_resync under resync_lock.
md/raid1: count resync requests in nr_pending.
md/raid1: fix_read_error should act on all non-faulty devices.
and the two patches for the check start hang.
Any ideas on what happened here?
Thanks,
Bill
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
BillStuff
2014-10-14 14:56:17 UTC
Permalink
Post by NeilBrown
Post by BillStuff
[removed alot of stuff about raid5 check hanging]
Thanks for the testing! You have included enough information.
I didn't really like that 'sync_starting' variable when I wrote the patch,
but it seemed do the right thing. It doesn't.
If md_check_recovery() runs again immediately after scheduling the sync
thread to run, it will not have set sync_starting but will find ->sync_thread
is NULL and so will clear MD_RECOVERY_RUNNING. The next time it runs, that
flag is still clear and ->sync_thread is not NULL so it will try to stop the
thread, which deadlocks.
This patch on top of what you have should fix it... but I might end up
redoing the logic a bit.
Neil,
With the second patch, my test has been running well for close to 5 days
now,
That is good news - thanks.
Now two weeks in, this seems to be working ok with the exception of this
write error,
and another "write didn't make it to both disks during recovery" hiccup
I'll describe below.
Post by NeilBrown
Post by BillStuff
but something odd happened yesterday.
It's on a raid1 array (md1) with both ext3 and xfs on LVM, but I suspect
it happened silently on my
raid5 test also.
md1 : active raid1 sdh3[0] sdg3[1]
76959296 blocks [2/2] [UU]
[=>...................] check = 9.7% (7532800/76959296)
finish=108.8min speed=10628K/sec
bitmap: 0/1 pages [0KB], 65536KB chunk
Again this is running kernel builds, read / write loops, and remove/
(re)add / check loops.
At some point while removing a disk from the array, something bad
happened, the errors below
appeared in the log, and the ext3 filesystem remounted readonly. xfs
plowed right on through.
There's no evidence of any read or write errors to the member disks,
either in logs or on the
disks themselves, and the raid1 checks came back with zero mismatches.
fsck for the ext3 fs
complained a lot but the filesystem came back into a useable state and
I've restarted my tests.
Below, dm-3 is xfs, dm-1 is ext3. The interesting stuff happens around
12:27:04. This was
immediately after (or during?) removing sdh3 from md1.
Oct 4 12:26:01 xplane kernel: md: unbind<sdg3>
Oct 4 12:26:01 xplane kernel: md: export_rdev(sdg3)
Oct 4 12:26:11 xplane kernel: md: bind<sdg3>
Removed and re-added sdg3
Post by BillStuff
Oct 4 12:26:11 xplane kernel: --- wd:1 rd:2
Oct 4 12:26:11 xplane kernel: disk 0, wo:0, o:1, dev:sdh3
Oct 4 12:26:11 xplane kernel: disk 1, wo:1, o:1, dev:sdg3
->add_disk included sdg3 in the array not in-sync.
Post by BillStuff
Oct 4 12:26:12 xplane kernel: md: recovery of RAID array md1
Oct 4 12:26:12 xplane kernel: md: minimum _guaranteed_ speed: 10000
KB/sec/disk.
Oct 4 12:26:12 xplane kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for recovery.
Oct 4 12:26:12 xplane kernel: md: using 128k window, over a total of
76959296k.
Recovery starts
Post by BillStuff
Oct 4 12:27:03 xplane kernel: md: md1: recovery done.
Recovery finishes after 51 seconds.
Post by BillStuff
Oct 4 12:27:03 xplane kernel: --- wd:2 rd:2
Oct 4 12:27:03 xplane kernel: disk 0, wo:0, o:1, dev:sdh3
Oct 4 12:27:03 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
->spare_active() marks sdg3 at fully in-sync (not long "wo" - write only).
Post by BillStuff
Oct 4 12:27:04 xplane kernel: md/raid1:md1: Disk failure on sdh3,
disabling device.
sdh3 fails a second later. Presumably this was requested by your testing
harness.
Post by BillStuff
Oct 4 12:27:04 xplane kernel: md/raid1:md1: Operation continuing on 1
devices.
raid.c:error() confirms the device will not be ignored.
Post by BillStuff
Oct 4 12:27:05 xplane kernel: quiet_error: 912 callbacks suppressed
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331468
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Write errors start happening.
You should only get a write error if no writes successfully completed to
in_sync, non-faulty devices.
It is possible that the write to sdg3 completed before it was marked in-sync,
and the write to sdh3 completed after it was marked as faulty.
How long after recovery completes do you fail the next device?
The logs suggest it is the next second, which could be anywhere from 1msec
to 1998 msecs.
NeilBrown
The script sleeps for 1 second and then fails / removes / adds the
opposite disk. After doing all disks in the array it does a "check",
then starts over. I'm sure there's some script housekeeping time in there,
but it should be right around 1 second.

The other hiccup that happened may be related; hopefully at least
informative.

Half of my test is the fail / remove / add loop. The other half continually
writes files and saves a checksum. Later, when the file is deleted, it
re-checks
the sum. A few days ago this second check failed:

sdg3 in sync, sdh3 (re)added at 10:48:48

Oct 8 10:48:48 xplane kernel: md: bind<sdh3>
Oct 8 10:48:49 xplane kernel: RAID1 conf printout:
Oct 8 10:48:49 xplane kernel: --- wd:1 rd:2
Oct 8 10:48:49 xplane kernel: disk 0, wo:1, o:1, dev:sdh3
Oct 8 10:48:49 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
Oct 8 10:48:49 xplane kernel: md: recovery of RAID array md1
Oct 8 10:48:49 xplane kernel: md: minimum _guaranteed_ speed: 10000
KB/sec/disk.
Oct 8 10:48:49 xplane kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for recovery.
Oct 8 10:48:49 xplane kernel: md: using 128k window, over a total of
76959296k.

testfile /mnt/lvtest/tmp/tmpfile02518.0xd6.tmp written at 10:51:47

Oct 8 10:51:54 xplane kernel: md: md1: recovery done.
Oct 8 10:51:54 xplane kernel: RAID1 conf printout:
Oct 8 10:51:54 xplane kernel: --- wd:2 rd:2
Oct 8 10:51:54 xplane kernel: disk 0, wo:0, o:1, dev:sdh3
Oct 8 10:51:54 xplane kernel: disk 1, wo:0, o:1, dev:sdg3

md5sum for /mnt/lvtest/tmp/tmpfile02518.0xd6.tmp different at 11:30
during QC check

Nothing bad showed up in the logs, for disks, filesystems or the array.

From looking at the file, it appears a chunk of it didn't get written
to /dev/sdh3.
When the second QC was run, the read was from /dev/sdh3 and the sum was
different.

I did a "check" and I got mismatch_cnt = 256; looked at the differences
between
/dev/sdg3 and /dev/sdh3 and they're around 50GB into the 73GB (I was
expecting to find
them near the tail end, although this might be less meaningful with the
bitmap.)

I'm now trying to reproduce these two failures, now on 3.14.21 with the
2 resync start
patches, without LVM in the middle.

Thanks always for the help.
-Bill
Post by NeilBrown
Post by BillStuff
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331469
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331470
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331471
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331472
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331473
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331474
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331475
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331476
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Buffer I/O error on device dm-3, logical
block 6331477
Oct 4 12:27:05 xplane kernel: lost page write due to I/O error on dm-3
Oct 4 12:27:05 xplane kernel: Aborting journal on device dm-1.
Journal has aborted
ext3_journal_start_sb: Detected aborted journal
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error: remounting
filesystem read-only
ext3_journal_start_sb: Detected aborted journal
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in
ext3_writeback_write_end: IO failure
Journal has aborted
Journal has aborted
Journal has aborted
Journal has aborted
Journal has aborted
Oct 4 12:27:05 xplane kernel: EXT3-fs (dm-1): error in
ext3_dirty_inode: IO failure
Oct 4 12:27:05 xplane kernel: --- wd:1 rd:2
Oct 4 12:27:05 xplane kernel: disk 0, wo:1, o:0, dev:sdh3
Oct 4 12:27:05 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
raid1_remove_disk is called to disconnect the device from the array.
Post by BillStuff
Oct 4 12:27:05 xplane kernel: --- wd:1 rd:2
Oct 4 12:27:05 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
which it has now done.
Post by BillStuff
Oct 4 12:27:14 xplane kernel: md: unbind<sdh3>
Oct 4 12:27:14 xplane kernel: md: export_rdev(sdh3)
Oct 4 12:27:24 xplane kernel: md: bind<sdh3>
Oct 4 12:27:24 xplane kernel: --- wd:1 rd:2
Oct 4 12:27:24 xplane kernel: disk 0, wo:1, o:1, dev:sdh3
Oct 4 12:27:24 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
Oct 4 12:27:24 xplane kernel: md: recovery of RAID array md1
Oct 4 12:27:24 xplane kernel: md: minimum _guaranteed_ speed: 10000
KB/sec/disk.
Oct 4 12:27:24 xplane kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for recovery.
Oct 4 12:27:24 xplane kernel: md: using 128k window, over a total of
76959296k.
Oct 4 12:27:33 xplane kernel: md: md1: recovery done.
Oct 4 12:27:33 xplane kernel: --- wd:2 rd:2
Oct 4 12:27:33 xplane kernel: disk 0, wo:0, o:1, dev:sdh3
Oct 4 12:27:33 xplane kernel: disk 1, wo:0, o:1, dev:sdg3
Oct 4 12:27:35 xplane kernel: md: data-check of RAID array md1
Oct 4 12:27:35 xplane kernel: md: minimum _guaranteed_ speed: 10000
KB/sec/disk.
Oct 4 12:27:35 xplane kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for data-check.
Oct 4 12:27:35 xplane kernel: md: using 128k window, over a total of
76959296k.
Oct 4 12:29:59 xplane kernel: __journal_remove_journal_head: freeing
b_committed_data
It seems like something got out of sync as the disk was being removed,
but before the remove completed.
md/raid1: intialise start_next_window for READ case to avoid hang
md/raid1: be more cautious where we read-balance during resync.
md/raid1: clean up request counts properly in close_sync()
md/raid1: make sure resync waits for conflicting writes to complete.
md/raid1: Don't use next_resync to determine how far resync has
progressed
md/raid1: update next_resync under resync_lock.
md/raid1: count resync requests in nr_pending.
md/raid1: fix_read_error should act on all non-faulty devices.
and the two patches for the check start hang.
Any ideas on what happened here?
Thanks,
Bill
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
BillStuff
2014-10-14 16:55:00 UTC
Permalink
[snip]
Post by NeilBrown
Write errors start happening.
You should only get a write error if no writes successfully completed to
in_sync, non-faulty devices.
It is possible that the write to sdg3 completed before it was marked in-sync,
and the write to sdh3 completed after it was marked as faulty.
How long after recovery completes do you fail the next device?
The logs suggest it is the next second, which could be anywhere from 1msec
to 1998 msecs.
NeilBrown
FYI Neil: Running through my logs, I noticed 6 of these failures in my
testing
over the past few days. All "recovery completed, wait a second, fail the
other
drive" cases. Same signature in the logs. Apparently things kept working
until
the filesystem tripped on its journal and fell over.

So the upshot is this seems reasonably reproducible.

-Bill
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Loading...