| Age | Commit message (Collapse) | Author | Files | Lines |
|
The request_queue is frozen and quiesced while the elevator init_sched()
method runs, so queue_lock is not needed for BFQ cgroup initialization.
Signed-off-by: Yu Kuai <yukuai@fygo.io>
Link: https://patch.msgid.link/1965073ea20f33114a8d903816b986e483b9bb34.1780621988.git.yukuai@fygo.io
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Writing 0 to BFQ's low_latency attribute ends weight raising for active,
idle and async queues. The async cgroup path walks q->blkg_list, converts
each blkg to BFQ policy data and then reads bfqg->async_bfqq and
bfqg->async_idle_bfqq.
That walk was protected only by bfqd->lock. blkcg release work is
serialized by q->blkcg_mutex and q->queue_lock instead, and
blkg_free_workfn() can call BFQ's pd_free_fn before it removes
blkg->q_node from q->blkg_list. A low_latency reset can therefore still
find the blkg on the queue list after the BFQ policy data has been freed.
The buggy scenario involves two paths, with each column showing the order
within that path:
BFQ low_latency reset: blkcg blkg release work:
1. bfq_low_latency_store() 1. blkg_free_workfn() takes
calls bfq_end_wr(). q->blkcg_mutex.
2. bfq_end_wr_async() walks 2. BFQ pd_free_fn drops the
q->blkg_list. final bfq_group reference.
3. blkg_to_bfqg() returns 3. blkg->q_node remains on
the stale policy data. q->blkg_list until list_del_init().
4. bfq_end_wr_async_queues()
reads async queue fields.
Fix this by taking q->blkcg_mutex and q->queue_lock around the
q->blkg_list walk, then taking bfqd->lock before touching BFQ async
queues. The mutex serializes against policy-data free and queue_lock
stabilizes the list. Move the async reset out of bfq_end_wr()'s existing
bfqd->lock critical section so the lock order matches blkcg policy
callbacks.
Validation reproduced this kernel report:
BUG: KASAN: slab-use-after-free in bfq_end_wr_async_queues+0x246/0x340
Call Trace:
<TASK>
dump_stack_lvl+0x66/0xa0
print_report+0xce/0x630
? bfq_end_wr_async_queues+0x246/0x340
? srso_alias_return_thunk+0x5/0xfbef5
? __virt_addr_valid+0x20d/0x410
? bfq_end_wr_async_queues+0x246/0x340
kasan_report+0xe0/0x110
? bfq_end_wr_async_queues+0x246/0x340
bfq_end_wr_async_queues+0x246/0x340
bfq_end_wr_async+0xba/0x180
bfq_low_latency_store+0x4e5/0x690
? 0xffffffffc02150da
? __pfx_bfq_low_latency_store+0x10/0x10
? __pfx_bfq_low_latency_store+0x10/0x10
elv_attr_store+0xc4/0x110
kernfs_fop_write_iter+0x2f5/0x4a0
vfs_write+0x604/0x11f0
? __pfx_locks_remove_posix+0x10/0x10
? __pfx_vfs_write+0x10/0x10
ksys_write+0xf9/0x1d0
? __pfx_ksys_write+0x10/0x10
do_syscall_64+0x115/0x6a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Allocated by task 544:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
__kasan_kmalloc+0xaa/0xb0
bfq_pd_alloc+0xc0/0x1b0
blkg_alloc+0x346/0x960
blkg_create+0x8c2/0x10d0
bio_associate_blkg_from_css+0x9f3/0xfa0
bio_associate_blkg+0xd9/0x200
bio_init+0x303/0x640
__blkdev_direct_IO_simple+0x56b/0x8a0
blkdev_direct_IO+0x8e7/0x2580
blkdev_read_iter+0x205/0x400
vfs_read+0x7b0/0xda0
ksys_read+0xf9/0x1d0
do_syscall_64+0x115/0x6a0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 465:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x60
__kasan_slab_free+0x5f/0x80
kfree+0x307/0x580
blkg_free_workfn+0xef/0x460
process_one_work+0x8d0/0x1870
worker_thread+0x575/0xf80
kthread+0x2e7/0x3c0
ret_from_fork+0x576/0x810
ret_from_fork_asm+0x1a/0x30
Fixes: 44e44a1b329e ("block, bfq: improve responsiveness")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
Reviewed-by: Tao Cui <cuitao@kylinos.cn>
Link: https://patch.msgid.link/20260621135930.2657810-1-zzzccc427@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
The default limits is unchanged, and user can configure async_depth now.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bfq and mq-deadline consider sync writes as async requests and only
reserve tags for sync reads by async_depth, however, kyber doesn't
consider sync writes as async requests for now.
Consider the case there are lots of dirty pages, and user use fsync to
flush dirty pages. In this case sched_tags can be exhausted by sync writes
and sync reads can stuck waiting for tag. Hence let kyber follow what
mq-deadline and bfq did, and unify async requests checking for all
elevators.
Signed-off-by: Yu Kuai <yukuai@fnnas.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
To check if a request queue is for a rotational device, a double
negation is needed with the pattern "!blk_queue_nonrot(q)". Simplify
this with the introduction of the helper blk_queue_rot() which tests
if a requests queue limit has the BLK_FEAT_ROTATIONAL feature set.
All call sites of blk_queue_nonrot() are modified to use blk_queue_rot()
and blk_queue_nonrot() definition removed.
No functional changes.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When wbt_enable_default() is moved out of queue freezing in elevator_change(),
it can cause the wbt inflight counter to become negative (-1), leading to hung
tasks in the writeback path. Tasks get stuck in wbt_wait() because the counter
is in an inconsistent state.
The issue occurs because wbt_enable_default() could race with IO submission,
allowing the counter to be decremented before proper initialization. This manifests
as:
rq_wait[0]:
inflight: -1
has_waiters: True
rwb_enabled() checks the state, which can be updated exactly between wbt_wait()
(rq_qos_throttle()) and wbt_track()(rq_qos_track()), then the inflight counter
will become negative.
And results in hung task warnings like:
task:kworker/u24:39 state:D stack:0 pid:14767
Call Trace:
rq_qos_wait+0xb4/0x150
wbt_wait+0xa9/0x100
__rq_qos_throttle+0x24/0x40
blk_mq_submit_bio+0x672/0x7b0
...
Fix this by:
1. Splitting wbt_enable_default() into:
- __wbt_enable_default(): Returns true if wbt_init() should be called
- wbt_enable_default(): Wrapper for existing callers (no init)
- wbt_init_enable_default(): New function that checks and inits WBT
2. Using wbt_init_enable_default() in blk_register_queue() to ensure
proper initialization during queue registration
3. Move wbt_init() out of wbt_enable_default() which is only for enabling
disabled wbt from bfq and iocost, and wbt_init() isn't needed. Then the
original lock warning can be avoided.
4. Removing the ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT flag and its handling
code since it's no longer needed
This ensures WBT is properly initialized before any IO can be submitted,
preventing the counter from going negative.
Cc: Nilay Shroff <nilay@linux.ibm.com>
Cc: Yu Kuai <yukuai@fnnas.com>
Cc: Guangwu Zhang <guazhang@redhat.com>
Fixes: 78c271344b6f ("block: move wbt_enable_default() out of queue freezing from sched ->exit()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Current depth_updated has some problems:
1) depth_updated() will be called for each hctx, while all elevators
will update async_depth for the disk level, this is not related to hctx;
2) In blk_mq_update_nr_requests(), if previous hctx update succeed and
this hctx update failed, q->nr_requests will not be updated, while
async_depth is already updated with new nr_reqeuests in previous
depth_updated();
3) All elevators are using q->nr_requests to calculate async_depth now,
however, q->nr_requests is still the old value when depth_updated() is
called from blk_mq_update_nr_requests();
Those problems are first from error path, then mq-deadline, and recently
for bfq and kyber, fix those problems by:
- pass in request_queue instead of hctx;
- move depth_updated() after q->nr_requests is updated in
blk_mq_update_nr_requests();
- add depth_updated() call inside init_sched() method to initialize
async_depth;
- remove init_hctx() method for mq-deadline and bfq that is useless now;
Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes")
Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code")
Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250821060612.1729939-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made
GFP_NOWAIT implicitly include __GFP_NOWARN.
Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g.,
`GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean up these
redundant flags across subsystems.
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Link: https://lore.kernel.org/r/20250811081135.374315-1-rongqianfeng@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently elevators will record internal 'async_depth' to throttle
asynchronous requests, and they both calculate shallow_dpeth based on
sb->shift, with the respect that sb->shift is the available tags in one
word.
However, sb->shift is not the availbale tags in the last word, see
__map_depth:
if (index == sb->map_nr - 1)
return sb->depth - (index << sb->shift);
For consequence, if the last word is used, more tags can be get than
expected, for example, assume nr_requests=256 and there are four words,
in the worst case if user set nr_requests=32, then the first word is
the last word, and still use bits per word, which is 64, to calculate
async_depth is wrong.
One the ohter hand, due to cgroup qos, bfq can allow only one request
to be allocated, and set shallow_dpeth=1 will still allow the number
of words request to be allocated.
Fix this problems by using shallow_depth to the whole sbitmap instead
of per word, also change kyber, mq-deadline and bfq to follow this,
a new helper __map_depth_with_shallow() is introduced to calculate
available bits in each word.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250807032413.1469456-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In preparation for allocating sched_tags before freezing the request
queue and acquiring ->elevator_lock, move the elevator queue allocation
logic from the elevator ops ->init_sched callback into blk_mq_init_sched.
As elevator_alloc is now only invoked from block layer core, we don't
need to export it, so unexport elevator_alloc function.
This refactoring provides a centralized location for elevator queue
initialization, which makes it easier to store pre-allocated sched_tags
in the struct elevator_queue during later changes.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250730074614.2537382-2-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently issue io can grab queue_lock three times from bfq_bio_merge(),
bfq_limit_depth() and bfq_prepare_request(), the queue_lock is not
necessary if icq is already created because both queue and ioc can't be
freed before io issuing is done, hence remove the unnecessary queue_lock
and use rcu to protect radix tree lookup.
Noted this is also a prep patch to support request batch dispatching[1].
[1] https://lore.kernel.org/all/20250722072431.610354-1-yukuai1@huaweicloud.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20250729023229.2944898-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
scheduler's ->exit() is called with queue frozen and elevator lock is held, and
wbt_enable_default() can't be called with queue frozen, otherwise the
following lockdep warning is triggered:
#6 (&q->rq_qos_mutex){+.+.}-{4:4}:
#5 (&eq->sysfs_lock){+.+.}-{4:4}:
#4 (&q->elevator_lock){+.+.}-{4:4}:
#3 (&q->q_usage_counter(io)#3){++++}-{0:0}:
#2 (fs_reclaim){+.+.}-{0:0}:
#1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}:
#0 (&q->debugfs_mutex){+.+.}-{4:4}:
Fix the issue by moving wbt_enable_default() out of bfq's exit(), and
call it from elevator_change_done().
Meantime add disk->rqos_state_mutex for covering wbt state change, which
matches the purpose more than ->elevator_lock.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250505141805.2751237-26-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ELEVATOR_FLAG_DISABLE_WBT is only used by BFQ to disallow wbt when BFQ is
in use. The flag is set in BFQ's init(), and cleared in BFQ's exit().
Making it as request queue flag, so that we can avoid to deal with elevator
switch race. Also it isn't graceful to checking one scheduler flag in
wbt_enable_default().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505141805.2751237-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.
Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.
Patch was created by using Coccinelle.
Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/d0d57e1dab46b617856dfb93c721d221cc31ab0b.1738746821.git.namcao@linutronix.de
|
|
Pull block updates from Jens Axboe:
- NVMe pull requests via Keith:
- Target support for PCI-Endpoint transport (Damien)
- TCP IO queue spreading fixes (Sagi, Chaitanya)
- Target handling for "limited retry" flags (Guixen)
- Poll type fix (Yongsoo)
- Xarray storage error handling (Keisuke)
- Host memory buffer free size fix on error (Francis)
- MD pull requests via Song:
- Reintroduce md-linear (Yu Kuai)
- md-bitmap refactor and fix (Yu Kuai)
- Replace kmap_atomic with kmap_local_page (David Reaver)
- Quite a few queue freeze and debugfs deadlock fixes
Ming introduced lockdep support for this in the 6.13 kernel, and it
has (unsurprisingly) uncovered quite a few issues
- Use const attributes for IO schedulers
- Remove bio ioprio wrappers
- Fixes for stacked device atomic write support
- Refactor queue affinity helpers, in preparation for better supporting
isolated CPUs
- Cleanups of loop O_DIRECT handling
- Cleanup of BLK_MQ_F_* flags
- Add rotational support for null_blk
- Various fixes and cleanups
* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
block: Don't trim an atomic write
block: Add common atomic writes enable flag
md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
block: limit disk max sectors to (LLONG_MAX >> 9)
block: Change blk_stack_atomic_writes_limits() unit_min check
block: Ensure start sector is aligned for stacking atomic writes
blk-mq: Move more error handling into blk_mq_submit_bio()
block: Reorder the request allocation code in blk_mq_submit_bio()
nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
md/md-bitmap: move bitmap_{start, end}write to md upper layer
md/raid5: implement pers->bitmap_sector()
md: add a new callback pers->bitmap_sector()
md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
md: Replace deprecated kmap_atomic() with kmap_local_page()
md: reintroduce md-linear
partitions: ldm: remove the initial kernel-doc notation
blk-cgroup: rwstat: fix kernel-doc warnings in header file
blk-cgroup: fix kernel-doc warnings in header file
nbd: fix partial sending
...
|
|
Our syzkaller report a following UAF for v6.6:
BUG: KASAN: slab-use-after-free in bfq_init_rq+0x175d/0x17a0 block/bfq-iosched.c:6958
Read of size 8 at addr ffff8881b57147d8 by task fsstress/232726
CPU: 2 PID: 232726 Comm: fsstress Not tainted 6.6.0-g3629d1885222 #39
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106
print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364
print_report+0x3e/0x70 mm/kasan/report.c:475
kasan_report+0xb8/0xf0 mm/kasan/report.c:588
hlist_add_head include/linux/list.h:1023 [inline]
bfq_init_rq+0x175d/0x17a0 block/bfq-iosched.c:6958
bfq_insert_request.isra.0+0xe8/0xa20 block/bfq-iosched.c:6271
bfq_insert_requests+0x27f/0x390 block/bfq-iosched.c:6323
blk_mq_insert_request+0x290/0x8f0 block/blk-mq.c:2660
blk_mq_submit_bio+0x1021/0x15e0 block/blk-mq.c:3143
__submit_bio+0xa0/0x6b0 block/blk-core.c:639
__submit_bio_noacct_mq block/blk-core.c:718 [inline]
submit_bio_noacct_nocheck+0x5b7/0x810 block/blk-core.c:747
submit_bio_noacct+0xca0/0x1990 block/blk-core.c:847
__ext4_read_bh fs/ext4/super.c:205 [inline]
ext4_read_bh+0x15e/0x2e0 fs/ext4/super.c:230
__read_extent_tree_block+0x304/0x6f0 fs/ext4/extents.c:567
ext4_find_extent+0x479/0xd20 fs/ext4/extents.c:947
ext4_ext_map_blocks+0x1a3/0x2680 fs/ext4/extents.c:4182
ext4_map_blocks+0x929/0x15a0 fs/ext4/inode.c:660
ext4_iomap_begin_report+0x298/0x480 fs/ext4/inode.c:3569
iomap_iter+0x3dd/0x1010 fs/iomap/iter.c:91
iomap_fiemap+0x1f4/0x360 fs/iomap/fiemap.c:80
ext4_fiemap+0x181/0x210 fs/ext4/extents.c:5051
ioctl_fiemap.isra.0+0x1b4/0x290 fs/ioctl.c:220
do_vfs_ioctl+0x31c/0x11a0 fs/ioctl.c:811
__do_sys_ioctl fs/ioctl.c:869 [inline]
__se_sys_ioctl+0xae/0x190 fs/ioctl.c:857
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x78/0xe2
Allocated by task 232719:
kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
__kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328
kasan_slab_alloc include/linux/kasan.h:188 [inline]
slab_post_alloc_hook mm/slab.h:768 [inline]
slab_alloc_node mm/slub.c:3492 [inline]
kmem_cache_alloc_node+0x1b8/0x6f0 mm/slub.c:3537
bfq_get_queue+0x215/0x1f00 block/bfq-iosched.c:5869
bfq_get_bfqq_handle_split+0x167/0x5f0 block/bfq-iosched.c:6776
bfq_init_rq+0x13a4/0x17a0 block/bfq-iosched.c:6938
bfq_insert_request.isra.0+0xe8/0xa20 block/bfq-iosched.c:6271
bfq_insert_requests+0x27f/0x390 block/bfq-iosched.c:6323
blk_mq_insert_request+0x290/0x8f0 block/blk-mq.c:2660
blk_mq_submit_bio+0x1021/0x15e0 block/blk-mq.c:3143
__submit_bio+0xa0/0x6b0 block/blk-core.c:639
__submit_bio_noacct_mq block/blk-core.c:718 [inline]
submit_bio_noacct_nocheck+0x5b7/0x810 block/blk-core.c:747
submit_bio_noacct+0xca0/0x1990 block/blk-core.c:847
__ext4_read_bh fs/ext4/super.c:205 [inline]
ext4_read_bh_nowait+0x15a/0x240 fs/ext4/super.c:217
ext4_read_bh_lock+0xac/0xd0 fs/ext4/super.c:242
ext4_bread_batch+0x268/0x500 fs/ext4/inode.c:958
__ext4_find_entry+0x448/0x10f0 fs/ext4/namei.c:1671
ext4_lookup_entry fs/ext4/namei.c:1774 [inline]
ext4_lookup.part.0+0x359/0x6f0 fs/ext4/namei.c:1842
ext4_lookup+0x72/0x90 fs/ext4/namei.c:1839
__lookup_slow+0x257/0x480 fs/namei.c:1696
lookup_slow fs/namei.c:1713 [inline]
walk_component+0x454/0x5c0 fs/namei.c:2004
link_path_walk.part.0+0x773/0xda0 fs/namei.c:2331
link_path_walk fs/namei.c:3826 [inline]
path_openat+0x1b9/0x520 fs/namei.c:3826
do_filp_open+0x1b7/0x400 fs/namei.c:3857
do_sys_openat2+0x5dc/0x6e0 fs/open.c:1428
do_sys_open fs/open.c:1443 [inline]
__do_sys_openat fs/open.c:1459 [inline]
__se_sys_openat fs/open.c:1454 [inline]
__x64_sys_openat+0x148/0x200 fs/open.c:1454
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x78/0xe2
Freed by task 232726:
kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522
____kasan_slab_free mm/kasan/common.c:236 [inline]
__kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244
kasan_slab_free include/linux/kasan.h:164 [inline]
slab_free_hook mm/slub.c:1827 [inline]
slab_free_freelist_hook mm/slub.c:1853 [inline]
slab_free mm/slub.c:3820 [inline]
kmem_cache_free+0x110/0x760 mm/slub.c:3842
bfq_put_queue+0x6a7/0xfb0 block/bfq-iosched.c:5428
bfq_forget_entity block/bfq-wf2q.c:634 [inline]
bfq_put_idle_entity+0x142/0x240 block/bfq-wf2q.c:645
bfq_forget_idle+0x189/0x1e0 block/bfq-wf2q.c:671
bfq_update_vtime block/bfq-wf2q.c:1280 [inline]
__bfq_lookup_next_entity block/bfq-wf2q.c:1374 [inline]
bfq_lookup_next_entity+0x350/0x480 block/bfq-wf2q.c:1433
bfq_update_next_in_service+0x1c0/0x4f0 block/bfq-wf2q.c:128
bfq_deactivate_entity+0x10a/0x240 block/bfq-wf2q.c:1188
bfq_deactivate_bfqq block/bfq-wf2q.c:1592 [inline]
bfq_del_bfqq_busy+0x2e8/0xad0 block/bfq-wf2q.c:1659
bfq_release_process_ref+0x1cc/0x220 block/bfq-iosched.c:3139
bfq_split_bfqq+0x481/0xdf0 block/bfq-iosched.c:6754
bfq_init_rq+0xf29/0x17a0 block/bfq-iosched.c:6934
bfq_insert_request.isra.0+0xe8/0xa20 block/bfq-iosched.c:6271
bfq_insert_requests+0x27f/0x390 block/bfq-iosched.c:6323
blk_mq_insert_request+0x290/0x8f0 block/blk-mq.c:2660
blk_mq_submit_bio+0x1021/0x15e0 block/blk-mq.c:3143
__submit_bio+0xa0/0x6b0 block/blk-core.c:639
__submit_bio_noacct_mq block/blk-core.c:718 [inline]
submit_bio_noacct_nocheck+0x5b7/0x810 block/blk-core.c:747
submit_bio_noacct+0xca0/0x1990 block/blk-core.c:847
__ext4_read_bh fs/ext4/super.c:205 [inline]
ext4_read_bh+0x15e/0x2e0 fs/ext4/super.c:230
__read_extent_tree_block+0x304/0x6f0 fs/ext4/extents.c:567
ext4_find_extent+0x479/0xd20 fs/ext4/extents.c:947
ext4_ext_map_blocks+0x1a3/0x2680 fs/ext4/extents.c:4182
ext4_map_blocks+0x929/0x15a0 fs/ext4/inode.c:660
ext4_iomap_begin_report+0x298/0x480 fs/ext4/inode.c:3569
iomap_iter+0x3dd/0x1010 fs/iomap/iter.c:91
iomap_fiemap+0x1f4/0x360 fs/iomap/fiemap.c:80
ext4_fiemap+0x181/0x210 fs/ext4/extents.c:5051
ioctl_fiemap.isra.0+0x1b4/0x290 fs/ioctl.c:220
do_vfs_ioctl+0x31c/0x11a0 fs/ioctl.c:811
__do_sys_ioctl fs/ioctl.c:869 [inline]
__se_sys_ioctl+0xae/0x190 fs/ioctl.c:857
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x78/0xe2
commit 1ba0403ac644 ("block, bfq: fix uaf for accessing waker_bfqq after
splitting") fix the problem that if waker_bfqq is in the merge chain,
and current is the only procress, waker_bfqq can be freed from
bfq_split_bfqq(). However, the case that waker_bfqq is not in the merge
chain is missed, and if the procress reference of waker_bfqq is 0,
waker_bfqq can be freed as well.
Fix the problem by checking procress reference if waker_bfqq is not in
the merge_chain.
Fixes: 1ba0403ac644 ("block, bfq: fix uaf for accessing waker_bfqq after splitting")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20250108084148.1549973-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The elevator core now allows instances of 'struct elv_fs_entry' to be
moved into read-only memory. Make use of that to protect them against
accidental or malicious modifications.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20250102-sysfs-const-attr-elevator-v1-3-9837d2058c60@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Set new allocated bfqq to bic or remove freed bfqq from bic are both
protected by bfqd->lock, however bfq_limit_depth() is deferencing bfqq
from bic without the lock, this can lead to UAF if the io_context is
shared by multiple tasks.
For example, test bfq with io_uring can trigger following UAF in v6.6:
==================================================================
BUG: KASAN: slab-use-after-free in bfqq_group+0x15/0x50
Call Trace:
<TASK>
dump_stack_lvl+0x47/0x80
print_address_description.constprop.0+0x66/0x300
print_report+0x3e/0x70
kasan_report+0xb4/0xf0
bfqq_group+0x15/0x50
bfqq_request_over_limit+0x130/0x9a0
bfq_limit_depth+0x1b5/0x480
__blk_mq_alloc_requests+0x2b5/0xa00
blk_mq_get_new_requests+0x11d/0x1d0
blk_mq_submit_bio+0x286/0xb00
submit_bio_noacct_nocheck+0x331/0x400
__block_write_full_folio+0x3d0/0x640
writepage_cb+0x3b/0xc0
write_cache_pages+0x254/0x6c0
write_cache_pages+0x254/0x6c0
do_writepages+0x192/0x310
filemap_fdatawrite_wbc+0x95/0xc0
__filemap_fdatawrite_range+0x99/0xd0
filemap_write_and_wait_range.part.0+0x4d/0xa0
blkdev_read_iter+0xef/0x1e0
io_read+0x1b6/0x8a0
io_issue_sqe+0x87/0x300
io_wq_submit_work+0xeb/0x390
io_worker_handle_work+0x24d/0x550
io_wq_worker+0x27f/0x6c0
ret_from_fork_asm+0x1b/0x30
</TASK>
Allocated by task 808602:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_slab_alloc+0x83/0x90
kmem_cache_alloc_node+0x1b1/0x6d0
bfq_get_queue+0x138/0xfa0
bfq_get_bfqq_handle_split+0xe3/0x2c0
bfq_init_rq+0x196/0xbb0
bfq_insert_request.isra.0+0xb5/0x480
bfq_insert_requests+0x156/0x180
blk_mq_insert_request+0x15d/0x440
blk_mq_submit_bio+0x8a4/0xb00
submit_bio_noacct_nocheck+0x331/0x400
__blkdev_direct_IO_async+0x2dd/0x330
blkdev_write_iter+0x39a/0x450
io_write+0x22a/0x840
io_issue_sqe+0x87/0x300
io_wq_submit_work+0xeb/0x390
io_worker_handle_work+0x24d/0x550
io_wq_worker+0x27f/0x6c0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x1b/0x30
Freed by task 808589:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x27/0x40
__kasan_slab_free+0x126/0x1b0
kmem_cache_free+0x10c/0x750
bfq_put_queue+0x2dd/0x770
__bfq_insert_request.isra.0+0x155/0x7a0
bfq_insert_request.isra.0+0x122/0x480
bfq_insert_requests+0x156/0x180
blk_mq_dispatch_plug_list+0x528/0x7e0
blk_mq_flush_plug_list.part.0+0xe5/0x590
__blk_flush_plug+0x3b/0x90
blk_finish_plug+0x40/0x60
do_writepages+0x19d/0x310
filemap_fdatawrite_wbc+0x95/0xc0
__filemap_fdatawrite_range+0x99/0xd0
filemap_write_and_wait_range.part.0+0x4d/0xa0
blkdev_read_iter+0xef/0x1e0
io_read+0x1b6/0x8a0
io_issue_sqe+0x87/0x300
io_wq_submit_work+0xeb/0x390
io_worker_handle_work+0x24d/0x550
io_wq_worker+0x27f/0x6c0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x1b/0x30
Fix the problem by protecting bic_to_bfqq() with bfqd->lock.
CC: Jan Kara <jack@suse.cz>
Fixes: 76f1df88bbc2 ("bfq: Limit number of requests consumed by each cgroup")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20241129091509.2227136-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit bc3b1e9e7c50e1de0f573eea3871db61dd4787de.
The bic is associated with sync_bfqq, and bfq_release_process_ref cannot
be put into bfq_put_cooperator.
kasan report:
[ 400.347277] ==================================================================
[ 400.347287] BUG: KASAN: slab-use-after-free in bic_set_bfqq+0x200/0x230
[ 400.347420] Read of size 8 at addr ffff88881cab7d60 by task dockerd/5800
[ 400.347430]
[ 400.347436] CPU: 24 UID: 0 PID: 5800 Comm: dockerd Kdump: loaded Tainted: G E 6.12.0 #32
[ 400.347450] Tainted: [E]=UNSIGNED_MODULE
[ 400.347454] Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference Platform, BIOS VMW201.00V.20192059.B64.2207280713 07/28/2022
[ 400.347460] Call Trace:
[ 400.347464] <TASK>
[ 400.347468] dump_stack_lvl+0x5d/0x80
[ 400.347490] print_report+0x174/0x505
[ 400.347521] kasan_report+0xe0/0x160
[ 400.347541] bic_set_bfqq+0x200/0x230
[ 400.347549] bfq_bic_update_cgroup+0x419/0x740
[ 400.347560] bfq_bio_merge+0x133/0x320
[ 400.347584] blk_mq_submit_bio+0x1761/0x1e20
[ 400.347625] __submit_bio+0x28b/0x7b0
[ 400.347664] submit_bio_noacct_nocheck+0x6b2/0xd30
[ 400.347690] iomap_readahead+0x50c/0x680
[ 400.347731] read_pages+0x17f/0x9c0
[ 400.347785] page_cache_ra_unbounded+0x366/0x4a0
[ 400.347795] filemap_fault+0x83d/0x2340
[ 400.347819] __xfs_filemap_fault+0x11a/0x7d0 [xfs]
[ 400.349256] __do_fault+0xf1/0x610
[ 400.349270] do_fault+0x977/0x11a0
[ 400.349281] __handle_mm_fault+0x5d1/0x850
[ 400.349314] handle_mm_fault+0x1f8/0x560
[ 400.349324] do_user_addr_fault+0x324/0x970
[ 400.349337] exc_page_fault+0x76/0xf0
[ 400.349350] asm_exc_page_fault+0x26/0x30
[ 400.349360] RIP: 0033:0x55a480d77375
[ 400.349384] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 49 3b 66 10 0f 86 ae 02 00 00 55 48 89 e5 48 83 ec 58 48 8b 10 <83> 7a 10 00 0f 84 27 02 00 00 44 0f b6 42 28 44 0f b6 4a 29 41 80
[ 400.349392] RSP: 002b:00007f18c37fd8b8 EFLAGS: 00010216
[ 400.349401] RAX: 00007f18c37fd9d0 RBX: 0000000000000000 RCX: 0000000000000000
[ 400.349407] RDX: 000055a484407d38 RSI: 000000c000e8b0c0 RDI: 0000000000000000
[ 400.349412] RBP: 00007f18c37fd910 R08: 000055a484017f60 R09: 000055a484066f80
[ 400.349417] R10: 0000000000194000 R11: 0000000000000005 R12: 0000000000000008
[ 400.349422] R13: 0000000000000000 R14: 000000c000476a80 R15: 0000000000000000
[ 400.349430] </TASK>
[ 400.349452]
[ 400.349454] Allocated by task 5800:
[ 400.349459] kasan_save_stack+0x30/0x50
[ 400.349469] kasan_save_track+0x14/0x30
[ 400.349475] __kasan_slab_alloc+0x89/0x90
[ 400.349482] kmem_cache_alloc_node_noprof+0xdc/0x2a0
[ 400.349492] bfq_get_queue+0x1ef/0x1100
[ 400.349502] __bfq_get_bfqq_handle_split+0x11a/0x510
[ 400.349511] bfq_insert_requests+0xf55/0x9030
[ 400.349519] blk_mq_flush_plug_list+0x446/0x14c0
[ 400.349527] __blk_flush_plug+0x27c/0x4e0
[ 400.349534] blk_finish_plug+0x52/0xa0
[ 400.349540] _xfs_buf_ioapply+0x739/0xc30 [xfs]
[ 400.350246] __xfs_buf_submit+0x1b2/0x640 [xfs]
[ 400.350967] xfs_buf_read_map+0x306/0xa20 [xfs]
[ 400.351672] xfs_trans_read_buf_map+0x285/0x7d0 [xfs]
[ 400.352386] xfs_imap_to_bp+0x107/0x270 [xfs]
[ 400.353077] xfs_iget+0x70d/0x1eb0 [xfs]
[ 400.353786] xfs_lookup+0x2ca/0x3a0 [xfs]
[ 400.354506] xfs_vn_lookup+0x14e/0x1a0 [xfs]
[ 400.355197] __lookup_slow+0x19c/0x340
[ 400.355204] lookup_one_unlocked+0xfc/0x120
[ 400.355211] ovl_lookup_single+0x1b3/0xcf0 [overlay]
[ 400.355255] ovl_lookup_layer+0x316/0x490 [overlay]
[ 400.355295] ovl_lookup+0x844/0x1fd0 [overlay]
[ 400.355351] lookup_one_qstr_excl+0xef/0x150
[ 400.355357] do_unlinkat+0x22a/0x620
[ 400.355366] __x64_sys_unlinkat+0x109/0x1e0
[ 400.355375] do_syscall_64+0x82/0x160
[ 400.355384] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 400.355393]
[ 400.355395] Freed by task 5800:
[ 400.355400] kasan_save_stack+0x30/0x50
[ 400.355407] kasan_save_track+0x14/0x30
[ 400.355413] kasan_save_free_info+0x3b/0x70
[ 400.355422] __kasan_slab_free+0x4f/0x70
[ 400.355429] kmem_cache_free+0x176/0x520
[ 400.355438] bfq_put_queue+0x67e/0x980
[ 400.355447] bfq_bic_update_cgroup+0x407/0x740
[ 400.355454] bfq_bio_merge+0x133/0x320
[ 400.355460] blk_mq_submit_bio+0x1761/0x1e20
[ 400.355467] __submit_bio+0x28b/0x7b0
[ 400.355473] submit_bio_noacct_nocheck+0x6b2/0xd30
[ 400.355480] iomap_readahead+0x50c/0x680
[ 400.355490] read_pages+0x17f/0x9c0
[ 400.355498] page_cache_ra_unbounded+0x366/0x4a0
[ 400.355505] filemap_fault+0x83d/0x2340
[ 400.355514] __xfs_filemap_fault+0x11a/0x7d0 [xfs]
[ 400.356204] __do_fault+0xf1/0x610
[ 400.356213] do_fault+0x977/0x11a0
[ 400.356221] __handle_mm_fault+0x5d1/0x850
[ 400.356230] handle_mm_fault+0x1f8/0x560
[ 400.356238] do_user_addr_fault+0x324/0x970
[ 400.356248] exc_page_fault+0x76/0xf0
[ 400.356258] asm_exc_page_fault+0x26/0x30
[ 400.356266]
[ 400.356269] The buggy address belongs to the object at ffff88881cab7bc0
which belongs to the cache bfq_queue of size 576
[ 400.356276] The buggy address is located 416 bytes inside of
freed 576-byte region [ffff88881cab7bc0, ffff88881cab7e00)
[ 400.356285]
[ 400.356287] The buggy address belongs to the physical page:
[ 400.356292] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88881cab0b00 pfn:0x81cab0
[ 400.356300] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[ 400.356323] flags: 0x50000000000040(head|node=1|zone=2)
[ 400.356331] page_type: f5(slab)
[ 400.356340] raw: 0050000000000040 ffff88880a00c280 dead000000000122 0000000000000000
[ 400.356347] raw: ffff88881cab0b00 00000000802e0025 00000001f5000000 0000000000000000
[ 400.356354] head: 0050000000000040 ffff88880a00c280 dead000000000122 0000000000000000
[ 400.356359] head: ffff88881cab0b00 00000000802e0025 00000001f5000000 0000000000000000
[ 400.356365] head: 0050000000000003 ffffea002072ac01 ffffffffffffffff 0000000000000000
[ 400.356370] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
[ 400.356378] page dumped because: kasan: bad access detected
[ 400.356381]
[ 400.356383] Memory state around the buggy address:
[ 400.356387] ffff88881cab7c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 400.356392] ffff88881cab7c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 400.356397] >ffff88881cab7d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 400.356400] ^
[ 400.356405] ffff88881cab7d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 400.356409] ffff88881cab7e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 400.356413] ==================================================================
Cc: stable@vger.kernel.org
Fixes: bc3b1e9e7c50 ("block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()")
Signed-off-by: Zach Wade <zachwade.k@gmail.com>
Cc: Ding Hui <dinghui@sangfor.com.cn>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20241119153410.2546-1-zachwade.k@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Make code cleaner, there are no functional changes.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-8-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now that 'bfqq_already_existing' is only used in one branch, it can be
removed. There are no functional changes.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-7-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The local variable is used to call bfq_bfqq_resume_state() later,
since 'bfqd->lock' is held, and bfqq status will not change between
setting 'split' and calling bfq_bfqq_resume_state(), move forward
bfq_bfqq_resume_state() so that 'split' can be removed. There are no
functional chagnes.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Because bfq_put_cooperator() is always followed by
bfq_release_process_ref().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Original state:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\--------------\ \-------------\ \-------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 1 2 4
After commit 0e456dba86c7 ("block, bfq: choose the last bfqq from merge
chain in bfq_setup_cooperator()"), if P1 issues a new IO:
Without the patch:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\------------------------------\ \-------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 2 4
bfqq3 will be used to handle IO from P1, this is not expected, IO
should be redirected to bfqq4;
With the patch:
-------------------------------------------
| |
Process 1 Process 2 Process 3 | Process 4
(BIC1) (BIC2) (BIC3) | (BIC4)
| | | |
\-------------\ \-------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 2 4
IO is redirected to bfqq4, however, procress reference of bfqq3 is still
2, while there is only P2 using it.
Fix the problem by calling bfq_merge_bfqqs() for each bfqq in the merge
chain. Also change bfqq_merge_bfqqs() to return new_bfqq to simplify
code.
Fixes: 0e456dba86c7 ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
After commit 42c306ed7233 ("block, bfq: don't break merge chain in
bfq_split_bfqq()"), if the current procress is the last holder of bfqq,
the bfqq can be freed after bfq_split_bfqq(). Hence recored the bfqq and
then access bfqq->waker_bfqq may trigger UAF. What's more, the waker_bfqq
may in the merge chain of bfqq, hence just recored waker_bfqq is still
not safe.
Fix the problem by adding a helper bfq_waker_bfqq() to check if
bfqq->waker_bfqq is in the merge chain, and current procress is the only
holder.
Fixes: 42c306ed7233 ("block, bfq: don't break merge chain in bfq_split_bfqq()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Instead of open coding it, there are no functional changes.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Consider the following scenario:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\-------------\ \-------------\ \--------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 1 2 4
If Process 1 issue a new IO and bfqq2 is found, and then bfq_init_rq()
decide to spilt bfqq2 by bfq_split_bfqq(). Howerver, procress reference
of bfqq2 is 1 and bfq_split_bfqq() just clear the coop flag, which will
break the merge chain.
Expected result: caller will allocate a new bfqq for BIC1
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
| | |
\-------------\ \--------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 1 3
Since the condition is only used for the last bfqq4 when the previous
bfqq2 and bfqq3 are already splited. Fix the problem by checking if
bfqq is the last one in the merge chain as well.
Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Consider the following merge chain:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\--------------\ \-------------\ \-------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
IO from Process 1 will get bfqf2 from BIC1 first, then
bfq_setup_cooperator() will found bfqq2 already merged to bfqq3 and then
handle this IO from bfqq3. However, the merge chain can be much deeper
and bfqq3 can be merged to other bfqq as well.
Fix this problem by iterating to the last bfqq in
bfq_setup_cooperator().
Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
1) initial state, three tasks:
Process 1 Process 2 Process 3
(BIC1) (BIC2) (BIC3)
| Λ | Λ | Λ
| | | | | |
V | V | V |
bfqq1 bfqq2 bfqq3
process ref: 1 1 1
2) bfqq1 merged to bfqq2:
Process 1 Process 2 Process 3
(BIC1) (BIC2) (BIC3)
| | | Λ
\--------------\| | |
V V |
bfqq1--------->bfqq2 bfqq3
process ref: 0 2 1
3) bfqq2 merged to bfqq3:
Process 1 Process 2 Process 3
(BIC1) (BIC2) (BIC3)
here -> Λ | |
\--------------\ \-------------\|
V V
bfqq1--------->bfqq2---------->bfqq3
process ref: 0 1 3
In this case, IO from Process 1 will get bfqq2 from BIC1 first, and then
get bfqq3 through merge chain, and finially handle IO by bfqq3.
Howerver, current code will think bfqq2 is owned by BIC1, like initial
state, and set bfqq2->bic to BIC1.
bfq_insert_request
-> by Process 1
bfqq = bfq_init_rq(rq)
bfqq = bfq_get_bfqq_handle_split
bfqq = bic_to_bfqq
-> get bfqq2 from BIC1
bfqq->ref++
rq->elv.priv[0] = bic
rq->elv.priv[1] = bfqq
if (bfqq_process_refs(bfqq) == 1)
bfqq->bic = bic
-> record BIC1 to bfqq2
__bfq_insert_request
new_bfqq = bfq_setup_cooperator
-> get bfqq3 from bfqq2->new_bfqq
bfqq_request_freed(bfqq)
new_bfqq->ref++
rq->elv.priv[1] = new_bfqq
-> handle IO by bfqq3
Fix the problem by checking bfqq is from merge chain fist. And this
might fix a following problem reported by our syzkaller(unreproducible):
==================================================================
BUG: KASAN: slab-use-after-free in bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline]
BUG: KASAN: slab-use-after-free in bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline]
BUG: KASAN: slab-use-after-free in bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889
Write of size 1 at addr ffff888123839eb8 by task kworker/0:1H/18595
CPU: 0 PID: 18595 Comm: kworker/0:1H Tainted: G L 6.6.0-07439-gba2303cacfda #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Workqueue: kblockd blk_mq_requeue_work
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:364 [inline]
print_report+0x10d/0x610 mm/kasan/report.c:475
kasan_report+0x8e/0xc0 mm/kasan/report.c:588
bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline]
bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline]
bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889
bfq_get_bfqq_handle_split+0x169/0x5d0 block/bfq-iosched.c:6757
bfq_init_rq block/bfq-iosched.c:6876 [inline]
bfq_insert_request block/bfq-iosched.c:6254 [inline]
bfq_insert_requests+0x1112/0x5cf0 block/bfq-iosched.c:6304
blk_mq_insert_request+0x290/0x8d0 block/blk-mq.c:2593
blk_mq_requeue_work+0x6bc/0xa70 block/blk-mq.c:1502
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
</TASK>
Allocated by task 20776:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
__kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328
kasan_slab_alloc include/linux/kasan.h:188 [inline]
slab_post_alloc_hook mm/slab.h:763 [inline]
slab_alloc_node mm/slub.c:3458 [inline]
kmem_cache_alloc_node+0x1a4/0x6f0 mm/slub.c:3503
ioc_create_icq block/blk-ioc.c:370 [inline]
ioc_find_get_icq+0x180/0xaa0 block/blk-ioc.c:436
bfq_prepare_request+0x39/0xf0 block/bfq-iosched.c:6812
blk_mq_rq_ctx_init.isra.7+0x6ac/0xa00 block/blk-mq.c:403
__blk_mq_alloc_requests+0xcc0/0x1070 block/blk-mq.c:517
blk_mq_get_new_requests block/blk-mq.c:2940 [inline]
blk_mq_submit_bio+0x624/0x27c0 block/blk-mq.c:3042
__submit_bio+0x331/0x6f0 block/blk-core.c:624
__submit_bio_noacct_mq block/blk-core.c:703 [inline]
submit_bio_noacct_nocheck+0x816/0xb40 block/blk-core.c:732
submit_bio_noacct+0x7a6/0x1b50 block/blk-core.c:826
xlog_write_iclog+0x7d5/0xa00 fs/xfs/xfs_log.c:1958
xlog_state_release_iclog+0x3b8/0x720 fs/xfs/xfs_log.c:619
xlog_cil_push_work+0x19c5/0x2270 fs/xfs/xfs_log_cil.c:1330
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
Freed by task 946:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522
____kasan_slab_free mm/kasan/common.c:236 [inline]
__kasan_slab_free+0x12c/0x1c0 mm/kasan/common.c:244
kasan_slab_free include/linux/kasan.h:164 [inline]
slab_free_hook mm/slub.c:1815 [inline]
slab_free_freelist_hook mm/slub.c:1841 [inline]
slab_free mm/slub.c:3786 [inline]
kmem_cache_free+0x118/0x6f0 mm/slub.c:3808
rcu_do_batch+0x35c/0xe30 kernel/rcu/tree.c:2189
rcu_core+0x819/0xd90 kernel/rcu/tree.c:2462
__do_softirq+0x1b0/0x7a2 kernel/softirq.c:553
Last potentially related work creation:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
__kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2712 [inline]
call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826
ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105
ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
Second to last potentially related work creation:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
__kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2712 [inline]
call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826
ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105
ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
The buggy address belongs to the object at ffff888123839d68
which belongs to the cache bfq_io_cq of size 1360
The buggy address is located 336 bytes inside of
freed 1360-byte region [ffff888123839d68, ffff88812383a2b8)
The buggy address belongs to the physical page:
page:ffffea00048e0e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88812383f588 pfn:0x123838
head:ffffea00048e0e00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x17ffffc0000a40(workingset|slab|head|node=0|zone=2|lastcpupid=0x1fffff)
page_type: 0xffffffff()
raw: 0017ffffc0000a40 ffff88810588c200 ffffea00048ffa10 ffff888105889488
raw: ffff88812383f588 0000000000150006 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888123839d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888123839e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888123839e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888123839f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888123839f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240902130329.3787024-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently building for C=1 generates the following warning:
block/bfq-iosched.c:5498:9: warning: context imbalance in 'bfq_exit_icq' - different lock contexts for basic block
Refactor bfq_exit_icq() into a core part which loops for the actuators,
and only lock calling this routine when necessary.
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240614090345.655716-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Convert any user of ktime_get_ns() to use blk_time_get_ns(), and
ktime_get() to blk_time_get(), so we have a unified API for querying the
current time in nanoseconds or as ktime.
No functional changes intended, this patch just wraps ktime_get_ns()
and ktime_get() with a block helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull SCSI updates from James Bottomley:
"Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi,
lpfc, qla2xxx).
We have a couple of major core changes impacting other systems:
- Command Duration Limits, which spills into block and ATA
- block level Persistent Reservation Operations, which touches block,
nvme, target and dm
Both of these are added with merge commits containing a cover letter
explaining what's going on"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits)
scsi: core: Improve warning message in scsi_device_block()
scsi: core: Replace scsi_target_block() with scsi_block_targets()
scsi: core: Don't wait for quiesce in scsi_device_block()
scsi: core: Don't wait for quiesce in scsi_stop_queue()
scsi: core: Merge scsi_internal_device_block() and device_block()
scsi: sg: Increase number of devices
scsi: bsg: Increase number of devices
scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue
scsi: ufs: ufs-pci: Add support for Intel Arrow Lake
scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT
scsi: ufs: wb: Add explicit flush_threshold sysfs attribute
scsi: ufs: ufs-qcom: Switch to the new ICE API
scsi: ufs: dt-bindings: qcom: Add ICE phandle
scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk
scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk
scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC
scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR
scsi: ufs: core: Remove dedicated hwq for dev command
scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command
scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes
...
|
|
The I/O priority user interface defines the 16-bits ioprio values as the
combination of the upper 3-bits for an I/O priority class and the lower
13-bits as priority data. However, the kernel only uses the lower 3-bits of
the priority data to define priority levels for the RT and BE priority
classes. The data part of an ioprio value is completely ignored for the
IDLE and NONE classes. This is enforced by checks done in
ioprio_check_cap(), which is called for all paths that allow defining an
I/O priority for I/Os: the per-context ioprio_set() system call, aio
interface and io_uring interface.
Clarify this fact in the uapi ioprio.h header file and introduce the
IOPRIO_PRIO_LEVEL_MASK and IOPRIO_PRIO_LEVEL() macros for users to define
and get priority levels in an ioprio value. The coarser macro
IOPRIO_PRIO_DATA() is retained for backward compatibility with old
applications already using it. There is no functional change introduced
with this.
In-kernel users of the IOPRIO_PRIO_DATA() macro which are explicitly
handling I/O priority data as a priority level are modified to use the new
IOPRIO_PRIO_LEVEL() macro without any functional change. Since f2fs is the
only user of this macro not explicitly using that value as a priority
level, it is left unchanged.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Link: https://lore.kernel.org/r/20230511011356.227789-2-nks@flawful.org
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
|
|
Check bfqq->dispatched for each BFQ queue instead of checking it for an
invalid bfqq pointer.
Fixes: 3e49c1e4a615 ("block: BFQ: Add several invariant checks")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230519220347.3643295-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If anything goes wrong with the counters that track the number of
requests, I/O locks up. Make such scenarios easier to debug by adding
invariant checks for the request counters. Additionally, check that
BFQ queues are empty before these are freed.
Cc: Jan Kara <jack@suse.cz>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230516223853.1385255-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When the weighted sum is zero the calculation of limit causes
a division by zero error. Fix this by continuing to the next level.
This was discovered by running as root:
stress-ng --ioprio 0
Fixes divison by error oops:
[ 521.450556] divide error: 0000 [#1] SMP NOPTI
[ 521.450766] CPU: 2 PID: 2684464 Comm: stress-ng-iopri Not tainted 6.2.1-1280.native #1
[ 521.451117] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014
[ 521.451627] RIP: 0010:bfqq_request_over_limit+0x207/0x400
[ 521.451875] Code: 01 48 8d 0c c8 74 0b 48 8b 82 98 00 00 00 48 8d 0c c8 8b 85 34 ff ff ff 48 89 ca 41 0f af 41 50 48 d1 ea 48 98 48 01 d0 31 d2 <48> f7 f1 41 39 41 48 89 85 34 ff ff ff 0f 8c 7b 01 00 00 49 8b 44
[ 521.452699] RSP: 0018:ffffb1af84eb3948 EFLAGS: 00010046
[ 521.452938] RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000
[ 521.453262] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb1af84eb3978
[ 521.453584] RBP: ffffb1af84eb3a30 R08: 0000000000000001 R09: ffff8f88ab8a4ba0
[ 521.453905] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8f88ab8a4b18
[ 521.454224] R13: ffff8f8699093000 R14: 0000000000000001 R15: ffffb1af84eb3970
[ 521.454549] FS: 00005640b6b0b580(0000) GS:ffff8f88b3880000(0000) knlGS:0000000000000000
[ 521.454912] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 521.455170] CR2: 00007ffcbcae4e38 CR3: 00000002e46de001 CR4: 0000000000770ee0
[ 521.455491] PKRU: 55555554
[ 521.455619] Call Trace:
[ 521.455736] <TASK>
[ 521.455837] ? bfq_request_merge+0x3a/0xc0
[ 521.456027] ? elv_merge+0x115/0x140
[ 521.456191] bfq_limit_depth+0xc8/0x240
[ 521.456366] __blk_mq_alloc_requests+0x21a/0x2c0
[ 521.456577] blk_mq_submit_bio+0x23c/0x6c0
[ 521.456766] __submit_bio+0xb8/0x140
[ 521.457236] submit_bio_noacct_nocheck+0x212/0x300
[ 521.457748] submit_bio_noacct+0x1a6/0x580
[ 521.458220] submit_bio+0x43/0x80
[ 521.458660] ext4_io_submit+0x23/0x80
[ 521.459116] ext4_do_writepages+0x40a/0xd00
[ 521.459596] ext4_writepages+0x65/0x100
[ 521.460050] do_writepages+0xb7/0x1c0
[ 521.460492] __filemap_fdatawrite_range+0xa6/0x100
[ 521.460979] file_write_and_wait_range+0xbf/0x140
[ 521.461452] ext4_sync_file+0x105/0x340
[ 521.461882] __x64_sys_fsync+0x67/0x100
[ 521.462305] ? syscall_exit_to_user_mode+0x2c/0x1c0
[ 521.462768] do_syscall_64+0x3b/0xc0
[ 521.463165] entry_SYSCALL_64_after_hwframe+0x5a/0xc4
[ 521.463621] RIP: 0033:0x5640b6c56590
[ 521.464006] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d 71 70 0e 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Link: https://lore.kernel.org/r/20230413133009.1605335-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Instead of passing a bool at_head, pass down the full flags from the
blk_mq_insert_request interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blk-mq-tag.h is always included by blk-mq.h, and causes recursive
inclusion hell with further changes. Just merge it into blk-mq.h
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20230413064057.707578-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Before commit fd571df0ac5b ("block, bfq: turn bfqq_data into an array
in bfq_io_cq"), process reference is read before bfq_put_stable_ref(),
and it's safe if bfq_put_stable_ref() put the last reference, because
process reference will be 0 and 'stable_merge_bfqq' won't be accessed
in this case. However, the commit changed the order and will cause
uaf for 'stable_merge_bfqq'.
In order to emphasize that bfq_put_stable_ref() can drop the last
reference, fix the problem by moving bfq_put_stable_ref() to the end of
bfq_setup_stable_merge().
Fixes: fd571df0ac5b ("block, bfq: turn bfqq_data into an array in bfq_io_cq")
Reported-and-tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/linux-block/20230307071448.rzihxbm4jhbf5krj@shindev/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block updates from Jens Axboe:
- NVMe updates via Christoph:
- Small improvements to the logging functionality (Amit Engel)
- Authentication cleanups (Hannes Reinecke)
- Cleanup and optimize the DMA mapping cod in the PCIe driver
(Keith Busch)
- Work around the command effects for Format NVM (Keith Busch)
- Misc cleanups (Keith Busch, Christoph Hellwig)
- Fix and cleanup freeing single sgl (Keith Busch)
- MD updates via Song:
- Fix a rare crash during the takeover process
- Don't update recovery_cp when curr_resync is ACTIVE
- Free writes_pending in md_stop
- Change active_io to percpu
- Updates to drbd, inching us closer to unifying the out-of-tree driver
with the in-tree one (Andreas, Christoph, Lars, Robert)
- BFQ update adding support for multi-actuator drives (Paolo, Federico,
Davide)
- Make brd compliant with REQ_NOWAIT (me)
- Fix for IOPOLL and queue entering, fixing stalled IO waiting on
timeouts (me)
- Fix for REQ_NOWAIT with multiple bios (me)
- Fix memory leak in blktrace cleanup (Greg)
- Clean up sbitmap and fix a potential hang (Kemeng)
- Clean up some bits in BFQ, and fix a bug in the request injection
(Kemeng)
- Clean up the request allocation and issue code, and fix some bugs
related to that (Kemeng)
- ublk updates and fixes:
- Add support for unprivileged ublk (Ming)
- Improve device deletion handling (Ming)
- Misc (Liu, Ziyang)
- s390 dasd fixes (Alexander, Qiheng)
- Improve utility of request caching and fixes (Anuj, Xiao)
- zoned cleanups (Pankaj)
- More constification for kobjs (Thomas)
- blk-iocost cleanups (Yu)
- Remove bio splitting from drivers that don't need it (Christoph)
- Switch blk-cgroups to use struct gendisk. Some of this is now
incomplete as select late reverts were done. (Christoph)
- Add bvec initialization helpers, and convert callers to use that
rather than open-coding it (Christoph)
- Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin,
Matthew, Ulf, Zhong)
* tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits)
brd: use radix_tree_maybe_preload instead of radix_tree_preload
block: use proper return value from bio_failfast()
block: bio-integrity: Copy flags when bio_integrity_payload is cloned
block: Fix io statistics for cgroup in throttle path
brd: mark as nowait compatible
brd: check for REQ_NOWAIT and set correct page allocation mask
brd: return 0/-error from brd_insert_page()
block: sync mixed merged request's failfast with 1st bio's
Revert "blk-cgroup: pin the gendisk in struct blkcg_gq"
Revert "blk-cgroup: pass a gendisk to blkg_lookup"
Revert "blk-cgroup: delay blk-cgroup initialization until add_disk"
Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release"
Revert "blk-cgroup: move the cgroup information to struct gendisk"
nvme-pci: remove iod use_sgls
nvme-pci: fix freeing single sgl
block: ublk: check IO buffer based on flag need_get_data
s390/dasd: Fix potential memleak in dasd_eckd_init()
s390/dasd: sort out physical vs virtual pointers usage
block: Remove the ALLOC_CACHE_SLACK constant
block: make kobj_type structures constant
...
|
|
Prepare for storing the blkcg information in the gendisk instead of
the request_queue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pass a gendisk to wbt_enable_default and wbt_disable_default to
prepare for phasing out usage of the request_queue in the blk-cgroup
code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230203150400.3199230-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bfqd->bfq_wr_max_time is set to 0 in bfq_init_queue and is never changed.
It is only used in bfq_wr_duration when bfq_wr_max_time > 0 which never
meets, so bfqd->bfq_wr_max_time is not used actually. Just remove it.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-9-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We jump to tag only for returning current rq. Return directly to
remove this tag.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20230116095153.3810101-8-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We have already avoided a circular list in bfq_setup_merge (see comments
in bfq_setup_merge() for details), so bfq_queue will not appear in it's
new_bfqq list. Just remove this check.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-7-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The async_bfqq is assigned with bfqq->bic->bfqq[0], use it directly.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-6-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use helper macro RQ_BFQQ to get bfqq of request.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-5-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Inject limit is updated or reset when time_is_before_eq_jiffies(
decrease_time_jif + several msecs) or think-time state changes.
decrease_time_jif is initialized to 0 and will be set to current jiffies
when inject limit is updated or reset. If the jiffies is slightly greater
than LONG_MAX, time_is_after_eq_jiffies(0) will keep for a long time, so as
time_is_after_eq_jiffies(decrease_time_jif + several msecs). If the
think-time state never chages, then the injection will not work as expected
for long time.
To be more specific:
Function bfq_update_inject_limit maybe triggered when jiffies pasts
decrease_time_jif + msecs_to_jiffies(10) in bfq_add_request by setting
bfqd->wait_dispatch to true.
Function bfq_reset_inject_limit are called in two conditions:
1. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(1000) in
function bfq_add_request.
2. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(100) or
bfq think-time state change from short to long.
Fix this by initializing bfqq->decrease_time_jif to current jiffies
to trigger service injection soon when service injection conditions
are met.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-4-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Parameter reason is never used, just remove it.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-3-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Function bfq_choose_bfqq_for_injection may temporarily raise inject limit
to one request if current inject_limit is 0 before search of the source
queue for injection. However the search below will reset inject limit to
bfqd->in_service_queue which is zero for raised inject limit. Then the
temporarily raised inject limit never works as expected.
Assigment limit to bfqd->in_service_queue in search is needed as limit
maybe overwriten to min_t(unsigned int, 1, limit) for condition that
a large in-flight request is on non-rotational devices in found queue.
So we need to reset limit to bfqd->in_service_queue for normal case.
Actually, we have already make sure bfqd->rq_in_driver is < limit before
search, then
-Limit is >= 1 as bfqd->rq_in_driver is >= 0. Then min_t(unsigned int,
1, limit) is always 1. So we can simply check bfqd->rq_in_driver with
1 instead of result of min_t(unsigned int, 1, limit) for larget request in
non-rotational device case to avoid overwritting limit and the bug is gone.
-For normal case, we have already check bfqd->rq_in_driver is < limit,
so we can return found bfqq unconditionally to remove unncessary check.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230116095153.3810101-2-shikemeng@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
After commit 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'"),
bic->bfqq will be accessed in bic_set_bfqq(), however, in some context
bic->bfqq will be freed, and bic_set_bfqq() is called with the freed
bic->bfqq.
Fix the problem by always freeing bfqq after bic_set_bfqq().
Fixes: 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'")
Reported-and-tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230130014136.591038-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Upon the invocation of its dispatch function, BFQ returns the next I/O
request of the in-service bfq_queue, unless some exception holds. One
such exception is that there is some underutilized actuator, different
from the actuator for which the in-service queue contains I/O, and
that some other bfq_queue happens to contain I/O for such an
actuator. In this case, the next I/O request of the latter bfq_queue,
and not of the in-service bfq_queue, is returned (I/O is injected from
that bfq_queue). To find such an actuator, a linear scan, in
increasing index order, is performed among actuators.
Performing a linear scan entails a prioritization among actuators: an
underutilized actuator may be considered for injection only if all
actuators with a lower index are currently fully utilized, or if there
is no pending I/O for any lower-index actuator that happens to be
underutilized.
This commits breaks this prioritization and tends to distribute
injection uniformly across actuators. This is obtained by adding the
following condition to the linear scan: even if an actuator A is
underutilized, A is however skipped if its load is higher than that of
the next actuator.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Davide Zini <davidezini2@gmail.com>
Link: https://lore.kernel.org/r/20230103145503.71712-9-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The main service scheme of BFQ for sync I/O is serving one sync
bfq_queue at a time, for a while. In particular, BFQ enforces this
scheme when it deems the latter necessary to boost throughput or
to preserve service guarantees. Unfortunately, when BFQ enforces
this policy, only one actuator at a time gets served for a while,
because each bfq_queue contains I/O only for one actuator. The
other actuators may remain underutilized.
Actually, BFQ may serve (inject) extra I/O, taken from other
bfq_queues, in parallel with that of the in-service queue. This
injection mechanism may provide the ground for dealing also with
the above actuator-underutilization problem. Yet BFQ does not take
the actuator load into account when choosing which queue to pick
extra I/O from. In addition, BFQ may happen to inject extra I/O
only when the in-service queue is temporarily empty.
In view of these facts, this commit extends the
injection mechanism in such a way that the latter:
(1) takes into account also the actuator load;
(2) checks such a load on each dispatch, and injects I/O for an
underutilized actuator, if there is one and there is I/O for it.
To perform the check in (2), this commit introduces a load
threshold, currently set to 4. A linear scan of each actuator is
performed, until an actuator is found for which the following two
conditions hold: the load of the actuator is below the threshold,
and there is at least one non-in-service queue that contains I/O
for that actuator. If such a pair (actuator, queue) is found, then
the head request of that queue is returned for dispatch, instead
of the head request of the in-service queue.
We have set the threshold, empirically, to the minimum possible
value for which an actuator is fully utilized, or close to be
fully utilized. By doing so, injected I/O 'steals' as few
drive-queue slots as possibile to the in-service queue. This
reduces as much as possible the probability that the service of
I/O from the in-service bfq_queue gets delayed because of slot
exhaustion, i.e., because all the slots of the drive queue are
filled with I/O injected from other queues (NCQ provides for 32
slots).
This new mechanism also counters actuator underutilization in the
case of asymmetric configurations of bfq_queues. Namely if there
are few bfq_queues containing I/O for some actuators and many
bfq_queues containing I/O for other actuators. Or if the
bfq_queues containing I/O for some actuators have lower weights
than the other bfq_queues.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Davide Zini <davidezini2@gmail.com>
Link: https://lore.kernel.org/r/20230103145503.71712-8-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch implements the code to gather the content of the
independent_access_ranges structure from the request_queue and copy
it into the queue's bfq_data. This copy is done at queue initialization.
We copy the access ranges into the bfq_data to avoid taking the queue
lock each time we access the ranges.
This implementation, however, puts a limit to the maximum independent
ranges supported by the scheduler. Such a limit is equal to the constant
BFQ_MAX_ACTUATORS. This limit was placed to avoid the allocation of
dynamic memory.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Co-developed-by: Rory Chen <rory.c.chen@seagate.com>
Signed-off-by: Rory Chen <rory.c.chen@seagate.com>
Signed-off-by: Federico Gavioli <f.gavioli97@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-7-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Similarly to sync bfq_queues, also async bfq_queues need to be split
on a per-actuator basis.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Davide Zini <davidezini2@gmail.com>
Link: https://lore.kernel.org/r/20230103145503.71712-6-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When a bfq_queue Q is merged with another queue, several pieces of
information are saved about Q. These pieces are stored in the
bfqq_data field in the bfq_io_cq data structure of the process
associated with Q.
Yet, with a multi-actuator drive, a process may get associated with
multiple bfq_queues: one queue for each of the N actuators. Each of
these queues may undergo a merge. So, the bfq_io_cq data structure
must be able to accommodate the above information for N queues.
This commit solves this problem by turning the bfqq_data scalar field
into an array of N elements (and by changing code so as to handle
this array).
This solution is written under the assumption that bfq_queues
associated with different actuators cannot be cross-merged. This
assumption holds naturally with basic queue merging: the latter is
triggered by spatial locality, and sectors for different actuators are
not close to each other (apart from the corner case of the last
sectors served by a given actuator and the first sectors served by the
next actuator). As for stable cross-merging, the assumption here is
that it is disabled.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Gabriele Felici <felicigb@gmail.com>
Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net>
Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com>
Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-5-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
With a multi-actuator drive, a process may get associated with multiple
bfq_queues: one queue for each of the N actuators. So, the bfq_io_cq
data structure must be able to accommodate its per-queue persistent
information for N queues. Currently it stores this information for
just one queue, in several scalar fields.
This is a preparatory commit for moving to accommodating persistent
information for N queues. In particular, this commit packs all the
above scalar fields into a single data structure. Then there is now
only one field, in bfq_io_cq, that stores all the above information. This
scalar field will then be turned into an array by a following commit.
Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net>
Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com>
Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-4-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If queues associated with different actuators are merged, then control
is lost on each actuator. Therefore some actuator may be
underutilized, and throughput may decrease. This problem cannot occur
with basic queue merging, because the latter is triggered by spatial
locality, and sectors for different actuators are not close to each
other. Yet it may happen with stable merging. To address this issue,
this commit prevents stable merging from occurring among queues
associated with different actuators.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-3-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Single-LUN multi-actuator SCSI drives, as well as all multi-actuator
SATA drives appear as a single device to the I/O subsystem [1]. Yet
they address commands to different actuators internally, as a function
of Logical Block Addressing (LBAs). A given sector is reachable by
only one of the actuators. For example, Seagate’s Serial Advanced
Technology Attachment (SATA) version contains two actuators and maps
the lower half of the SATA LBA space to the lower actuator and the
upper half to the upper actuator.
Evidently, to fully utilize actuators, no actuator must be left idle
or underutilized while there is pending I/O for it. The block layer
must somehow control the load of each actuator individually. This
commit lays the ground for allowing BFQ to provide such a per-actuator
control.
BFQ associates an I/O-request sync bfq_queue with each process doing
synchronous I/O, or with a group of processes, in case of queue
merging. Then BFQ serves one bfq_queue at a time. While in service, a
bfq_queue is emptied in request-position order. Yet the same process,
or group of processes, may generate I/O for different actuators. In
this case, different streams of I/O (each for a different actuator)
get all inserted into the same sync bfq_queue. So there is basically
no individual control on when each stream is served, i.e., on when the
I/O requests of the stream are picked from the bfq_queue and
dispatched to the drive.
This commit enables BFQ to control the service of each actuator
individually for synchronous I/O, by simply splitting each sync
bfq_queue into N queues, one for each actuator. In other words, a sync
bfq_queue is now associated to a pair (process, actuator). As a
consequence of this split, the per-queue proportional-share policy
implemented by BFQ will guarantee that the sync I/O generated for each
actuator, by each process, receives its fair share of service.
This is just a preparatory patch. If the I/O of the same process
happens to be sent to different queues, then each of these queues may
undergo queue merging. To handle this event, the bfq_io_cq data
structure must be properly extended. In addition, stable merging must
be disabled to avoid loss of control on individual actuators. Finally,
also async queues must be split. These issues are described in detail
and addressed in next commits. As for this commit, although multiple
per-process bfq_queues are provided, the I/O of each process or group
of processes is still sent to only one queue, regardless of the
actuator the I/O is for. The forwarding to distinct bfq_queues will be
enabled after addressing the above issues.
[1] https://www.linaro.org/blog/budget-fair-queueing-bfq-linux-io-scheduler-optimizations-for-multi-actuator-sata-hard-drives/
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Gabriele Felici <felicigb@gmail.com>
Signed-off-by: Carmine Zaccagnino <carmine@carminezacc.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20230103145503.71712-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'")
will access 'bic->bfqq' in bic_set_bfqq(), however, bfq_exit_icq_bfqq()
can free bfqq first, and then call bic_set_bfqq(), which will cause uaf.
Fix the problem by moving bfq_exit_bfqq() behind bic_set_bfqq().
Fixes: 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20221226030605.1437081-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The 'bfqd->num_groups_with_pending_reqs' is used when
CONFIG_BFQ_GROUP_IOSCHED is enabled, so let the variables and processes
take effect when CONFIG_BFQ_GROUP_IOSCHED is enabled.
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Yuwei Guan <Yuwei.Guan@zeekrlife.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20221110112622.389332-1-Yuwei.Guan@zeekrlife.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Just to make the code a litter cleaner, there are no functional changes.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221214033155.3455754-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Our test report a uaf for 'bfqq->bic' in 5.10:
==================================================================
BUG: KASAN: use-after-free in bfq_select_queue+0x378/0xa30
CPU: 6 PID: 2318352 Comm: fsstress Kdump: loaded Not tainted 5.10.0-60.18.0.50.h602.kasan.eulerosv2r11.x86_64 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-20220320_160524-szxrtosci10000 04/01/2014
Call Trace:
bfq_select_queue+0x378/0xa30
bfq_dispatch_request+0xe8/0x130
blk_mq_do_dispatch_sched+0x62/0xb0
__blk_mq_sched_dispatch_requests+0x215/0x2a0
blk_mq_sched_dispatch_requests+0x8f/0xd0
__blk_mq_run_hw_queue+0x98/0x180
__blk_mq_delay_run_hw_queue+0x22b/0x240
blk_mq_run_hw_queue+0xe3/0x190
blk_mq_sched_insert_requests+0x107/0x200
blk_mq_flush_plug_list+0x26e/0x3c0
blk_finish_plug+0x63/0x90
__iomap_dio_rw+0x7b5/0x910
iomap_dio_rw+0x36/0x80
ext4_dio_read_iter+0x146/0x190 [ext4]
ext4_file_read_iter+0x1e2/0x230 [ext4]
new_sync_read+0x29f/0x400
vfs_read+0x24e/0x2d0
ksys_read+0xd5/0x1b0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x61/0xc6
Commit 3bc5e683c67d ("bfq: Split shared queues on move between cgroups")
changes that move process to a new cgroup will allocate a new bfqq to
use, however, the old bfqq and new bfqq can point to the same bic:
1) Initial state, two process with io in the same cgroup.
Process 1 Process 2
(BIC1) (BIC2)
| Λ | Λ
| | | |
V | V |
bfqq1 bfqq2
2) bfqq1 is merged to bfqq2.
Process 1 Process 2
(BIC1) (BIC2)
| |
\-------------\|
V
bfqq1 bfqq2(coop)
3) Process 1 exit, then issue new io(denoce IOA) from Process 2.
(BIC2)
| Λ
| |
V |
bfqq2(coop)
4) Before IOA is completed, move Process 2 to another cgroup and issue io.
Process 2
(BIC2)
Λ
|\--------------\
| V
bfqq2 bfqq3
Now that BIC2 points to bfqq3, while bfqq2 and bfqq3 both point to BIC2.
If all the requests are completed, and Process 2 exit, BIC2 will be
freed while there is no guarantee that bfqq2 will be freed before BIC2.
Fix the problem by clearing bfqq->bic while bfqq is detached from bic.
Fixes: 3bc5e683c67d ("bfq: Split shared queues on move between cgroups")
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221214030430.3304151-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
oom_bfqq is just a fallback bfqq, so shouldn't be used with waker
detection.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221108181030.1611703-2-khazhy@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL,
but woken_list_node still being hashed. This would happen when
bfq_init_rq() expects a brand new allocated queue to be returned from
bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq
without resetting woken_list_node. Since we can always return oom_bfqq
when attempting to allocate, we cannot assume waker_bfqq starts as NULL.
Avoid setting woken_bfqq for oom_bfqq entirely, as it's not useful.
Crashes would have a stacktrace like:
[160595.656560] bfq_add_bfqq_busy+0x110/0x1ec
[160595.661142] bfq_add_request+0x6bc/0x980
[160595.666602] bfq_insert_request+0x8ec/0x1240
[160595.671762] bfq_insert_requests+0x58/0x9c
[160595.676420] blk_mq_sched_insert_request+0x11c/0x198
[160595.682107] blk_mq_submit_bio+0x270/0x62c
[160595.686759] __submit_bio_noacct_mq+0xec/0x178
[160595.691926] submit_bio+0x120/0x184
[160595.695990] ext4_mpage_readpages+0x77c/0x7c8
[160595.701026] ext4_readpage+0x60/0xb0
[160595.705158] filemap_read_page+0x54/0x114
[160595.711961] filemap_fault+0x228/0x5f4
[160595.716272] do_read_fault+0xe0/0x1f0
[160595.720487] do_fault+0x40/0x1c8
Tested by injecting random failures into bfq_get_queue, crashes go away
completely.
Fixes: 8ef3fc3a043c ("block, bfq: make shared queues inherit wakers")
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221108181030.1611703-1-khazhy@google.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Such code are not even compiled since they are inside marco "#if 0".
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@unimore.it>
Link: https://lore.kernel.org/r/20221102022542.3621219-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It's the same with bfq_weights_tree_remove() now.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-7-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The 'bfq_data' and 'rb_root_cached' can both be accessed through
'bfq_queue', thus only pass 'bfq_queue' as parameter.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-6-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now that root group is counted into 'num_groups_with_pending_reqs',
'num_groups_with_pending_reqs > 0' is always true in
bfq_asymmetric_scenario(). Thus change the condition to '> 1'.
On the other hand, this change can enable concurrent sync io if only
one group is activated.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-5-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently, bfq can't handle sync io concurrently as long as they
are not issued from root group. This is because
'bfqd->num_groups_with_pending_reqs > 0' is always true in
bfq_asymmetric_scenario().
The way that bfqg is counted into 'num_groups_with_pending_reqs':
Before this patch:
1) root group will never be counted.
2) Count if bfqg or it's child bfqgs have pending requests.
3) Don't count if bfqg and it's child bfqgs complete all the requests.
After this patch:
1) root group is counted.
2) Count if bfqg have pending requests.
3) Don't count if bfqg complete all the requests.
With this change, the occasion that only one group is activated can be
detected, and next patch will support concurrent sync io in the
occasion.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If entity belongs to bfqq, then entity->in_groups_with_pending_reqs
is not used currently. This patch use it to track if bfqq has pending
requests through callers of weights_tree insertion and removal.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit b5dc5d4d1f4f ("block,bfq: Disable writeback throttling") tries to
disable wbt for bfq, it's done by calling wbt_disable_default() in
bfq_init_queue(). However, wbt is still enabled if default elevator is
bfq:
device_add_disk
elevator_init_mq
bfq_init_queue
wbt_disable_default -> done nothing
blk_register_queue
wbt_enable_default -> wbt is enabled
Fix the problem by adding a new flag ELEVATOR_FLAG_DISBALE_WBT, bfq
will set the flag in bfq_init_queue, and following wbt_enable_default()
won't enable wbt while the flag is set.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221019121518.3865235-7-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
'bfqd' can be accessed through 'bfqq->bfqd', there is no need to pass
it as a parameter separately.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220816015631.1323948-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
'bfqq->bfqd' is ensured to set in bfq_init_queue(), and it will never
change afterwards.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220816015631.1323948-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use the new blk_opf_t type for arguments and variables that represent
request flags or a bitwise combination of a request operation and
request flags. Rename those variables from 'op' into 'opf'.
This patch does not change any functionality.
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-8-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BFQ uses io_start_time_ns. That member variable is only set if I/O
statistics are enabled. Hence this patch that enables I/O statistics
at the time BFQ is associated with a request queue.
Compile-tested only.
Reported-by: Cixi Geng <cixi.geng1@unisoc.com>
Cc: Cixi Geng <cixi.geng1@unisoc.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Paolo Valente <paolo.valente@unimore.it>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
q->elevator is referred in blk_mq_has_sqsched() without any protection,
no .q_usage_counter is held, no queue srcu and rcu read lock is held,
so potential use-after-free may be triggered.
Fix the issue by adding one queue flag for checking if the elevator
uses single queue style dispatch. Meantime the elevator feature flag
of ELEVATOR_F_MQ_AWARE isn't needed any more.
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220616014401.817001-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block updates from Jens Axboe:
"Here are the core block changes for 5.19. This contains:
- blk-throttle accounting fix (Laibin)
- Series removing redundant assignments (Michal)
- Expose bio cache via the bio_set, so that DM can use it (Mike)
- Finish off the bio allocation interface cleanups by dealing with
the weirdest member of the family. bio_kmalloc combines a kmalloc
for the bio and bio_vecs with a hidden bio_init call and magic
cleanup semantics (Christoph)
- Clean up the block layer API so that APIs consumed by file systems
are (almost) only struct block_device based, so that file systems
don't have to poke into block layer internals like the
request_queue (Christoph)
- Clean up the blk_execute_rq* API (Christoph)
- Clean up various lose end in the blk-cgroup code to make it easier
to follow in preparation of reworking the blkcg assignment for bios
(Christoph)
- Fix use-after-free issues in BFQ when processes with merged queues
get moved to different cgroups (Jan)
- BFQ fixes (Jan)
- Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming,
Wolfgang, me)"
* tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits)
blk-mq: fix typo in comment
bfq: Remove bfq_requeue_request_body()
bfq: Remove superfluous conversion from RQ_BIC()
bfq: Allow current waker to defend against a tentative one
bfq: Relax waker detection for shared queues
blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE()
blk-throttle: Set BIO_THROTTLED when bio has been throttled
blk-cgroup: Remove unnecessary rcu_read_lock/unlock()
blk-cgroup: always terminate io.stat lines
block, bfq: make bfq_has_work() more accurate
block, bfq: protect 'bfqd->queued' by 'bfqd->lock'
block: cleanup the VM accounting in submit_bio
block: Fix the bio.bi_opf comment
block: reorder the REQ_ flags
blk-iocost: combine local_stat and desc_stat to stat
block: improve the error message from bio_check_eod
block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone
block: remove superfluous calls to blkcg_bio_issue_init
kthread: unexport kthread_blkcg
blk-cgroup: cleanup blkcg_maybe_throttle_current
...
|
|
The function has only a single caller and two lines. Just remove it
since it is pointless and just harming readability.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220519105235.31397-4-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We store struct bfq_io_cq pointer in rq->elv.priv[0] in bfq_init_rq().
Thus a call to icq_to_bic() in RQ_BIC() is wrong. Luckily it does no
harm currently because struct io_iq is the first one in struct
bfq_io_cq.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220519105235.31397-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The code in bfq_check_waker() ignores wake up events from the current
waker. This makes it more likely we select a new tentative waker
although the current one is generating more wake up events. Treat
current waker the same way as any other process and allow it to reset
the waker detection logic.
Fixes: 71217df39dc6 ("block, bfq: make waker-queue detection more robust")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220519105235.31397-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently we look for waker only if current queue has no requests. This
makes sense for bfq queues with a single process however for shared
queues when there is a larger number of processes the condition that
queue has no requests is difficult to meet because often at least one
process has some request in flight although all the others are waiting
for the waker to do the work and this harms throughput. Relax the "no
queued request for bfq queue" condition to "the current task has no
queued requests yet". For this, we also need to start tracking number of
requests in flight for each task.
This patch (together with the following one) restores the performance
for dbench with 128 clients that regressed with commit c65e6fd460b4
("bfq: Do not let waker requests skip proper accounting") because
this commit makes requests of wakers properly enter BFQ queues and thus
these queues become ineligible for the old waker detection logic.
Dbench results:
Vanilla 5.18-rc3 5.18-rc3 + revert 5.18-rc3 patched
Mean 1237.36 ( 0.00%) 950.16 * 23.21%* 988.35 * 20.12%*
Numbers are time to complete workload so lower is better.
Fixes: c65e6fd460b4 ("bfq: Do not let waker requests skip proper accounting")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220519105235.31397-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bfq_has_work() is using busy_queues currently, which is not accurate
because bfq_queue is busy doesn't represent that it has requests. Since
bfqd aready has a counter 'queued' to record how many requests are in
bfq, use it instead of busy_queues.
Noted that bfq_has_work() can be called with 'bfqd->lock' held, thus the
lock can't be held in bfq_has_work() to protect 'bfqd->queued'.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220513023507.2625717-3-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If bfq_schedule_dispatch() is called from bfq_idle_slice_timer_body(),
then 'bfqd->queued' is read without holding 'bfqd->lock'. This is
wrong since it can be wrote concurrently.
Fix the problem by holding 'bfqd->lock' in such case.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220513023507.2625717-2-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
People are occasionally reporting a warning bfqq_request_over_limit()
triggering reporting that BFQ's idea of cgroup hierarchy (and its depth)
does not match what generic blkcg code thinks. This can actually happen
when bfqq gets moved between BFQ groups while bfqq_request_over_limit()
is running. Make sure the code is safe against BFQ queue being moved to
a different BFQ group.
Fixes: 76f1df88bbc2 ("bfq: Limit number of requests consumed by each cgroup")
CC: stable@vger.kernel.org
Link: https://lore.kernel.org/all/CAJCQCtTw_2C7ZSz7as5Gvq=OmnDiio=HRkQekqWpKot84sQhFA@mail.gmail.com/
Reported-by: Chris Murphy <lists@colorremedies.com>
Reported-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220407140738.9723-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BFQ usage of __bio_blkcg() is a relict from the past. Furthermore if bio
would not be associated with any blkcg, the usage of __bio_blkcg() in
BFQ is prone to races with the task being migrated between cgroups as
__bio_blkcg() calls at different places could return different blkcgs.
Convert BFQ to the new situation where bio->bi_blkg is initialized in
bio_set_dev() and thus practically always valid. This allows us to save
blkcg_gq lookup and noticeably simplify the code.
CC: stable@vger.kernel.org
Fixes: 0fe061b9f03c ("blkcg: fix ref count issue with bio_blkcg() using task_css")
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-8-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We call bfq_init_rq() from request merging functions where requests we
get should have already gone through bfq_init_rq() during insert and
anyway we want to do anything only if the request is already tracked by
BFQ. So replace calls to bfq_init_rq() with RQ_BFQQ() instead to simply
skip requests untracked by BFQ. We move bfq_init_rq() call in
bfq_insert_request() a bit earlier to cover request merging and thus
can transfer FIFO position in case of a merge.
CC: stable@vger.kernel.org
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-6-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In bfq_insert_request() we unlock bfqd->lock only to call
trace_block_rq_insert() and then lock bfqd->lock again. This is really
pointless since tracing is disabled if we really care about performance
and even if the tracepoint is enabled, it is a quick call.
CC: stable@vger.kernel.org
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-5-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When the process is migrated to a different cgroup (or in case of
writeback just starts submitting bios associated with a different
cgroup) bfq_merge_bio() can operate with stale cgroup information in
bic. Thus the bio can be merged to a request from a different cgroup or
it can result in merging of bfqqs for different cgroups or bfqqs of
already dead cgroups and causing possible use-after-free issues. Fix the
problem by updating cgroup information in bfq_merge_bio().
CC: stable@vger.kernel.org
Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-4-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When bfqq is shared by multiple processes it can happen that one of the
processes gets moved to a different cgroup (or just starts submitting IO
for different cgroup). In case that happens we need to split the merged
bfqq as otherwise we will have IO for multiple cgroups in one bfqq and
we will just account IO time to wrong entities etc.
Similarly if the bfqq is scheduled to merge with another bfqq but the
merge didn't happen yet, cancel the merge as it need not be valid
anymore.
CC: stable@vger.kernel.org
Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support")
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It can happen that the parent of a bfqq changes between the moment we
decide two queues are worth to merge (and set bic->stable_merge_bfqq)
and the moment bfq_setup_merge() is called. This can happen e.g. because
the process submitted IO for a different cgroup and thus bfqq got
reparented. It can even happen that the bfqq we are merging with has
parent cgroup that is already offline and going to be destroyed in which
case the merge can lead to use-after-free issues such as:
BUG: KASAN: use-after-free in __bfq_deactivate_entity+0x9cb/0xa50
Read of size 8 at addr ffff88800693c0c0 by task runc:[2:INIT]/10544
CPU: 0 PID: 10544 Comm: runc:[2:INIT] Tainted: G E 5.15.2-0.g5fb85fd-default #1 openSUSE Tumbleweed (unreleased) f1f3b891c72369aebecd2e43e4641a6358867c70
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
Call Trace:
<IRQ>
dump_stack_lvl+0x46/0x5a
print_address_description.constprop.0+0x1f/0x140
? __bfq_deactivate_entity+0x9cb/0xa50
kasan_report.cold+0x7f/0x11b
? __bfq_deactivate_entity+0x9cb/0xa50
__bfq_deactivate_entity+0x9cb/0xa50
? update_curr+0x32f/0x5d0
bfq_deactivate_entity+0xa0/0x1d0
bfq_del_bfqq_busy+0x28a/0x420
? resched_curr+0x116/0x1d0
? bfq_requeue_bfqq+0x70/0x70
? check_preempt_wakeup+0x52b/0xbc0
__bfq_bfqq_expire+0x1a2/0x270
bfq_bfqq_expire+0xd16/0x2160
? try_to_wake_up+0x4ee/0x1260
? bfq_end_wr_async_queues+0xe0/0xe0
? _raw_write_unlock_bh+0x60/0x60
? _raw_spin_lock_irq+0x81/0xe0
bfq_idle_slice_timer+0x109/0x280
? bfq_dispatch_request+0x4870/0x4870
__hrtimer_run_queues+0x37d/0x700
? enqueue_hrtimer+0x1b0/0x1b0
? kvm_clock_get_cycles+0xd/0x10
? ktime_get_update_offsets_now+0x6f/0x280
hrtimer_interrupt+0x2c8/0x740
Fix the problem by checking that the parent of the two bfqqs we are
merging in bfq_setup_merge() is the same.
Link: https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/
CC: stable@vger.kernel.org
Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bfq_setup_cooperator() can mark bic as stably merged even though it
decides to not merge its bfqqs (when bfq_setup_merge() returns NULL).
Make sure to mark bic as stably merged only if we are really going to
merge bfqqs.
CC: stable@vger.kernel.org
Tested-by: "yukuai (C)" <yukuai3@huawei.com>
Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220401102752.8599-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Merge updates from Andrew Morton:
- A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs
- Most the MM patches which precede the patches in Willy's tree: kasan,
pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb,
userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp,
cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap,
zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits)
mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release()
Docs/ABI/testing: add DAMON sysfs interface ABI document
Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface
selftests/damon: add a test for DAMON sysfs interface
mm/damon/sysfs: support DAMOS stats
mm/damon/sysfs: support DAMOS watermarks
mm/damon/sysfs: support schemes prioritization
mm/damon/sysfs: support DAMOS quotas
mm/damon/sysfs: support DAMON-based Operation Schemes
mm/damon/sysfs: support the physical address space monitoring
mm/damon/sysfs: link DAMON for virtual address spaces monitoring
mm/damon: implement a minimal stub for sysfs-based DAMON interface
mm/damon/core: add number of each enum type values
mm/damon/core: allow non-exclusive DAMON start/stop
Docs/damon: update outdated term 'regions update interval'
Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling
Docs/vm/damon: call low level monitoring primitives the operations
mm/damon: remove unnecessary CONFIG_DAMON option
mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}()
mm/damon/dbgfs-test: fix is_target_id() change
...
|
|
bfq_get_queue() expects a "bool" for the third arg, so pass "false"
rather than "BLK_RW_ASYNC" which will soon be removed.
Link: https://lkml.kernel.org/r/164549983746.9187.7949730109246767909.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jens Axboe <axboe@kernel.dk>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull block updates from Jens Axboe:
- BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo)
- blk-rq-qos completion fix (Tejun)
- blk-cgroup merge fix (Tejun)
- Add offline error return value to distinguish it from an IO error on
the device (Song)
- IO stats fixes (Zhang, Christoph)
- blkcg refcount fixes (Ming, Yu)
- Fix for indefinite dispatch loop softlockup (Shin'ichiro)
- blk-mq hardware queue management improvements (Ming)
- sbitmap dead code removal (Ming, John)
- Plugging merge improvements (me)
- Show blk-crypto capabilities in sysfs (Eric)
- Multiple delayed queue run improvement (David)
- Block throttling fixes (Ming)
- Start deprecating auto module loading based on dev_t (Christoph)
- bio allocation improvements (Christoph, Chaitanya)
- Get rid of bio_devname (Christoph)
- bio clone improvements (Christoph)
- Block plugging improvements (Christoph)
- Get rid of genhd.h header (Christoph)
- Ensure drivers use appropriate flush helpers (Christoph)
- Refcounting improvements (Christoph)
- Queue initialization and teardown improvements (Ming, Christoph)
- Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng,
Lukas, Nian, Yang, Eric, Chengming)
* tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits)
block: cancel all throttled bios in del_gendisk()
block: let blkcg_gq grab request queue's refcnt
block: avoid use-after-free on throttle data
block: limit request dispatch loop duration
block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative"
sr: simplify the local variable initialization in sr_block_open()
block: don't merge across cgroup boundaries if blkcg is enabled
block: fix rq-qos breakage from skipping rq_qos_done_bio()
block: flush plug based on hardware and software queue order
block: ensure plug merging checks the correct queue at least once
block: move rq_qos_exit() into disk_release()
block: do more work in elevator_exit
block: move blk_exit_queue into disk_release
block: move q_usage_counter release into blk_queue_release
block: don't remove hctx debugfs dir from blk_mq_exit_queue
block: move blkcg initialization/destroy into disk allocation/release handler
sr: implement ->free_disk to simplify refcounting
sd: implement ->free_disk to simplify refcounting
sd: delay calling free_opal_dev
sd: call sd_zbc_release_disk before releasing the scsi_device reference
...
|
|
There is a spelling mistake in a bfq_log_bfqq message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220315221539.2959167-1-colin.i.king@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A crash [1] happened to be triggered in conjunction with commit
2d52c58b9c9b ("block, bfq: honor already-setup queue merges"). The
latter was then reverted by commit ebc69e897e17 ("Revert "block, bfq:
honor already-setup queue merges""). Yet, the reverted commit was not
the one introducing the bug. In fact, it actually triggered a UAF
introduced by a different commit, and now fixed by commit d29bd41428cf
("block, bfq: reset last_bfqq_created on group change").
So, there is no point in keeping commit 2d52c58b9c9b ("block, bfq:
honor already-setup queue merges") out. This commit restores it.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=214503
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20211125181510.15004-1-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
KASAN reports a use-after-free report when doing normal scsi-mq test
[69832.239032] ==================================================================
[69832.241810] BUG: KASAN: use-after-free in bfq_dispatch_request+0x1045/0x44b0
[69832.243267] Read of size 8 at addr ffff88802622ba88 by task kworker/3:1H/155
[69832.244656]
[69832.245007] CPU: 3 PID: 155 Comm: kworker/3:1H Not tainted 5.10.0-10295-g576c6382529e #8
[69832.246626] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[69832.249069] Workqueue: kblockd blk_mq_run_work_fn
[69832.250022] Call Trace:
[69832.250541] dump_stack+0x9b/0xce
[69832.251232] ? bfq_dispatch_request+0x1045/0x44b0
[69832.252243] print_address_description.constprop.6+0x3e/0x60
[69832.253381] ? __cpuidle_text_end+0x5/0x5
[69832.254211] ? vprintk_func+0x6b/0x120
[69832.254994] ? bfq_dispatch_request+0x1045/0x44b0
[69832.255952] ? bfq_dispatch_request+0x1045/0x44b0
[69832.256914] kasan_report.cold.9+0x22/0x3a
[69832.257753] ? bfq_dispatch_request+0x1045/0x44b0
[69832.258755] check_memory_region+0x1c1/0x1e0
[69832.260248] bfq_dispatch_request+0x1045/0x44b0
[69832.261181] ? bfq_bfqq_expire+0x2440/0x2440
[69832.262032] ? blk_mq_delay_run_hw_queues+0xf9/0x170
[69832.263022] __blk_mq_do_dispatch_sched+0x52f/0x830
[69832.264011] ? blk_mq_sched_request_inserted+0x100/0x100
[69832.265101] __blk_mq_sched_dispatch_requests+0x398/0x4f0
[69832.266206] ? blk_mq_do_dispatch_ctx+0x570/0x570
[69832.267147] ? __switch_to+0x5f4/0xee0
[69832.267898] blk_mq_sched_dispatch_requests+0xdf/0x140
[69832.268946] __blk_mq_run_hw_queue+0xc0/0x270
[69832.269840] blk_mq_run_work_fn+0x51/0x60
[69832.278170] process_one_work+0x6d4/0xfe0
[69832.278984] worker_thread+0x91/0xc80
[69832.279726] ? __kthread_parkme+0xb0/0x110
[69832.280554] ? process_one_work+0xfe0/0xfe0
[69832.281414] kthread+0x32d/0x3f0
[69832.282082] ? kthread_park+0x170/0x170
[69832.282849] ret_from_fork+0x1f/0x30
[69832.283573]
[69832.283886] Allocated by task 7725:
[69832.284599] kasan_save_stack+0x19/0x40
[69832.285385] __kasan_kmalloc.constprop.2+0xc1/0xd0
[69832.286350] kmem_cache_alloc_node+0x13f/0x460
[69832.287237] bfq_get_queue+0x3d4/0x1140
[69832.287993] bfq_get_bfqq_handle_split+0x103/0x510
[69832.289015] bfq_init_rq+0x337/0x2d50
[69832.289749] bfq_insert_requests+0x304/0x4e10
[69832.290634] blk_mq_sched_insert_requests+0x13e/0x390
[69832.291629] blk_mq_flush_plug_list+0x4b4/0x760
[69832.292538] blk_flush_plug_list+0x2c5/0x480
[69832.293392] io_schedule_prepare+0xb2/0xd0
[69832.294209] io_schedule_timeout+0x13/0x80
[69832.295014] wait_for_common_io.constprop.1+0x13c/0x270
[69832.296137] submit_bio_wait+0x103/0x1a0
[69832.296932] blkdev_issue_discard+0xe6/0x160
[69832.297794] blk_ioctl_discard+0x219/0x290
[69832.298614] blkdev_common_ioctl+0x50a/0x1750
[69832.304715] blkdev_ioctl+0x470/0x600
[69832.305474] block_ioctl+0xde/0x120
[69832.306232] vfs_ioctl+0x6c/0xc0
[69832.306877] __se_sys_ioctl+0x90/0xa0
[69832.307629] do_syscall_64+0x2d/0x40
[69832.308362] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[69832.309382]
[69832.309701] Freed by task 155:
[69832.310328] kasan_save_stack+0x19/0x40
[69832.311121] kasan_set_track+0x1c/0x30
[69832.311868] kasan_set_free_info+0x1b/0x30
[69832.312699] __kasan_slab_free+0x111/0x160
[69832.313524] kmem_cache_free+0x94/0x460
[69832.314367] bfq_put_queue+0x582/0x940
[69832.315112] __bfq_bfqd_reset_in_service+0x166/0x1d0
[69832.317275] bfq_bfqq_expire+0xb27/0x2440
[69832.318084] bfq_dispatch_request+0x697/0x44b0
[69832.318991] __blk_mq_do_dispatch_sched+0x52f/0x830
[69832.319984] __blk_mq_sched_dispatch_requests+0x398/0x4f0
[69832.321087] blk_mq_sched_dispatch_requests+0xdf/0x140
[69832.322225] __blk_mq_run_hw_queue+0xc0/0x270
[69832.323114] blk_mq_run_work_fn+0x51/0x60
[69832.323942] process_one_work+0x6d4/0xfe0
[69832.324772] worker_thread+0x91/0xc80
[69832.325518] kthread+0x32d/0x3f0
[69832.326205] ret_from_fork+0x1f/0x30
[69832.326932]
[69832.338297] The buggy address belongs to the object at ffff88802622b968
[69832.338297] which belongs to the cache bfq_queue of size 512
[69832.340766] The buggy address is located 288 bytes inside of
[69832.340766] 512-byte region [ffff88802622b968, ffff88802622bb68)
[69832.343091] The buggy address belongs to the page:
[69832.344097] page:ffffea0000988a00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88802622a528 pfn:0x26228
[69832.346214] head:ffffea0000988a00 order:2 compound_mapcount:0 compound_pincount:0
[69832.347719] flags: 0x1fffff80010200(slab|head)
[69832.348625] raw: 001fffff80010200 ffffea0000dbac08 ffff888017a57650 ffff8880179fe840
[69832.354972] raw: ffff88802622a528 0000000000120008 00000001ffffffff 0000000000000000
[69832.356547] page dumped because: kasan: bad access detected
[69832.357652]
[69832.357970] Memory state around the buggy address:
[69832.358926] ffff88802622b980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[69832.360358] ffff88802622ba00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[69832.361810] >ffff88802622ba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[69832.363273] ^
[69832.363975] ffff88802622bb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
[69832.375960] ffff88802622bb80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[69832.377405] ==================================================================
In bfq_dispatch_requestfunction, it may have function call:
bfq_dispatch_request
__bfq_dispatch_request
bfq_select_queue
bfq_bfqq_expire
__bfq_bfqd_reset_in_service
bfq_put_queue
kmem_cache_free
In this function call, in_serv_queue has beed expired and meet the
conditions to free. In the function bfq_dispatch_request, the address
of in_serv_queue pointing to has been released. For getting the value
of idle_timer_disabled, it will get flags value from the address which
in_serv_queue pointing to, then the problem of use-after-free happens;
Fix the problem by check in_serv_queue == bfqd->in_service_queue, to
get the value of idle_timer_disabled if in_serve_queue is equel to
bfqd->in_service_queue. If the space of in_serv_queue pointing has
been released, this judge will aviod use-after-free problem.
And if in_serv_queue may be expired or finished, the idle_timer_disabled
will be false which would not give effects to bfq_update_dispatch_stats.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Wensheng <zhangwensheng5@huawei.com>
Link: https://lore.kernel.org/r/20220303070334.3020168-1-zhangwensheng5@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use bfq_group() instead, which do the same thing.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220129015924.3958918-2-yukuai3@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now that we disable wbt by set WBT_STATE_OFF_DEFAULT in
wbt_disable_default() when switch elevator to bfq. And when
we remove scsi device, wbt will be enabled by wbt_enable_default.
If it become false positive between wbt_wait() and wbt_track()
when submit write request.
The following is the scenario that triggered the problem.
T1 T2 T3
elevator_switch_mq
bfq_init_queue
wbt_disable_default <= Set
rwb->enable_state (OFF)
Submit_bio
blk_mq_make_request
rq_qos_throttle
<= rwb->enable_state (OFF)
scsi_remove_device
sd_remove
del_gendisk
blk_unregister_queue
elv_unregister_queue
wbt_enable_default
<= Set rwb->enable_state (ON)
q_qos_track
<= rwb->enable_state (ON)
^^^^^^ this request will mark WBT_TRACKED without inflight add and will
lead to drop rqw->inflight to -1 in wbt_done() which will trigger IO hung.
Fix this by move wbt_enable_default() from elv_unregister to
bfq_exit_queue(). Only re-enable wbt when bfq exit.
Fixes: 76a8040817b4b ("blk-wbt: make sure throttle is enabled properly")
Remove oneline stale comment, and kill one oneshot local variable.
Signed-off-by: Ming Lei <ming.lei@rehdat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-block/20211214133103.551813-1-qiulaibin@huawei.com/
Signed-off-by: Laibin Qiu <qiulaibin@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove the ioc argument as it always points to current->io_context.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
After the prepare side has been moved to the only I/O scheduler that
cares, do the same for the cleanup and the NULL initialization.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move blk_mq_sched_assign_ioc so that many interfaces from the file can
be marked static. Rename the function to ioc_find_get_icq as well and
return the icq to simplify the interface.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
No need to create a new I/O context if there is none present yet in
->limit_depth.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove the unused bfqd argument, and hardcode ioc to current->io_context.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 7cc4ffc55564 ("block, bfq: put reqs of waker and woken in
dispatch list") added a condition to bfq_insert_request() which added
waker's requests directly to dispatch list. The rationale was that
completing waker's IO is needed to get more IO for the current queue.
Although this rationale is valid, there is a hole in it. The waker does
not necessarily serve the IO only for the current queue and maybe it's
current IO is not needed for current queue to make progress. Furthermore
injecting IO like this completely bypasses any service accounting within
bfq and thus we do not properly track how much service is waker's queue
getting or that the waker is actually doing any IO. Depending on the
conditions this can result in the waker getting too much or too few
service.
Consider for example the following job file:
[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync
[slowwriter]
numjobs=1
prioclass=2
prio=7
fsync=200
[fastwriter]
numjobs=1
prioclass=2
prio=0
fsync=200
Despite processes have very different IO priorities, they get the same
about of service. The reason is that bfq identifies these processes as
having waker-wakee relationship and once that happens, IO from
fastwriter gets injected during slowwriter's time slice. As a result bfq
is not aware that fastwriter has any IO to do and constantly schedules
only slowwriter's queue. Thus fastwriter is forced to compete with
slowwriter's IO all the time instead of getting its share of time based
on IO priority.
Drop the special injection condition from bfq_insert_request(). As a
result, requests will be tracked and queued in a normal way and on next
dispatch bfq_select_queue() can decide whether the waker's inserted
requests should be injected during the current queue's timeslice or not.
Fixes: 7cc4ffc55564 ("block, bfq: put reqs of waker and woken in dispatch list")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-8-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Waker - wakee relationships are important in deciding whether one queue
can preempt the other one. Print information about detected waker-wakee
relationships so that scheduling decisions can be better understood from
block traces.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-7-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently, when process A starts issuing requests shortly after process
B has completed some IO three times in a row, we decide that B is a
"waker" of A meaning that completing IO of B is needed for A to make
progress and generally stop separating A's and B's IO much. This logic
is useful to avoid unnecessary idling and thus throughput loss for cases
where workload needs to switch e.g. between the process and the
journaling thread doing IO. However the detection heuristic tends to
frequently give false positives when A and B are fighting IO bandwidth
and other processes aren't doing much IO as we are basically deemed to
eventually accumulate three occurences of a situation where one process
starts issuing requests after the other has completed some IO. To reduce
these false positives, cancel the waker detection also if we didn't
accumulate three detected wakeups within given timeout. The rationale is
that if wakeups are really rare, the pointless idling doesn't hurt
throughput that much anyway.
This significantly reduces false waker detection for workload like:
[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync
[slowwriter]
numjobs=1
fsync=200
[fastwriter]
numjobs=1
fsync=200
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-5-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When cgroup IO scheduling is used with BFQ it does not really provide
service differentiation if the cgroup drives a big IO depth. That for
example happens with writeback which asynchronously submits lots of IO
but it can happen with AIO as well. The problem is that if we have two
cgroups that submit IO with different weights, the cgroup with higher
weight properly gets more IO time and is able to dispatch more IO.
However this causes lower weight cgroup to accumulate more requests
inside BFQ and eventually lower weight cgroup consumes most of IO
scheduler tags. At that point higher weight cgroup stops getting better
service as it is mostly blocked waiting for a scheduler tag while its
queues inside BFQ are empty and thus lower weight cgroup gets served.
Check how many requests submitting cgroup has allocated in
bfq_limit_depth() and if it consumes more requests than what would
correspond to its weight limit available depth to 1 so that the cgroup
cannot consume many more requests. With this limitation the higher
weight cgroup gets proper service even with writeback.
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-4-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Store bitmap depth shift inside bfq_data so that we can use it in
bfq_limit_depth() for proportioning when limiting number of available
request tags for a cgroup.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When we want to limit number of requests used by each bfqq and also
cgroup, we need to track also number of requests used by each cgroup.
So track number of allocated requests for each bfq_entity.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The only user of the io_context for IO is BFQ, yet we put the checking
and logic of it into the normal IO path.
Put the creation into blk_mq_sched_assign_ioc(), and have BFQ use that
helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Now that we use shared tags for shared sbitmap support, we don't require
the tags sbitmap pointers, so drop them.
This essentially reverts commit 222a5ae03cdd ("blk-mq: Use pointers for
blk_mq_tags bitmap tags").
Function blk_mq_init_bitmap_tags() is removed also, since it would be only
a wrappper for blk_mq_init_bitmaps().
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/1633429419-228500-14-git-send-email-john.garry@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Except for the features passed to blk_queue_required_elevator_features,
elevator.h is only needed internally to the block layer. Move the
ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to
block/, dropping all the spurious includes outside of that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit 2d52c58b9c9bdae0ca3df6a1eab5745ab3f7d80b.
We have had several folks complain that this causes hangs for them, which
is especially problematic as the commit has also hit stable already.
As no resolution seems to be forthcoming right now, revert the patch.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214503
Fixes: 2d52c58b9c9b ("block, bfq: honor already-setup queue merges")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The function bfq_setup_merge prepares the merging between two
bfq_queues, say bfqq and new_bfqq. To this goal, it assigns
bfqq->new_bfqq = new_bfqq. Then, each time some I/O for bfqq arrives,
the process that generated that I/O is disassociated from bfqq and
associated with new_bfqq (merging is actually a redirection). In this
respect, bfq_setup_merge increases new_bfqq->ref in advance, adding
the number of processes that are expected to be associated with
new_bfqq.
Unfortunately, the stable-merging mechanism interferes with this
setup. After bfqq->new_bfqq has been set by bfq_setup_merge, and
before all the expected processes have been associated with
bfqq->new_bfqq, bfqq may happen to be stably merged with a different
queue than the current bfqq->new_bfqq. In this case, bfqq->new_bfqq
gets changed. So, some of the processes that have been already
accounted for in the ref counter of the previous new_bfqq will not be
associated with that queue. This creates an unbalance, because those
references will never be decremented.
This commit fixes this issue by reestablishing the previous, natural
behaviour: once bfqq->new_bfqq has been set, it will not be changed
until all expected redirections have occurred.
Signed-off-by: Davide Zini <davidezini2@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210802141352.74353-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace the magic lookup through the kobject tree with an explicit
backpointer, given that the device model links are set up and torn
down at times when I/O is still possible, leading to potential
NULL or invalid pointer dereferences.
Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
Reported-by: syzbot <syzbot+aa0801b6b32dca9dda82@syzkaller.appspotmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The default IO priority is the best effort (BE) class with the
normal priority level IOPRIO_NORM (4). However, get_task_ioprio()
returns IOPRIO_CLASS_NONE/IOPRIO_NORM as the default priority and
get_current_ioprio() returns IOPRIO_CLASS_NONE/0. Let's be consistent
with the defined default and have both of these functions return the
default priority IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM) when
the user did not define another default IO priority for the task.
In include/uapi/linux/ioprio.h, introduce the IOPRIO_BE_NORM macro as
an alias to IOPRIO_NORM to clarify that this default level applies to
the BE priotity class. In include/linux/ioprio.h, define the macro
IOPRIO_DEFAULT as IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM)
and use this new macro when setting a priority to the default.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210811033702.368488-7-damien.lemoal@wdc.com
[axboe: drop unnecessary lightnvm change]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The BFQ scheduler and ioprio_check_cap() both assume that the RT
priority class (IOPRIO_CLASS_RT) can have up to 8 different priority
levels, similarly to the BE class (IOPRIO_CLASS_iBE). This is
controlled using the IOPRIO_BE_NR macro , which is badly named as the
number of levels also applies to the RT class.
Introduce the class independent IOPRIO_NR_LEVELS macro, defined to 8,
to make things clear. Keep the old IOPRIO_BE_NR macro definition as an
alias for IOPRIO_NR_LEVELS.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210811033702.368488-6-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For a request that has a priority level equal to or larger than
IOPRIO_BE_NR, bfq_set_next_ioprio_data() prints a critical warning but
defaults to setting the request new_ioprio field to IOPRIO_BE_NR. This
is not consistent with the warning and the allowed values for priority
levels. Fix this by setting the request new_ioprio field to
IOPRIO_BE_NR - 1, the lowest priority level allowed.
Cc: <stable@vger.kernel.org>
Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210811033702.368488-2-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When merging one bio to request, if they are discard IO and the queue
supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE
because both block core and related drivers(nvme, virtio-blk) doesn't
handle mixed discard io merge(traditional IO merge together with
discard merge) well.
Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation,
so both blk-mq and drivers just need to handle multi-range discard.
Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Fixes: 2705dfb20947 ("block: fix discard request merge")
Link: https://lore.kernel.org/r/20210729034226.1591070-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The backing device information only makes sense for file system I/O,
and thus belongs into the gendisk and not the lower level request_queue
structure. Move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Lockdep complains about lock inversion between ioc->lock and bfqd->lock:
bfqd -> ioc:
put_io_context+0x33/0x90 -> ioc->lock grabbed
blk_mq_free_request+0x51/0x140
blk_put_request+0xe/0x10
blk_attempt_req_merge+0x1d/0x30
elv_attempt_insert_merge+0x56/0xa0
blk_mq_sched_try_insert_merge+0x4b/0x60
bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
blk_mq_sched_insert_requests+0xd6/0x2b0
blk_mq_flush_plug_list+0x154/0x280
blk_finish_plug+0x40/0x60
ext4_writepages+0x696/0x1320
do_writepages+0x1c/0x80
__filemap_fdatawrite_range+0xd7/0x120
sync_file_range+0xac/0xf0
ioc->bfqd:
bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
exit_io_context+0x48/0x50
do_exit+0x7e9/0xdd0
do_group_exit+0x54/0xc0
To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
free the merged request but rather leave that upto the caller similarly
to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
to free all the merged requests after dropping bfqd->lock.
Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently, bfq does very little in bfq_requests_merged() and handles all
the request cleanup in bfq_finish_requeue_request() called from
blk_mq_free_request(). That is currently safe only because
blk_mq_free_request() is called shortly after bfq_requests_merged()
while bfqd->lock is still held. However to fix a lock inversion between
bfqd->lock and ioc->lock, we need to call blk_mq_free_request() after
dropping bfqd->lock. That would mean that already merged request could
be seen by other processes inside bfq queues and possibly dispatched to
the device which is wrong. So move cleanup of the request from
bfq_finish_requeue_request() to bfq_requests_merged().
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20210623093634.27879-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 85686d0dc194 ("block, bfq: keep shared queues out of the waker
mechanism") leaves shared bfq_queues out of the waker-detection
mechanism. It attains this goal by not updating the pointer
last_completed_rq_bfqq, if the last request completed belongs to a
shared bfq_queue (so that the pointer will not point to the shared
bfq_queue).
Yet this has a side effect: the pointer last_completed_rq_bfqq keeps
pointing, deceptively, to a bfq_queue that actually is not the last
one to have had a request completed. As a consequence, such a
bfq_queue may deceptively be considered as a waker of some bfq_queue,
even of some shared bfq_queue.
To address this issue, reset last_completed_rq_bfqq if the last
request completed belongs to a shared queue.
Fixes: 85686d0dc194 ("block, bfq: keep shared queues out of the waker mechanism")
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-8-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Consider two bfq_queues, say Q1 and Q2, with Q2 empty. If a request of
Q1 gets completed shortly before a new request arrives for Q2, then
BFQ flags Q1 as a candidate waker for Q2. Yet, the arrival of this new
request may have a different cause, in the following case. If also Q2
has requests in flight while waiting for the arrival of a new request,
then the completion of its own requests may be the actual cause of the
awakening of the process that sends I/O to Q2. So Q1 may be flagged
wrongly as a candidate waker.
This commit avoids this deceptive flagging, by disabling
candidate-waker flagging for Q2, if Q2 has in-flight I/O.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-7-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created
queues"), BFQ may schedule a merge between a newly created sync
bfq_queue, say Q2, and the last sync bfq_queue created, say Q1. To this
goal, BFQ stores the address of Q1 in the field bic->stable_merge_bfqq
of the bic associated with Q2. So, when the time for the possible merge
arrives, BFQ knows which bfq_queue to merge Q2 with. In particular,
BFQ checks for possible merges on request arrivals.
Yet the same bic may also be associated with an async bfq_queue, say
Q3. So, if a request for Q3 arrives, then the above check may happen
to be executed while the bfq_queue at hand is Q3, instead of Q2. In
this case, Q1 happens to be merged with an async bfq_queue. This is
not only a conceptual mistake, because async queues are to be kept out
of queue merging, but also a bug that leads to inconsistent states.
This commits simply filters async queues out of delayed merges.
Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-6-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
One of the methods with which bfq boosts throughput is by merging queues.
One of the merging variants in bfq is the stable merge.
This mechanism is activated between two queues only if they are created
within a certain maximum time T1 from each other.
Merging can happen soon or be delayed. In the second case, before
merging, bfq needs to evaluate a throughput-boost parameter that
indicates whether the queue generates a high throughput is served alone.
Merging occurs when this throughput-boost is not high enough.
In particular, this parameter is evaluated and late merging may occur
only after at least a time T2 from the creation of the queue.
Currently T1 and T2 are set to 180ms and 200ms, respectively.
In this way the merging mechanism rarely occurs because time is not
enough. This results in a noticeable lowering of the overall throughput
with some workloads (see the example below).
This commit introduces two constants bfq_activation_stable_merging and
bfq_late_stable_merging in order to increase the duration of T1 and T2.
Both the stable merging activation time and the late merging
time are set to 600ms. This value has been experimentally evaluated
using sqlite benchmark in the Phoronix Test Suite on a HDD.
The duration of the benchmark before this fix was 111.02s, while now
it has reached 97.02s, a better result than that of all the other
schedulers.
Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-5-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created
queues"), BFQ may schedule a merge between a newly created sync
bfq_queue and the last sync bfq_queue created. Such a merging is not
performed immediately, because BFQ needs first to find out whether the
newly created queue actually reaches a higher throughput if not merged
at all (and in that case BFQ will not perform any stable merging). To
check that, a little time must be waited after the creation of the new
queue, so that some I/O can flow in the queue, and statistics on such
I/O can be computed.
Yet, to evaluate the above waiting time, the last split time is
considered as start time, instead of the creation time of the
queue. This is a mistake, because considering the split time is
correct only in the following scenario.
The queue undergoes a non-stable merges on the arrival of its very
first I/O request, due to close I/O with some other queue. While the
queue is merged for close I/O, stable merging is not considered. Yet
the queue may then happen to be split, if the close I/O finishes (or
happens to be a false positive). From this time on, the queue can
again be considered for stable merging. But, again, a little time must
elapse, to let some new I/O flow in the queue and to get updated
statistics. To wait for this time, the split time is to be taken into
account.
Yet, if the queue does not undergo a non-stable merge on the arrival
of its very first request, then BFQ immediately checks whether the
stable merge is to be performed. It happens because the split time for
a queue is initialized to minus infinity when the queue is created.
This commit fixes this mistake by adding the missing condition. Now
the check for delayed stable-merge is performed after a little time is
elapsed not only from the last queue split time, but also from the
creation time of the queue.
Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-4-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When attempting to schedule a merge of a given bfq_queue with the currently
in-service bfq_queue or with a cooperating bfq_queue among the scheduled
bfq_queues, delayed stable merge is checked for rotational or non-queueing
devs. For this stable merge to be performed, some conditions must be met.
If the current bfq_queue underwent some split from some merged bfq_queue,
one of these conditions is that two hundred milliseconds must elapse from
split, otherwise this condition is always met.
Unfortunately, by mistake, time_is_after_jiffies() was written instead of
time_is_before_jiffies() for this check, verifying that less than two
hundred milliseconds have elapsed instead of verifying that at least two
hundred milliseconds have elapsed.
Fix this issue by replacing time_is_after_jiffies() with
time_is_before_jiffies().
Signed-off-by: Luca Mariotti <mariottiluca1@hotmail.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com>
Link: https://lore.kernel.org/r/20210619140948.98712-3-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Merged bfq_queues are kept out of weight-raising (low-latency)
mechanisms. The reason is that these queues are usually created for
non-interactive and non-soft-real-time tasks. Yet this is not the case
for stably-merged queues. These queues are merged just because they
are created shortly after each other. So they may easily serve the I/O
of an interactive or soft-real time application, if the application
happens to spawn multiple processes.
To address this issue, this commits lets also stably-merged queued
enjoy weight raising.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210619140948.98712-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BFQ may merge a new bfq_queue, stably, with the last bfq_queue
created. In particular, BFQ first waits a little bit for some I/O to
flow inside the new queue, say Q2, if this is needed to understand
whether it is better or worse to merge Q2 with the last queue created,
say Q1. This delayed stable merge is performed by assigning
bic->stable_merge_bfqq = Q1, for the bic associated with Q1.
Yet, while waiting for some I/O to flow in Q2, a non-stable queue
merge of Q2 with Q1 may happen, causing the bic previously associated
with Q2 to be associated with exactly Q1 (bic->bfqq = Q1). After that,
Q2 and Q1 may happen to be split, and, in the split, Q1 may happen to
be recycled as a non-shared bfq_queue. In that case, Q1 may then
happen to undergo a stable merge with the bfq_queue pointed by
bic->stable_merge_bfqq. Yet bic->stable_merge_bfqq still points to
Q1. So Q1 would be merged with itself.
This commit fixes this error by intercepting this situation, and
canceling the schedule of the stable merge.
Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues")
Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20210512094352.85545-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
__blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
for the current CPU again and uses that to get the corresponding Kyber
context in the passed hctx. However, the thread may be preempted between
the two calls to blk_mq_get_ctx(), and the ctx returned the second time
may no longer correspond to the passed hctx. This "works" accidentally
most of the time, but it can cause us to read garbage if the second ctx
came from an hctx with more ctx's than the first one (i.e., if
ctx->index_hw[hctx->type] > hctx->nr_ctx).
This manifested as this UBSAN array index out of bounds error reported
by Jakub:
UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
index 13106 is out of range for type 'long unsigned int [128]'
Call Trace:
dump_stack+0xa4/0xe5
ubsan_epilogue+0x5/0x40
__ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
queued_spin_lock_slowpath+0x476/0x480
do_raw_spin_lock+0x1c2/0x1d0
kyber_bio_merge+0x112/0x180
blk_mq_submit_bio+0x1f5/0x1100
submit_bio_noacct+0x7b0/0x870
submit_bio+0xc2/0x3a0
btrfs_map_bio+0x4f0/0x9d0
btrfs_submit_data_bio+0x24e/0x310
submit_one_bio+0x7f/0xb0
submit_extent_page+0xc4/0x440
__extent_writepage_io+0x2b8/0x5e0
__extent_writepage+0x28d/0x6e0
extent_write_cache_pages+0x4d7/0x7a0
extent_writepages+0xa2/0x110
do_writepages+0x8f/0x180
__writeback_single_inode+0x99/0x7f0
writeback_sb_inodes+0x34e/0x790
__writeback_inodes_wb+0x9e/0x120
wb_writeback+0x4d2/0x660
wb_workfn+0x64d/0xa10
process_one_work+0x53a/0xa80
worker_thread+0x69/0x5b0
kthread+0x20b/0x240
ret_from_fork+0x1f/0x30
Only Kyber uses the hctx, so fix it by passing the request_queue to
->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
map the queues itself to avoid the mismatch.
Fixes: a6088845c2bf ("block: kyber: make kyber more friendly with merging")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since commit 01e99aeca39796003 'blk-mq: insert passthrough request into
hctx->dispatch directly', passthrough request should not appear in
IO-scheduler any more, so blk_rq_is_passthrough checking in addon IO
schedulers is redundant.
(Notes: this patch passes generic IO load test with hdds under SAS
controller and hdds under AHCI controller but obviously not covers all.
Not sure if passthrough request can still escape into IO scheduler from
blk_mq_sched_insert_requests, which is used by blk_mq_flush_plug_list and
has lots of indirect callers.)
Signed-off-by: Lin Feng <linf@wangsu.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Many throughput-sensitive workloads are made of several parallel I/O
flows, with all flows generated by the same application, or more
generically by the same task (e.g., system boot). The most
counterproductive action with these workloads is plugging I/O dispatch
when one of the bfq_queues associated with these flows remains
temporarily empty.
To avoid this plugging, BFQ has been using a burst-handling mechanism
for years now. This mechanism has proven effective for throughput, and
not detrimental for service guarantees. This commit pushes this
mechanism a little bit further, basing on the following two facts.
First, all the I/O flows of a the same application or task contribute
to the execution/completion of that common application or task. So the
performance figures that matter are total throughput of the flows and
task-wide I/O latency. In particular, these flows do not need to be
protected from each other, in terms of individual bandwidth or
latency.
Second, the above fact holds regardless of the number of flows.
Putting these two facts together, this commits merges stably the
bfq_queues associated with these I/O flows, i.e., with the processes
that generate these IO/ flows, regardless of how many the involved
processes are.
To decide whether a set of bfq_queues is actually associated with the
I/O flows of a common application or task, and to merge these queues
stably, this commit operates as follows: given a bfq_queue, say Q2,
currently being created, and the last bfq_queue, say Q1, created
before Q2, Q2 is merged stably with Q1 if
- very little time has elapsed since when Q1 was created
- Q2 has the same ioprio as Q1
- Q2 belongs to the same group as Q1
Merging bfq_queues also reduces scheduling overhead. A fio test with
ten random readers on /dev/nullb shows a throughput boost of 40%, with
a quadcore. Since BFQ's execution time amounts to ~50% of the total
per-request processing time, the above throughput boost implies that
BFQ's overhead is reduced by more than 50%.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-7-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Shared queues are likely to receive I/O at a high rate. This may
deceptively let them be considered as wakers of other queues. But a
false waker will unjustly steal bandwidth to its supposedly woken
queue. So considering also shared queues in the waking mechanism may
cause more control troubles than throughput benefits. This commit
keeps shared queues out of the waker-detection mechanism.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-6-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When the io_latency heuristic is off, bfq_queues must not start to be
weight-raised. Unfortunately, by mistake, this may happen when the
state of a previously weight-raised bfq_queue is resumed after a queue
split. This commit fixes this error.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-5-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Consider a bfq_queue bfqq that is about to be merged with another
bfq_queue new_bfqq. The processes associated with bfqq are cooperators
of the processes associated with new_bfqq. So, if bfqq has a waker,
then it is reasonable (and beneficial for throughput) to assume that
all these processes will be happy to let bfqq's waker freely inject
I/O when they have no I/O. So this commit makes new_bfqq inherit
bfqq's waker.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-4-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Consider a new I/O request that arrives for a bfq_queue bfqq. If, when
this happens, the only active bfq_queues are bfqq and either its waker
bfq_queue or one of its woken bfq_queues, then there is no point in
queueing this new I/O request in bfqq for service. In fact, the
in-service queue and bfqq agree on serving this new I/O request as
soon as possible. So this commit puts this new I/O request directly
into the dispatch list.
Tested-by: Jan Kara <jack@suse.cz>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-3-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Suppose that I/O dispatch is plugged, to wait for new I/O for the
in-service bfq-queue, say bfqq. Suppose then that there is a further
bfq_queue woken by bfqq, and that this woken queue has pending I/O. A
woken queue does not steal bandwidth from bfqq, because it remains
soon without I/O if bfqq is not served. So there is virtually no risk
of loss of bandwidth for bfqq if this woken queue has I/O dispatched
while bfqq is waiting for new I/O. In contrast, this extra I/O
injection boosts throughput. This commit performs this extra
injection.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Link: https://lore.kernel.org/r/20210304174627.161-2-paolo.valente@linaro.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Correct the comments since bfq_fifo_expire[0] is for async request,
while bfq_fifo_expire[1] is for sync request.
Also update docs, according the source code, the default
fifo_expire_async is 250ms, and fifo_expire_sync is 125ms.
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Get rid of the wrapper for trace_block_rq_insert() and call the function
directly.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull core block updates from Jens Axboe:
"Another nice round of removing more code than what is added, mostly
due to Christoph's relentless pursuit of tech debt removal/cleanups.
This pull request contains:
- Two series of BFQ improvements (Paolo, Jan, Jia)
- Block iov_iter improvements (Pavel)
- bsg error path fix (Pan)
- blk-mq scheduler improvements (Jan)
- -EBUSY discard fix (Jan)
- bvec allocation improvements (Ming, Christoph)
- bio allocation and init improvements (Christoph)
- Store bdev pointer in bio instead of gendisk + partno (Christoph)
- Block trace point cleanups (Christoph)
- hard read-only vs read-only split (Christoph)
- Block based swap cleanups (Christoph)
- Zoned write granularity support (Damien)
- Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"
* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
mm: simplify swapdev_block
sd_zbc: clear zone resources for non-zoned case
block: introduce blk_queue_clear_zone_settings()
zonefs: use zone write granularity as block size
block: introduce zone_write_granularity limit
block: use blk_queue_set_zoned in add_partition()
nullb: use blk_queue_set_zoned() to setup zoned devices
nvme: cleanup zone information initialization
block: document zone_append_max_bytes attribute
block: use bi_max_vecs to find the bvec pool
md/raid10: remove dead code in reshape_request
block: mark the bio as cloned in bio_iov_bvec_set
block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
block: remove a layer of indentation in bio_iov_iter_get_pages
block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
block: remove the 1 and 4 vec bvec_slabs entries
block: streamline bvec_alloc
block: factor out a bvec_alloc_gfp helper
block: move struct biovec_slab to bio.c
block: reuse BIO_INLINE_VECS for integrity bvecs
...
|
|
This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a.
bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core
sbitmap_get_shallow, which uses just the number to limit the scan depth of
each bitmap word, formula:
scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100%
That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct.
But after commit patch 'bfq: Fix computation of shallow depth', we use
sbitmap.depth instead, as a example in following case:
sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64.
The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and
three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit
nothing.
Signed-off-by: Lin Feng <linf@wangsu.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently whenever bfq queue has a request queued we add now -
last_completion_time to the think time statistics. This is however
misleading in case the process is able to submit several requests in
parallel because e.g. if the queue has request completed at time T0 and
then queues new requests at times T1, T2, then we will add T1-T0 and
T2-T0 to think time statistics which just doesn't make any sence (the
queue's think time is penalized by the queue being able to submit more
IO). So add to think time statistics only time intervals when the queue
had no IO pending.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
[axboe: fix whitespace on empty line]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use local variable 'ttime' instead of dereferencing bfqq.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bfq_setup_cooperator() uses bfqd->in_serv_last_pos so detect whether it
makes sense to merge current bfq queue with the in-service queue.
However if the in-service queue is freshly scheduled and didn't dispatch
any requests yet, bfqd->in_serv_last_pos is stale and contains value
from the previously scheduled bfq queue which can thus result in a bogus
decision that the two queues should be merged. This bug can be observed
for example with the following fio jobfile:
[global]
direct=0
ioengine=sync
invalidate=1
size=1g
rw=read
[reader]
numjobs=4
directory=/mnt
where the 4 processes will end up in the one shared bfq queue although
they do IO to physically very distant files (for some reason I was able to
observe this only with slice_idle=1ms setting).
Fix the problem by invalidating bfqd->in_serv_last_pos when switching
in-service queue.
Fixes: 058fdecc6de7 ("block, bfq: fix in-service-queue check for queue merging")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It's only used in the same file, mark is appropriately static.
Fixes: 71217df39dc6 ("block, bfq: make waker-queue detection more robust")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In the presence of many parallel I/O flows, the detection of waker
bfq_queues suffers from false positives. This commits addresses this
issue by making the filtering of actual wakers more selective. In more
detail, a candidate waker must be found to meet waker requirements
three times before being promoted to actual waker.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
To prevent injection information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
To prevent weight-raising information from being lost on bfq_queue merging,
also the amount of service that a bfq_queue receives must be saved and
restored when the bfq_queue is merged and split, respectively.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A bfq_queue may happen to be deemed as soft real-time while it is
still enjoying interactive weight-raising. If this happens because of
a false positive, then the bfq_queue is likely to loose its soft
real-time status soon. Upon losing such a status, the bfq_queue must
get back its interactive weight-raising, if its interactive period is
not over yet. But this case is not handled. This commit corrects this
error.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Upon an I/O-dispatch attempt, BFQ may detect that it was better to
plug I/O dispatch, and to wait for a new request to arrive for the
currently in-service queue. But the arrival of a new request for an
empty bfq_queue, and thus the switch from idle to busy of the
bfq_queue, may cause the scenario to change, and make plugging no
longer needed for service guarantees, or more convenient for
throughput. In this case, keeping I/O-dispatch plugged would certainly
lower throughput.
To address this issue, this commit makes such a check, and stops
plugging I/O if it is better to stop plugging I/O.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Some BFQ mechanisms make their decisions on a bfq_queue basing also on
whether the bfq_queue is I/O bound. In this respect, the current logic
for evaluating whether a bfq_queue is I/O bound is rather rough. This
commits replaces this logic with a more effective one.
The new logic measures the percentage of time during which a bfq_queue
is active, and marks the bfq_queue as I/O bound if the latter if this
percentage is above a fixed threshold.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This reverts commit b445547ec1bbd3e7bf4b1c142550942f70527d95.
Since both mq-deadline and BFQ completely ignore hctx they are passed to
their dispatch function and dispatch whatever request they deem fit
checking whether any request for a particular hctx is queued is just
pointless since we'll very likely get a request from a different hctx
anyway. In the following commit we'll deal with lock contention in these
IO schedulers in presence of multiple HW queues in a different way.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This commits preserves I/O-dispatch plugging for a special symmetric
case that may suddenly turn into asymmetric: the case where only one
bfq_queue, say bfqq, is busy. In this case, not expiring bfqq does not
cause any harm to any other queues in terms of service guarantees. In
contrast, it avoids the following unlucky sequence of events: (1) bfqq
is expired, (2) a new queue with a lower weight than bfqq becomes busy
(or more queues), (3) the new queue is served until a new request
arrives for bfqq, (4) when bfqq is finally served, there are so many
requests of the new queue in the drive that the pending requests for
bfqq take a lot of time to be served. In particular, event (2) may
case even already dispatched requests of bfqq to be delayed, inside
the drive. So, to avoid this series of events, the scenario is
preventively declared as asymmetric also if bfqq is the only busy
queues. By doing so, I/O-dispatch plugging is performed for bfqq.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BFQ tags some bfq_queues as interactive or soft_rt if it deems that
these bfq_queues contain the I/O of, respectively, interactive or soft
real-time applications. BFQ privileges both these special types of
bfq_queues over normal bfq_queues. To privilege a bfq_queue, BFQ
mainly raises the weight of the bfq_queue. In particular, soft_rt
bfq_queues get a higher weight than interactive bfq_queues.
A bfq_queue may turn from interactive to soft_rt. And this leads to a
tricky issue. Soft real-time applications usually start with an
I/O-bound, interactive phase, in which they load themselves into main
memory. BFQ correctly detects this phase, and keeps the bfq_queues
associated with the application in interactive mode for a
while. Problems arise when the I/O pattern of the application finally
switches to soft real-time. One of the conditions for a bfq_queue to
be deemed as soft_rt is that the bfq_queue does not consume too much
bandwidth. But the bfq_queues associated with a soft real-time
application consume as much bandwidth as they can in the loading phase
of the application. So, after the application becomes truly soft
real-time, a lot of time should pass before the average bandwidth
consumed by its bfq_queues finally drops to a value acceptable for
soft_rt bfq_queues. As a consequence, there might be a time gap during
which the application is not privileged at all, because its bfq_queues
are not interactive any longer, but cannot be deemed as soft_rt yet.
To avoid this problem, BFQ pretends that an interactive bfq_queue
consumes zero bandwidth, and allows an interactive bfq_queue to switch
to soft_rt. Yet, this fake zero-bandwidth consumption easily causes
the bfq_queue to often switch to soft_rt deceptively, during its
loading phase. As in soft_rt mode, the bfq_queue gets its bandwidth
correctly computed, and therefore soon switches back to
interactive. Then it switches again to soft_rt, and so on. These
spurious fluctuations usually cause losses of throughput, because they
deceive BFQ's mechanisms for boosting throughput (injection,
I/O-plugging avoidance, ...).
This commit addresses this issue as follows:
1) It does compute actual bandwidth consumption also for interactive
bfq_queues. This avoids the above false positives.
2) When a bfq_queue switches from interactive to normal mode, the
consumed bandwidth is reset (forgotten). This allows the
bfq_queue to enjoy soft_rt very quickly. In particular, two
alternatives are possible in this switch:
- the bfq_queue still has backlog, and therefore there is a budget
already scheduled to serve the bfq_queue; in this case, the
scheduling of the current budget of the bfq_queue is not
hindered, because only the scheduling of the next budget will
be affected by the weight drop. After that, if the bfq_queue is
actually in a soft_rt phase, and becomes empty during the
service of its current budget, which is the natural behavior of
a soft_rt bfq_queue, then the bfq_queue will be considered as
soft_rt when its next I/O arrives. If, in contrast, the
bfq_queue remains constantly non-empty, then its next budget
will be scheduled with a low weight, which is the natural
treatment for an I/O-bound (non soft_rt) bfq_queue.
- the bfq_queue is empty; in this case, the bfq_queue may be
considered unjustly soft_rt when its new I/O arrives. Yet
the problem is now much smaller than before, because it is
unlikely that more than one spurious fluctuation occurs.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BFQ heuristics try to detect interactive I/O, and raise the weight of
the queues containing such an I/O. Yet, if also the user changes the
weight of a queue (i.e., the user changes the ioprio of the process
associated with that queue), then it is most likely better to prevent
BFQ heuristics from silently changing the same weight.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Tests on slower machines showed current window to be way too
small. This commit increases it.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since commit c5089591c3ba ("block, bfq: detect wakers and
unconditionally inject their I/O"), when the in-service bfq_queue, say
Q, is temporarily empty, BFQ checks whether there are I/O requests to
inject (also) from the waker bfq_queue for Q. To this goal, the value
pointed by bfqq->waker_bfqq->next_rq must be controlled. However, the
current implementation mistakenly looks at bfqq->next_rq, which
instead points to the next request of the currently served queue.
This mistake evidently causes losses of throughput in scenarios with
waker bfq_queues.
This commit corrects this mistake.
Fixes: c5089591c3ba ("block, bfq: detect wakers and unconditionally inject their I/O")
Signed-off-by: Jia Cheng Hu <jia.jiachenghu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The value of the I/O plugging (idling) timeout is used also as the
think-time threshold to decide whether a process has a short think
time. In this respect, a good value of this timeout for rotational
drives is un the order of several ms. Yet, this is often too long a
time interval to be effective as a think-time threshold. This commit
mitigates this problem (by a lot, according to tests), by halving the
threshold.
Tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BFQ computes number of tags it allows to be allocated for each request type
based on tag bitmap. However it uses 1 << bitmap.shift as number of
available tags which is wrong. 'shift' is just an internal bitmap value
containing logarithm of how many bits bitmap uses in each bitmap word.
Thus number of tags allowed for some request types can be far to low.
Use proper bitmap.depth which has the number of tags instead.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block updates from Jens Axboe:
- Series of merge handling cleanups (Baolin, Christoph)
- Series of blk-throttle fixes and cleanups (Baolin)
- Series cleaning up BDI, seperating the block device from the
backing_dev_info (Christoph)
- Removal of bdget() as a generic API (Christoph)
- Removal of blkdev_get() as a generic API (Christoph)
- Cleanup of is-partition checks (Christoph)
- Series reworking disk revalidation (Christoph)
- Series cleaning up bio flags (Christoph)
- bio crypt fixes (Eric)
- IO stats inflight tweak (Gabriel)
- blk-mq tags fixes (Hannes)
- Buffer invalidation fixes (Jan)
- Allow soft limits for zone append (Johannes)
- Shared tag set improvements (John, Kashyap)
- Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)
- DM no-wait support (Mike, Konstantin)
- Request allocation improvements (Ming)
- Allow md/dm/bcache to use IO stat helpers (Song)
- Series improving blk-iocost (Tejun)
- Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
Xianting, Yang, Yufen, yangerkun)
* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
block: fix uapi blkzoned.h comments
blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
blk-mq: get rid of the dead flush handle code path
block: get rid of unnecessary local variable
block: fix comment and add lockdep assert
blk-mq: use helper function to test hw stopped
block: use helper function to test queue register
block: remove redundant mq check
block: invoke blk_mq_exit_sched no matter whether have .exit_sched
percpu_ref: don't refer to ref->data if it isn't allocated
block: ratelimit handle_bad_sector() message
blk-throttle: Re-use the throtl_set_slice_end()
blk-throttle: Open code __throtl_de/enqueue_tg()
blk-throttle: Move service tree validation out of the throtl_rb_first()
blk-throttle: Move the list operation after list validation
blk-throttle: Fix IO hang for a corner case
blk-throttle: Avoid tracking latency if low limit is invalid
blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
block: Remove redundant 'return' statement
...
|
|
Pull block fixes from Jens Axboe:
- Fix a regression in bdev partition locking (Christoph)
- NVMe pull request from Christoph:
- cancel async events before freeing them (David Milburn)
- revert a broken race fix (James Smart)
- fix command processing during resets (Sagi Grimberg)
- Fix a kyber crash with requeued flushes (Omar)
- Fix __bio_try_merge_page() same_page error for no merging (Ritesh)
* tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-block:
block: Set same_page to false in __bio_try_merge_page if ret is false
nvme-fabrics: allow to queue requests for live queues
block: only call sched requeue_request() for scheduled requests
nvme-tcp: cancel async events before freeing event struct
nvme-rdma: cancel async events before freeing event struct
nvme-fc: cancel async events before freeing event struct
nvme: Revert: Fix controller creation races with teardown flow
block: restore a specific error code in bdev_del_partition
|
|
Yang Yang reported the following crash caused by requeueing a flush
request in Kyber:
[ 2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00
...
[ 2.517468] pc : clear_bit+0x18/0x2c
[ 2.517502] lr : sbitmap_queue_clear+0x40/0x228
[ 2.517503] sp : ffffff800832bc60 pstate : 00c00145
...
[ 2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000)
[ 2.517602] Call trace:
[ 2.517606] clear_bit+0x18/0x2c
[ 2.517619] kyber_finish_request+0x74/0x80
[ 2.517627] blk_mq_requeue_request+0x3c/0xc0
[ 2.517637] __scsi_queue_insert+0x11c/0x148
[ 2.517640] scsi_softirq_done+0x114/0x130
[ 2.517643] blk_done_softirq+0x7c/0xb0
[ 2.517651] __do_softirq+0x208/0x3bc
[ 2.517657] run_ksoftirqd+0x34/0x60
[ 2.517663] smpboot_thread_fn+0x1c4/0x2c0
[ 2.517667] kthread+0x110/0x120
[ 2.517669] ret_from_fork+0x10/0x18
This happens because Kyber doesn't track flush requests, so
kyber_finish_request() reads a garbage domain token. Only call the
scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for
the finish_request() hook in blk_mq_free_request()). Now that we're
handling it in blk-mq, also remove the check from BFQ.
Reported-by: Yang Yang <yang.yang@vivo.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
contention is possible for mq-deadline and bfq IO schedulers
when nr_hw_queues is more than one.
It is because kblockd work queue can submit IO from all online CPUs
(through blk_mq_run_hw_queues()) even though only one hctx has pending
commands.
The elevator callback .has_work for mq-deadline and bfq scheduler considers
pending work if there are any IOs on request queue but it does not account
hctx context.
Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering
the elevator even though there are no requests queued.
[jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap]
Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
with the goal of later being able to use a common shared tag bitmap across
all HW contexts in a set.
Signed-off-by: John Garry <john.garry@huawei.com>
Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
Tested-by: Douglas Gilbert <dgilbert@interlog.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.
[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
|
|
Change "at at" to "at a".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
None of the I/O schedulers actually needs it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use the common interface bdi_dev_name() to get device name.
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Add missing <linux/backing-dev.h> include BFQ
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The
goal of this put is to release a process reference to a bfq_queue. But
process-reference releases may trigger also some extra operation, and,
to this goal, are handled through bfq_release_process_ref(). So, turn
the invocation of bfq_put_queue() into an invocation of
bfq_release_process_ref().
Tested-by: cki-project@redhat.com
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is
not in bfqd-lock critical section. The bfqq, which is not
equal to NULL in bfq_idle_slice_timer, may be freed after passing
to bfq_idle_slice_timer_body. So we will access the freed memory.
In addition, considering the bfqq may be in race, we should
firstly check whether bfqq is in service before doing something
on it in bfq_idle_slice_timer_body func. If the bfqq in race is
not in service, it means the bfqq has been expired through
__bfq_bfqq_expire func, and wait_request flags has been cleared in
__bfq_bfqd_reset_in_service func. So we do not need to re-clear the
wait_request of bfqq which is not in service.
KASAN log is given as follows:
[13058.354613] ==================================================================
[13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290
[13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767
[13058.354646]
[13058.354655] CPU: 96 PID: 19767 Comm: fork13
[13058.354661] Call trace:
[13058.354667] dump_backtrace+0x0/0x310
[13058.354672] show_stack+0x28/0x38
[13058.354681] dump_stack+0xd8/0x108
[13058.354687] print_address_description+0x68/0x2d0
[13058.354690] kasan_report+0x124/0x2e0
[13058.354697] __asan_load8+0x88/0xb0
[13058.354702] bfq_idle_slice_timer+0xac/0x290
[13058.354707] __hrtimer_run_queues+0x298/0x8b8
[13058.354710] hrtimer_interrupt+0x1b8/0x678
[13058.354716] arch_timer_handler_phys+0x4c/0x78
[13058.354722] handle_percpu_devid_irq+0xf0/0x558
[13058.354731] generic_handle_irq+0x50/0x70
[13058.354735] __handle_domain_irq+0x94/0x110
[13058.354739] gic_handle_irq+0x8c/0x1b0
[13058.354742] el1_irq+0xb8/0x140
[13058.354748] do_wp_page+0x260/0xe28
[13058.354752] __handle_mm_fault+0x8ec/0x9b0
[13058.354756] handle_mm_fault+0x280/0x460
[13058.354762] do_page_fault+0x3ec/0x890
[13058.354765] do_mem_abort+0xc0/0x1b0
[13058.354768] el0_da+0x24/0x28
[13058.354770]
[13058.354773] Allocated by task 19731:
[13058.354780] kasan_kmalloc+0xe0/0x190
[13058.354784] kasan_slab_alloc+0x14/0x20
[13058.354788] kmem_cache_alloc_node+0x130/0x440
[13058.354793] bfq_get_queue+0x138/0x858
[13058.354797] bfq_get_bfqq_handle_split+0xd4/0x328
[13058.354801] bfq_init_rq+0x1f4/0x1180
[13058.354806] bfq_insert_requests+0x264/0x1c98
[13058.354811] blk_mq_sched_insert_requests+0x1c4/0x488
[13058.354818] blk_mq_flush_plug_list+0x2d4/0x6e0
[13058.354826] blk_flush_plug_list+0x230/0x548
[13058.354830] blk_finish_plug+0x60/0x80
[13058.354838] read_pages+0xec/0x2c0
[13058.354842] __do_page_cache_readahead+0x374/0x438
[13058.354846] ondemand_readahead+0x24c/0x6b0
[13058.354851] page_cache_sync_readahead+0x17c/0x2f8
[13058.354858] generic_file_buffered_read+0x588/0xc58
[13058.354862] generic_file_read_iter+0x1b4/0x278
[13058.354965] ext4_file_read_iter+0xa8/0x1d8 [ext4]
[13058.354972] __vfs_read+0x238/0x320
[13058.354976] vfs_read+0xbc/0x1c0
[13058.354980] ksys_read+0xdc/0x1b8
[13058.354984] __arm64_sys_read+0x50/0x60
[13058.354990] el0_svc_common+0xb4/0x1d8
[13058.354994] el0_svc_handler+0x50/0xa8
[13058.354998] el0_svc+0x8/0xc
[13058.354999]
[13058.355001] Freed by task 19731:
[13058.355007] __kasan_slab_free+0x120/0x228
[13058.355010] kasan_slab_free+0x10/0x18
[13058.355014] kmem_cache_free+0x288/0x3f0
[13058.355018] bfq_put_queue+0x134/0x208
[13058.355022] bfq_exit_icq_bfqq+0x164/0x348
[13058.355026] bfq_exit_icq+0x28/0x40
[13058.355030] ioc_exit_icq+0xa0/0x150
[13058.355035] put_io_context_active+0x250/0x438
[13058.355038] exit_io_context+0xd0/0x138
[13058.355045] do_exit+0x734/0xc58
[13058.355050] do_group_exit+0x78/0x220
[13058.355054] __wake_up_parent+0x0/0x50
[13058.355058] el0_svc_common+0xb4/0x1d8
[13058.355062] el0_svc_handler+0x50/0xa8
[13058.355066] el0_svc+0x8/0xc
[13058.355067]
[13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464
[13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040)
[13058.355077] The buggy address belongs to the page:
[13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0
[13058.366175] flags: 0x2ffffe0000008100(slab|head)
[13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780
[13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000
[13058.370789] page dumped because: kasan: bad access detected
[13058.370791]
[13058.370792] Memory state around the buggy address:
[13058.370797] ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb
[13058.370801] ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[13058.370808] ^
[13058.370811] ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[13058.370815] ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[13058.370817] ==================================================================
[13058.370820] Disabling lock debugging due to kernel taint
Here, we directly pass the bfqd to bfq_idle_slice_timer_body func.
--
V2->V3: rewrite the comment as suggested by Paolo Valente
V1->V2: add one comment, and add Fixes and Reported-by tag.
Fixes: aee69d78d ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Reported-by: Wang Wang <wangwang2@huawei.com>
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Signed-off-by: Feilong Lin <linfeilong@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The exact, general goal of the function bfq_split_bfqq() is not that
apparent. Add a comment to make it clear.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ifdefs around gets and puts of bfq groups reduce readability, remove them.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The flag on_st in the bfq_entity data structure is true if the entity
is on a service tree or is in service. Yet the name of the field,
confusingly, does not mention the second, very important case. Extend
the name to mention the second case too.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BFQ maintains an ordered list, implemented with an RB tree, of
head-request positions of non-empty bfq_queues. This position tree,
inherited from CFQ, is used to find bfq_queues that contain I/O close
to each other. BFQ merges these bfq_queues into a single shared queue,
if this boosts throughput on the device at hand.
There is however a special-purpose bfq_queue that does not participate
in queue merging, the oom bfq_queue. Yet, also this bfq_queue could be
wrongly added to the position tree. So bfqq_find_close() could return
the oom bfq_queue, which is a source of further troubles in an
out-of-memory situation. This commit prevents the oom bfq_queue from
being inserted into the position tree.
Tested-by: Patrick Dung <patdung100@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 478de3380c1c ("block, bfq: deschedule empty bfq_queues not
referred by any process") fixed commit 3726112ec731 ("block, bfq:
re-schedule empty queues if they deserve I/O plugging") by
descheduling an empty bfq_queue when it remains with not process
reference. Yet, this still left a case uncovered: an empty bfq_queue
with not process reference that remains in service. This happens for
an in-service sync bfq_queue that is deemed to deserve I/O-dispatch
plugging when it remains empty. Yet no new requests will arrive for
such a bfq_queue if no process sends requests to it any longer. Even
worse, the bfq_queue may happen to be prematurely freed while still in
service (because there may remain no reference to it any longer).
This commit solves this problem by preventing I/O dispatch from being
plugged for the in-service bfq_queue, if the latter has no process
reference (the bfq_queue is then prevented from remaining in service).
Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Reported-by: Patrick Dung <patdung100@gmail.com>
Tested-by: Patrick Dung <patdung100@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This macro is never used after introduced from commit aee69d78dec0
("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Better to remove it.
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull core block updates from Jens Axboe:
"Due to more granular branches, this one is small and will be followed
with other core branches that add specific features. I meant to just
have a core and drivers branch, but external dependencies we ended up
adding a few more that are also core.
The changes are:
- Fixes and improvements for the zoned device support (Ajay, Damien)
- sed-opal table writing and datastore UID (Revanth)
- blk-cgroup (and bfq) blk-cgroup stat fixes (Tejun)
- Improvements to the block stats tracking (Pavel)
- Fix for overruning sysfs buffer for large number of CPUs (Ming)
- Optimization for small IO (Ming, Christoph)
- Fix typo in RWH lifetime hint (Eugene)
- Dead code removal and documentation (Bart)
- Reduction in memory usage for queue and tag set (Bart)
- Kerneldoc header documentation (André)
- Device/partition revalidation fixes (Jan)
- Stats tracking for flush requests (Konstantin)
- Various other little fixes here and there (et al)"
* tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block: (48 commits)
Revert "block: split bio if the only bvec's length is > SZ_4K"
block: add iostat counters for flush requests
block,bfq: Skip tracing hooks if possible
block: sed-opal: Introduce SUM_SET_LIST parameter and append it using 'add_token_u64'
blk-cgroup: cgroup_rstat_updated() shouldn't be called on cgroup1
block: Don't disable interrupts in trigger_softirq()
sbitmap: Delete sbitmap_any_bit_clear()
blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue()
block: split bio if the only bvec's length is > SZ_4K
block: still try to split bio if the bvec crosses pages
blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT
blk-cgroup: reimplement basic IO stats using cgroup rstat
blk-cgroup: remove now unused blkg_print_stat_{bytes|ios}_recursive()
blk-throtl: stop using blkg->stat_bytes and ->stat_ios
bfq-iosched: stop using blkg->stat_bytes and ->stat_ios
bfq-iosched: relocate bfqg_*rwstat*() helpers
block: add zone open, close and finish ioctl support
block: add zone open, close and finish operations
block: Simplify REQ_OP_ZONE_RESET_ALL handling
block: Remove REQ_OP_ZONE_RESET plugging
...
|
|
Since commit 3726112ec731 ("block, bfq: re-schedule empty queues if
they deserve I/O plugging"), to prevent the service guarantees of a
bfq_queue from being violated, the bfq_queue may be left busy, i.e.,
scheduled for service, even if empty (see comments in
__bfq_bfqq_expire() for details). But, if no process will send
requests to the bfq_queue any longer, then there is no point in
keeping the bfq_queue scheduled for service.
In addition, keeping the bfq_queue scheduled for service, but with no
process reference any longer, may cause the bfq_queue to be freed when
descheduled from service. But this is assumed to never happen, and
causes a UAF if it happens. This, in turn, caused crashes [1, 2].
This commit fixes this issue by descheduling an empty bfq_queue when
it remains with not process reference.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1767539
[2] https://bugzilla.kernel.org/show_bug.cgi?id=205447
Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
Reported-by: Chris Evich <cevich@redhat.com>
Reported-by: Patrick Dung <patdung100@gmail.com>
Reported-by: Thorsten Schubert <tschubert@bafh.org>
Tested-by: Thorsten Schubert <tschubert@bafh.org>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When used on cgroup1, bfq uses the blkg->stat_bytes and ->stat_ios
from blk-cgroup core to populate six stat knobs. blk-cgroup core is
moving away from blkg_rwstat to improve scalability and won't be able
to support this usage.
It isn't like the sharing gains all that much. Let's break it out to
dedicated rwstat counters which are updated when on cgroup1. This
makes use of bfqg_*rwstat*() helpers outside of
CONFIG_BFQ_CGROUP_DEBUG. Move them out.
v2: Compile fix when !CONFIG_BFQ_CGROUP_DEBUG.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If equal to 0, the injection limit for a bfq_queue is pushed to 1
after a first sample of the total service time of the I/O requests of
the queue is computed (to allow injection to start). Yet, because of a
mistake in the branch that performs this action, the push may happen
also in some other case. This commit fixes this issue.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The update period of the injection limit has been tentatively set to
100 ms, to reduce fluctuations. This value however proved to cause,
occasionally, the limit to be decremented for some bfq_queue only
after the queue underwent excessive injection for a lot of time. This
commit reduces the period to 10 ms.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Upon an increment attempt of the injection limit, the latter is
constrained not to become higher than twice the maximum number
max_rq_in_driver of I/O requests that have happened to be in service
in the drive. This high bound allows the injection limit to grow
beyond max_rq_in_driver, which may then cause max_rq_in_driver itself
to grow.
However, since the limit is incremented by only one unit at a time,
there is no need for such a high bound, and just max_rq_in_driver+1 is
enough.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BFQ updates the injection limit of each bfq_queue as a function of how
much the limit inflates the service times experienced by the I/O
requests of the queue. So only service times affected by injection
must be taken into account. Unfortunately, in the current
implementation of this update scheme, the service time of an I/O
request rq not affected by injection may happen to be considered in
the following case: there is no I/O request in service when rq
arrives.
This commit fixes this issue by making sure that only service times
affected by injection are considered for updating the injection
limit. In particular, the service time of an I/O request rq is now
considered only if at least one of the following two conditions holds:
- the destination bfq_queue for rq underwent injection before rq
arrival, and there is still I/O in service in the drive on rq arrival
(the service of such unfinished I/O may delay the service of rq);
- injection occurs between the arrival and the completion time of rq.
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
As reported in [1], the call bfq_init_rq(rq) may return NULL in case
of OOM (in particular, if rq->elv.icq is NULL because memory
allocation failed in failed in ioc_create_icq()).
This commit handles this circumstance.
[1] https://lkml.org/lkml/2019/7/22/824
Cc: Hsin-Yi Wang <hsinyi@google.com>
Cc: Nicolas Boichat <drinkcat@chromium.org>
Cc: Doug Anderson <dianders@chromium.org>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Reported-by: Hsin-Yi Wang <hsinyi@google.com>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since commit 13a857a4c4e8 ("block, bfq: detect wakers and
unconditionally inject their I/O"), every bfq_queue has a pointer to a
waker bfq_queue and a list of the bfq_queues it may wake. In this
respect, when a bfq_queue, say Q, remains with no I/O source attached
to it, Q cannot be woken by any other bfq_queue, and cannot wake any
other bfq_queue. Then Q must be removed from the woken list of its
possible waker bfq_queue, and all bfq_queues in the woken list of Q
must stop having a waker bfq_queue.
Q remains with no I/O source in two cases: when the last process
associated with Q exits or when such a process gets associated with a
different bfq_queue. Unfortunately, commit 13a857a4c4e8 ("block, bfq:
detect wakers and unconditionally inject their I/O") performed the
above updates only in the first case.
This commit fixes this bug by moving these updates to when Q gets
freed. This is a simple and safe way to handle all cases, as both the
above events, process exit and re-association, lead to Q being freed
soon, and because dangling references would come out only after Q gets
freed (if no update were performed).
Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O")
Reported-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Since commit 13a857a4c4e8 ("block, bfq: detect wakers and
unconditionally inject their I/O"), BFQ stores, in a per-device
pointer last_completed_rq_bfqq, the last bfq_queue that had an I/O
request completed. If some bfq_queue receives new I/O right after the
last request of last_completed_rq_bfqq has been completed, then
last_completed_rq_bfqq may be a waker bfq_queue.
But if the bfq_queue last_completed_rq_bfqq points to is freed, then
last_completed_rq_bfqq becomes a dangling reference. This commit
resets last_completed_rq_bfqq if the pointed bfq_queue is freed.
Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O")
Reported-by: Douglas Anderson <dianders@chromium.org>
Tested-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block fixes from Jens Axboe:
- Several io_uring fixes/improvements:
- Blocking fix for O_DIRECT (me)
- Latter page slowness for registered buffers (me)
- Fix poll hang under certain conditions (me)
- Defer sequence check fix for wrapped rings (Zhengyuan)
- Mismatch in async inc/dec accounting (Zhengyuan)
- Memory ordering issue that could cause stall (Zhengyuan)
- Track sequential defer in bytes, not pages (Zhengyuan)
- NVMe pull request from Christoph
- Set of hang fixes for wbt (Josef)
- Redundant error message kill for libahci (Ding)
- Remove unused blk_mq_sched_started_request() and related ops (Marcos)
- drbd dynamic alloc shash descriptor to reduce stack use (Arnd)
- blkcg ->pd_stat() non-debug print (Tejun)
- bcache memory leak fix (Wei)
- Comment fix (Akinobu)
- BFQ perf regression fix (Paolo)
* tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
io_uring: ensure ->list is initialized for poll commands
Revert "nvme-pci: don't create a read hctx mapping without read queues"
nvme: fix multipath crash when ANA is deactivated
nvme: fix memory leak caused by incorrect subsystem free
nvme: ignore subnqn for ADATA SX6000LNP
drbd: dynamically allocate shash descriptor
block: blk-mq: Remove blk_mq_sched_started_request and started_request
bcache: fix possible memory leak in bch_cached_dev_run()
io_uring: track io length in async_list based on bytes
io_uring: don't use iov_iter_advance() for fixed buffers
block: properly handle IOCB_NOWAIT for async O_DIRECT IO
blk-mq: allow REQ_NOWAIT to return an error inline
io_uring: add a memory barrier before atomic_read
rq-qos: use a mb for got_token
rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
rq-qos: don't reset has_sleepers on spurious wakeups
rq-qos: fix missed wake-ups in rq_qos_throttle
wait: add wq_has_single_sleeper helper
block, bfq: check also in-flight I/O in dispatch plugging
block: fix sysfs module parameters directory path in comment
...
|
|
Consider a sync bfq_queue Q that remains empty while in service, and
suppose that, when this happens, there is a fair amount of already
in-flight I/O not belonging to Q. In such a situation, I/O dispatching
may need to be plugged (until new I/O arrives for Q), for the
following reason.
The drive may decide to serve in-flight non-Q's I/O requests before
Q's ones, thereby delaying the arrival of new I/O requests for Q
(recall that Q is sync). If I/O-dispatching is not plugged, then,
while Q remains empty, a basically uncontrolled amount of I/O from
other queues may be dispatched too, possibly causing the service of
Q's I/O to be delayed even longer in the drive. This problem gets more
and more serious as the speed and the queue depth of the drive grow,
because, as these two quantities grow, the probability to find no
queue busy but many requests in flight grows too.
If Q has the same weight and priority as the other queues, then the
above delay is unlikely to cause any issue, because all queues tend to
undergo the same treatment. So, since not plugging I/O dispatching is
convenient for throughput, it is better not to plug. Things change in
case Q has a higher weight or priority than some other queue, because
Q's service guarantees may simply be violated. For this reason,
commit 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric
scenarios") does plug I/O in such an asymmetric scenario. Plugging
minimizes the delay induced by already in-flight I/O, and enables Q to
recover the bandwidth it may lose because of this delay.
Yet the above commit does not cover the case of weight-raised queues,
for efficiency concerns. For weight-raised queues, I/O-dispatch
plugging is activated simply if not all bfq_queues are
weight-raised. But this check does not handle the case of in-flight
requests, because a bfq_queue may become non busy *before* all its
in-flight requests are completed.
This commit performs I/O-dispatch plugging for weight-raised queues if
there are some in-flight requests.
As a practical example of the resulting recover of control, under
write load on a Samsung SSD 970 PRO, gnome-terminal starts in 1.5
seconds after this fix, against 15 seconds before the fix (as a
reference, gnome-terminal takes about 35 seconds to start with any of
the other I/O schedulers).
Fixes: 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric scenarios")
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Rename the block documentation files to ReST, add an
index for them and adjust in order to produce a nice html
output via the Sphinx build system.
At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
|
|
Pull block updates from Jens Axboe:
"This is the main block updates for 5.3. Nothing earth shattering or
major in here, just fixes, additions, and improvements all over the
map. This contains:
- Series of documentation fixes (Bart)
- Optimization of the blk-mq ctx get/put (Bart)
- null_blk removal race condition fix (Bob)
- req/bio_op() cleanups (Chaitanya)
- Series cleaning up the segment accounting, and request/bio mapping
(Christoph)
- Series cleaning up the page getting/putting for bios (Christoph)
- block cgroup cleanups and moving it to where it is used (Christoph)
- block cgroup fixes (Tejun)
- Series of fixes and improvements to bcache, most notably a write
deadlock fix (Coly)
- blk-iolatency STS_AGAIN and accounting fixes (Dennis)
- Series of improvements and fixes to BFQ (Douglas, Paolo)
- debugfs_create() return value check removal for drbd (Greg)
- Use struct_size(), where appropriate (Gustavo)
- Two lighnvm fixes (Heiner, Geert)
- MD fixes, including a read balance and corruption fix (Guoqing,
Marcos, Xiao, Yufen)
- block opal shadow mbr additions (Jonas, Revanth)
- sbitmap compare-and-exhange improvemnts (Pavel)
- Fix for potential bio->bi_size overflow (Ming)
- NVMe pull requests:
- improved PCIe suspent support (Keith Busch)
- error injection support for the admin queue (Akinobu Mita)
- Fibre Channel discovery improvements (James Smart)
- tracing improvements including nvmetc tracing support (Minwoo Im)
- misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya
Kulkarni)"
- Various little fixes and improvements to drivers and core"
* tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits)
blk-iolatency: fix STS_AGAIN handling
block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES
blk-mq: simplify blk_mq_make_request()
blk-mq: remove blk_mq_put_ctx()
sbitmap: Replace cmpxchg with xchg
block: fix .bi_size overflow
block: sed-opal: check size of shadow mbr
block: sed-opal: ioctl for writing to shadow mbr
block: sed-opal: add ioctl for done-mark of shadow mbr
block: never take page references for ITER_BVEC
direct-io: use bio_release_pages in dio_bio_complete
block_dev: use bio_release_pages in bio_unmap_user
block_dev: use bio_release_pages in blkdev_bio_end_io
iomap: use bio_release_pages in iomap_dio_bio_end_io
block: use bio_release_pages in bio_map_user_iov
block: use bio_release_pages in bio_unmap_user
block: optionally mark pages dirty in bio_release_pages
block: move the BIO_NO_PAGE_REF check into bio_release_pages
block: skd_main.c: Remove call to memset after dma_alloc_coherent
block: mtip32xx: Remove call to memset after dma_alloc_coherent
...
|
|
In reboot tests on several devices we were seeing a "use after free"
when slub_debug or KASAN was enabled. The kernel complained about:
Unable to handle kernel paging request at virtual address 6b6b6c2b
...which is a classic sign of use after free under slub_debug. The
stack crawl in kgdb looked like:
0 test_bit (addr=<optimized out>, nr=<optimized out>)
1 bfq_bfqq_busy (bfqq=<optimized out>)
2 bfq_select_queue (bfqd=<optimized out>)
3 __bfq_dispatch_request (hctx=<optimized out>)
4 bfq_dispatch_request (hctx=<optimized out>)
5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440)
6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440)
7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440)
8 0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>)
9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480)
10 0xc024cff4 in worker_thread (__worker=0xec6d4640)
Digging in kgdb, it could be found that, though bfqq looked fine,
bfqq->bic had been freed.
Through further digging, I postulated that perhaps it is illegal to
access a "bic" (AKA an "icq") after bfq_exit_icq() had been called
because the "bic" can be freed at some point in time after this call
is made. I confirmed that there certainly were cases where the exact
crashing code path would access the "bic" after bfq_exit_icq() had
been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and
saw that the bic was 0x7 at the time of the crash.
To understand a bit more about why this crash was fairly uncommon (I
saw it only once in a few hundred reboots), you can see that much of
the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't
access the ->bic anymore. The only case it doesn't is if
bfq_put_queue() sees a reference still held.
However, even in the case when bfqq isn't freed, the crash is still
rare. Why? I tracked what happened to the "bic" after the exit
routine. It doesn't get freed right away. Rather,
put_io_context_active() eventually called put_io_context() which
queued up freeing on a workqueue. The freeing then actually happened
later than that through call_rcu(). Despite all these delays, some
extra debugging showed that all the hoops could be jumped through in
time and the memory could be freed causing the original crash. Phew!
To make a long story short, assuming it truly is illegal to access an
icq after the "exit_icq" callback is finished, this patch is needed.
Cc: stable@vger.kernel.org
Reviewed-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Some debug code suggested by Paolo was tripping when I did reboot
stress tests. Specifically in bfq_bfqq_resume_state()
"bic->saved_wr_start_at_switch_to_srt" was later than the current
value of "jiffies". A bit of debugging showed that
"bic->saved_wr_start_at_switch_to_srt" was actually 0 and a bit more
debugging showed that was because we had run through the "unlikely"
case in the bfq_bfqq_save_state() function.
Let's init "saved_wr_start_at_switch_to_srt" in the unlikely case to
something sane.
NOTE: this fixes no known real-world errors.
Reviewed-by: Paolo Valente <paolo.valente@linaro.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
By mistake, there is a '&' instead of a '==' in the definition of the
macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with
the correct one.
Fixes: 7074f076ff15 ("block, bfq: do not tag totally seeky queues as soft rt")
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Consider, on one side, a bfq_queue Q that remains empty while in
service, and, on the other side, the pending I/O of bfq_queues that,
according to their timestamps, have to be served after Q. If an
uncontrolled amount of I/O from the latter bfq_queues were dispatched
while Q is waiting for its new I/O to arrive, then Q's bandwidth
guarantees would be violated. To prevent this, I/O dispatch is plugged
until Q receives new I/O (except for a properly controlled amount of
injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging,
for the following reason.
Preemption is performed in two steps. First, Q is expired and
re-scheduled. Second, the new bfq_queue to serve is chosen. The first
step is needed by the second, as the second can be performed only
after Q's timestamps have been properly updated (done in the
expiration step), and Q has been re-queued for service. This
dependency is a consequence of the way how BFQ's scheduling algorithm
is currently implemented.
But Q is not re-scheduled at all in the first step, because Q is
empty. As a consequence, an uncontrolled amount of I/O may be
dispatched until Q becomes non empty again. This breaks Q's service
guarantees.
This commit addresses this issue by re-scheduling Q even if it is
empty. This in turn breaks the assumption that all scheduled queues
are non empty. Then a few extra checks are now needed.
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
BFQ enqueues the I/O coming from each process into a separate
bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
served for at most timeout_sync milliseconds (default: 125 ms). This
service scheme is prone to the following inaccuracy.
While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may
receive I/O, and, according to BFQ's scheduling policy, may become the
right bfq_queue to serve, in place of the currently in-service
bfq_queue. In this respect, postponing the service of Q2 to after the
service of Q1 finishes may delay the completion of Q2's I/O, compared
with an ideal service in which all non-empty bfq_queues are served in
parallel, and every non-empty bfq_queue is served at a rate
proportional to the bfq_queue's weight. This additional delay is equal
at most to the time Q1 may unjustly remain in service before switching
to Q2.
If Q1 and Q2 have the same weight, then this time is most likely
negligible compared with the completion time to be guaranteed to Q2's
I/O. In addition, first, one of the reasons why BFQ may want to serve
Q1 for a while is that this boosts throughput and, second, serving Q1
longer reduces BFQ's overhead. As a conclusion, it is usually better
not to preempt Q1 if both Q1 and Q2 have the same weight.
In contrast, as Q2's weight or priority becomes higher and higher
compared with that of Q1, the above delay becomes larger and larger,
compared with the I/O completion times that have to be guaranteed to
Q2 according to Q2's weight. So reducing this delay may be more
important than avoiding the costs of preempting Q1.
Accordingly, this commit preempts Q1 if Q2 has a higher weight or a
higher priority than Q1. Preemption causes Q1 to be re-scheduled, and
triggers a new choice of the next bfq_queue to serve. If Q2 really is
the next bfq_queue to serve, then Q2 will be set in service
immediately.
This change reduces the component of the I/O latency caused by the
above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5
SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for
a process doing sporadic random reads while another process is doing
continuous sequential reads.
Signed-off-by: Nicola Bottura <bottura.nicola95@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A bfq_queue Q may happen to be synchronized with another
bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
receive new I/O. We call Q2 "waker queue".
If I/O plugging is being performed for Q, and Q is not receiving any
more I/O because of the above synchronization, then, thanks to BFQ's
injection mechanism, the waker queue is likely to get served before
the I/O-plugging timeout fires.
Unfortunately, this fact may not be sufficient to guarantee a high
throughput during the I/O plugging, because the inject limit for Q may
be too low to guarantee a lot of injected I/O. In addition, the
duration of the plugging, i.e., the time before Q finally receives new
I/O, may not be minimized, because the waker queue may happen to be
served only after other queues.
To address these issues, this commit introduces the explicit detection
of the waker queue, and the unconditional injection of a pending I/O
request of the waker queue on each invocation of
bfq_dispatch_request().
One may be concerned that this systematic injection of I/O from the
waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
the contrary, next Q's I/O is brought forward dramatically, for it is
not blocked for milliseconds.
Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|