aboutsummaryrefslogtreecommitdiffstats
path: root/block/bfq-iosched.c
AgeCommit message (Collapse)AuthorFilesLines
7 daysblock, bfq: don't grab queue_lock to initialize bfqYu Kuai1-5/+0
The request_queue is frozen and quiesced while the elevator init_sched() method runs, so queue_lock is not needed for BFQ cgroup initialization. Signed-off-by: Yu Kuai <yukuai@fygo.io> Link: https://patch.msgid.link/1965073ea20f33114a8d903816b986e483b9bb34.1780621988.git.yukuai@fygo.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
9 daysblock, bfq: protect async queue reset with blkcg locksCen Zhang1-1/+2
Writing 0 to BFQ's low_latency attribute ends weight raising for active, idle and async queues. The async cgroup path walks q->blkg_list, converts each blkg to BFQ policy data and then reads bfqg->async_bfqq and bfqg->async_idle_bfqq. That walk was protected only by bfqd->lock. blkcg release work is serialized by q->blkcg_mutex and q->queue_lock instead, and blkg_free_workfn() can call BFQ's pd_free_fn before it removes blkg->q_node from q->blkg_list. A low_latency reset can therefore still find the blkg on the queue list after the BFQ policy data has been freed. The buggy scenario involves two paths, with each column showing the order within that path: BFQ low_latency reset: blkcg blkg release work: 1. bfq_low_latency_store() 1. blkg_free_workfn() takes calls bfq_end_wr(). q->blkcg_mutex. 2. bfq_end_wr_async() walks 2. BFQ pd_free_fn drops the q->blkg_list. final bfq_group reference. 3. blkg_to_bfqg() returns 3. blkg->q_node remains on the stale policy data. q->blkg_list until list_del_init(). 4. bfq_end_wr_async_queues() reads async queue fields. Fix this by taking q->blkcg_mutex and q->queue_lock around the q->blkg_list walk, then taking bfqd->lock before touching BFQ async queues. The mutex serializes against policy-data free and queue_lock stabilizes the list. Move the async reset out of bfq_end_wr()'s existing bfqd->lock critical section so the lock order matches blkcg policy callbacks. Validation reproduced this kernel report: BUG: KASAN: slab-use-after-free in bfq_end_wr_async_queues+0x246/0x340 Call Trace: <TASK> dump_stack_lvl+0x66/0xa0 print_report+0xce/0x630 ? bfq_end_wr_async_queues+0x246/0x340 ? srso_alias_return_thunk+0x5/0xfbef5 ? __virt_addr_valid+0x20d/0x410 ? bfq_end_wr_async_queues+0x246/0x340 kasan_report+0xe0/0x110 ? bfq_end_wr_async_queues+0x246/0x340 bfq_end_wr_async_queues+0x246/0x340 bfq_end_wr_async+0xba/0x180 bfq_low_latency_store+0x4e5/0x690 ? 0xffffffffc02150da ? __pfx_bfq_low_latency_store+0x10/0x10 ? __pfx_bfq_low_latency_store+0x10/0x10 elv_attr_store+0xc4/0x110 kernfs_fop_write_iter+0x2f5/0x4a0 vfs_write+0x604/0x11f0 ? __pfx_locks_remove_posix+0x10/0x10 ? __pfx_vfs_write+0x10/0x10 ksys_write+0xf9/0x1d0 ? __pfx_ksys_write+0x10/0x10 do_syscall_64+0x115/0x6a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Allocated by task 544: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 __kasan_kmalloc+0xaa/0xb0 bfq_pd_alloc+0xc0/0x1b0 blkg_alloc+0x346/0x960 blkg_create+0x8c2/0x10d0 bio_associate_blkg_from_css+0x9f3/0xfa0 bio_associate_blkg+0xd9/0x200 bio_init+0x303/0x640 __blkdev_direct_IO_simple+0x56b/0x8a0 blkdev_direct_IO+0x8e7/0x2580 blkdev_read_iter+0x205/0x400 vfs_read+0x7b0/0xda0 ksys_read+0xf9/0x1d0 do_syscall_64+0x115/0x6a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task 465: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x5f/0x80 kfree+0x307/0x580 blkg_free_workfn+0xef/0x460 process_one_work+0x8d0/0x1870 worker_thread+0x575/0xf80 kthread+0x2e7/0x3c0 ret_from_fork+0x576/0x810 ret_from_fork_asm+0x1a/0x30 Fixes: 44e44a1b329e ("block, bfq: improve responsiveness") Assisted-by: Codex:gpt-5.5 Signed-off-by: Cen Zhang <zzzccc427@gmail.com> Reviewed-by: Tao Cui <cuitao@kylinos.cn> Link: https://patch.msgid.link/20260621135930.2657810-1-zzzccc427@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook1-3/+3
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-03block, bfq: convert to use request_queue->async_depthYu Kuai1-26/+17
The default limits is unchanged, and user can configure async_depth now. Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-02-03blk-mq-sched: unify elevators checking for async requestsYu Kuai1-1/+1
bfq and mq-deadline consider sync writes as async requests and only reserve tags for sync reads by async_depth, however, kyber doesn't consider sync writes as async requests for now. Consider the case there are lots of dirty pages, and user use fsync to flush dirty pages. In this case sched_tags can be exhausted by sync writes and sync reads can stuck waiting for tag. Hence let kyber follow what mq-deadline and bfq did, and unify async requests checking for all elevators. Signed-off-by: Yu Kuai <yukuai@fnnas.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-01-29block: introduce blk_queue_rot()Damien Le Moal1-10/+10
To check if a request queue is for a rotational device, a double negation is needed with the pattern "!blk_queue_nonrot(q)". Simplify this with the introduction of the helper blk_queue_rot() which tests if a requests queue limit has the BLK_FEAT_ROTATIONAL feature set. All call sites of blk_queue_nonrot() are modified to use blk_queue_rot() and blk_queue_nonrot() definition removed. No functional changes. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-12block: fix race between wbt_enable_default and IO submissionMing Lei1-1/+1
When wbt_enable_default() is moved out of queue freezing in elevator_change(), it can cause the wbt inflight counter to become negative (-1), leading to hung tasks in the writeback path. Tasks get stuck in wbt_wait() because the counter is in an inconsistent state. The issue occurs because wbt_enable_default() could race with IO submission, allowing the counter to be decremented before proper initialization. This manifests as: rq_wait[0]: inflight: -1 has_waiters: True rwb_enabled() checks the state, which can be updated exactly between wbt_wait() (rq_qos_throttle()) and wbt_track()(rq_qos_track()), then the inflight counter will become negative. And results in hung task warnings like: task:kworker/u24:39 state:D stack:0 pid:14767 Call Trace: rq_qos_wait+0xb4/0x150 wbt_wait+0xa9/0x100 __rq_qos_throttle+0x24/0x40 blk_mq_submit_bio+0x672/0x7b0 ... Fix this by: 1. Splitting wbt_enable_default() into: - __wbt_enable_default(): Returns true if wbt_init() should be called - wbt_enable_default(): Wrapper for existing callers (no init) - wbt_init_enable_default(): New function that checks and inits WBT 2. Using wbt_init_enable_default() in blk_register_queue() to ensure proper initialization during queue registration 3. Move wbt_init() out of wbt_enable_default() which is only for enabling disabled wbt from bfq and iocost, and wbt_init() isn't needed. Then the original lock warning can be avoided. 4. Removing the ELEVATOR_FLAG_ENABLE_WBT_ON_EXIT flag and its handling code since it's no longer needed This ensures WBT is properly initialized before any IO can be submitted, preventing the counter from going negative. Cc: Nilay Shroff <nilay@linux.ibm.com> Cc: Yu Kuai <yukuai@fnnas.com> Cc: Guangwu Zhang <guazhang@redhat.com> Fixes: 78c271344b6f ("block: move wbt_enable_default() out of queue freezing from sched ->exit()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-05blk-mq: fix elevator depth_updated methodYu Kuai1-17/+5
Current depth_updated has some problems: 1) depth_updated() will be called for each hctx, while all elevators will update async_depth for the disk level, this is not related to hctx; 2) In blk_mq_update_nr_requests(), if previous hctx update succeed and this hctx update failed, q->nr_requests will not be updated, while async_depth is already updated with new nr_reqeuests in previous depth_updated(); 3) All elevators are using q->nr_requests to calculate async_depth now, however, q->nr_requests is still the old value when depth_updated() is called from blk_mq_update_nr_requests(); Those problems are first from error path, then mq-deadline, and recently for bfq and kyber, fix those problems by: - pass in request_queue instead of hctx; - move depth_updated() after q->nr_requests is updated in blk_mq_update_nr_requests(); - add depth_updated() call inside init_sched() method to initialize async_depth; - remove init_hctx() method for mq-deadline and bfq that is useless now; Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes") Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code") Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Li Nan <linan122@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250821060612.1729939-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-11block, bfq: remove redundant __GFP_NOWARNQianfeng Rong1-2/+1
Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made GFP_NOWAIT implicitly include __GFP_NOWARN. Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean up these redundant flags across subsystems. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Link: https://lore.kernel.org/r/20250811081135.374315-1-rongqianfeng@vivo.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-07lib/sbitmap: convert shallow_depth from one word to the whole sbitmapYu Kuai1-19/+16
Currently elevators will record internal 'async_depth' to throttle asynchronous requests, and they both calculate shallow_dpeth based on sb->shift, with the respect that sb->shift is the available tags in one word. However, sb->shift is not the availbale tags in the last word, see __map_depth: if (index == sb->map_nr - 1) return sb->depth - (index << sb->shift); For consequence, if the last word is used, more tags can be get than expected, for example, assume nr_requests=256 and there are four words, in the worst case if user set nr_requests=32, then the first word is the last word, and still use bits per word, which is 64, to calculate async_depth is wrong. One the ohter hand, due to cgroup qos, bfq can allow only one request to be allocated, and set shallow_dpeth=1 will still allow the number of words request to be allocated. Fix this problems by using shallow_depth to the whole sbitmap instead of per word, also change kyber, mq-deadline and bfq to follow this, a new helper __map_depth_with_shallow() is introduced to calculate available bits in each word. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250807032413.1469456-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-30block: move elevator queue allocation logic into blk_mq_init_schedNilay Shroff1-10/+3
In preparation for allocating sched_tags before freezing the request queue and acquiring ->elevator_lock, move the elevator queue allocation logic from the elevator ops ->init_sched callback into blk_mq_init_sched. As elevator_alloc is now only invoked from block layer core, we don't need to export it, so unexport elevator_alloc function. This refactoring provides a centralized location for elevator queue initialization, which makes it easier to store pre-allocated sched_tags in the struct elevator_queue during later changes. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250730074614.2537382-2-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-29blk-ioc: don't hold queue_lock for ioc_lookup_icq()Yu Kuai1-16/+2
Currently issue io can grab queue_lock three times from bfq_bio_merge(), bfq_limit_depth() and bfq_prepare_request(), the queue_lock is not necessary if icq is already created because both queue and ioc can't be freed before io issuing is done, hence remove the unnecessary queue_lock and use rcu to protect radix tree lookup. Noted this is also a prep patch to support request batch dispatching[1]. [1] https://lore.kernel.org/all/20250722072431.610354-1-yukuai1@huaweicloud.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20250729023229.2944898-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move wbt_enable_default() out of queue freezing from sched ->exit()Ming Lei1-1/+1
scheduler's ->exit() is called with queue frozen and elevator lock is held, and wbt_enable_default() can't be called with queue frozen, otherwise the following lockdep warning is triggered: #6 (&q->rq_qos_mutex){+.+.}-{4:4}: #5 (&eq->sysfs_lock){+.+.}-{4:4}: #4 (&q->elevator_lock){+.+.}-{4:4}: #3 (&q->q_usage_counter(io)#3){++++}-{0:0}: #2 (fs_reclaim){+.+.}-{0:0}: #1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: #0 (&q->debugfs_mutex){+.+.}-{4:4}: Fix the issue by moving wbt_enable_default() out of bfq's exit(), and call it from elevator_change_done(). Meantime add disk->rqos_state_mutex for covering wbt state change, which matches the purpose more than ->elevator_lock. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-26-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flagMing Lei1-2/+2
ELEVATOR_FLAG_DISABLE_WBT is only used by BFQ to disallow wbt when BFQ is in use. The flag is set in BFQ's init(), and cleared in BFQ's exit(). Making it as request queue flag, so that we can avoid to deal with elevator switch race. Also it isn't graceful to checking one scheduler flag in wbt_enable_default(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-18block, bfq: Switch to use hrtimer_setup()Nam Cao1-3/+2
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/d0d57e1dab46b617856dfb93c721d221cc31ab0b.1738746821.git.namcao@linutronix.de
2025-01-20Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linuxLinus Torvalds1-1/+1
Pull block updates from Jens Axboe: - NVMe pull requests via Keith: - Target support for PCI-Endpoint transport (Damien) - TCP IO queue spreading fixes (Sagi, Chaitanya) - Target handling for "limited retry" flags (Guixen) - Poll type fix (Yongsoo) - Xarray storage error handling (Keisuke) - Host memory buffer free size fix on error (Francis) - MD pull requests via Song: - Reintroduce md-linear (Yu Kuai) - md-bitmap refactor and fix (Yu Kuai) - Replace kmap_atomic with kmap_local_page (David Reaver) - Quite a few queue freeze and debugfs deadlock fixes Ming introduced lockdep support for this in the 6.13 kernel, and it has (unsurprisingly) uncovered quite a few issues - Use const attributes for IO schedulers - Remove bio ioprio wrappers - Fixes for stacked device atomic write support - Refactor queue affinity helpers, in preparation for better supporting isolated CPUs - Cleanups of loop O_DIRECT handling - Cleanup of BLK_MQ_F_* flags - Add rotational support for null_blk - Various fixes and cleanups * tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits) block: Don't trim an atomic write block: Add common atomic writes enable flag md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add() block: limit disk max sectors to (LLONG_MAX >> 9) block: Change blk_stack_atomic_writes_limits() unit_min check block: Ensure start sector is aligned for stacking atomic writes blk-mq: Move more error handling into blk_mq_submit_bio() block: Reorder the request allocation code in blk_mq_submit_bio() nvme: fix bogus kzalloc() return check in nvme_init_effects_log() md/md-bitmap: move bitmap_{start, end}write to md upper layer md/raid5: implement pers->bitmap_sector() md: add a new callback pers->bitmap_sector() md/md-bitmap: remove the last parameter for bimtap_ops->endwrite() md/md-bitmap: factor behind write counters out from bitmap_{start/end}write() md: Replace deprecated kmap_atomic() with kmap_local_page() md: reintroduce md-linear partitions: ldm: remove the initial kernel-doc notation blk-cgroup: rwstat: fix kernel-doc warnings in header file blk-cgroup: fix kernel-doc warnings in header file nbd: fix partial sending ...
2025-01-09block, bfq: fix waker_bfqq UAF after bfq_split_bfqq()Yu Kuai1-2/+10
Our syzkaller report a following UAF for v6.6: BUG: KASAN: slab-use-after-free in bfq_init_rq+0x175d/0x17a0 block/bfq-iosched.c:6958 Read of size 8 at addr ffff8881b57147d8 by task fsstress/232726 CPU: 2 PID: 232726 Comm: fsstress Not tainted 6.6.0-g3629d1885222 #39 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364 print_report+0x3e/0x70 mm/kasan/report.c:475 kasan_report+0xb8/0xf0 mm/kasan/report.c:588 hlist_add_head include/linux/list.h:1023 [inline] bfq_init_rq+0x175d/0x17a0 block/bfq-iosched.c:6958 bfq_insert_request.isra.0+0xe8/0xa20 block/bfq-iosched.c:6271 bfq_insert_requests+0x27f/0x390 block/bfq-iosched.c:6323 blk_mq_insert_request+0x290/0x8f0 block/blk-mq.c:2660 blk_mq_submit_bio+0x1021/0x15e0 block/blk-mq.c:3143 __submit_bio+0xa0/0x6b0 block/blk-core.c:639 __submit_bio_noacct_mq block/blk-core.c:718 [inline] submit_bio_noacct_nocheck+0x5b7/0x810 block/blk-core.c:747 submit_bio_noacct+0xca0/0x1990 block/blk-core.c:847 __ext4_read_bh fs/ext4/super.c:205 [inline] ext4_read_bh+0x15e/0x2e0 fs/ext4/super.c:230 __read_extent_tree_block+0x304/0x6f0 fs/ext4/extents.c:567 ext4_find_extent+0x479/0xd20 fs/ext4/extents.c:947 ext4_ext_map_blocks+0x1a3/0x2680 fs/ext4/extents.c:4182 ext4_map_blocks+0x929/0x15a0 fs/ext4/inode.c:660 ext4_iomap_begin_report+0x298/0x480 fs/ext4/inode.c:3569 iomap_iter+0x3dd/0x1010 fs/iomap/iter.c:91 iomap_fiemap+0x1f4/0x360 fs/iomap/fiemap.c:80 ext4_fiemap+0x181/0x210 fs/ext4/extents.c:5051 ioctl_fiemap.isra.0+0x1b4/0x290 fs/ioctl.c:220 do_vfs_ioctl+0x31c/0x11a0 fs/ioctl.c:811 __do_sys_ioctl fs/ioctl.c:869 [inline] __se_sys_ioctl+0xae/0x190 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x78/0xe2 Allocated by task 232719: kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 __kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328 kasan_slab_alloc include/linux/kasan.h:188 [inline] slab_post_alloc_hook mm/slab.h:768 [inline] slab_alloc_node mm/slub.c:3492 [inline] kmem_cache_alloc_node+0x1b8/0x6f0 mm/slub.c:3537 bfq_get_queue+0x215/0x1f00 block/bfq-iosched.c:5869 bfq_get_bfqq_handle_split+0x167/0x5f0 block/bfq-iosched.c:6776 bfq_init_rq+0x13a4/0x17a0 block/bfq-iosched.c:6938 bfq_insert_request.isra.0+0xe8/0xa20 block/bfq-iosched.c:6271 bfq_insert_requests+0x27f/0x390 block/bfq-iosched.c:6323 blk_mq_insert_request+0x290/0x8f0 block/blk-mq.c:2660 blk_mq_submit_bio+0x1021/0x15e0 block/blk-mq.c:3143 __submit_bio+0xa0/0x6b0 block/blk-core.c:639 __submit_bio_noacct_mq block/blk-core.c:718 [inline] submit_bio_noacct_nocheck+0x5b7/0x810 block/blk-core.c:747 submit_bio_noacct+0xca0/0x1990 block/blk-core.c:847 __ext4_read_bh fs/ext4/super.c:205 [inline] ext4_read_bh_nowait+0x15a/0x240 fs/ext4/super.c:217 ext4_read_bh_lock+0xac/0xd0 fs/ext4/super.c:242 ext4_bread_batch+0x268/0x500 fs/ext4/inode.c:958 __ext4_find_entry+0x448/0x10f0 fs/ext4/namei.c:1671 ext4_lookup_entry fs/ext4/namei.c:1774 [inline] ext4_lookup.part.0+0x359/0x6f0 fs/ext4/namei.c:1842 ext4_lookup+0x72/0x90 fs/ext4/namei.c:1839 __lookup_slow+0x257/0x480 fs/namei.c:1696 lookup_slow fs/namei.c:1713 [inline] walk_component+0x454/0x5c0 fs/namei.c:2004 link_path_walk.part.0+0x773/0xda0 fs/namei.c:2331 link_path_walk fs/namei.c:3826 [inline] path_openat+0x1b9/0x520 fs/namei.c:3826 do_filp_open+0x1b7/0x400 fs/namei.c:3857 do_sys_openat2+0x5dc/0x6e0 fs/open.c:1428 do_sys_open fs/open.c:1443 [inline] __do_sys_openat fs/open.c:1459 [inline] __se_sys_openat fs/open.c:1454 [inline] __x64_sys_openat+0x148/0x200 fs/open.c:1454 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x78/0xe2 Freed by task 232726: kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 ____kasan_slab_free mm/kasan/common.c:236 [inline] __kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1827 [inline] slab_free_freelist_hook mm/slub.c:1853 [inline] slab_free mm/slub.c:3820 [inline] kmem_cache_free+0x110/0x760 mm/slub.c:3842 bfq_put_queue+0x6a7/0xfb0 block/bfq-iosched.c:5428 bfq_forget_entity block/bfq-wf2q.c:634 [inline] bfq_put_idle_entity+0x142/0x240 block/bfq-wf2q.c:645 bfq_forget_idle+0x189/0x1e0 block/bfq-wf2q.c:671 bfq_update_vtime block/bfq-wf2q.c:1280 [inline] __bfq_lookup_next_entity block/bfq-wf2q.c:1374 [inline] bfq_lookup_next_entity+0x350/0x480 block/bfq-wf2q.c:1433 bfq_update_next_in_service+0x1c0/0x4f0 block/bfq-wf2q.c:128 bfq_deactivate_entity+0x10a/0x240 block/bfq-wf2q.c:1188 bfq_deactivate_bfqq block/bfq-wf2q.c:1592 [inline] bfq_del_bfqq_busy+0x2e8/0xad0 block/bfq-wf2q.c:1659 bfq_release_process_ref+0x1cc/0x220 block/bfq-iosched.c:3139 bfq_split_bfqq+0x481/0xdf0 block/bfq-iosched.c:6754 bfq_init_rq+0xf29/0x17a0 block/bfq-iosched.c:6934 bfq_insert_request.isra.0+0xe8/0xa20 block/bfq-iosched.c:6271 bfq_insert_requests+0x27f/0x390 block/bfq-iosched.c:6323 blk_mq_insert_request+0x290/0x8f0 block/blk-mq.c:2660 blk_mq_submit_bio+0x1021/0x15e0 block/blk-mq.c:3143 __submit_bio+0xa0/0x6b0 block/blk-core.c:639 __submit_bio_noacct_mq block/blk-core.c:718 [inline] submit_bio_noacct_nocheck+0x5b7/0x810 block/blk-core.c:747 submit_bio_noacct+0xca0/0x1990 block/blk-core.c:847 __ext4_read_bh fs/ext4/super.c:205 [inline] ext4_read_bh+0x15e/0x2e0 fs/ext4/super.c:230 __read_extent_tree_block+0x304/0x6f0 fs/ext4/extents.c:567 ext4_find_extent+0x479/0xd20 fs/ext4/extents.c:947 ext4_ext_map_blocks+0x1a3/0x2680 fs/ext4/extents.c:4182 ext4_map_blocks+0x929/0x15a0 fs/ext4/inode.c:660 ext4_iomap_begin_report+0x298/0x480 fs/ext4/inode.c:3569 iomap_iter+0x3dd/0x1010 fs/iomap/iter.c:91 iomap_fiemap+0x1f4/0x360 fs/iomap/fiemap.c:80 ext4_fiemap+0x181/0x210 fs/ext4/extents.c:5051 ioctl_fiemap.isra.0+0x1b4/0x290 fs/ioctl.c:220 do_vfs_ioctl+0x31c/0x11a0 fs/ioctl.c:811 __do_sys_ioctl fs/ioctl.c:869 [inline] __se_sys_ioctl+0xae/0x190 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x78/0xe2 commit 1ba0403ac644 ("block, bfq: fix uaf for accessing waker_bfqq after splitting") fix the problem that if waker_bfqq is in the merge chain, and current is the only procress, waker_bfqq can be freed from bfq_split_bfqq(). However, the case that waker_bfqq is not in the merge chain is missed, and if the procress reference of waker_bfqq is 0, waker_bfqq can be freed as well. Fix the problem by checking procress reference if waker_bfqq is not in the merge_chain. Fixes: 1ba0403ac644 ("block, bfq: fix uaf for accessing waker_bfqq after splitting") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20250108084148.1549973-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-02block, bfq: constify sysfs attributesThomas Weißschuh1-1/+1
The elevator core now allows instances of 'struct elv_fs_entry' to be moved into read-only memory. Make use of that to protect them against accidental or malicious modifications. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20250102-sysfs-const-attr-elevator-v1-3-9837d2058c60@weissschuh.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-29block, bfq: fix bfqq uaf in bfq_limit_depth()Yu Kuai1-13/+24
Set new allocated bfqq to bic or remove freed bfqq from bic are both protected by bfqd->lock, however bfq_limit_depth() is deferencing bfqq from bic without the lock, this can lead to UAF if the io_context is shared by multiple tasks. For example, test bfq with io_uring can trigger following UAF in v6.6: ================================================================== BUG: KASAN: slab-use-after-free in bfqq_group+0x15/0x50 Call Trace: <TASK> dump_stack_lvl+0x47/0x80 print_address_description.constprop.0+0x66/0x300 print_report+0x3e/0x70 kasan_report+0xb4/0xf0 bfqq_group+0x15/0x50 bfqq_request_over_limit+0x130/0x9a0 bfq_limit_depth+0x1b5/0x480 __blk_mq_alloc_requests+0x2b5/0xa00 blk_mq_get_new_requests+0x11d/0x1d0 blk_mq_submit_bio+0x286/0xb00 submit_bio_noacct_nocheck+0x331/0x400 __block_write_full_folio+0x3d0/0x640 writepage_cb+0x3b/0xc0 write_cache_pages+0x254/0x6c0 write_cache_pages+0x254/0x6c0 do_writepages+0x192/0x310 filemap_fdatawrite_wbc+0x95/0xc0 __filemap_fdatawrite_range+0x99/0xd0 filemap_write_and_wait_range.part.0+0x4d/0xa0 blkdev_read_iter+0xef/0x1e0 io_read+0x1b6/0x8a0 io_issue_sqe+0x87/0x300 io_wq_submit_work+0xeb/0x390 io_worker_handle_work+0x24d/0x550 io_wq_worker+0x27f/0x6c0 ret_from_fork_asm+0x1b/0x30 </TASK> Allocated by task 808602: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 __kasan_slab_alloc+0x83/0x90 kmem_cache_alloc_node+0x1b1/0x6d0 bfq_get_queue+0x138/0xfa0 bfq_get_bfqq_handle_split+0xe3/0x2c0 bfq_init_rq+0x196/0xbb0 bfq_insert_request.isra.0+0xb5/0x480 bfq_insert_requests+0x156/0x180 blk_mq_insert_request+0x15d/0x440 blk_mq_submit_bio+0x8a4/0xb00 submit_bio_noacct_nocheck+0x331/0x400 __blkdev_direct_IO_async+0x2dd/0x330 blkdev_write_iter+0x39a/0x450 io_write+0x22a/0x840 io_issue_sqe+0x87/0x300 io_wq_submit_work+0xeb/0x390 io_worker_handle_work+0x24d/0x550 io_wq_worker+0x27f/0x6c0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x1b/0x30 Freed by task 808589: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x27/0x40 __kasan_slab_free+0x126/0x1b0 kmem_cache_free+0x10c/0x750 bfq_put_queue+0x2dd/0x770 __bfq_insert_request.isra.0+0x155/0x7a0 bfq_insert_request.isra.0+0x122/0x480 bfq_insert_requests+0x156/0x180 blk_mq_dispatch_plug_list+0x528/0x7e0 blk_mq_flush_plug_list.part.0+0xe5/0x590 __blk_flush_plug+0x3b/0x90 blk_finish_plug+0x40/0x60 do_writepages+0x19d/0x310 filemap_fdatawrite_wbc+0x95/0xc0 __filemap_fdatawrite_range+0x99/0xd0 filemap_write_and_wait_range.part.0+0x4d/0xa0 blkdev_read_iter+0xef/0x1e0 io_read+0x1b6/0x8a0 io_issue_sqe+0x87/0x300 io_wq_submit_work+0xeb/0x390 io_worker_handle_work+0x24d/0x550 io_wq_worker+0x27f/0x6c0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x1b/0x30 Fix the problem by protecting bic_to_bfqq() with bfqd->lock. CC: Jan Kara <jack@suse.cz> Fixes: 76f1df88bbc2 ("bfq: Limit number of requests consumed by each cgroup") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20241129091509.2227136-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-11-19Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()"Zach Wade1-2/+4
This reverts commit bc3b1e9e7c50e1de0f573eea3871db61dd4787de. The bic is associated with sync_bfqq, and bfq_release_process_ref cannot be put into bfq_put_cooperator. kasan report: [ 400.347277] ================================================================== [ 400.347287] BUG: KASAN: slab-use-after-free in bic_set_bfqq+0x200/0x230 [ 400.347420] Read of size 8 at addr ffff88881cab7d60 by task dockerd/5800 [ 400.347430] [ 400.347436] CPU: 24 UID: 0 PID: 5800 Comm: dockerd Kdump: loaded Tainted: G E 6.12.0 #32 [ 400.347450] Tainted: [E]=UNSIGNED_MODULE [ 400.347454] Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference Platform, BIOS VMW201.00V.20192059.B64.2207280713 07/28/2022 [ 400.347460] Call Trace: [ 400.347464] <TASK> [ 400.347468] dump_stack_lvl+0x5d/0x80 [ 400.347490] print_report+0x174/0x505 [ 400.347521] kasan_report+0xe0/0x160 [ 400.347541] bic_set_bfqq+0x200/0x230 [ 400.347549] bfq_bic_update_cgroup+0x419/0x740 [ 400.347560] bfq_bio_merge+0x133/0x320 [ 400.347584] blk_mq_submit_bio+0x1761/0x1e20 [ 400.347625] __submit_bio+0x28b/0x7b0 [ 400.347664] submit_bio_noacct_nocheck+0x6b2/0xd30 [ 400.347690] iomap_readahead+0x50c/0x680 [ 400.347731] read_pages+0x17f/0x9c0 [ 400.347785] page_cache_ra_unbounded+0x366/0x4a0 [ 400.347795] filemap_fault+0x83d/0x2340 [ 400.347819] __xfs_filemap_fault+0x11a/0x7d0 [xfs] [ 400.349256] __do_fault+0xf1/0x610 [ 400.349270] do_fault+0x977/0x11a0 [ 400.349281] __handle_mm_fault+0x5d1/0x850 [ 400.349314] handle_mm_fault+0x1f8/0x560 [ 400.349324] do_user_addr_fault+0x324/0x970 [ 400.349337] exc_page_fault+0x76/0xf0 [ 400.349350] asm_exc_page_fault+0x26/0x30 [ 400.349360] RIP: 0033:0x55a480d77375 [ 400.349384] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 49 3b 66 10 0f 86 ae 02 00 00 55 48 89 e5 48 83 ec 58 48 8b 10 <83> 7a 10 00 0f 84 27 02 00 00 44 0f b6 42 28 44 0f b6 4a 29 41 80 [ 400.349392] RSP: 002b:00007f18c37fd8b8 EFLAGS: 00010216 [ 400.349401] RAX: 00007f18c37fd9d0 RBX: 0000000000000000 RCX: 0000000000000000 [ 400.349407] RDX: 000055a484407d38 RSI: 000000c000e8b0c0 RDI: 0000000000000000 [ 400.349412] RBP: 00007f18c37fd910 R08: 000055a484017f60 R09: 000055a484066f80 [ 400.349417] R10: 0000000000194000 R11: 0000000000000005 R12: 0000000000000008 [ 400.349422] R13: 0000000000000000 R14: 000000c000476a80 R15: 0000000000000000 [ 400.349430] </TASK> [ 400.349452] [ 400.349454] Allocated by task 5800: [ 400.349459] kasan_save_stack+0x30/0x50 [ 400.349469] kasan_save_track+0x14/0x30 [ 400.349475] __kasan_slab_alloc+0x89/0x90 [ 400.349482] kmem_cache_alloc_node_noprof+0xdc/0x2a0 [ 400.349492] bfq_get_queue+0x1ef/0x1100 [ 400.349502] __bfq_get_bfqq_handle_split+0x11a/0x510 [ 400.349511] bfq_insert_requests+0xf55/0x9030 [ 400.349519] blk_mq_flush_plug_list+0x446/0x14c0 [ 400.349527] __blk_flush_plug+0x27c/0x4e0 [ 400.349534] blk_finish_plug+0x52/0xa0 [ 400.349540] _xfs_buf_ioapply+0x739/0xc30 [xfs] [ 400.350246] __xfs_buf_submit+0x1b2/0x640 [xfs] [ 400.350967] xfs_buf_read_map+0x306/0xa20 [xfs] [ 400.351672] xfs_trans_read_buf_map+0x285/0x7d0 [xfs] [ 400.352386] xfs_imap_to_bp+0x107/0x270 [xfs] [ 400.353077] xfs_iget+0x70d/0x1eb0 [xfs] [ 400.353786] xfs_lookup+0x2ca/0x3a0 [xfs] [ 400.354506] xfs_vn_lookup+0x14e/0x1a0 [xfs] [ 400.355197] __lookup_slow+0x19c/0x340 [ 400.355204] lookup_one_unlocked+0xfc/0x120 [ 400.355211] ovl_lookup_single+0x1b3/0xcf0 [overlay] [ 400.355255] ovl_lookup_layer+0x316/0x490 [overlay] [ 400.355295] ovl_lookup+0x844/0x1fd0 [overlay] [ 400.355351] lookup_one_qstr_excl+0xef/0x150 [ 400.355357] do_unlinkat+0x22a/0x620 [ 400.355366] __x64_sys_unlinkat+0x109/0x1e0 [ 400.355375] do_syscall_64+0x82/0x160 [ 400.355384] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 400.355393] [ 400.355395] Freed by task 5800: [ 400.355400] kasan_save_stack+0x30/0x50 [ 400.355407] kasan_save_track+0x14/0x30 [ 400.355413] kasan_save_free_info+0x3b/0x70 [ 400.355422] __kasan_slab_free+0x4f/0x70 [ 400.355429] kmem_cache_free+0x176/0x520 [ 400.355438] bfq_put_queue+0x67e/0x980 [ 400.355447] bfq_bic_update_cgroup+0x407/0x740 [ 400.355454] bfq_bio_merge+0x133/0x320 [ 400.355460] blk_mq_submit_bio+0x1761/0x1e20 [ 400.355467] __submit_bio+0x28b/0x7b0 [ 400.355473] submit_bio_noacct_nocheck+0x6b2/0xd30 [ 400.355480] iomap_readahead+0x50c/0x680 [ 400.355490] read_pages+0x17f/0x9c0 [ 400.355498] page_cache_ra_unbounded+0x366/0x4a0 [ 400.355505] filemap_fault+0x83d/0x2340 [ 400.355514] __xfs_filemap_fault+0x11a/0x7d0 [xfs] [ 400.356204] __do_fault+0xf1/0x610 [ 400.356213] do_fault+0x977/0x11a0 [ 400.356221] __handle_mm_fault+0x5d1/0x850 [ 400.356230] handle_mm_fault+0x1f8/0x560 [ 400.356238] do_user_addr_fault+0x324/0x970 [ 400.356248] exc_page_fault+0x76/0xf0 [ 400.356258] asm_exc_page_fault+0x26/0x30 [ 400.356266] [ 400.356269] The buggy address belongs to the object at ffff88881cab7bc0 which belongs to the cache bfq_queue of size 576 [ 400.356276] The buggy address is located 416 bytes inside of freed 576-byte region [ffff88881cab7bc0, ffff88881cab7e00) [ 400.356285] [ 400.356287] The buggy address belongs to the physical page: [ 400.356292] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88881cab0b00 pfn:0x81cab0 [ 400.356300] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 400.356323] flags: 0x50000000000040(head|node=1|zone=2) [ 400.356331] page_type: f5(slab) [ 400.356340] raw: 0050000000000040 ffff88880a00c280 dead000000000122 0000000000000000 [ 400.356347] raw: ffff88881cab0b00 00000000802e0025 00000001f5000000 0000000000000000 [ 400.356354] head: 0050000000000040 ffff88880a00c280 dead000000000122 0000000000000000 [ 400.356359] head: ffff88881cab0b00 00000000802e0025 00000001f5000000 0000000000000000 [ 400.356365] head: 0050000000000003 ffffea002072ac01 ffffffffffffffff 0000000000000000 [ 400.356370] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 [ 400.356378] page dumped because: kasan: bad access detected [ 400.356381] [ 400.356383] Memory state around the buggy address: [ 400.356387] ffff88881cab7c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 400.356392] ffff88881cab7c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 400.356397] >ffff88881cab7d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 400.356400] ^ [ 400.356405] ffff88881cab7d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 400.356409] ffff88881cab7e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 400.356413] ================================================================== Cc: stable@vger.kernel.org Fixes: bc3b1e9e7c50 ("block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()") Signed-off-by: Zach Wade <zachwade.k@gmail.com> Cc: Ding Hui <dinghui@sangfor.com.cn> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20241119153410.2546-1-zachwade.k@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10block, bfq: factor out a helper to split bfqq in bfq_init_rq()Yu Kuai1-51/+58
Make code cleaner, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-8-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()Yu Kuai1-21/+16
Now that 'bfqq_already_existing' is only used in one branch, it can be removed. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10block, bfq: remove local variable 'split' in bfq_init_rq()Yu Kuai1-15/+8
The local variable is used to call bfq_bfqq_resume_state() later, since 'bfqd->lock' is held, and bfqq status will not change between setting 'split' and calling bfq_bfqq_resume_state(), move forward bfq_bfqq_resume_state() so that 'split' can be removed. There are no functional chagnes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()Yu Kuai1-4/+2
Because bfq_put_cooperator() is always followed by bfq_release_process_ref(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10block, bfq: fix procress reference leakage for bfqq in merge chainYu Kuai1-20/+17
Original state: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \--------------\ \-------------\ \-------------\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 1 2 4 After commit 0e456dba86c7 ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()"), if P1 issues a new IO: Without the patch: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \------------------------------\ \-------------\| V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 0 2 4 bfqq3 will be used to handle IO from P1, this is not expected, IO should be redirected to bfqq4; With the patch: ------------------------------------------- | | Process 1 Process 2 Process 3 | Process 4 (BIC1) (BIC2) (BIC3) | (BIC4) | | | | \-------------\ \-------------\| V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 0 2 4 IO is redirected to bfqq4, however, procress reference of bfqq3 is still 2, while there is only P2 using it. Fix the problem by calling bfq_merge_bfqqs() for each bfqq in the merge chain. Also change bfqq_merge_bfqqs() to return new_bfqq to simplify code. Fixes: 0e456dba86c7 ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-10block, bfq: fix uaf for accessing waker_bfqq after splittingYu Kuai1-3/+28
After commit 42c306ed7233 ("block, bfq: don't break merge chain in bfq_split_bfqq()"), if the current procress is the last holder of bfqq, the bfqq can be freed after bfq_split_bfqq(). Hence recored the bfqq and then access bfqq->waker_bfqq may trigger UAF. What's more, the waker_bfqq may in the merge chain of bfqq, hence just recored waker_bfqq is still not safe. Fix the problem by adding a helper bfq_waker_bfqq() to check if bfqq->waker_bfqq is in the merge chain, and current procress is the only holder. Fixes: 42c306ed7233 ("block, bfq: don't break merge chain in bfq_split_bfqq()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240909134154.954924-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03block, bfq: use bfq_reassign_last_bfqq() in bfq_bfqq_move()Yu Kuai1-2/+2
Instead of open coding it, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03block, bfq: don't break merge chain in bfq_split_bfqq()Yu Kuai1-1/+1
Consider the following scenario: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \-------------\ \-------------\ \--------------\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 1 2 4 If Process 1 issue a new IO and bfqq2 is found, and then bfq_init_rq() decide to spilt bfqq2 by bfq_split_bfqq(). Howerver, procress reference of bfqq2 is 1 and bfq_split_bfqq() just clear the coop flag, which will break the merge chain. Expected result: caller will allocate a new bfqq for BIC1 Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) | | | \-------------\ \--------------\| V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 ref 0 0 1 3 Since the condition is only used for the last bfqq4 when the previous bfqq2 and bfqq3 are already splited. Fix the problem by checking if bfqq is the last one in the merge chain as well. Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()Yu Kuai1-2/+6
Consider the following merge chain: Process 1 Process 2 Process 3 Process 4 (BIC1) (BIC2) (BIC3) (BIC4) Λ | | | \--------------\ \-------------\ \-------------\| V V V bfqq1--------->bfqq2---------->bfqq3----------->bfqq4 IO from Process 1 will get bfqf2 from BIC1 first, then bfq_setup_cooperator() will found bfqq2 already merged to bfqq3 and then handle this IO from bfqq3. However, the merge chain can be much deeper and bfqq3 can be merged to other bfqq as well. Fix this problem by iterating to the last bfqq in bfq_setup_cooperator(). Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-09-03block, bfq: fix possible UAF for bfqq->bic with merge chainYu Kuai1-1/+2
1) initial state, three tasks: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) | Λ | Λ | Λ | | | | | | V | V | V | bfqq1 bfqq2 bfqq3 process ref: 1 1 1 2) bfqq1 merged to bfqq2: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) | | | Λ \--------------\| | | V V | bfqq1--------->bfqq2 bfqq3 process ref: 0 2 1 3) bfqq2 merged to bfqq3: Process 1 Process 2 Process 3 (BIC1) (BIC2) (BIC3) here -> Λ | | \--------------\ \-------------\| V V bfqq1--------->bfqq2---------->bfqq3 process ref: 0 1 3 In this case, IO from Process 1 will get bfqq2 from BIC1 first, and then get bfqq3 through merge chain, and finially handle IO by bfqq3. Howerver, current code will think bfqq2 is owned by BIC1, like initial state, and set bfqq2->bic to BIC1. bfq_insert_request -> by Process 1 bfqq = bfq_init_rq(rq) bfqq = bfq_get_bfqq_handle_split bfqq = bic_to_bfqq -> get bfqq2 from BIC1 bfqq->ref++ rq->elv.priv[0] = bic rq->elv.priv[1] = bfqq if (bfqq_process_refs(bfqq) == 1) bfqq->bic = bic -> record BIC1 to bfqq2 __bfq_insert_request new_bfqq = bfq_setup_cooperator -> get bfqq3 from bfqq2->new_bfqq bfqq_request_freed(bfqq) new_bfqq->ref++ rq->elv.priv[1] = new_bfqq -> handle IO by bfqq3 Fix the problem by checking bfqq is from merge chain fist. And this might fix a following problem reported by our syzkaller(unreproducible): ================================================================== BUG: KASAN: slab-use-after-free in bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline] BUG: KASAN: slab-use-after-free in bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline] BUG: KASAN: slab-use-after-free in bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889 Write of size 1 at addr ffff888123839eb8 by task kworker/0:1H/18595 CPU: 0 PID: 18595 Comm: kworker/0:1H Tainted: G L 6.6.0-07439-gba2303cacfda #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Workqueue: kblockd blk_mq_requeue_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:364 [inline] print_report+0x10d/0x610 mm/kasan/report.c:475 kasan_report+0x8e/0xc0 mm/kasan/report.c:588 bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline] bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline] bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889 bfq_get_bfqq_handle_split+0x169/0x5d0 block/bfq-iosched.c:6757 bfq_init_rq block/bfq-iosched.c:6876 [inline] bfq_insert_request block/bfq-iosched.c:6254 [inline] bfq_insert_requests+0x1112/0x5cf0 block/bfq-iosched.c:6304 blk_mq_insert_request+0x290/0x8d0 block/blk-mq.c:2593 blk_mq_requeue_work+0x6bc/0xa70 block/blk-mq.c:1502 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 </TASK> Allocated by task 20776: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 __kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328 kasan_slab_alloc include/linux/kasan.h:188 [inline] slab_post_alloc_hook mm/slab.h:763 [inline] slab_alloc_node mm/slub.c:3458 [inline] kmem_cache_alloc_node+0x1a4/0x6f0 mm/slub.c:3503 ioc_create_icq block/blk-ioc.c:370 [inline] ioc_find_get_icq+0x180/0xaa0 block/blk-ioc.c:436 bfq_prepare_request+0x39/0xf0 block/bfq-iosched.c:6812 blk_mq_rq_ctx_init.isra.7+0x6ac/0xa00 block/blk-mq.c:403 __blk_mq_alloc_requests+0xcc0/0x1070 block/blk-mq.c:517 blk_mq_get_new_requests block/blk-mq.c:2940 [inline] blk_mq_submit_bio+0x624/0x27c0 block/blk-mq.c:3042 __submit_bio+0x331/0x6f0 block/blk-core.c:624 __submit_bio_noacct_mq block/blk-core.c:703 [inline] submit_bio_noacct_nocheck+0x816/0xb40 block/blk-core.c:732 submit_bio_noacct+0x7a6/0x1b50 block/blk-core.c:826 xlog_write_iclog+0x7d5/0xa00 fs/xfs/xfs_log.c:1958 xlog_state_release_iclog+0x3b8/0x720 fs/xfs/xfs_log.c:619 xlog_cil_push_work+0x19c5/0x2270 fs/xfs/xfs_log_cil.c:1330 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 Freed by task 946: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 ____kasan_slab_free mm/kasan/common.c:236 [inline] __kasan_slab_free+0x12c/0x1c0 mm/kasan/common.c:244 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1815 [inline] slab_free_freelist_hook mm/slub.c:1841 [inline] slab_free mm/slub.c:3786 [inline] kmem_cache_free+0x118/0x6f0 mm/slub.c:3808 rcu_do_batch+0x35c/0xe30 kernel/rcu/tree.c:2189 rcu_core+0x819/0xd90 kernel/rcu/tree.c:2462 __do_softirq+0x1b0/0x7a2 kernel/softirq.c:553 Last potentially related work creation: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2712 [inline] call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826 ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105 ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 Second to last potentially related work creation: kasan_save_stack+0x20/0x40 mm/kasan/common.c:45 __kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2712 [inline] call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826 ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105 ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700 worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781 kthread+0x33c/0x440 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305 The buggy address belongs to the object at ffff888123839d68 which belongs to the cache bfq_io_cq of size 1360 The buggy address is located 336 bytes inside of freed 1360-byte region [ffff888123839d68, ffff88812383a2b8) The buggy address belongs to the physical page: page:ffffea00048e0e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88812383f588 pfn:0x123838 head:ffffea00048e0e00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x17ffffc0000a40(workingset|slab|head|node=0|zone=2|lastcpupid=0x1fffff) page_type: 0xffffffff() raw: 0017ffffc0000a40 ffff88810588c200 ffffea00048ffa10 ffff888105889488 raw: ffff88812383f588 0000000000150006 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888123839d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888123839e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888123839e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888123839f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888123839f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Fixes: 36eca8948323 ("block, bfq: add Early Queue Merge (EQM)") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240902130329.3787024-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-16block: BFQ: Refactor bfq_exit_icq() to silence sparse warningJohn Garry1-18/+20
Currently building for C=1 generates the following warning: block/bfq-iosched.c:5498:9: warning: context imbalance in 'bfq_exit_icq' - different lock contexts for basic block Refactor bfq_exit_icq() into a core part which loops for the actuators, and only lock calling this routine when necessary. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240614090345.655716-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-05block: add blk_time_get_ns() and blk_time_get() helpersJens Axboe1-14/+14
Convert any user of ktime_get_ns() to use blk_time_get_ns(), and ktime_get() to blk_time_get(), so we have a unified API for querying the current time in nanoseconds or as ktime. No functional changes intended, this patch just wraps ktime_get_ns() and ktime_get() with a block helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-30Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds1-4/+4
Pull SCSI updates from James Bottomley: "Updates to the usual drivers (ufs, pm80xx, libata-scsi, smartpqi, lpfc, qla2xxx). We have a couple of major core changes impacting other systems: - Command Duration Limits, which spills into block and ATA - block level Persistent Reservation Operations, which touches block, nvme, target and dm Both of these are added with merge commits containing a cover letter explaining what's going on" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (187 commits) scsi: core: Improve warning message in scsi_device_block() scsi: core: Replace scsi_target_block() with scsi_block_targets() scsi: core: Don't wait for quiesce in scsi_device_block() scsi: core: Don't wait for quiesce in scsi_stop_queue() scsi: core: Merge scsi_internal_device_block() and device_block() scsi: sg: Increase number of devices scsi: bsg: Increase number of devices scsi: qla2xxx: Remove unused nvme_ls_waitq wait queue scsi: ufs: ufs-pci: Add support for Intel Arrow Lake scsi: sd: sd_zbc: Use PAGE_SECTORS_SHIFT scsi: ufs: wb: Add explicit flush_threshold sysfs attribute scsi: ufs: ufs-qcom: Switch to the new ICE API scsi: ufs: dt-bindings: qcom: Add ICE phandle scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_RTC quirk scsi: ufs: ufs-mediatek: Set UFSHCD_QUIRK_MCQ_BROKEN_INTR quirk scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_RTC scsi: ufs: core: Add host quirk UFSHCD_QUIRK_MCQ_BROKEN_INTR scsi: ufs: core: Remove dedicated hwq for dev command scsi: ufs: core: mcq: Fix the incorrect OCS value for the device command scsi: ufs: dt-bindings: samsung,exynos: Drop unneeded quotes ...
2023-05-22scsi: block: ioprio: Clean up interface definitionDamien Le Moal1-4/+4
The I/O priority user interface defines the 16-bits ioprio values as the combination of the upper 3-bits for an I/O priority class and the lower 13-bits as priority data. However, the kernel only uses the lower 3-bits of the priority data to define priority levels for the RT and BE priority classes. The data part of an ioprio value is completely ignored for the IDLE and NONE classes. This is enforced by checks done in ioprio_check_cap(), which is called for all paths that allow defining an I/O priority for I/Os: the per-context ioprio_set() system call, aio interface and io_uring interface. Clarify this fact in the uapi ioprio.h header file and introduce the IOPRIO_PRIO_LEVEL_MASK and IOPRIO_PRIO_LEVEL() macros for users to define and get priority levels in an ioprio value. The coarser macro IOPRIO_PRIO_DATA() is retained for backward compatibility with old applications already using it. There is no functional change introduced with this. In-kernel users of the IOPRIO_PRIO_DATA() macro which are explicitly handling I/O priority data as a priority level are modified to use the new IOPRIO_PRIO_LEVEL() macro without any functional change. Since f2fs is the only user of this macro not explicitly using that value as a priority level, it is left unchanged. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20230511011356.227789-2-nks@flawful.org Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2023-05-19block: BFQ: Move an invariant checkBart Van Assche1-1/+1
Check bfqq->dispatched for each BFQ queue instead of checking it for an invalid bfqq pointer. Fixes: 3e49c1e4a615 ("block: BFQ: Add several invariant checks") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230519220347.3643295-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: BFQ: Add several invariant checksBart Van Assche1-0/+9
If anything goes wrong with the counters that track the number of requests, I/O locks up. Make such scenarios easier to debug by adding invariant checks for the request counters. Additionally, check that BFQ queues are empty before these are freed. Cc: Jan Kara <jack@suse.cz> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230516223853.1385255-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-16block, bfq: Fix division by zero error on zero wsumColin Ian King1-0/+2
When the weighted sum is zero the calculation of limit causes a division by zero error. Fix this by continuing to the next level. This was discovered by running as root: stress-ng --ioprio 0 Fixes divison by error oops: [ 521.450556] divide error: 0000 [#1] SMP NOPTI [ 521.450766] CPU: 2 PID: 2684464 Comm: stress-ng-iopri Not tainted 6.2.1-1280.native #1 [ 521.451117] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.1-0-g3208b098f51a-prebuilt.qemu.org 04/01/2014 [ 521.451627] RIP: 0010:bfqq_request_over_limit+0x207/0x400 [ 521.451875] Code: 01 48 8d 0c c8 74 0b 48 8b 82 98 00 00 00 48 8d 0c c8 8b 85 34 ff ff ff 48 89 ca 41 0f af 41 50 48 d1 ea 48 98 48 01 d0 31 d2 <48> f7 f1 41 39 41 48 89 85 34 ff ff ff 0f 8c 7b 01 00 00 49 8b 44 [ 521.452699] RSP: 0018:ffffb1af84eb3948 EFLAGS: 00010046 [ 521.452938] RAX: 000000000000003c RBX: 0000000000000000 RCX: 0000000000000000 [ 521.453262] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb1af84eb3978 [ 521.453584] RBP: ffffb1af84eb3a30 R08: 0000000000000001 R09: ffff8f88ab8a4ba0 [ 521.453905] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8f88ab8a4b18 [ 521.454224] R13: ffff8f8699093000 R14: 0000000000000001 R15: ffffb1af84eb3970 [ 521.454549] FS: 00005640b6b0b580(0000) GS:ffff8f88b3880000(0000) knlGS:0000000000000000 [ 521.454912] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 521.455170] CR2: 00007ffcbcae4e38 CR3: 00000002e46de001 CR4: 0000000000770ee0 [ 521.455491] PKRU: 55555554 [ 521.455619] Call Trace: [ 521.455736] <TASK> [ 521.455837] ? bfq_request_merge+0x3a/0xc0 [ 521.456027] ? elv_merge+0x115/0x140 [ 521.456191] bfq_limit_depth+0xc8/0x240 [ 521.456366] __blk_mq_alloc_requests+0x21a/0x2c0 [ 521.456577] blk_mq_submit_bio+0x23c/0x6c0 [ 521.456766] __submit_bio+0xb8/0x140 [ 521.457236] submit_bio_noacct_nocheck+0x212/0x300 [ 521.457748] submit_bio_noacct+0x1a6/0x580 [ 521.458220] submit_bio+0x43/0x80 [ 521.458660] ext4_io_submit+0x23/0x80 [ 521.459116] ext4_do_writepages+0x40a/0xd00 [ 521.459596] ext4_writepages+0x65/0x100 [ 521.460050] do_writepages+0xb7/0x1c0 [ 521.460492] __filemap_fdatawrite_range+0xa6/0x100 [ 521.460979] file_write_and_wait_range+0xbf/0x140 [ 521.461452] ext4_sync_file+0x105/0x340 [ 521.461882] __x64_sys_fsync+0x67/0x100 [ 521.462305] ? syscall_exit_to_user_mode+0x2c/0x1c0 [ 521.462768] do_syscall_64+0x3b/0xc0 [ 521.463165] entry_SYSCALL_64_after_hwframe+0x5a/0xc4 [ 521.463621] RIP: 0033:0x5640b6c56590 [ 521.464006] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d 71 70 0e 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20230413133009.1605335-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: pass a flags argument to elevator_type->insert_requestsChristoph Hellwig1-8/+8
Instead of passing a bool at_head, pass down the full flags from the blk_mq_insert_request interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: remove blk-mq-tag.hChristoph Hellwig1-1/+0
blk-mq-tag.h is always included by blk-mq.h, and causes recursive inclusion hell with further changes. Just merge it into blk-mq.h instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-08block, bfq: fix uaf for 'stable_merge_bfqq'Yu Kuai1-9/+9
Before commit fd571df0ac5b ("block, bfq: turn bfqq_data into an array in bfq_io_cq"), process reference is read before bfq_put_stable_ref(), and it's safe if bfq_put_stable_ref() put the last reference, because process reference will be 0 and 'stable_merge_bfqq' won't be accessed in this case. However, the commit changed the order and will cause uaf for 'stable_merge_bfqq'. In order to emphasize that bfq_put_stable_ref() can drop the last reference, fix the problem by moving bfq_put_stable_ref() to the end of bfq_setup_stable_merge(). Fixes: fd571df0ac5b ("block, bfq: turn bfqq_data into an array in bfq_io_cq") Reported-and-tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/linux-block/20230307071448.rzihxbm4jhbf5krj@shindev/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-20Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linuxLinus Torvalds1-210/+415
Pull block updates from Jens Axboe: - NVMe updates via Christoph: - Small improvements to the logging functionality (Amit Engel) - Authentication cleanups (Hannes Reinecke) - Cleanup and optimize the DMA mapping cod in the PCIe driver (Keith Busch) - Work around the command effects for Format NVM (Keith Busch) - Misc cleanups (Keith Busch, Christoph Hellwig) - Fix and cleanup freeing single sgl (Keith Busch) - MD updates via Song: - Fix a rare crash during the takeover process - Don't update recovery_cp when curr_resync is ACTIVE - Free writes_pending in md_stop - Change active_io to percpu - Updates to drbd, inching us closer to unifying the out-of-tree driver with the in-tree one (Andreas, Christoph, Lars, Robert) - BFQ update adding support for multi-actuator drives (Paolo, Federico, Davide) - Make brd compliant with REQ_NOWAIT (me) - Fix for IOPOLL and queue entering, fixing stalled IO waiting on timeouts (me) - Fix for REQ_NOWAIT with multiple bios (me) - Fix memory leak in blktrace cleanup (Greg) - Clean up sbitmap and fix a potential hang (Kemeng) - Clean up some bits in BFQ, and fix a bug in the request injection (Kemeng) - Clean up the request allocation and issue code, and fix some bugs related to that (Kemeng) - ublk updates and fixes: - Add support for unprivileged ublk (Ming) - Improve device deletion handling (Ming) - Misc (Liu, Ziyang) - s390 dasd fixes (Alexander, Qiheng) - Improve utility of request caching and fixes (Anuj, Xiao) - zoned cleanups (Pankaj) - More constification for kobjs (Thomas) - blk-iocost cleanups (Yu) - Remove bio splitting from drivers that don't need it (Christoph) - Switch blk-cgroups to use struct gendisk. Some of this is now incomplete as select late reverts were done. (Christoph) - Add bvec initialization helpers, and convert callers to use that rather than open-coding it (Christoph) - Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin, Matthew, Ulf, Zhong) * tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits) brd: use radix_tree_maybe_preload instead of radix_tree_preload block: use proper return value from bio_failfast() block: bio-integrity: Copy flags when bio_integrity_payload is cloned block: Fix io statistics for cgroup in throttle path brd: mark as nowait compatible brd: check for REQ_NOWAIT and set correct page allocation mask brd: return 0/-error from brd_insert_page() block: sync mixed merged request's failfast with 1st bio's Revert "blk-cgroup: pin the gendisk in struct blkcg_gq" Revert "blk-cgroup: pass a gendisk to blkg_lookup" Revert "blk-cgroup: delay blk-cgroup initialization until add_disk" Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release" Revert "blk-cgroup: move the cgroup information to struct gendisk" nvme-pci: remove iod use_sgls nvme-pci: fix freeing single sgl block: ublk: check IO buffer based on flag need_get_data s390/dasd: Fix potential memleak in dasd_eckd_init() s390/dasd: sort out physical vs virtual pointers usage block: Remove the ALLOC_CACHE_SLACK constant block: make kobj_type structures constant ...
2023-02-03blk-cgroup: pass a gendisk to blkcg_{de,}activate_policyChristoph Hellwig1-1/+1
Prepare for storing the blkcg information in the gendisk instead of the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-03blk-wbt: pass a gendisk to wbt_{enable,disable}_defaultChristoph Hellwig1-2/+2
Pass a gendisk to wbt_enable_default and wbt_disable_default to prepare for phasing out usage of the request_queue in the blk-cgroup code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Herrmann <aherrmann@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230203150400.3199230-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: remove unused bfq_wr_max_time in struct bfq_dataKemeng Shi1-4/+0
bfqd->bfq_wr_max_time is set to 0 in bfq_init_queue and is never changed. It is only used in bfq_wr_duration when bfq_wr_max_time > 0 which never meets, so bfqd->bfq_wr_max_time is not used actually. Just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-9-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: remove unnecessary goto tag in bfq_dispatch_rq_from_bfqqKemeng Shi1-6/+3
We jump to tag only for returning current rq. Return directly to remove this tag. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230116095153.3810101-8-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: remove redundant check in bfq_put_cooperatorKemeng Shi1-2/+0
We have already avoided a circular list in bfq_setup_merge (see comments in bfq_setup_merge() for details), so bfq_queue will not appear in it's new_bfqq list. Just remove this check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-7-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: remove unnecessary dereference to get async_bfqqKemeng Shi1-1/+1
The async_bfqq is assigned with bfqq->bic->bfqq[0], use it directly. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-6-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: use helper macro RQ_BFQQ to get bfqq of requestKemeng Shi1-3/+3
Use helper macro RQ_BFQQ to get bfqq of request. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-5-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: initialize bfqq->decrease_time_jif correctlyKemeng Shi1-0/+2
Inject limit is updated or reset when time_is_before_eq_jiffies( decrease_time_jif + several msecs) or think-time state changes. decrease_time_jif is initialized to 0 and will be set to current jiffies when inject limit is updated or reset. If the jiffies is slightly greater than LONG_MAX, time_is_after_eq_jiffies(0) will keep for a long time, so as time_is_after_eq_jiffies(decrease_time_jif + several msecs). If the think-time state never chages, then the injection will not work as expected for long time. To be more specific: Function bfq_update_inject_limit maybe triggered when jiffies pasts decrease_time_jif + msecs_to_jiffies(10) in bfq_add_request by setting bfqd->wait_dispatch to true. Function bfq_reset_inject_limit are called in two conditions: 1. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(1000) in function bfq_add_request. 2. jiffies pasts bfqq->decrease_time_jif + msecs_to_jiffies(100) or bfq think-time state change from short to long. Fix this by initializing bfqq->decrease_time_jif to current jiffies to trigger service injection soon when service injection conditions are met. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-4-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: remove unsed parameter reason in bfq_bfqq_is_slowKemeng Shi1-3/+2
Parameter reason is never used, just remove it. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-3-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: correctly raise inject limit in bfq_choose_bfqq_for_injectionKemeng Shi1-6/+4
Function bfq_choose_bfqq_for_injection may temporarily raise inject limit to one request if current inject_limit is 0 before search of the source queue for injection. However the search below will reset inject limit to bfqd->in_service_queue which is zero for raised inject limit. Then the temporarily raised inject limit never works as expected. Assigment limit to bfqd->in_service_queue in search is needed as limit maybe overwriten to min_t(unsigned int, 1, limit) for condition that a large in-flight request is on non-rotational devices in found queue. So we need to reset limit to bfqd->in_service_queue for normal case. Actually, we have already make sure bfqd->rq_in_driver is < limit before search, then -Limit is >= 1 as bfqd->rq_in_driver is >= 0. Then min_t(unsigned int, 1, limit) is always 1. So we can simply check bfqd->rq_in_driver with 1 instead of result of min_t(unsigned int, 1, limit) for larget request in non-rotational device case to avoid overwritting limit and the bug is gone. -For normal case, we have already check bfqd->rq_in_driver is < limit, so we can return found bfqq unconditionally to remove unncessary check. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230116095153.3810101-2-shikemeng@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: fix uaf for bfqq in bic_set_bfqq()Yu Kuai1-1/+3
After commit 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'"), bic->bfqq will be accessed in bic_set_bfqq(), however, in some context bic->bfqq will be freed, and bic_set_bfqq() is called with the freed bic->bfqq. Fix the problem by always freeing bfqq after bic_set_bfqq(). Fixes: 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'") Reported-and-tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230130014136.591038-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: balance I/O injection among underutilized actuatorsDavide Zini1-5/+13
Upon the invocation of its dispatch function, BFQ returns the next I/O request of the in-service bfq_queue, unless some exception holds. One such exception is that there is some underutilized actuator, different from the actuator for which the in-service queue contains I/O, and that some other bfq_queue happens to contain I/O for such an actuator. In this case, the next I/O request of the latter bfq_queue, and not of the in-service bfq_queue, is returned (I/O is injected from that bfq_queue). To find such an actuator, a linear scan, in increasing index order, is performed among actuators. Performing a linear scan entails a prioritization among actuators: an underutilized actuator may be considered for injection only if all actuators with a lower index are currently fully utilized, or if there is no pending I/O for any lower-index actuator that happens to be underutilized. This commits breaks this prioritization and tends to distribute injection uniformly across actuators. This is obtained by adding the following condition to the linear scan: even if an actuator A is underutilized, A is however skipped if its load is higher than that of the next actuator. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-9-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: inject I/O to underutilized actuatorsDavide Zini1-35/+101
The main service scheme of BFQ for sync I/O is serving one sync bfq_queue at a time, for a while. In particular, BFQ enforces this scheme when it deems the latter necessary to boost throughput or to preserve service guarantees. Unfortunately, when BFQ enforces this policy, only one actuator at a time gets served for a while, because each bfq_queue contains I/O only for one actuator. The other actuators may remain underutilized. Actually, BFQ may serve (inject) extra I/O, taken from other bfq_queues, in parallel with that of the in-service queue. This injection mechanism may provide the ground for dealing also with the above actuator-underutilization problem. Yet BFQ does not take the actuator load into account when choosing which queue to pick extra I/O from. In addition, BFQ may happen to inject extra I/O only when the in-service queue is temporarily empty. In view of these facts, this commit extends the injection mechanism in such a way that the latter: (1) takes into account also the actuator load; (2) checks such a load on each dispatch, and injects I/O for an underutilized actuator, if there is one and there is I/O for it. To perform the check in (2), this commit introduces a load threshold, currently set to 4. A linear scan of each actuator is performed, until an actuator is found for which the following two conditions hold: the load of the actuator is below the threshold, and there is at least one non-in-service queue that contains I/O for that actuator. If such a pair (actuator, queue) is found, then the head request of that queue is returned for dispatch, instead of the head request of the in-service queue. We have set the threshold, empirically, to the minimum possible value for which an actuator is fully utilized, or close to be fully utilized. By doing so, injected I/O 'steals' as few drive-queue slots as possibile to the in-service queue. This reduces as much as possible the probability that the service of I/O from the in-service bfq_queue gets delayed because of slot exhaustion, i.e., because all the slots of the drive queue are filled with I/O injected from other queues (NCQ provides for 32 slots). This new mechanism also counters actuator underutilization in the case of asymmetric configurations of bfq_queues. Namely if there are few bfq_queues containing I/O for some actuators and many bfq_queues containing I/O for other actuators. Or if the bfq_queues containing I/O for some actuators have lower weights than the other bfq_queues. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-8-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: retrieve independent access ranges from request queueFederico Gavioli1-8/+51
This patch implements the code to gather the content of the independent_access_ranges structure from the request_queue and copy it into the queue's bfq_data. This copy is done at queue initialization. We copy the access ranges into the bfq_data to avoid taking the queue lock each time we access the ranges. This implementation, however, puts a limit to the maximum independent ranges supported by the scheduler. Such a limit is equal to the constant BFQ_MAX_ACTUATORS. This limit was placed to avoid the allocation of dynamic memory. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Co-developed-by: Rory Chen <rory.c.chen@seagate.com> Signed-off-by: Rory Chen <rory.c.chen@seagate.com> Signed-off-by: Federico Gavioli <f.gavioli97@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-7-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: split also async bfq_queues on a per-actuator basisDavide Zini1-17/+22
Similarly to sync bfq_queues, also async bfq_queues need to be split on a per-actuator basis. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Davide Zini <davidezini2@gmail.com> Link: https://lore.kernel.org/r/20230103145503.71712-6-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: turn bfqq_data into an array in bfq_io_cqPaolo Valente1-41/+59
When a bfq_queue Q is merged with another queue, several pieces of information are saved about Q. These pieces are stored in the bfqq_data field in the bfq_io_cq data structure of the process associated with Q. Yet, with a multi-actuator drive, a process may get associated with multiple bfq_queues: one queue for each of the N actuators. Each of these queues may undergo a merge. So, the bfq_io_cq data structure must be able to accommodate the above information for N queues. This commit solves this problem by turning the bfqq_data scalar field into an array of N elements (and by changing code so as to handle this array). This solution is written under the assumption that bfq_queues associated with different actuators cannot be cross-merged. This assumption holds naturally with basic queue merging: the latter is triggered by spatial locality, and sectors for different actuators are not close to each other (apart from the corner case of the last sectors served by a given actuator and the first sectors served by the next actuator). As for stable cross-merging, the assumption here is that it is disabled. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gabriele Felici <felicigb@gmail.com> Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net> Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com> Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-5-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: move io_cq-persistent bfqq data into a dedicated structPaolo Valente1-58/+78
With a multi-actuator drive, a process may get associated with multiple bfq_queues: one queue for each of the N actuators. So, the bfq_io_cq data structure must be able to accommodate its per-queue persistent information for N queues. Currently it stores this information for just one queue, in several scalar fields. This is a preparatory commit for moving to accommodating persistent information for N queues. In particular, this commit packs all the above scalar fields into a single data structure. Then there is now only one field, in bfq_io_cq, that stores all the above information. This scalar field will then be turned into an array by a following commit. Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gianmarco Lusvardi <glusvardi@posteo.net> Signed-off-by: Giulio Barabino <giuliobarabino99@gmail.com> Signed-off-by: Emiliano Maccaferri <inbox@emilianomaccaferri.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-4-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: forbid stable merging of queues associated with different actuatorsPaolo Valente1-4/+9
If queues associated with different actuators are merged, then control is lost on each actuator. Therefore some actuator may be underutilized, and throughput may decrease. This problem cannot occur with basic queue merging, because the latter is triggered by spatial locality, and sectors for different actuators are not close to each other. Yet it may happen with stable merging. To address this issue, this commit prevents stable merging from occurring among queues associated with different actuators. Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-3-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-01-29block, bfq: split sync bfq_queues on a per-actuator basisPaolo Valente1-55/+105
Single-LUN multi-actuator SCSI drives, as well as all multi-actuator SATA drives appear as a single device to the I/O subsystem [1]. Yet they address commands to different actuators internally, as a function of Logical Block Addressing (LBAs). A given sector is reachable by only one of the actuators. For example, Seagate’s Serial Advanced Technology Attachment (SATA) version contains two actuators and maps the lower half of the SATA LBA space to the lower actuator and the upper half to the upper actuator. Evidently, to fully utilize actuators, no actuator must be left idle or underutilized while there is pending I/O for it. The block layer must somehow control the load of each actuator individually. This commit lays the ground for allowing BFQ to provide such a per-actuator control. BFQ associates an I/O-request sync bfq_queue with each process doing synchronous I/O, or with a group of processes, in case of queue merging. Then BFQ serves one bfq_queue at a time. While in service, a bfq_queue is emptied in request-position order. Yet the same process, or group of processes, may generate I/O for different actuators. In this case, different streams of I/O (each for a different actuator) get all inserted into the same sync bfq_queue. So there is basically no individual control on when each stream is served, i.e., on when the I/O requests of the stream are picked from the bfq_queue and dispatched to the drive. This commit enables BFQ to control the service of each actuator individually for synchronous I/O, by simply splitting each sync bfq_queue into N queues, one for each actuator. In other words, a sync bfq_queue is now associated to a pair (process, actuator). As a consequence of this split, the per-queue proportional-share policy implemented by BFQ will guarantee that the sync I/O generated for each actuator, by each process, receives its fair share of service. This is just a preparatory patch. If the I/O of the same process happens to be sent to different queues, then each of these queues may undergo queue merging. To handle this event, the bfq_io_cq data structure must be properly extended. In addition, stable merging must be disabled to avoid loss of control on individual actuators. Finally, also async queues must be split. These issues are described in detail and addressed in next commits. As for this commit, although multiple per-process bfq_queues are provided, the I/O of each process or group of processes is still sent to only one queue, regardless of the actuator the I/O is for. The forwarding to distinct bfq_queues will be enabled after addressing the above issues. [1] https://www.linaro.org/blog/budget-fair-queueing-bfq-linux-io-scheduler-optimizations-for-multi-actuator-sata-hard-drives/ Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Gabriele Felici <felicigb@gmail.com> Signed-off-by: Carmine Zaccagnino <carmine@carminezacc.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20230103145503.71712-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-26block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqqYu Kuai1-1/+1
Commit 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'") will access 'bic->bfqq' in bic_set_bfqq(), however, bfq_exit_icq_bfqq() can free bfqq first, and then call bic_set_bfqq(), which will cause uaf. Fix the problem by moving bfq_exit_bfqq() behind bic_set_bfqq(). Fixes: 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20221226030605.1437081-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-15block, bfq: only do counting of pending-request for BFQ_GROUP_IOSCHEDYuwei Guan1-0/+2
The 'bfqd->num_groups_with_pending_reqs' is used when CONFIG_BFQ_GROUP_IOSCHED is enabled, so let the variables and processes take effect when CONFIG_BFQ_GROUP_IOSCHED is enabled. Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Yuwei Guan <Yuwei.Guan@zeekrlife.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20221110112622.389332-1-Yuwei.Guan@zeekrlife.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14block, bfq: replace 0/1 with false/true in bic apisYu Kuai1-2/+2
Just to make the code a litter cleaner, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221214033155.3455754-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-14block, bfq: fix possible uaf for 'bfqq->bic'Yu Kuai1-1/+6
Our test report a uaf for 'bfqq->bic' in 5.10: ================================================================== BUG: KASAN: use-after-free in bfq_select_queue+0x378/0xa30 CPU: 6 PID: 2318352 Comm: fsstress Kdump: loaded Not tainted 5.10.0-60.18.0.50.h602.kasan.eulerosv2r11.x86_64 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-20220320_160524-szxrtosci10000 04/01/2014 Call Trace: bfq_select_queue+0x378/0xa30 bfq_dispatch_request+0xe8/0x130 blk_mq_do_dispatch_sched+0x62/0xb0 __blk_mq_sched_dispatch_requests+0x215/0x2a0 blk_mq_sched_dispatch_requests+0x8f/0xd0 __blk_mq_run_hw_queue+0x98/0x180 __blk_mq_delay_run_hw_queue+0x22b/0x240 blk_mq_run_hw_queue+0xe3/0x190 blk_mq_sched_insert_requests+0x107/0x200 blk_mq_flush_plug_list+0x26e/0x3c0 blk_finish_plug+0x63/0x90 __iomap_dio_rw+0x7b5/0x910 iomap_dio_rw+0x36/0x80 ext4_dio_read_iter+0x146/0x190 [ext4] ext4_file_read_iter+0x1e2/0x230 [ext4] new_sync_read+0x29f/0x400 vfs_read+0x24e/0x2d0 ksys_read+0xd5/0x1b0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x61/0xc6 Commit 3bc5e683c67d ("bfq: Split shared queues on move between cgroups") changes that move process to a new cgroup will allocate a new bfqq to use, however, the old bfqq and new bfqq can point to the same bic: 1) Initial state, two process with io in the same cgroup. Process 1 Process 2 (BIC1) (BIC2) | Λ | Λ | | | | V | V | bfqq1 bfqq2 2) bfqq1 is merged to bfqq2. Process 1 Process 2 (BIC1) (BIC2) | | \-------------\| V bfqq1 bfqq2(coop) 3) Process 1 exit, then issue new io(denoce IOA) from Process 2. (BIC2) | Λ | | V | bfqq2(coop) 4) Before IOA is completed, move Process 2 to another cgroup and issue io. Process 2 (BIC2) Λ |\--------------\ | V bfqq2 bfqq3 Now that BIC2 points to bfqq3, while bfqq2 and bfqq3 both point to BIC2. If all the requests are completed, and Process 2 exit, BIC2 will be freed while there is no guarantee that bfqq2 will be freed before BIC2. Fix the problem by clearing bfqq->bic while bfqq is detached from bic. Fixes: 3bc5e683c67d ("bfq: Split shared queues on move between cgroups") Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221214030430.3304151-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-09bfq: ignore oom_bfqq in bfq_check_wakerKhazhismel Kumykov1-1/+3
oom_bfqq is just a fallback bfqq, so shouldn't be used with waker detection. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221108181030.1611703-2-khazhy@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-09bfq: fix waker_bfqq inconsistency crashKhazhismel Kumykov1-2/+7
This fixes crashes in bfq_add_bfqq_busy due to waker_bfqq being NULL, but woken_list_node still being hashed. This would happen when bfq_init_rq() expects a brand new allocated queue to be returned from bfq_get_bfqq_handle_split() and unconditionally updates waker_bfqq without resetting woken_list_node. Since we can always return oom_bfqq when attempting to allocate, we cannot assume waker_bfqq starts as NULL. Avoid setting woken_bfqq for oom_bfqq entirely, as it's not useful. Crashes would have a stacktrace like: [160595.656560] bfq_add_bfqq_busy+0x110/0x1ec [160595.661142] bfq_add_request+0x6bc/0x980 [160595.666602] bfq_insert_request+0x8ec/0x1240 [160595.671762] bfq_insert_requests+0x58/0x9c [160595.676420] blk_mq_sched_insert_request+0x11c/0x198 [160595.682107] blk_mq_submit_bio+0x270/0x62c [160595.686759] __submit_bio_noacct_mq+0xec/0x178 [160595.691926] submit_bio+0x120/0x184 [160595.695990] ext4_mpage_readpages+0x77c/0x7c8 [160595.701026] ext4_readpage+0x60/0xb0 [160595.705158] filemap_read_page+0x54/0x114 [160595.711961] filemap_fault+0x228/0x5f4 [160595.716272] do_read_fault+0xe0/0x1f0 [160595.720487] do_fault+0x40/0x1c8 Tested by injecting random failures into bfq_get_queue, crashes go away completely. Fixes: 8ef3fc3a043c ("block, bfq: make shared queues inherit wakers") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20221108181030.1611703-1-khazhy@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-01block, bfq: remove dead code for updating 'rq_in_driver'Yu Kuai1-16/+0
Such code are not even compiled since they are inside marco "#if 0". Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@unimore.it> Link: https://lore.kernel.org/r/20221102022542.3621219-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-01block, bfq: cleanup __bfq_weights_tree_remove()Yu Kuai1-10/+1
It's the same with bfq_weights_tree_remove() now. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-01block, bfq: cleanup bfq_weights_tree add/remove apisYu Kuai1-10/+9
The 'bfq_data' and 'rb_root_cached' can both be accessed through 'bfq_queue', thus only pass 'bfq_queue' as parameter. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-6-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-01block, bfq: do not idle if only one group is activatedYu Kuai1-2/+2
Now that root group is counted into 'num_groups_with_pending_reqs', 'num_groups_with_pending_reqs > 0' is always true in bfq_asymmetric_scenario(). Thus change the condition to '> 1'. On the other hand, this change can enable concurrent sync io if only one group is activated. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-5-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-01block, bfq: refactor the counting of 'num_groups_with_pending_reqs'Yu Kuai1-42/+0
Currently, bfq can't handle sync io concurrently as long as they are not issued from root group. This is because 'bfqd->num_groups_with_pending_reqs > 0' is always true in bfq_asymmetric_scenario(). The way that bfqg is counted into 'num_groups_with_pending_reqs': Before this patch: 1) root group will never be counted. 2) Count if bfqg or it's child bfqgs have pending requests. 3) Don't count if bfqg and it's child bfqgs complete all the requests. After this patch: 1) root group is counted. 2) Count if bfqg have pending requests. 3) Don't count if bfqg complete all the requests. With this change, the occasion that only one group is activated can be detected, and next patch will support concurrent sync io in the occasion. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-01block, bfq: support to track if bfqq has pending requestsYu Kuai1-0/+1
If entity belongs to bfqq, then entity->in_groups_with_pending_reqs is not used currently. This patch use it to track if bfqq has pending requests through callers of weights_tree insertion and removal. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220916071942.214222-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-23blk-wbt: don't enable throttling if default elevator is bfqYu Kuai1-0/+2
Commit b5dc5d4d1f4f ("block,bfq: Disable writeback throttling") tries to disable wbt for bfq, it's done by calling wbt_disable_default() in bfq_init_queue(). However, wbt is still enabled if default elevator is bfq: device_add_disk elevator_init_mq bfq_init_queue wbt_disable_default -> done nothing blk_register_queue wbt_enable_default -> wbt is enabled Fix the problem by adding a new flag ELEVATOR_FLAG_DISBALE_WBT, bfq will set the flag in bfq_init_queue, and following wbt_enable_default() won't enable wbt while the flag is set. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20221019121518.3865235-7-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22block, bfq: remove useless parameter for bfq_add/del_bfqq_busy()Yu Kuai1-4/+4
'bfqd' can be accessed through 'bfqq->bfqd', there is no need to pass it as a parameter separately. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220816015631.1323948-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22block, bfq: remove useless checking in bfq_put_queue()Yu Kuai1-4/+2
'bfqq->bfqd' is ensured to set in bfq_init_queue(), and it will never change afterwards. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220816015631.1323948-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14block/bfq: Use the new blk_opf_t typeBart Van Assche1-8/+8
Use the new blk_opf_t type for arguments and variables that represent request flags or a bitwise combination of a request operation and request flags. Rename those variables from 'op' into 'opf'. This patch does not change any functionality. Cc: Jan Kara <jack@suse.cz> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-8-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16block/bfq: Enable I/O statisticsBart Van Assche1-0/+3
BFQ uses io_start_time_ns. That member variable is only set if I/O statistics are enabled. Hence this patch that enables I/O statistics at the time BFQ is associated with a request queue. Compile-tested only. Reported-by: Cixi Geng <cixi.geng1@unisoc.com> Cc: Cixi Geng <cixi.geng1@unisoc.com> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Paolo Valente <paolo.valente@unimore.it> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-16blk-mq: avoid to touch q->elevator without any protectionMing Lei1-0/+3
q->elevator is referred in blk_mq_has_sqsched() without any protection, no .q_usage_counter is held, no queue srcu and rcu read lock is held, so potential use-after-free may be triggered. Fix the issue by adding one queue flag for checking if the elevator uses single queue style dispatch. Meantime the elevator feature flag of ELEVATOR_F_MQ_AWARE isn't needed any more. Cc: Jan Kara <jack@suse.cz> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220616014401.817001-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-23Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds1-42/+53
Pull block updates from Jens Axboe: "Here are the core block changes for 5.19. This contains: - blk-throttle accounting fix (Laibin) - Series removing redundant assignments (Michal) - Expose bio cache via the bio_set, so that DM can use it (Mike) - Finish off the bio allocation interface cleanups by dealing with the weirdest member of the family. bio_kmalloc combines a kmalloc for the bio and bio_vecs with a hidden bio_init call and magic cleanup semantics (Christoph) - Clean up the block layer API so that APIs consumed by file systems are (almost) only struct block_device based, so that file systems don't have to poke into block layer internals like the request_queue (Christoph) - Clean up the blk_execute_rq* API (Christoph) - Clean up various lose end in the blk-cgroup code to make it easier to follow in preparation of reworking the blkcg assignment for bios (Christoph) - Fix use-after-free issues in BFQ when processes with merged queues get moved to different cgroups (Jan) - BFQ fixes (Jan) - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming, Wolfgang, me)" * tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits) blk-mq: fix typo in comment bfq: Remove bfq_requeue_request_body() bfq: Remove superfluous conversion from RQ_BIC() bfq: Allow current waker to defend against a tentative one bfq: Relax waker detection for shared queues blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE() blk-throttle: Set BIO_THROTTLED when bio has been throttled blk-cgroup: Remove unnecessary rcu_read_lock/unlock() blk-cgroup: always terminate io.stat lines block, bfq: make bfq_has_work() more accurate block, bfq: protect 'bfqd->queued' by 'bfqd->lock' block: cleanup the VM accounting in submit_bio block: Fix the bio.bi_opf comment block: reorder the REQ_ flags blk-iocost: combine local_stat and desc_stat to stat block: improve the error message from bio_check_eod block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone block: remove superfluous calls to blkcg_bio_issue_init kthread: unexport kthread_blkcg blk-cgroup: cleanup blkcg_maybe_throttle_current ...
2022-05-19bfq: Remove bfq_requeue_request_body()Jan Kara1-7/+2
The function has only a single caller and two lines. Just remove it since it is pointless and just harming readability. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220519105235.31397-4-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-19bfq: Remove superfluous conversion from RQ_BIC()Jan Kara1-1/+1
We store struct bfq_io_cq pointer in rq->elv.priv[0] in bfq_init_rq(). Thus a call to icq_to_bic() in RQ_BIC() is wrong. Luckily it does no harm currently because struct io_iq is the first one in struct bfq_io_cq. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220519105235.31397-3-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-19bfq: Allow current waker to defend against a tentative oneJan Kara1-2/+1
The code in bfq_check_waker() ignores wake up events from the current waker. This makes it more likely we select a new tentative waker although the current one is generating more wake up events. Treat current waker the same way as any other process and allow it to reset the waker detection logic. Fixes: 71217df39dc6 ("block, bfq: make waker-queue detection more robust") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220519105235.31397-2-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-19bfq: Relax waker detection for shared queuesJan Kara1-2/+3
Currently we look for waker only if current queue has no requests. This makes sense for bfq queues with a single process however for shared queues when there is a larger number of processes the condition that queue has no requests is difficult to meet because often at least one process has some request in flight although all the others are waiting for the waker to do the work and this harms throughput. Relax the "no queued request for bfq queue" condition to "the current task has no queued requests yet". For this, we also need to start tracking number of requests in flight for each task. This patch (together with the following one) restores the performance for dbench with 128 clients that regressed with commit c65e6fd460b4 ("bfq: Do not let waker requests skip proper accounting") because this commit makes requests of wakers properly enter BFQ queues and thus these queues become ineligible for the old waker detection logic. Dbench results: Vanilla 5.18-rc3 5.18-rc3 + revert 5.18-rc3 patched Mean 1237.36 ( 0.00%) 950.16 * 23.21%* 988.35 * 20.12%* Numbers are time to complete workload so lower is better. Fixes: c65e6fd460b4 ("bfq: Do not let waker requests skip proper accounting") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220519105235.31397-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-16block, bfq: make bfq_has_work() more accurateYu Kuai1-4/+12
bfq_has_work() is using busy_queues currently, which is not accurate because bfq_queue is busy doesn't represent that it has requests. Since bfqd aready has a counter 'queued' to record how many requests are in bfq, use it instead of busy_queues. Noted that bfq_has_work() can be called with 'bfqd->lock' held, thus the lock can't be held in bfq_has_work() to protect 'bfqd->queued'. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220513023507.2625717-3-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-16block, bfq: protect 'bfqd->queued' by 'bfqd->lock'Yu Kuai1-1/+3
If bfq_schedule_dispatch() is called from bfq_idle_slice_timer_body(), then 'bfqd->queued' is read without holding 'bfqd->lock'. This is wrong since it can be wrote concurrently. Fix the problem by holding 'bfqd->lock' in such case. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220513023507.2625717-2-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-29bfq: Fix warning in bfqq_request_over_limit()Jan Kara1-3/+9
People are occasionally reporting a warning bfqq_request_over_limit() triggering reporting that BFQ's idea of cgroup hierarchy (and its depth) does not match what generic blkcg code thinks. This can actually happen when bfqq gets moved between BFQ groups while bfqq_request_over_limit() is running. Make sure the code is safe against BFQ queue being moved to a different BFQ group. Fixes: 76f1df88bbc2 ("bfq: Limit number of requests consumed by each cgroup") CC: stable@vger.kernel.org Link: https://lore.kernel.org/all/CAJCQCtTw_2C7ZSz7as5Gvq=OmnDiio=HRkQekqWpKot84sQhFA@mail.gmail.com/ Reported-by: Chris Murphy <lists@colorremedies.com> Reported-by: "yukuai (C)" <yukuai3@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220407140738.9723-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17bfq: Get rid of __bio_blkcg() usageJan Kara1-10/+1
BFQ usage of __bio_blkcg() is a relict from the past. Furthermore if bio would not be associated with any blkcg, the usage of __bio_blkcg() in BFQ is prone to races with the task being migrated between cgroups as __bio_blkcg() calls at different places could return different blkcgs. Convert BFQ to the new situation where bio->bi_blkg is initialized in bio_set_dev() and thus practically always valid. This allows us to save blkcg_gq lookup and noticeably simplify the code. CC: stable@vger.kernel.org Fixes: 0fe061b9f03c ("blkcg: fix ref count issue with bio_blkcg() using task_css") Tested-by: "yukuai (C)" <yukuai3@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220401102752.8599-8-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17bfq: Remove pointless bfq_init_rq() callsJan Kara1-6/+6
We call bfq_init_rq() from request merging functions where requests we get should have already gone through bfq_init_rq() during insert and anyway we want to do anything only if the request is already tracked by BFQ. So replace calls to bfq_init_rq() with RQ_BFQQ() instead to simply skip requests untracked by BFQ. We move bfq_init_rq() call in bfq_insert_request() a bit earlier to cover request merging and thus can transfer FIFO position in case of a merge. CC: stable@vger.kernel.org Tested-by: "yukuai (C)" <yukuai3@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220401102752.8599-6-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17bfq: Drop pointless unlock-lock pairJan Kara1-3/+0
In bfq_insert_request() we unlock bfqd->lock only to call trace_block_rq_insert() and then lock bfqd->lock again. This is really pointless since tracing is disabled if we really care about performance and even if the tracepoint is enabled, it is a quick call. CC: stable@vger.kernel.org Tested-by: "yukuai (C)" <yukuai3@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220401102752.8599-5-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17bfq: Update cgroup information before merging bioJan Kara1-2/+9
When the process is migrated to a different cgroup (or in case of writeback just starts submitting bios associated with a different cgroup) bfq_merge_bio() can operate with stale cgroup information in bic. Thus the bio can be merged to a request from a different cgroup or it can result in merging of bfqqs for different cgroups or bfqqs of already dead cgroups and causing possible use-after-free issues. Fix the problem by updating cgroup information in bfq_merge_bio(). CC: stable@vger.kernel.org Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support") Tested-by: "yukuai (C)" <yukuai3@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220401102752.8599-4-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17bfq: Split shared queues on move between cgroupsJan Kara1-1/+1
When bfqq is shared by multiple processes it can happen that one of the processes gets moved to a different cgroup (or just starts submitting IO for different cgroup). In case that happens we need to split the merged bfqq as otherwise we will have IO for multiple cgroups in one bfqq and we will just account IO time to wrong entities etc. Similarly if the bfqq is scheduled to merge with another bfqq but the merge didn't happen yet, cancel the merge as it need not be valid anymore. CC: stable@vger.kernel.org Fixes: e21b7a0b9887 ("block, bfq: add full hierarchical scheduling and cgroups support") Tested-by: "yukuai (C)" <yukuai3@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220401102752.8599-3-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17bfq: Avoid merging queues with different parentsJan Kara1-0/+8
It can happen that the parent of a bfqq changes between the moment we decide two queues are worth to merge (and set bic->stable_merge_bfqq) and the moment bfq_setup_merge() is called. This can happen e.g. because the process submitted IO for a different cgroup and thus bfqq got reparented. It can even happen that the bfqq we are merging with has parent cgroup that is already offline and going to be destroyed in which case the merge can lead to use-after-free issues such as: BUG: KASAN: use-after-free in __bfq_deactivate_entity+0x9cb/0xa50 Read of size 8 at addr ffff88800693c0c0 by task runc:[2:INIT]/10544 CPU: 0 PID: 10544 Comm: runc:[2:INIT] Tainted: G E 5.15.2-0.g5fb85fd-default #1 openSUSE Tumbleweed (unreleased) f1f3b891c72369aebecd2e43e4641a6358867c70 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014 Call Trace: <IRQ> dump_stack_lvl+0x46/0x5a print_address_description.constprop.0+0x1f/0x140 ? __bfq_deactivate_entity+0x9cb/0xa50 kasan_report.cold+0x7f/0x11b ? __bfq_deactivate_entity+0x9cb/0xa50 __bfq_deactivate_entity+0x9cb/0xa50 ? update_curr+0x32f/0x5d0 bfq_deactivate_entity+0xa0/0x1d0 bfq_del_bfqq_busy+0x28a/0x420 ? resched_curr+0x116/0x1d0 ? bfq_requeue_bfqq+0x70/0x70 ? check_preempt_wakeup+0x52b/0xbc0 __bfq_bfqq_expire+0x1a2/0x270 bfq_bfqq_expire+0xd16/0x2160 ? try_to_wake_up+0x4ee/0x1260 ? bfq_end_wr_async_queues+0xe0/0xe0 ? _raw_write_unlock_bh+0x60/0x60 ? _raw_spin_lock_irq+0x81/0xe0 bfq_idle_slice_timer+0x109/0x280 ? bfq_dispatch_request+0x4870/0x4870 __hrtimer_run_queues+0x37d/0x700 ? enqueue_hrtimer+0x1b0/0x1b0 ? kvm_clock_get_cycles+0xd/0x10 ? ktime_get_update_offsets_now+0x6f/0x280 hrtimer_interrupt+0x2c8/0x740 Fix the problem by checking that the parent of the two bfqqs we are merging in bfq_setup_merge() is the same. Link: https://lore.kernel.org/linux-block/20211125172809.GC19572@quack2.suse.cz/ CC: stable@vger.kernel.org Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues") Tested-by: "yukuai (C)" <yukuai3@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220401102752.8599-2-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17bfq: Avoid false marking of bic as stably mergedJan Kara1-3/+6
bfq_setup_cooperator() can mark bic as stably merged even though it decides to not merge its bfqqs (when bfq_setup_merge() returns NULL). Make sure to mark bic as stably merged only if we are really going to merge bfqqs. CC: stable@vger.kernel.org Tested-by: "yukuai (C)" <yukuai3@huawei.com> Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220401102752.8599-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+1
Merge updates from Andrew Morton: - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs - Most the MM patches which precede the patches in Willy's tree: kasan, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb, userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp, cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap, zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits) mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release() Docs/ABI/testing: add DAMON sysfs interface ABI document Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface selftests/damon: add a test for DAMON sysfs interface mm/damon/sysfs: support DAMOS stats mm/damon/sysfs: support DAMOS watermarks mm/damon/sysfs: support schemes prioritization mm/damon/sysfs: support DAMOS quotas mm/damon/sysfs: support DAMON-based Operation Schemes mm/damon/sysfs: support the physical address space monitoring mm/damon/sysfs: link DAMON for virtual address spaces monitoring mm/damon: implement a minimal stub for sysfs-based DAMON interface mm/damon/core: add number of each enum type values mm/damon/core: allow non-exclusive DAMON start/stop Docs/damon: update outdated term 'regions update interval' Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling Docs/vm/damon: call low level monitoring primitives the operations mm/damon: remove unnecessary CONFIG_DAMON option mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}() mm/damon/dbgfs-test: fix is_target_id() change ...
2022-03-22block/bfq-iosched.c: use "false" rather than "BLK_RW_ASYNC"NeilBrown1-1/+1
bfq_get_queue() expects a "bool" for the third arg, so pass "false" rather than "BLK_RW_ASYNC" which will soon be removed. Link: https://lkml.kernel.org/r/164549983746.9187.7949730109246767909.stgit@noble.brown Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-21Merge tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds1-13/+24
Pull block updates from Jens Axboe: - BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo) - blk-rq-qos completion fix (Tejun) - blk-cgroup merge fix (Tejun) - Add offline error return value to distinguish it from an IO error on the device (Song) - IO stats fixes (Zhang, Christoph) - blkcg refcount fixes (Ming, Yu) - Fix for indefinite dispatch loop softlockup (Shin'ichiro) - blk-mq hardware queue management improvements (Ming) - sbitmap dead code removal (Ming, John) - Plugging merge improvements (me) - Show blk-crypto capabilities in sysfs (Eric) - Multiple delayed queue run improvement (David) - Block throttling fixes (Ming) - Start deprecating auto module loading based on dev_t (Christoph) - bio allocation improvements (Christoph, Chaitanya) - Get rid of bio_devname (Christoph) - bio clone improvements (Christoph) - Block plugging improvements (Christoph) - Get rid of genhd.h header (Christoph) - Ensure drivers use appropriate flush helpers (Christoph) - Refcounting improvements (Christoph) - Queue initialization and teardown improvements (Ming, Christoph) - Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng, Lukas, Nian, Yang, Eric, Chengming) * tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits) block: cancel all throttled bios in del_gendisk() block: let blkcg_gq grab request queue's refcnt block: avoid use-after-free on throttle data block: limit request dispatch loop duration block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative" sr: simplify the local variable initialization in sr_block_open() block: don't merge across cgroup boundaries if blkcg is enabled block: fix rq-qos breakage from skipping rq_qos_done_bio() block: flush plug based on hardware and software queue order block: ensure plug merging checks the correct queue at least once block: move rq_qos_exit() into disk_release() block: do more work in elevator_exit block: move blk_exit_queue into disk_release block: move q_usage_counter release into blk_queue_release block: don't remove hctx debugfs dir from blk_mq_exit_queue block: move blkcg initialization/destroy into disk allocation/release handler sr: implement ->free_disk to simplify refcounting sd: implement ->free_disk to simplify refcounting sd: delay calling free_opal_dev sd: call sd_zbc_release_disk before releasing the scsi_device reference ...
2022-03-16block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative"Colin Ian King1-1/+1
There is a spelling mistake in a bfq_log_bfqq message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220315221539.2959167-1-colin.i.king@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-08Revert "Revert "block, bfq: honor already-setup queue merges""Paolo Valente1-3/+13
A crash [1] happened to be triggered in conjunction with commit 2d52c58b9c9b ("block, bfq: honor already-setup queue merges"). The latter was then reverted by commit ebc69e897e17 ("Revert "block, bfq: honor already-setup queue merges""). Yet, the reverted commit was not the one introducing the bug. In fact, it actually triggered a UAF introduced by a different commit, and now fixed by commit d29bd41428cf ("block, bfq: reset last_bfqq_created on group change"). So, there is no point in keeping commit 2d52c58b9c9b ("block, bfq: honor already-setup queue merges") out. This commit restores it. [1] https://bugzilla.kernel.org/show_bug.cgi?id=214503 Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20211125181510.15004-1-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-05bfq: fix use-after-free in bfq_dispatch_requestZhang Wensheng1-7/+8
KASAN reports a use-after-free report when doing normal scsi-mq test [69832.239032] ================================================================== [69832.241810] BUG: KASAN: use-after-free in bfq_dispatch_request+0x1045/0x44b0 [69832.243267] Read of size 8 at addr ffff88802622ba88 by task kworker/3:1H/155 [69832.244656] [69832.245007] CPU: 3 PID: 155 Comm: kworker/3:1H Not tainted 5.10.0-10295-g576c6382529e #8 [69832.246626] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [69832.249069] Workqueue: kblockd blk_mq_run_work_fn [69832.250022] Call Trace: [69832.250541] dump_stack+0x9b/0xce [69832.251232] ? bfq_dispatch_request+0x1045/0x44b0 [69832.252243] print_address_description.constprop.6+0x3e/0x60 [69832.253381] ? __cpuidle_text_end+0x5/0x5 [69832.254211] ? vprintk_func+0x6b/0x120 [69832.254994] ? bfq_dispatch_request+0x1045/0x44b0 [69832.255952] ? bfq_dispatch_request+0x1045/0x44b0 [69832.256914] kasan_report.cold.9+0x22/0x3a [69832.257753] ? bfq_dispatch_request+0x1045/0x44b0 [69832.258755] check_memory_region+0x1c1/0x1e0 [69832.260248] bfq_dispatch_request+0x1045/0x44b0 [69832.261181] ? bfq_bfqq_expire+0x2440/0x2440 [69832.262032] ? blk_mq_delay_run_hw_queues+0xf9/0x170 [69832.263022] __blk_mq_do_dispatch_sched+0x52f/0x830 [69832.264011] ? blk_mq_sched_request_inserted+0x100/0x100 [69832.265101] __blk_mq_sched_dispatch_requests+0x398/0x4f0 [69832.266206] ? blk_mq_do_dispatch_ctx+0x570/0x570 [69832.267147] ? __switch_to+0x5f4/0xee0 [69832.267898] blk_mq_sched_dispatch_requests+0xdf/0x140 [69832.268946] __blk_mq_run_hw_queue+0xc0/0x270 [69832.269840] blk_mq_run_work_fn+0x51/0x60 [69832.278170] process_one_work+0x6d4/0xfe0 [69832.278984] worker_thread+0x91/0xc80 [69832.279726] ? __kthread_parkme+0xb0/0x110 [69832.280554] ? process_one_work+0xfe0/0xfe0 [69832.281414] kthread+0x32d/0x3f0 [69832.282082] ? kthread_park+0x170/0x170 [69832.282849] ret_from_fork+0x1f/0x30 [69832.283573] [69832.283886] Allocated by task 7725: [69832.284599] kasan_save_stack+0x19/0x40 [69832.285385] __kasan_kmalloc.constprop.2+0xc1/0xd0 [69832.286350] kmem_cache_alloc_node+0x13f/0x460 [69832.287237] bfq_get_queue+0x3d4/0x1140 [69832.287993] bfq_get_bfqq_handle_split+0x103/0x510 [69832.289015] bfq_init_rq+0x337/0x2d50 [69832.289749] bfq_insert_requests+0x304/0x4e10 [69832.290634] blk_mq_sched_insert_requests+0x13e/0x390 [69832.291629] blk_mq_flush_plug_list+0x4b4/0x760 [69832.292538] blk_flush_plug_list+0x2c5/0x480 [69832.293392] io_schedule_prepare+0xb2/0xd0 [69832.294209] io_schedule_timeout+0x13/0x80 [69832.295014] wait_for_common_io.constprop.1+0x13c/0x270 [69832.296137] submit_bio_wait+0x103/0x1a0 [69832.296932] blkdev_issue_discard+0xe6/0x160 [69832.297794] blk_ioctl_discard+0x219/0x290 [69832.298614] blkdev_common_ioctl+0x50a/0x1750 [69832.304715] blkdev_ioctl+0x470/0x600 [69832.305474] block_ioctl+0xde/0x120 [69832.306232] vfs_ioctl+0x6c/0xc0 [69832.306877] __se_sys_ioctl+0x90/0xa0 [69832.307629] do_syscall_64+0x2d/0x40 [69832.308362] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [69832.309382] [69832.309701] Freed by task 155: [69832.310328] kasan_save_stack+0x19/0x40 [69832.311121] kasan_set_track+0x1c/0x30 [69832.311868] kasan_set_free_info+0x1b/0x30 [69832.312699] __kasan_slab_free+0x111/0x160 [69832.313524] kmem_cache_free+0x94/0x460 [69832.314367] bfq_put_queue+0x582/0x940 [69832.315112] __bfq_bfqd_reset_in_service+0x166/0x1d0 [69832.317275] bfq_bfqq_expire+0xb27/0x2440 [69832.318084] bfq_dispatch_request+0x697/0x44b0 [69832.318991] __blk_mq_do_dispatch_sched+0x52f/0x830 [69832.319984] __blk_mq_sched_dispatch_requests+0x398/0x4f0 [69832.321087] blk_mq_sched_dispatch_requests+0xdf/0x140 [69832.322225] __blk_mq_run_hw_queue+0xc0/0x270 [69832.323114] blk_mq_run_work_fn+0x51/0x60 [69832.323942] process_one_work+0x6d4/0xfe0 [69832.324772] worker_thread+0x91/0xc80 [69832.325518] kthread+0x32d/0x3f0 [69832.326205] ret_from_fork+0x1f/0x30 [69832.326932] [69832.338297] The buggy address belongs to the object at ffff88802622b968 [69832.338297] which belongs to the cache bfq_queue of size 512 [69832.340766] The buggy address is located 288 bytes inside of [69832.340766] 512-byte region [ffff88802622b968, ffff88802622bb68) [69832.343091] The buggy address belongs to the page: [69832.344097] page:ffffea0000988a00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88802622a528 pfn:0x26228 [69832.346214] head:ffffea0000988a00 order:2 compound_mapcount:0 compound_pincount:0 [69832.347719] flags: 0x1fffff80010200(slab|head) [69832.348625] raw: 001fffff80010200 ffffea0000dbac08 ffff888017a57650 ffff8880179fe840 [69832.354972] raw: ffff88802622a528 0000000000120008 00000001ffffffff 0000000000000000 [69832.356547] page dumped because: kasan: bad access detected [69832.357652] [69832.357970] Memory state around the buggy address: [69832.358926] ffff88802622b980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [69832.360358] ffff88802622ba00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [69832.361810] >ffff88802622ba80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [69832.363273] ^ [69832.363975] ffff88802622bb00: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc [69832.375960] ffff88802622bb80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [69832.377405] ================================================================== In bfq_dispatch_requestfunction, it may have function call: bfq_dispatch_request __bfq_dispatch_request bfq_select_queue bfq_bfqq_expire __bfq_bfqd_reset_in_service bfq_put_queue kmem_cache_free In this function call, in_serv_queue has beed expired and meet the conditions to free. In the function bfq_dispatch_request, the address of in_serv_queue pointing to has been released. For getting the value of idle_timer_disabled, it will get flags value from the address which in_serv_queue pointing to, then the problem of use-after-free happens; Fix the problem by check in_serv_queue == bfqd->in_service_queue, to get the value of idle_timer_disabled if in_serve_queue is equel to bfqd->in_service_queue. If the space of in_serv_queue pointing has been released, this judge will aviod use-after-free problem. And if in_serv_queue may be expired or finished, the idle_timer_disabled will be false which would not give effects to bfq_update_dispatch_stats. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhang Wensheng <zhangwensheng5@huawei.com> Link: https://lore.kernel.org/r/20220303070334.3020168-1-zhangwensheng5@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-18block, bfq: cleanup bfq_bfqq_to_bfqg()Yu Kuai1-2/+2
Use bfq_group() instead, which do the same thing. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20220129015924.3958918-2-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-17block/wbt: fix negative inflight counter when remove scsi deviceLaibin Qiu1-0/+2
Now that we disable wbt by set WBT_STATE_OFF_DEFAULT in wbt_disable_default() when switch elevator to bfq. And when we remove scsi device, wbt will be enabled by wbt_enable_default. If it become false positive between wbt_wait() and wbt_track() when submit write request. The following is the scenario that triggered the problem. T1 T2 T3 elevator_switch_mq bfq_init_queue wbt_disable_default <= Set rwb->enable_state (OFF) Submit_bio blk_mq_make_request rq_qos_throttle <= rwb->enable_state (OFF) scsi_remove_device sd_remove del_gendisk blk_unregister_queue elv_unregister_queue wbt_enable_default <= Set rwb->enable_state (ON) q_qos_track <= rwb->enable_state (ON) ^^^^^^ this request will mark WBT_TRACKED without inflight add and will lead to drop rqw->inflight to -1 in wbt_done() which will trigger IO hung. Fix this by move wbt_enable_default() from elv_unregister to bfq_exit_queue(). Only re-enable wbt when bfq exit. Fixes: 76a8040817b4b ("blk-wbt: make sure throttle is enabled properly") Remove oneline stale comment, and kill one oneshot local variable. Signed-off-by: Ming Lei <ming.lei@rehdat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-block/20211214133103.551813-1-qiulaibin@huawei.com/ Signed-off-by: Laibin Qiu <qiulaibin@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29block: simplify ioc_lookup_icqChristoph Hellwig1-1/+1
Remove the ioc argument as it always points to current->io_context. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211126115817.2087431-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29block: move the remaining elv.icq handling to the I/O schedulerChristoph Hellwig1-1/+11
After the prepare side has been moved to the only I/O scheduler that cares, do the same for the cleanup and the NULL initialization. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211126115817.2087431-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29block: move blk_mq_sched_assign_ioc to blk-ioc.cChristoph Hellwig1-1/+1
Move blk_mq_sched_assign_ioc so that many interfaces from the file can be marked static. Rename the function to ioc_find_get_icq as well and return the icq to simplify the interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211126115817.2087431-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29bfq: use bfq_bic_lookup in bfq_limit_depthChristoph Hellwig1-1/+1
No need to create a new I/O context if there is none present yet in ->limit_depth. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211126115817.2087431-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29bfq: simplify bfq_bic_lookupChristoph Hellwig1-15/+10
Remove the unused bfqd argument, and hardcode ioc to current->io_context. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211126115817.2087431-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29bfq: Do not let waker requests skip proper accountingJan Kara1-43/+1
Commit 7cc4ffc55564 ("block, bfq: put reqs of waker and woken in dispatch list") added a condition to bfq_insert_request() which added waker's requests directly to dispatch list. The rationale was that completing waker's IO is needed to get more IO for the current queue. Although this rationale is valid, there is a hole in it. The waker does not necessarily serve the IO only for the current queue and maybe it's current IO is not needed for current queue to make progress. Furthermore injecting IO like this completely bypasses any service accounting within bfq and thus we do not properly track how much service is waker's queue getting or that the waker is actually doing any IO. Depending on the conditions this can result in the waker getting too much or too few service. Consider for example the following job file: [global] directory=/mnt/repro/ rw=write size=8g time_based runtime=30 ramp_time=10 blocksize=1m direct=0 ioengine=sync [slowwriter] numjobs=1 prioclass=2 prio=7 fsync=200 [fastwriter] numjobs=1 prioclass=2 prio=0 fsync=200 Despite processes have very different IO priorities, they get the same about of service. The reason is that bfq identifies these processes as having waker-wakee relationship and once that happens, IO from fastwriter gets injected during slowwriter's time slice. As a result bfq is not aware that fastwriter has any IO to do and constantly schedules only slowwriter's queue. Thus fastwriter is forced to compete with slowwriter's IO all the time instead of getting its share of time based on IO priority. Drop the special injection condition from bfq_insert_request(). As a result, requests will be tracked and queued in a normal way and on next dispatch bfq_select_queue() can decide whether the waker's inserted requests should be injected during the current queue's timeslice or not. Fixes: 7cc4ffc55564 ("block, bfq: put reqs of waker and woken in dispatch list") Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211125133645.27483-8-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29bfq: Log waker detectionsJan Kara1-0/+8
Waker - wakee relationships are important in deciding whether one queue can preempt the other one. Print information about detected waker-wakee relationships so that scheduling decisions can be better understood from block traces. Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211125133645.27483-7-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29bfq: Limit waker detection in timeJan Kara1-15/+23
Currently, when process A starts issuing requests shortly after process B has completed some IO three times in a row, we decide that B is a "waker" of A meaning that completing IO of B is needed for A to make progress and generally stop separating A's and B's IO much. This logic is useful to avoid unnecessary idling and thus throughput loss for cases where workload needs to switch e.g. between the process and the journaling thread doing IO. However the detection heuristic tends to frequently give false positives when A and B are fighting IO bandwidth and other processes aren't doing much IO as we are basically deemed to eventually accumulate three occurences of a situation where one process starts issuing requests after the other has completed some IO. To reduce these false positives, cancel the waker detection also if we didn't accumulate three detected wakeups within given timeout. The rationale is that if wakeups are really rare, the pointless idling doesn't hurt throughput that much anyway. This significantly reduces false waker detection for workload like: [global] directory=/mnt/repro/ rw=write size=8g time_based runtime=30 ramp_time=10 blocksize=1m direct=0 ioengine=sync [slowwriter] numjobs=1 fsync=200 [fastwriter] numjobs=1 fsync=200 Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211125133645.27483-5-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29bfq: Limit number of requests consumed by each cgroupJan Kara1-18/+117
When cgroup IO scheduling is used with BFQ it does not really provide service differentiation if the cgroup drives a big IO depth. That for example happens with writeback which asynchronously submits lots of IO but it can happen with AIO as well. The problem is that if we have two cgroups that submit IO with different weights, the cgroup with higher weight properly gets more IO time and is able to dispatch more IO. However this causes lower weight cgroup to accumulate more requests inside BFQ and eventually lower weight cgroup consumes most of IO scheduler tags. At that point higher weight cgroup stops getting better service as it is mostly blocked waiting for a scheduler tag while its queues inside BFQ are empty and thus lower weight cgroup gets served. Check how many requests submitting cgroup has allocated in bfq_limit_depth() and if it consumes more requests than what would correspond to its weight limit available depth to 1 so that the cgroup cannot consume many more requests. With this limitation the higher weight cgroup gets proper service even with writeback. Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211125133645.27483-4-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29bfq: Store full bitmap depth in bfq_dataJan Kara1-4/+6
Store bitmap depth shift inside bfq_data so that we can use it in bfq_limit_depth() for proportioning when limiting number of available request tags for a cgroup. Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211125133645.27483-3-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29bfq: Track number of allocated requests in bfq_entityJan Kara1-6/+22
When we want to limit number of requests used by each bfqq and also cgroup, we need to track also number of requests used by each cgroup. So track number of allocated requests for each bfq_entity. Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20211125133645.27483-2-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-29block: move io_context creation into where it's neededJens Axboe1-0/+2
The only user of the io_context for IO is BFQ, yet we put the checking and logic of it into the normal IO path. Put the creation into blk_mq_sched_assign_ioc(), and have BFQ use that helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18blk-mq: Stop using pointers for blk_mq_tags bitmap tagsJohn Garry1-2/+2
Now that we use shared tags for shared sbitmap support, we don't require the tags sbitmap pointers, so drop them. This essentially reverts commit 222a5ae03cdd ("blk-mq: Use pointers for blk_mq_tags bitmap tags"). Function blk_mq_init_bitmap_tags() is removed also, since it would be only a wrappper for blk_mq_init_bitmaps(). Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/1633429419-228500-14-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block: move elevator.h to block/Christoph Hellwig1-1/+1
Except for the features passed to blk_queue_required_elevator_features, elevator.h is only needed internally to the block layer. Move the ELEVATOR_F_* definitions to blkdev.h, and the move elevator.h to block/, dropping all the spurious includes outside of that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210920123328.1399408-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-28Revert "block, bfq: honor already-setup queue merges"Jens Axboe1-13/+3
This reverts commit 2d52c58b9c9bdae0ca3df6a1eab5745ab3f7d80b. We have had several folks complain that this causes hangs for them, which is especially problematic as the commit has also hit stable already. As no resolution seems to be forthcoming right now, revert the patch. Link: https://bugzilla.kernel.org/show_bug.cgi?id=214503 Fixes: 2d52c58b9c9b ("block, bfq: honor already-setup queue merges") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-09-02block, bfq: honor already-setup queue mergesPaolo Valente1-3/+13
The function bfq_setup_merge prepares the merging between two bfq_queues, say bfqq and new_bfqq. To this goal, it assigns bfqq->new_bfqq = new_bfqq. Then, each time some I/O for bfqq arrives, the process that generated that I/O is disassociated from bfqq and associated with new_bfqq (merging is actually a redirection). In this respect, bfq_setup_merge increases new_bfqq->ref in advance, adding the number of processes that are expected to be associated with new_bfqq. Unfortunately, the stable-merging mechanism interferes with this setup. After bfqq->new_bfqq has been set by bfq_setup_merge, and before all the expected processes have been associated with bfqq->new_bfqq, bfqq may happen to be stably merged with a different queue than the current bfqq->new_bfqq. In this case, bfqq->new_bfqq gets changed. So, some of the processes that have been already accounted for in the ref counter of the previous new_bfqq will not be associated with that queue. This creates an unbalance, because those references will never be decremented. This commit fixes this issue by reestablishing the previous, natural behaviour: once bfqq->new_bfqq has been set, it will not be changed until all expected redirections have occurred. Signed-off-by: Davide Zini <davidezini2@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210802141352.74353-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-23block: add an explicit ->disk backpointer to the request_queueChristoph Hellwig1-1/+1
Replace the magic lookup through the kobject tree with an explicit backpointer, given that the device model links are set up and torn down at times when I/O is still possible, leading to potential NULL or invalid pointer dereferences. Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk") Reported-by: syzbot <syzbot+aa0801b6b32dca9dda82@syzkaller.appspotmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Sven Schnelle <svens@linux.ibm.com> Link: https://lore.kernel.org/r/20210816134624.GA24234@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-18block: fix default IO priority handlingDamien Le Moal1-1/+1
The default IO priority is the best effort (BE) class with the normal priority level IOPRIO_NORM (4). However, get_task_ioprio() returns IOPRIO_CLASS_NONE/IOPRIO_NORM as the default priority and get_current_ioprio() returns IOPRIO_CLASS_NONE/0. Let's be consistent with the defined default and have both of these functions return the default priority IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_NORM) when the user did not define another default IO priority for the task. In include/uapi/linux/ioprio.h, introduce the IOPRIO_BE_NORM macro as an alias to IOPRIO_NORM to clarify that this default level applies to the BE priotity class. In include/linux/ioprio.h, define the macro IOPRIO_DEFAULT as IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, IOPRIO_BE_NORM) and use this new macro when setting a priority to the default. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Link: https://lore.kernel.org/r/20210811033702.368488-7-damien.lemoal@wdc.com [axboe: drop unnecessary lightnvm change] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-18block: Introduce IOPRIO_NR_LEVELSDamien Le Moal1-4/+4
The BFQ scheduler and ioprio_check_cap() both assume that the RT priority class (IOPRIO_CLASS_RT) can have up to 8 different priority levels, similarly to the BE class (IOPRIO_CLASS_iBE). This is controlled using the IOPRIO_BE_NR macro , which is badly named as the number of levels also applies to the RT class. Introduce the class independent IOPRIO_NR_LEVELS macro, defined to 8, to make things clear. Keep the old IOPRIO_BE_NR macro definition as an alias for IOPRIO_NR_LEVELS. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Link: https://lore.kernel.org/r/20210811033702.368488-6-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-18block: bfq: fix bfq_set_next_ioprio_data()Damien Le Moal1-1/+1
For a request that has a priority level equal to or larger than IOPRIO_BE_NR, bfq_set_next_ioprio_data() prints a critical warning but defaults to setting the request new_ioprio field to IOPRIO_BE_NR. This is not consistent with the warning and the allowed values for priority levels. Fix this by setting the request new_ioprio field to IOPRIO_BE_NR - 1, the lowest priority level allowed. Cc: <stable@vger.kernel.org> Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210811033702.368488-2-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09block: return ELEVATOR_DISCARD_MERGE if possibleMing Lei1-0/+3
When merging one bio to request, if they are discard IO and the queue supports multi-range discard, we need to return ELEVATOR_DISCARD_MERGE because both block core and related drivers(nvme, virtio-blk) doesn't handle mixed discard io merge(traditional IO merge together with discard merge) well. Fix the issue by returning ELEVATOR_DISCARD_MERGE in this situation, so both blk-mq and drivers just need to handle multi-range discard. Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 2705dfb20947 ("block: fix discard request merge") Link: https://lore.kernel.org/r/20210729034226.1591070-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09block: move the bdi from the request_queue to the gendiskChristoph Hellwig1-2/+2
The backing device information only makes sense for file system I/O, and thus belongs into the gendisk and not the lower level request_queue structure. Move it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24blk: Fix lock inversion between ioc lock and bfqd lockJan Kara1-2/+4
Lockdep complains about lock inversion between ioc->lock and bfqd->lock: bfqd -> ioc: put_io_context+0x33/0x90 -> ioc->lock grabbed blk_mq_free_request+0x51/0x140 blk_put_request+0xe/0x10 blk_attempt_req_merge+0x1d/0x30 elv_attempt_insert_merge+0x56/0xa0 blk_mq_sched_try_insert_merge+0x4b/0x60 bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed blk_mq_sched_insert_requests+0xd6/0x2b0 blk_mq_flush_plug_list+0x154/0x280 blk_finish_plug+0x40/0x60 ext4_writepages+0x696/0x1320 do_writepages+0x1c/0x80 __filemap_fdatawrite_range+0xd7/0x120 sync_file_range+0xac/0xf0 ioc->bfqd: bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed put_io_context_active+0x78/0xb0 -> ioc->lock grabbed exit_io_context+0x48/0x50 do_exit+0x7e9/0xdd0 do_group_exit+0x54/0xc0 To avoid this inversion we change blk_mq_sched_try_insert_merge() to not free the merged request but rather leave that upto the caller similarly to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure to free all the merged requests after dropping bfqd->lock. Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24bfq: Remove merged request already in bfq_requests_merged()Jan Kara1-28/+13
Currently, bfq does very little in bfq_requests_merged() and handles all the request cleanup in bfq_finish_requeue_request() called from blk_mq_free_request(). That is currently safe only because blk_mq_free_request() is called shortly after bfq_requests_merged() while bfqd->lock is still held. However to fix a lock inversion between bfqd->lock and ioc->lock, we need to call blk_mq_free_request() after dropping bfqd->lock. That would mean that already merged request could be seen by other processes inside bfq queues and possibly dispatched to the device which is wrong. So move cleanup of the request from bfq_finish_requeue_request() to bfq_requests_merged(). Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210623093634.27879-2-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: reset waker pointer with shared queuesPaolo Valente1-2/+4
Commit 85686d0dc194 ("block, bfq: keep shared queues out of the waker mechanism") leaves shared bfq_queues out of the waker-detection mechanism. It attains this goal by not updating the pointer last_completed_rq_bfqq, if the last request completed belongs to a shared bfq_queue (so that the pointer will not point to the shared bfq_queue). Yet this has a side effect: the pointer last_completed_rq_bfqq keeps pointing, deceptively, to a bfq_queue that actually is not the last one to have had a request completed. As a consequence, such a bfq_queue may deceptively be considered as a waker of some bfq_queue, even of some shared bfq_queue. To address this issue, reset last_completed_rq_bfqq if the last request completed belongs to a shared queue. Fixes: 85686d0dc194 ("block, bfq: keep shared queues out of the waker mechanism") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-8-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: check waker only for queues with no in-flight I/OPaolo Valente1-8/+13
Consider two bfq_queues, say Q1 and Q2, with Q2 empty. If a request of Q1 gets completed shortly before a new request arrives for Q2, then BFQ flags Q1 as a candidate waker for Q2. Yet, the arrival of this new request may have a different cause, in the following case. If also Q2 has requests in flight while waiting for the arrival of a new request, then the completion of its own requests may be the actual cause of the awakening of the process that sends I/O to Q2. So Q1 may be flagged wrongly as a candidate waker. This commit avoids this deceptive flagging, by disabling candidate-waker flagging for Q2, if Q2 has in-flight I/O. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-7-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: avoid delayed merge of async queuesPaolo Valente1-1/+7
Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created queues"), BFQ may schedule a merge between a newly created sync bfq_queue, say Q2, and the last sync bfq_queue created, say Q1. To this goal, BFQ stores the address of Q1 in the field bic->stable_merge_bfqq of the bic associated with Q2. So, when the time for the possible merge arrives, BFQ knows which bfq_queue to merge Q2 with. In particular, BFQ checks for possible merges on request arrivals. Yet the same bic may also be associated with an async bfq_queue, say Q3. So, if a request for Q3 arrives, then the above check may happen to be executed while the bfq_queue at hand is Q3, instead of Q2. In this case, Q1 happens to be merged with an async bfq_queue. This is not only a conceptual mistake, because async queues are to be kept out of queue merging, but also a bug that leads to inconsistent states. This commits simply filters async queues out of delayed merges. Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues") Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-6-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: boost throughput by extending queue-merging timesPietro Pedroni1-3/+13
One of the methods with which bfq boosts throughput is by merging queues. One of the merging variants in bfq is the stable merge. This mechanism is activated between two queues only if they are created within a certain maximum time T1 from each other. Merging can happen soon or be delayed. In the second case, before merging, bfq needs to evaluate a throughput-boost parameter that indicates whether the queue generates a high throughput is served alone. Merging occurs when this throughput-boost is not high enough. In particular, this parameter is evaluated and late merging may occur only after at least a time T2 from the creation of the queue. Currently T1 and T2 are set to 180ms and 200ms, respectively. In this way the merging mechanism rarely occurs because time is not enough. This results in a noticeable lowering of the overall throughput with some workloads (see the example below). This commit introduces two constants bfq_activation_stable_merging and bfq_late_stable_merging in order to increase the duration of T1 and T2. Both the stable merging activation time and the late merging time are set to 600ms. This value has been experimentally evaluated using sqlite benchmark in the Phoronix Test Suite on a HDD. The duration of the benchmark before this fix was 111.02s, while now it has reached 97.02s, a better result than that of all the other schedulers. Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-5-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: consider also creation time in delayed stable mergePaolo Valente1-1/+3
Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created queues"), BFQ may schedule a merge between a newly created sync bfq_queue and the last sync bfq_queue created. Such a merging is not performed immediately, because BFQ needs first to find out whether the newly created queue actually reaches a higher throughput if not merged at all (and in that case BFQ will not perform any stable merging). To check that, a little time must be waited after the creation of the new queue, so that some I/O can flow in the queue, and statistics on such I/O can be computed. Yet, to evaluate the above waiting time, the last split time is considered as start time, instead of the creation time of the queue. This is a mistake, because considering the split time is correct only in the following scenario. The queue undergoes a non-stable merges on the arrival of its very first I/O request, due to close I/O with some other queue. While the queue is merged for close I/O, stable merging is not considered. Yet the queue may then happen to be split, if the close I/O finishes (or happens to be a false positive). From this time on, the queue can again be considered for stable merging. But, again, a little time must elapse, to let some new I/O flow in the queue and to get updated statistics. To wait for this time, the split time is to be taken into account. Yet, if the queue does not undergo a non-stable merge on the arrival of its very first request, then BFQ immediately checks whether the stable merge is to be performed. It happens because the split time for a queue is initialized to minus infinity when the queue is created. This commit fixes this mistake by adding the missing condition. Now the check for delayed stable-merge is performed after a little time is elapsed not only from the last queue split time, but also from the creation time of the queue. Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-4-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: fix delayed stable merge checkLuca Mariotti1-1/+1
When attempting to schedule a merge of a given bfq_queue with the currently in-service bfq_queue or with a cooperating bfq_queue among the scheduled bfq_queues, delayed stable merge is checked for rotational or non-queueing devs. For this stable merge to be performed, some conditions must be met. If the current bfq_queue underwent some split from some merged bfq_queue, one of these conditions is that two hundred milliseconds must elapse from split, otherwise this condition is always met. Unfortunately, by mistake, time_is_after_jiffies() was written instead of time_is_before_jiffies() for this check, verifying that less than two hundred milliseconds have elapsed instead of verifying that at least two hundred milliseconds have elapsed. Fix this issue by replacing time_is_after_jiffies() with time_is_before_jiffies(). Signed-off-by: Luca Mariotti <mariottiluca1@hotmail.it> Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com> Link: https://lore.kernel.org/r/20210619140948.98712-3-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: let also stably merged queues enjoy weight raisingPaolo Valente1-1/+14
Merged bfq_queues are kept out of weight-raising (low-latency) mechanisms. The reason is that these queues are usually created for non-interactive and non-soft-real-time tasks. Yet this is not the case for stably-merged queues. These queues are merged just because they are created shortly after each other. So they may easily serve the I/O of an interactive or soft-real time application, if the application happens to spawn multiple processes. To address this issue, this commits lets also stably-merged queued enjoy weight raising. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-12block, bfq: avoid circular stable mergesPaolo Valente1-2/+29
BFQ may merge a new bfq_queue, stably, with the last bfq_queue created. In particular, BFQ first waits a little bit for some I/O to flow inside the new queue, say Q2, if this is needed to understand whether it is better or worse to merge Q2 with the last queue created, say Q1. This delayed stable merge is performed by assigning bic->stable_merge_bfqq = Q1, for the bic associated with Q1. Yet, while waiting for some I/O to flow in Q2, a non-stable queue merge of Q2 with Q1 may happen, causing the bic previously associated with Q2 to be associated with exactly Q1 (bic->bfqq = Q1). After that, Q2 and Q1 may happen to be split, and, in the split, Q1 may happen to be recycled as a non-shared bfq_queue. In that case, Q1 may then happen to undergo a stable merge with the bfq_queue pointed by bic->stable_merge_bfqq. Yet bic->stable_merge_bfqq still points to Q1. So Q1 would be merged with itself. This commit fixes this error by intercepting this situation, and canceling the schedule of the stable merge. Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues") Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210512094352.85545-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-11kyber: fix out of bounds access when preemptedOmar Sandoval1-2/+1
__blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx for the current CPU again and uses that to get the corresponding Kyber context in the passed hctx. However, the thread may be preempted between the two calls to blk_mq_get_ctx(), and the ctx returned the second time may no longer correspond to the passed hctx. This "works" accidentally most of the time, but it can cause us to read garbage if the second ctx came from an hctx with more ctx's than the first one (i.e., if ctx->index_hw[hctx->type] > hctx->nr_ctx). This manifested as this UBSAN array index out of bounds error reported by Jakub: UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9 index 13106 is out of range for type 'long unsigned int [128]' Call Trace: dump_stack+0xa4/0xe5 ubsan_epilogue+0x5/0x40 __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34 queued_spin_lock_slowpath+0x476/0x480 do_raw_spin_lock+0x1c2/0x1d0 kyber_bio_merge+0x112/0x180 blk_mq_submit_bio+0x1f5/0x1100 submit_bio_noacct+0x7b0/0x870 submit_bio+0xc2/0x3a0 btrfs_map_bio+0x4f0/0x9d0 btrfs_submit_data_bio+0x24e/0x310 submit_one_bio+0x7f/0xb0 submit_extent_page+0xc4/0x440 __extent_writepage_io+0x2b8/0x5e0 __extent_writepage+0x28d/0x6e0 extent_write_cache_pages+0x4d7/0x7a0 extent_writepages+0xa2/0x110 do_writepages+0x8f/0x180 __writeback_single_inode+0x99/0x7f0 writeback_sb_inodes+0x34e/0x790 __writeback_inodes_wb+0x9e/0x120 wb_writeback+0x4d2/0x660 wb_workfn+0x64d/0xa10 process_one_work+0x53a/0xa80 worker_thread+0x69/0x5b0 kthread+0x20b/0x240 ret_from_fork+0x1f/0x30 Only Kyber uses the hctx, so fix it by passing the request_queue to ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can map the queues itself to avoid the mismatch. Fixes: a6088845c2bf ("block: kyber: make kyber more friendly with merging") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-16bfq/mq-deadline: remove redundant check for passthrough requestLin Feng1-2/+1
Since commit 01e99aeca39796003 'blk-mq: insert passthrough request into hctx->dispatch directly', passthrough request should not appear in IO-scheduler any more, so blk_rq_is_passthrough checking in addon IO schedulers is redundant. (Notes: this patch passes generic IO load test with hdds under SAS controller and hdds under AHCI controller but obviously not covers all. Not sure if passthrough request can still escape into IO scheduler from blk_mq_sched_insert_requests, which is used by blk_mq_flush_plug_list and has lots of indirect callers.) Signed-off-by: Lin Feng <linf@wangsu.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: merge bursts of newly-created queuesPaolo Valente1-10/+249
Many throughput-sensitive workloads are made of several parallel I/O flows, with all flows generated by the same application, or more generically by the same task (e.g., system boot). The most counterproductive action with these workloads is plugging I/O dispatch when one of the bfq_queues associated with these flows remains temporarily empty. To avoid this plugging, BFQ has been using a burst-handling mechanism for years now. This mechanism has proven effective for throughput, and not detrimental for service guarantees. This commit pushes this mechanism a little bit further, basing on the following two facts. First, all the I/O flows of a the same application or task contribute to the execution/completion of that common application or task. So the performance figures that matter are total throughput of the flows and task-wide I/O latency. In particular, these flows do not need to be protected from each other, in terms of individual bandwidth or latency. Second, the above fact holds regardless of the number of flows. Putting these two facts together, this commits merges stably the bfq_queues associated with these I/O flows, i.e., with the processes that generate these IO/ flows, regardless of how many the involved processes are. To decide whether a set of bfq_queues is actually associated with the I/O flows of a common application or task, and to merge these queues stably, this commit operates as follows: given a bfq_queue, say Q2, currently being created, and the last bfq_queue, say Q1, created before Q2, Q2 is merged stably with Q1 if - very little time has elapsed since when Q1 was created - Q2 has the same ioprio as Q1 - Q2 belongs to the same group as Q1 Merging bfq_queues also reduces scheduling overhead. A fio test with ten random readers on /dev/nullb shows a throughput boost of 40%, with a quadcore. Since BFQ's execution time amounts to ~50% of the total per-request processing time, the above throughput boost implies that BFQ's overhead is reduced by more than 50%. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-7-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: keep shared queues out of the waker mechanismPaolo Valente1-1/+11
Shared queues are likely to receive I/O at a high rate. This may deceptively let them be considered as wakers of other queues. But a false waker will unjustly steal bandwidth to its supposedly woken queue. So considering also shared queues in the waking mechanism may cause more control troubles than throughput benefits. This commit keeps shared queues out of the waker-detection mechanism. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-6-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: fix weight-raising resume with !low_latencyPaolo Valente1-2/+8
When the io_latency heuristic is off, bfq_queues must not start to be weight-raised. Unfortunately, by mistake, this may happen when the state of a previously weight-raised bfq_queue is resumed after a queue split. This commit fixes this error. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-5-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: make shared queues inherit wakersPaolo Valente1-3/+39
Consider a bfq_queue bfqq that is about to be merged with another bfq_queue new_bfqq. The processes associated with bfqq are cooperators of the processes associated with new_bfqq. So, if bfqq has a waker, then it is reasonable (and beneficial for throughput) to assume that all these processes will be happy to let bfqq's waker freely inject I/O when they have no I/O. So this commit makes new_bfqq inherit bfqq's waker. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-4-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: put reqs of waker and woken in dispatch listPaolo Valente1-1/+43
Consider a new I/O request that arrives for a bfq_queue bfqq. If, when this happens, the only active bfq_queues are bfqq and either its waker bfq_queue or one of its woken bfq_queues, then there is no point in queueing this new I/O request in bfqq for service. In fact, the in-service queue and bfqq agree on serving this new I/O request as soon as possible. So this commit puts this new I/O request directly into the dispatch list. Tested-by: Jan Kara <jack@suse.cz> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-3-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-25block, bfq: always inject I/O of queues blocked by wakersPaolo Valente1-5/+27
Suppose that I/O dispatch is plugged, to wait for new I/O for the in-service bfq-queue, say bfqq. Suppose then that there is a further bfq_queue woken by bfqq, and that this woken queue has pending I/O. A woken queue does not steal bandwidth from bfqq, because it remains soon without I/O if bfqq is not served. So there is virtually no risk of loss of bandwidth for bfqq if this woken queue has I/O dispatched while bfqq is waiting for new I/O. In contrast, this extra I/O injection boosts throughput. This commit performs this extra injection. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Link: https://lore.kernel.org/r/20210304174627.161-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-02block/bfq: update comments and default value in docs for fifo_expireJoseph Qi1-1/+1
Correct the comments since bfq_fifo_expire[0] is for async request, while bfq_fifo_expire[1] is for sync request. Also update docs, according the source code, the default fifo_expire_async is 250ms, and fifo_expire_sync is 125ms. Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-22block: get rid of the trace rq insert wrapperChaitanya Kulkarni1-1/+3
Get rid of the wrapper for trace_block_rq_insert() and call the function directly. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-02-21Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-blockLinus Torvalds1-169/+276
Pull core block updates from Jens Axboe: "Another nice round of removing more code than what is added, mostly due to Christoph's relentless pursuit of tech debt removal/cleanups. This pull request contains: - Two series of BFQ improvements (Paolo, Jan, Jia) - Block iov_iter improvements (Pavel) - bsg error path fix (Pan) - blk-mq scheduler improvements (Jan) - -EBUSY discard fix (Jan) - bvec allocation improvements (Ming, Christoph) - bio allocation and init improvements (Christoph) - Store bdev pointer in bio instead of gendisk + partno (Christoph) - Block trace point cleanups (Christoph) - hard read-only vs read-only split (Christoph) - Block based swap cleanups (Christoph) - Zoned write granularity support (Damien) - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)" * tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits) mm: simplify swapdev_block sd_zbc: clear zone resources for non-zoned case block: introduce blk_queue_clear_zone_settings() zonefs: use zone write granularity as block size block: introduce zone_write_granularity limit block: use blk_queue_set_zoned in add_partition() nullb: use blk_queue_set_zoned() to setup zoned devices nvme: cleanup zone information initialization block: document zone_append_max_bytes attribute block: use bi_max_vecs to find the bvec pool md/raid10: remove dead code in reshape_request block: mark the bio as cloned in bio_iov_bvec_set block: set BIO_NO_PAGE_REF in bio_iov_bvec_set block: remove a layer of indentation in bio_iov_iter_get_pages block: turn the nr_iovecs argument to bio_alloc* into an unsigned short block: remove the 1 and 4 vec bvec_slabs entries block: streamline bvec_alloc block: factor out a bvec_alloc_gfp helper block: move struct biovec_slab to bio.c block: reuse BIO_INLINE_VECS for integrity bvecs ...
2021-02-02bfq-iosched: Revert "bfq: Fix computation of shallow depth"Lin Feng1-4/+4
This reverts commit 6d4d273588378c65915acaf7b2ee74e9dd9c130a. bfq.limit_depth passes word_depths[] as shallow_depth down to sbitmap core sbitmap_get_shallow, which uses just the number to limit the scan depth of each bitmap word, formula: scan_percentage_for_each_word = shallow_depth / (1 << sbimap->shift) * 100% That means the comments's percentiles 50%, 75%, 18%, 37% of bfq are correct. But after commit patch 'bfq: Fix computation of shallow depth', we use sbitmap.depth instead, as a example in following case: sbitmap.depth = 256, map_nr = 4, shift = 6; sbitmap_word.depth = 64. The resulsts of computed bfqd->word_depths[] are {128, 192, 48, 96}, and three of the numbers exceed core dirver's 'sbitmap_word.depth=64' limit nothing. Signed-off-by: Lin Feng <linf@wangsu.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27bfq: Use only idle IO periods for think time calculationsJan Kara1-1/+9
Currently whenever bfq queue has a request queued we add now - last_completion_time to the think time statistics. This is however misleading in case the process is able to submit several requests in parallel because e.g. if the queue has request completed at time T0 and then queues new requests at times T1, T2, then we will add T1-T0 and T2-T0 to think time statistics which just doesn't make any sence (the queue's think time is penalized by the queue being able to submit more IO). So add to think time statistics only time intervals when the queue had no IO pending. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> [axboe: fix whitespace on empty line] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27bfq: Use 'ttime' local variableJan Kara1-1/+1
Use local variable 'ttime' instead of dereferencing bfqq. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-27bfq: Avoid false bfq queue mergingJan Kara1-0/+1
bfq_setup_cooperator() uses bfqd->in_serv_last_pos so detect whether it makes sense to merge current bfq queue with the in-service queue. However if the in-service queue is freshly scheduled and didn't dispatch any requests yet, bfqd->in_serv_last_pos is stale and contains value from the previously scheduled bfq queue which can thus result in a bogus decision that the two queues should be merged. This bug can be observed for example with the following fio jobfile: [global] direct=0 ioengine=sync invalidate=1 size=1g rw=read [reader] numjobs=4 directory=/mnt where the 4 processes will end up in the one shared bfq queue although they do IO to physically very distant files (for some reason I was able to observe this only with slice_idle=1ms setting). Fix the problem by invalidating bfqd->in_serv_last_pos when switching in-service queue. Fixes: 058fdecc6de7 ("block, bfq: fix in-service-queue check for queue merging") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25bfq: bfq_check_waker() should be staticJens Axboe1-1/+2
It's only used in the same file, mark is appropriately static. Fixes: 71217df39dc6 ("block, bfq: make waker-queue detection more robust") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: make waker-queue detection more robustPaolo Valente1-108/+103
In the presence of many parallel I/O flows, the detection of waker bfq_queues suffers from false positives. This commits addresses this issue by making the filtering of actual wakers more selective. In more detail, a candidate waker must be found to meet waker requirements three times before being promoted to actual waker. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: save also injection state on queue mergingPaolo Valente1-0/+8
To prevent injection information from being lost on bfq_queue merging, also the amount of service that a bfq_queue receives must be saved and restored when the bfq_queue is merged and split, respectively. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: save also weight-raised service on queue mergingPaolo Valente1-0/+2
To prevent weight-raising information from being lost on bfq_queue merging, also the amount of service that a bfq_queue receives must be saved and restored when the bfq_queue is merged and split, respectively. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: fix switch back from soft-rt weitgh-raisingPaolo Valente1-2/+20
A bfq_queue may happen to be deemed as soft real-time while it is still enjoying interactive weight-raising. If this happens because of a false positive, then the bfq_queue is likely to loose its soft real-time status soon. Upon losing such a status, the bfq_queue must get back its interactive weight-raising, if its interactive period is not over yet. But this case is not handled. This commit corrects this error. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: re-evaluate convenience of I/O plugging on rq arrivalsPaolo Valente1-5/+19
Upon an I/O-dispatch attempt, BFQ may detect that it was better to plug I/O dispatch, and to wait for a new request to arrive for the currently in-service queue. But the arrival of a new request for an empty bfq_queue, and thus the switch from idle to busy of the bfq_queue, may cause the scenario to change, and make plugging no longer needed for service guarantees, or more convenient for throughput. In this case, keeping I/O-dispatch plugged would certainly lower throughput. To address this issue, this commit makes such a check, and stops plugging I/O if it is better to stop plugging I/O. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-25block, bfq: replace mechanism for evaluating I/O intensityPaolo Valente1-19/+44
Some BFQ mechanisms make their decisions on a bfq_queue basing also on whether the bfq_queue is I/O bound. In this respect, the current logic for evaluating whether a bfq_queue is I/O bound is rather rough. This commits replaces this logic with a more effective one. The new logic measures the percentage of time during which a bfq_queue is active, and marks the bfq_queue as I/O bound if the latter if this percentage is above a fixed threshold. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24Revert "blk-mq, elevator: Count requests per hctx to improve performance"Jan Kara1-5/+0
This reverts commit b445547ec1bbd3e7bf4b1c142550942f70527d95. Since both mq-deadline and BFQ completely ignore hctx they are passed to their dispatch function and dispatch whatever request they deem fit checking whether any request for a particular hctx is queued is just pointless since we'll very likely get a request from a different hctx anyway. In the following commit we'll deal with lock contention in these IO schedulers in presence of multiple HW queues in a different way. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block, bfq: do not expire a queue when it is the only busy onePaolo Valente1-2/+20
This commits preserves I/O-dispatch plugging for a special symmetric case that may suddenly turn into asymmetric: the case where only one bfq_queue, say bfqq, is busy. In this case, not expiring bfqq does not cause any harm to any other queues in terms of service guarantees. In contrast, it avoids the following unlucky sequence of events: (1) bfqq is expired, (2) a new queue with a lower weight than bfqq becomes busy (or more queues), (3) the new queue is served until a new request arrives for bfqq, (4) when bfqq is finally served, there are so many requests of the new queue in the drive that the pending requests for bfqq take a lot of time to be served. In particular, event (2) may case even already dispatched requests of bfqq to be delayed, inside the drive. So, to avoid this series of events, the scenario is preventively declared as asymmetric also if bfqq is the only busy queues. By doing so, I/O-dispatch plugging is performed for bfqq. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block, bfq: avoid spurious switches to soft_rt of interactive queuesPaolo Valente1-20/+37
BFQ tags some bfq_queues as interactive or soft_rt if it deems that these bfq_queues contain the I/O of, respectively, interactive or soft real-time applications. BFQ privileges both these special types of bfq_queues over normal bfq_queues. To privilege a bfq_queue, BFQ mainly raises the weight of the bfq_queue. In particular, soft_rt bfq_queues get a higher weight than interactive bfq_queues. A bfq_queue may turn from interactive to soft_rt. And this leads to a tricky issue. Soft real-time applications usually start with an I/O-bound, interactive phase, in which they load themselves into main memory. BFQ correctly detects this phase, and keeps the bfq_queues associated with the application in interactive mode for a while. Problems arise when the I/O pattern of the application finally switches to soft real-time. One of the conditions for a bfq_queue to be deemed as soft_rt is that the bfq_queue does not consume too much bandwidth. But the bfq_queues associated with a soft real-time application consume as much bandwidth as they can in the loading phase of the application. So, after the application becomes truly soft real-time, a lot of time should pass before the average bandwidth consumed by its bfq_queues finally drops to a value acceptable for soft_rt bfq_queues. As a consequence, there might be a time gap during which the application is not privileged at all, because its bfq_queues are not interactive any longer, but cannot be deemed as soft_rt yet. To avoid this problem, BFQ pretends that an interactive bfq_queue consumes zero bandwidth, and allows an interactive bfq_queue to switch to soft_rt. Yet, this fake zero-bandwidth consumption easily causes the bfq_queue to often switch to soft_rt deceptively, during its loading phase. As in soft_rt mode, the bfq_queue gets its bandwidth correctly computed, and therefore soon switches back to interactive. Then it switches again to soft_rt, and so on. These spurious fluctuations usually cause losses of throughput, because they deceive BFQ's mechanisms for boosting throughput (injection, I/O-plugging avoidance, ...). This commit addresses this issue as follows: 1) It does compute actual bandwidth consumption also for interactive bfq_queues. This avoids the above false positives. 2) When a bfq_queue switches from interactive to normal mode, the consumed bandwidth is reset (forgotten). This allows the bfq_queue to enjoy soft_rt very quickly. In particular, two alternatives are possible in this switch: - the bfq_queue still has backlog, and therefore there is a budget already scheduled to serve the bfq_queue; in this case, the scheduling of the current budget of the bfq_queue is not hindered, because only the scheduling of the next budget will be affected by the weight drop. After that, if the bfq_queue is actually in a soft_rt phase, and becomes empty during the service of its current budget, which is the natural behavior of a soft_rt bfq_queue, then the bfq_queue will be considered as soft_rt when its next I/O arrives. If, in contrast, the bfq_queue remains constantly non-empty, then its next budget will be scheduled with a low weight, which is the natural treatment for an I/O-bound (non soft_rt) bfq_queue. - the bfq_queue is empty; in this case, the bfq_queue may be considered unjustly soft_rt when its new I/O arrives. Yet the problem is now much smaller than before, because it is unlikely that more than one spurious fluctuation occurs. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block, bfq: do not raise non-default weightsPaolo Valente1-3/+7
BFQ heuristics try to detect interactive I/O, and raise the weight of the queues containing such an I/O. Yet, if also the user changes the weight of a queue (i.e., the user changes the ioprio of the process associated with that queue), then it is most likely better to prevent BFQ heuristics from silently changing the same weight. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block, bfq: increase time window for waker detectionPaolo Valente1-1/+1
Tests on slower machines showed current window to be way too small. This commit increases it. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block, bfq: set next_rq to waker_bfqq->next_rq in waker injectionJia Cheng Hu1-1/+1
Since commit c5089591c3ba ("block, bfq: detect wakers and unconditionally inject their I/O"), when the in-service bfq_queue, say Q, is temporarily empty, BFQ checks whether there are I/O requests to inject (also) from the waker bfq_queue for Q. To this goal, the value pointed by bfqq->waker_bfqq->next_rq must be controlled. However, the current implementation mistakenly looks at bfqq->next_rq, which instead points to the next request of the currently served queue. This mistake evidently causes losses of throughput in scenarios with waker bfq_queues. This commit corrects this mistake. Fixes: c5089591c3ba ("block, bfq: detect wakers and unconditionally inject their I/O") Signed-off-by: Jia Cheng Hu <jia.jiachenghu@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-24block, bfq: use half slice_idle as a threshold to check short ttimePaolo Valente1-3/+4
The value of the I/O plugging (idling) timeout is used also as the think-time threshold to decide whether a process has a short think time. In this respect, a good value of this timeout for rotational drives is un the order of several ms. Yet, this is often too long a time interval to be effective as a think-time threshold. This commit mitigates this problem (by a lot, according to tests), by halving the threshold. Tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-05bfq: Fix computation of shallow depthJan Kara1-4/+4
BFQ computes number of tags it allows to be allocated for each request type based on tag bitmap. However it uses 1 << bitmap.shift as number of available tags which is wrong. 'shift' is just an internal bitmap value containing logarithm of how many bits bitmap uses in each bitmap word. Thus number of tags allowed for some request types can be far to low. Use proper bitmap.depth which has the number of tags instead. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-13Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-blockLinus Torvalds1-2/+7
Pull block updates from Jens Axboe: - Series of merge handling cleanups (Baolin, Christoph) - Series of blk-throttle fixes and cleanups (Baolin) - Series cleaning up BDI, seperating the block device from the backing_dev_info (Christoph) - Removal of bdget() as a generic API (Christoph) - Removal of blkdev_get() as a generic API (Christoph) - Cleanup of is-partition checks (Christoph) - Series reworking disk revalidation (Christoph) - Series cleaning up bio flags (Christoph) - bio crypt fixes (Eric) - IO stats inflight tweak (Gabriel) - blk-mq tags fixes (Hannes) - Buffer invalidation fixes (Jan) - Allow soft limits for zone append (Johannes) - Shared tag set improvements (John, Kashyap) - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel) - DM no-wait support (Mike, Konstantin) - Request allocation improvements (Ming) - Allow md/dm/bcache to use IO stat helpers (Song) - Series improving blk-iocost (Tejun) - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang, Xianting, Yang, Yufen, yangerkun) * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits) block: fix uapi blkzoned.h comments blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue blk-mq: get rid of the dead flush handle code path block: get rid of unnecessary local variable block: fix comment and add lockdep assert blk-mq: use helper function to test hw stopped block: use helper function to test queue register block: remove redundant mq check block: invoke blk_mq_exit_sched no matter whether have .exit_sched percpu_ref: don't refer to ref->data if it isn't allocated block: ratelimit handle_bad_sector() message blk-throttle: Re-use the throtl_set_slice_end() blk-throttle: Open code __throtl_de/enqueue_tg() blk-throttle: Move service tree validation out of the throtl_rb_first() blk-throttle: Move the list operation after list validation blk-throttle: Fix IO hang for a corner case blk-throttle: Avoid tracking latency if low limit is invalid blk-throttle: Avoid getting the current time if tg->last_finish_time is 0 blk-throttle: Remove a meaningless parameter for throtl_downgrade_state() block: Remove redundant 'return' statement ...
2020-09-11Merge tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-blockLinus Torvalds1-12/+0
Pull block fixes from Jens Axboe: - Fix a regression in bdev partition locking (Christoph) - NVMe pull request from Christoph: - cancel async events before freeing them (David Milburn) - revert a broken race fix (James Smart) - fix command processing during resets (Sagi Grimberg) - Fix a kyber crash with requeued flushes (Omar) - Fix __bio_try_merge_page() same_page error for no merging (Ritesh) * tag 'block-5.9-2020-09-11' of git://git.kernel.dk/linux-block: block: Set same_page to false in __bio_try_merge_page if ret is false nvme-fabrics: allow to queue requests for live queues block: only call sched requeue_request() for scheduled requests nvme-tcp: cancel async events before freeing event struct nvme-rdma: cancel async events before freeing event struct nvme-fc: cancel async events before freeing event struct nvme: Revert: Fix controller creation races with teardown flow block: restore a specific error code in bdev_del_partition
2020-09-08block: only call sched requeue_request() for scheduled requestsOmar Sandoval1-12/+0
Yang Yang reported the following crash caused by requeueing a flush request in Kyber: [ 2.517297] Unable to handle kernel paging request at virtual address ffffffd8071c0b00 ... [ 2.517468] pc : clear_bit+0x18/0x2c [ 2.517502] lr : sbitmap_queue_clear+0x40/0x228 [ 2.517503] sp : ffffff800832bc60 pstate : 00c00145 ... [ 2.517599] Process ksoftirqd/5 (pid: 51, stack limit = 0xffffff8008328000) [ 2.517602] Call trace: [ 2.517606] clear_bit+0x18/0x2c [ 2.517619] kyber_finish_request+0x74/0x80 [ 2.517627] blk_mq_requeue_request+0x3c/0xc0 [ 2.517637] __scsi_queue_insert+0x11c/0x148 [ 2.517640] scsi_softirq_done+0x114/0x130 [ 2.517643] blk_done_softirq+0x7c/0xb0 [ 2.517651] __do_softirq+0x208/0x3bc [ 2.517657] run_ksoftirqd+0x34/0x60 [ 2.517663] smpboot_thread_fn+0x1c4/0x2c0 [ 2.517667] kthread+0x110/0x120 [ 2.517669] ret_from_fork+0x10/0x18 This happens because Kyber doesn't track flush requests, so kyber_finish_request() reads a garbage domain token. Only call the scheduler's requeue_request() hook if RQF_ELVPRIV is set (like we do for the finish_request() hook in blk_mq_free_request()). Now that we're handling it in blk-mq, also remove the check from BFQ. Reported-by: Yang Yang <yang.yang@vivo.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq, elevator: Count requests per hctx to improve performanceKashyap Desai1-0/+5
High CPU utilization on "native_queued_spin_lock_slowpath" due to lock contention is possible for mq-deadline and bfq IO schedulers when nr_hw_queues is more than one. It is because kblockd work queue can submit IO from all online CPUs (through blk_mq_run_hw_queues()) even though only one hctx has pending commands. The elevator callback .has_work for mq-deadline and bfq scheduler considers pending work if there are any IOs on request queue but it does not account hctx context. Add a per-hctx 'elevator_queued' count to the hctx to avoid triggering the elevator even though there are no requests queued. [jpg: Relocated atomic_dec() in dd_dispatch_request(), update commit message per Kashyap] Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Douglas Gilbert <dgilbert@interlog.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-03blk-mq: Use pointers for blk_mq_tags bitmap tagsJohn Garry1-2/+2
Introduce pointers for the blk_mq_tags regular and reserved bitmap tags, with the goal of later being able to use a common shared tag bitmap across all HW contexts in a set. Signed-off-by: John Garry <john.garry@huawei.com> Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used Tested-by: Douglas Gilbert <dgilbert@interlog.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva1-2/+2
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-07-31block: bfq-iosched: fix duplicated wordRandy Dunlap1-1/+1
Change "at at" to "at a". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-mq: remove the bio argument to ->prepare_requestChristoph Hellwig1-1/+1
None of the I/O schedulers actually needs it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09bdi: use bdi_dev_name() to get device nameYufen Yu1-2/+4
Use the common interface bdi_dev_name() to get device name. Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Add missing <linux/backing-dev.h> include BFQ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-21block, bfq: turn put_queue into release_process_ref in __bfq_bic_change_cgroupPaolo Valente1-2/+0
A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The goal of this put is to release a process reference to a bfq_queue. But process-reference releases may trigger also some extra operation, and, to this goal, are handled through bfq_release_process_ref(). So, turn the invocation of bfq_put_queue() into an invocation of bfq_release_process_ref(). Tested-by: cki-project@redhat.com Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-21block, bfq: fix use-after-free in bfq_idle_slice_timer_bodyZhiqiang Liu1-4/+12
In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is not in bfqd-lock critical section. The bfqq, which is not equal to NULL in bfq_idle_slice_timer, may be freed after passing to bfq_idle_slice_timer_body. So we will access the freed memory. In addition, considering the bfqq may be in race, we should firstly check whether bfqq is in service before doing something on it in bfq_idle_slice_timer_body func. If the bfqq in race is not in service, it means the bfqq has been expired through __bfq_bfqq_expire func, and wait_request flags has been cleared in __bfq_bfqd_reset_in_service func. So we do not need to re-clear the wait_request of bfqq which is not in service. KASAN log is given as follows: [13058.354613] ================================================================== [13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290 [13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767 [13058.354646] [13058.354655] CPU: 96 PID: 19767 Comm: fork13 [13058.354661] Call trace: [13058.354667] dump_backtrace+0x0/0x310 [13058.354672] show_stack+0x28/0x38 [13058.354681] dump_stack+0xd8/0x108 [13058.354687] print_address_description+0x68/0x2d0 [13058.354690] kasan_report+0x124/0x2e0 [13058.354697] __asan_load8+0x88/0xb0 [13058.354702] bfq_idle_slice_timer+0xac/0x290 [13058.354707] __hrtimer_run_queues+0x298/0x8b8 [13058.354710] hrtimer_interrupt+0x1b8/0x678 [13058.354716] arch_timer_handler_phys+0x4c/0x78 [13058.354722] handle_percpu_devid_irq+0xf0/0x558 [13058.354731] generic_handle_irq+0x50/0x70 [13058.354735] __handle_domain_irq+0x94/0x110 [13058.354739] gic_handle_irq+0x8c/0x1b0 [13058.354742] el1_irq+0xb8/0x140 [13058.354748] do_wp_page+0x260/0xe28 [13058.354752] __handle_mm_fault+0x8ec/0x9b0 [13058.354756] handle_mm_fault+0x280/0x460 [13058.354762] do_page_fault+0x3ec/0x890 [13058.354765] do_mem_abort+0xc0/0x1b0 [13058.354768] el0_da+0x24/0x28 [13058.354770] [13058.354773] Allocated by task 19731: [13058.354780] kasan_kmalloc+0xe0/0x190 [13058.354784] kasan_slab_alloc+0x14/0x20 [13058.354788] kmem_cache_alloc_node+0x130/0x440 [13058.354793] bfq_get_queue+0x138/0x858 [13058.354797] bfq_get_bfqq_handle_split+0xd4/0x328 [13058.354801] bfq_init_rq+0x1f4/0x1180 [13058.354806] bfq_insert_requests+0x264/0x1c98 [13058.354811] blk_mq_sched_insert_requests+0x1c4/0x488 [13058.354818] blk_mq_flush_plug_list+0x2d4/0x6e0 [13058.354826] blk_flush_plug_list+0x230/0x548 [13058.354830] blk_finish_plug+0x60/0x80 [13058.354838] read_pages+0xec/0x2c0 [13058.354842] __do_page_cache_readahead+0x374/0x438 [13058.354846] ondemand_readahead+0x24c/0x6b0 [13058.354851] page_cache_sync_readahead+0x17c/0x2f8 [13058.354858] generic_file_buffered_read+0x588/0xc58 [13058.354862] generic_file_read_iter+0x1b4/0x278 [13058.354965] ext4_file_read_iter+0xa8/0x1d8 [ext4] [13058.354972] __vfs_read+0x238/0x320 [13058.354976] vfs_read+0xbc/0x1c0 [13058.354980] ksys_read+0xdc/0x1b8 [13058.354984] __arm64_sys_read+0x50/0x60 [13058.354990] el0_svc_common+0xb4/0x1d8 [13058.354994] el0_svc_handler+0x50/0xa8 [13058.354998] el0_svc+0x8/0xc [13058.354999] [13058.355001] Freed by task 19731: [13058.355007] __kasan_slab_free+0x120/0x228 [13058.355010] kasan_slab_free+0x10/0x18 [13058.355014] kmem_cache_free+0x288/0x3f0 [13058.355018] bfq_put_queue+0x134/0x208 [13058.355022] bfq_exit_icq_bfqq+0x164/0x348 [13058.355026] bfq_exit_icq+0x28/0x40 [13058.355030] ioc_exit_icq+0xa0/0x150 [13058.355035] put_io_context_active+0x250/0x438 [13058.355038] exit_io_context+0xd0/0x138 [13058.355045] do_exit+0x734/0xc58 [13058.355050] do_group_exit+0x78/0x220 [13058.355054] __wake_up_parent+0x0/0x50 [13058.355058] el0_svc_common+0xb4/0x1d8 [13058.355062] el0_svc_handler+0x50/0xa8 [13058.355066] el0_svc+0x8/0xc [13058.355067] [13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464 [13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040) [13058.355077] The buggy address belongs to the page: [13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0 [13058.366175] flags: 0x2ffffe0000008100(slab|head) [13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780 [13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000 [13058.370789] page dumped because: kasan: bad access detected [13058.370791] [13058.370792] Memory state around the buggy address: [13058.370797] ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb [13058.370801] ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [13058.370808] ^ [13058.370811] ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [13058.370815] ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [13058.370817] ================================================================== [13058.370820] Disabling lock debugging due to kernel taint Here, we directly pass the bfqd to bfq_idle_slice_timer_body func. -- V2->V3: rewrite the comment as suggested by Paolo Valente V1->V2: add one comment, and add Fixes and Reported-by tag. Fixes: aee69d78d ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Acked-by: Paolo Valente <paolo.valente@linaro.org> Reported-by: Wang Wang <wangwang2@huawei.com> Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Signed-off-by: Feilong Lin <linfeilong@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03block, bfq: clarify the goal of bfq_split_bfqq()Paolo Valente1-0/+2
The exact, general goal of the function bfq_split_bfqq() is not that apparent. Add a comment to make it clear. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03block, bfq: remove ifdefs from around gets/puts of bfq groupsPaolo Valente1-5/+1
ifdefs around gets and puts of bfq groups reduce readability, remove them. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03block, bfq: extend incomplete name of field on_stPaolo Valente1-1/+1
The flag on_st in the bfq_entity data structure is true if the entity is on a service tree or is in service. Yet the name of the field, confusingly, does not mention the second, very important case. Extend the name to mention the second case too. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03block, bfq: do not insert oom queue into position treePaolo Valente1-0/+4
BFQ maintains an ordered list, implemented with an RB tree, of head-request positions of non-empty bfq_queues. This position tree, inherited from CFQ, is used to find bfq_queues that contain I/O close to each other. BFQ merges these bfq_queues into a single shared queue, if this boosts throughput on the device at hand. There is however a special-purpose bfq_queue that does not participate in queue merging, the oom bfq_queue. Yet, also this bfq_queue could be wrongly added to the position tree. So bfqq_find_close() could return the oom bfq_queue, which is a source of further troubles in an out-of-memory situation. This commit prevents the oom bfq_queue from being inserted into the position tree. Tested-by: Patrick Dung <patdung100@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03block, bfq: do not plug I/O for bfq_queues with no proc refsPaolo Valente1-0/+12
Commit 478de3380c1c ("block, bfq: deschedule empty bfq_queues not referred by any process") fixed commit 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging") by descheduling an empty bfq_queue when it remains with not process reference. Yet, this still left a case uncovered: an empty bfq_queue with not process reference that remains in service. This happens for an in-service sync bfq_queue that is deemed to deserve I/O-dispatch plugging when it remains empty. Yet no new requests will arrive for such a bfq_queue if no process sends requests to it any longer. Even worse, the bfq_queue may happen to be prematurely freed while still in service (because there may remain no reference to it any longer). This commit solves this problem by preventing I/O dispatch from being plugged for the in-service bfq_queue, if the latter has no process reference (the bfq_queue is then prevented from remaining in service). Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging") Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Reported-by: Patrick Dung <patdung100@gmail.com> Tested-by: Patrick Dung <patdung100@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-22block/bfq: remove unused bfq_class_rt which never usedAlex Shi1-1/+0
This macro is never used after introduced from commit aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Better to remove it. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-25Merge tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-blockLinus Torvalds1-0/+4
Pull core block updates from Jens Axboe: "Due to more granular branches, this one is small and will be followed with other core branches that add specific features. I meant to just have a core and drivers branch, but external dependencies we ended up adding a few more that are also core. The changes are: - Fixes and improvements for the zoned device support (Ajay, Damien) - sed-opal table writing and datastore UID (Revanth) - blk-cgroup (and bfq) blk-cgroup stat fixes (Tejun) - Improvements to the block stats tracking (Pavel) - Fix for overruning sysfs buffer for large number of CPUs (Ming) - Optimization for small IO (Ming, Christoph) - Fix typo in RWH lifetime hint (Eugene) - Dead code removal and documentation (Bart) - Reduction in memory usage for queue and tag set (Bart) - Kerneldoc header documentation (André) - Device/partition revalidation fixes (Jan) - Stats tracking for flush requests (Konstantin) - Various other little fixes here and there (et al)" * tag 'for-5.5/block-20191121' of git://git.kernel.dk/linux-block: (48 commits) Revert "block: split bio if the only bvec's length is > SZ_4K" block: add iostat counters for flush requests block,bfq: Skip tracing hooks if possible block: sed-opal: Introduce SUM_SET_LIST parameter and append it using 'add_token_u64' blk-cgroup: cgroup_rstat_updated() shouldn't be called on cgroup1 block: Don't disable interrupts in trigger_softirq() sbitmap: Delete sbitmap_any_bit_clear() blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue() block: split bio if the only bvec's length is > SZ_4K block: still try to split bio if the bvec crosses pages blk-cgroup: separate out blkg_rwstat under CONFIG_BLK_CGROUP_RWSTAT blk-cgroup: reimplement basic IO stats using cgroup rstat blk-cgroup: remove now unused blkg_print_stat_{bytes|ios}_recursive() blk-throtl: stop using blkg->stat_bytes and ->stat_ios bfq-iosched: stop using blkg->stat_bytes and ->stat_ios bfq-iosched: relocate bfqg_*rwstat*() helpers block: add zone open, close and finish ioctl support block: add zone open, close and finish operations block: Simplify REQ_OP_ZONE_RESET_ALL handling block: Remove REQ_OP_ZONE_RESET plugging ...
2019-11-14block, bfq: deschedule empty bfq_queues not referred by any processPaolo Valente1-6/+26
Since commit 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging"), to prevent the service guarantees of a bfq_queue from being violated, the bfq_queue may be left busy, i.e., scheduled for service, even if empty (see comments in __bfq_bfqq_expire() for details). But, if no process will send requests to the bfq_queue any longer, then there is no point in keeping the bfq_queue scheduled for service. In addition, keeping the bfq_queue scheduled for service, but with no process reference any longer, may cause the bfq_queue to be freed when descheduled from service. But this is assumed to never happen, and causes a UAF if it happens. This, in turn, caused crashes [1, 2]. This commit fixes this issue by descheduling an empty bfq_queue when it remains with not process reference. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1767539 [2] https://bugzilla.kernel.org/show_bug.cgi?id=205447 Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging") Reported-by: Chris Evich <cevich@redhat.com> Reported-by: Patrick Dung <patdung100@gmail.com> Reported-by: Thorsten Schubert <tschubert@bafh.org> Tested-by: Thorsten Schubert <tschubert@bafh.org> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-07bfq-iosched: stop using blkg->stat_bytes and ->stat_iosTejun Heo1-0/+4
When used on cgroup1, bfq uses the blkg->stat_bytes and ->stat_ios from blk-cgroup core to populate six stat knobs. blk-cgroup core is moving away from blkg_rwstat to improve scalability and won't be able to support this usage. It isn't like the sharing gains all that much. Let's break it out to dedicated rwstat counters which are updated when on cgroup1. This makes use of bfqg_*rwstat*() helpers outside of CONFIG_BFQ_CGROUP_DEBUG. Move them out. v2: Compile fix when !CONFIG_BFQ_CGROUP_DEBUG. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-17block, bfq: push up injection only after setting service timePaolo Valente1-5/+7
If equal to 0, the injection limit for a bfq_queue is pushed to 1 after a first sample of the total service time of the I/O requests of the queue is computed (to allow injection to start). Yet, because of a mistake in the branch that performs this action, the push may happen also in some other case. This commit fixes this issue. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-17block, bfq: increase update frequency of inject limitPaolo Valente1-1/+1
The update period of the injection limit has been tentatively set to 100 ms, to reduce fluctuations. This value however proved to cause, occasionally, the limit to be decremented for some bfq_queue only after the queue underwent excessive injection for a lot of time. This commit reduces the period to 10 ms. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-17block, bfq: reduce upper bound for inject limit to max_rq_in_driver+1Paolo Valente1-1/+1
Upon an increment attempt of the injection limit, the latter is constrained not to become higher than twice the maximum number max_rq_in_driver of I/O requests that have happened to be in service in the drive. This high bound allows the injection limit to grow beyond max_rq_in_driver, which may then cause max_rq_in_driver itself to grow. However, since the limit is incremented by only one unit at a time, there is no need for such a high bound, and just max_rq_in_driver+1 is enough. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-17block, bfq: update inject limit only after injection occurredPaolo Valente1-2/+17
BFQ updates the injection limit of each bfq_queue as a function of how much the limit inflates the service times experienced by the I/O requests of the queue. So only service times affected by injection must be taken into account. Unfortunately, in the current implementation of this update scheme, the service time of an I/O request rq not affected by injection may happen to be considered in the following case: there is no I/O request in service when rq arrives. This commit fixes this issue by making sure that only service times affected by injection are considered for updating the injection limit. In particular, the service time of an I/O request rq is now considered only if at least one of the following two conditions holds: - the destination bfq_queue for rq underwent injection before rq arrival, and there is still I/O in service in the drive on rq arrival (the service of such unfinished I/O may delay the service of rq); - injection occurs between the arrival and the completion time of rq. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-08block, bfq: handle NULL return value by bfq_init_rq()Paolo Valente1-3/+11
As reported in [1], the call bfq_init_rq(rq) may return NULL in case of OOM (in particular, if rq->elv.icq is NULL because memory allocation failed in failed in ioc_create_icq()). This commit handles this circumstance. [1] https://lkml.org/lkml/2019/7/22/824 Cc: Hsin-Yi Wang <hsinyi@google.com> Cc: Nicolas Boichat <drinkcat@chromium.org> Cc: Doug Anderson <dianders@chromium.org> Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-by: Hsin-Yi Wang <hsinyi@google.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-08block, bfq: move update of waker and woken list to queue freeingPaolo Valente1-15/+29
Since commit 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O"), every bfq_queue has a pointer to a waker bfq_queue and a list of the bfq_queues it may wake. In this respect, when a bfq_queue, say Q, remains with no I/O source attached to it, Q cannot be woken by any other bfq_queue, and cannot wake any other bfq_queue. Then Q must be removed from the woken list of its possible waker bfq_queue, and all bfq_queues in the woken list of Q must stop having a waker bfq_queue. Q remains with no I/O source in two cases: when the last process associated with Q exits or when such a process gets associated with a different bfq_queue. Unfortunately, commit 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O") performed the above updates only in the first case. This commit fixes this bug by moving these updates to when Q gets freed. This is a simple and safe way to handle all cases, as both the above events, process exit and re-association, lead to Q being freed soon, and because dangling references would come out only after Q gets freed (if no update were performed). Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O") Reported-by: Douglas Anderson <dianders@chromium.org> Tested-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-08block, bfq: reset last_completed_rq_bfqq if the pointed queue is freedPaolo Valente1-3/+7
Since commit 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O"), BFQ stores, in a per-device pointer last_completed_rq_bfqq, the last bfq_queue that had an I/O request completed. If some bfq_queue receives new I/O right after the last request of last_completed_rq_bfqq has been completed, then last_completed_rq_bfqq may be a waker bfq_queue. But if the bfq_queue last_completed_rq_bfqq points to is freed, then last_completed_rq_bfqq becomes a dangling reference. This commit resets last_completed_rq_bfqq if the pointed bfq_queue is freed. Fixes: 13a857a4c4e8 ("block, bfq: detect wakers and unconditionally inject their I/O") Reported-by: Douglas Anderson <dianders@chromium.org> Tested-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-26Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-blockLinus Torvalds1-24/+43
Pull block fixes from Jens Axboe: - Several io_uring fixes/improvements: - Blocking fix for O_DIRECT (me) - Latter page slowness for registered buffers (me) - Fix poll hang under certain conditions (me) - Defer sequence check fix for wrapped rings (Zhengyuan) - Mismatch in async inc/dec accounting (Zhengyuan) - Memory ordering issue that could cause stall (Zhengyuan) - Track sequential defer in bytes, not pages (Zhengyuan) - NVMe pull request from Christoph - Set of hang fixes for wbt (Josef) - Redundant error message kill for libahci (Ding) - Remove unused blk_mq_sched_started_request() and related ops (Marcos) - drbd dynamic alloc shash descriptor to reduce stack use (Arnd) - blkcg ->pd_stat() non-debug print (Tejun) - bcache memory leak fix (Wei) - Comment fix (Akinobu) - BFQ perf regression fix (Paolo) * tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits) io_uring: ensure ->list is initialized for poll commands Revert "nvme-pci: don't create a read hctx mapping without read queues" nvme: fix multipath crash when ANA is deactivated nvme: fix memory leak caused by incorrect subsystem free nvme: ignore subnqn for ADATA SX6000LNP drbd: dynamically allocate shash descriptor block: blk-mq: Remove blk_mq_sched_started_request and started_request bcache: fix possible memory leak in bch_cached_dev_run() io_uring: track io length in async_list based on bytes io_uring: don't use iov_iter_advance() for fixed buffers block: properly handle IOCB_NOWAIT for async O_DIRECT IO blk-mq: allow REQ_NOWAIT to return an error inline io_uring: add a memory barrier before atomic_read rq-qos: use a mb for got_token rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule rq-qos: don't reset has_sleepers on spurious wakeups rq-qos: fix missed wake-ups in rq_qos_throttle wait: add wq_has_single_sleeper helper block, bfq: check also in-flight I/O in dispatch plugging block: fix sysfs module parameters directory path in comment ...
2019-07-18block, bfq: check also in-flight I/O in dispatch pluggingPaolo Valente1-24/+43
Consider a sync bfq_queue Q that remains empty while in service, and suppose that, when this happens, there is a fair amount of already in-flight I/O not belonging to Q. In such a situation, I/O dispatching may need to be plugged (until new I/O arrives for Q), for the following reason. The drive may decide to serve in-flight non-Q's I/O requests before Q's ones, thereby delaying the arrival of new I/O requests for Q (recall that Q is sync). If I/O-dispatching is not plugged, then, while Q remains empty, a basically uncontrolled amount of I/O from other queues may be dispatched too, possibly causing the service of Q's I/O to be delayed even longer in the drive. This problem gets more and more serious as the speed and the queue depth of the drive grow, because, as these two quantities grow, the probability to find no queue busy but many requests in flight grows too. If Q has the same weight and priority as the other queues, then the above delay is unlikely to cause any issue, because all queues tend to undergo the same treatment. So, since not plugging I/O dispatching is convenient for throughput, it is better not to plug. Things change in case Q has a higher weight or priority than some other queue, because Q's service guarantees may simply be violated. For this reason, commit 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric scenarios") does plug I/O in such an asymmetric scenario. Plugging minimizes the delay induced by already in-flight I/O, and enables Q to recover the bandwidth it may lose because of this delay. Yet the above commit does not cover the case of weight-raised queues, for efficiency concerns. For weight-raised queues, I/O-dispatch plugging is activated simply if not all bfq_queues are weight-raised. But this check does not handle the case of in-flight requests, because a bfq_queue may become non busy *before* all its in-flight requests are completed. This commit performs I/O-dispatch plugging for weight-raised queues if there are some in-flight requests. As a practical example of the resulting recover of control, under write load on a Samsung SSD 970 PRO, gnome-terminal starts in 1.5 seconds after this fix, against 15 seconds before the fix (as a reference, gnome-terminal takes about 35 seconds to start with any of the other I/O schedulers). Fixes: 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric scenarios") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-15docs: block: convert to ReSTMauro Carvalho Chehab1-1/+1
Rename the block documentation files to ReST, add an index for them and adjust in order to produce a nice html output via the Sphinx build system. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2019-07-09Merge tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-blockLinus Torvalds1-296/+671
Pull block updates from Jens Axboe: "This is the main block updates for 5.3. Nothing earth shattering or major in here, just fixes, additions, and improvements all over the map. This contains: - Series of documentation fixes (Bart) - Optimization of the blk-mq ctx get/put (Bart) - null_blk removal race condition fix (Bob) - req/bio_op() cleanups (Chaitanya) - Series cleaning up the segment accounting, and request/bio mapping (Christoph) - Series cleaning up the page getting/putting for bios (Christoph) - block cgroup cleanups and moving it to where it is used (Christoph) - block cgroup fixes (Tejun) - Series of fixes and improvements to bcache, most notably a write deadlock fix (Coly) - blk-iolatency STS_AGAIN and accounting fixes (Dennis) - Series of improvements and fixes to BFQ (Douglas, Paolo) - debugfs_create() return value check removal for drbd (Greg) - Use struct_size(), where appropriate (Gustavo) - Two lighnvm fixes (Heiner, Geert) - MD fixes, including a read balance and corruption fix (Guoqing, Marcos, Xiao, Yufen) - block opal shadow mbr additions (Jonas, Revanth) - sbitmap compare-and-exhange improvemnts (Pavel) - Fix for potential bio->bi_size overflow (Ming) - NVMe pull requests: - improved PCIe suspent support (Keith Busch) - error injection support for the admin queue (Akinobu Mita) - Fibre Channel discovery improvements (James Smart) - tracing improvements including nvmetc tracing support (Minwoo Im) - misc fixes and cleanups (Anton Eidelman, Minwoo Im, Chaitanya Kulkarni)" - Various little fixes and improvements to drivers and core" * tag 'for-5.3/block-20190708' of git://git.kernel.dk/linux-block: (153 commits) blk-iolatency: fix STS_AGAIN handling block: nr_phys_segments needs to be zero for REQ_OP_WRITE_ZEROES blk-mq: simplify blk_mq_make_request() blk-mq: remove blk_mq_put_ctx() sbitmap: Replace cmpxchg with xchg block: fix .bi_size overflow block: sed-opal: check size of shadow mbr block: sed-opal: ioctl for writing to shadow mbr block: sed-opal: add ioctl for done-mark of shadow mbr block: never take page references for ITER_BVEC direct-io: use bio_release_pages in dio_bio_complete block_dev: use bio_release_pages in bio_unmap_user block_dev: use bio_release_pages in blkdev_bio_end_io iomap: use bio_release_pages in iomap_dio_bio_end_io block: use bio_release_pages in bio_map_user_iov block: use bio_release_pages in bio_unmap_user block: optionally mark pages dirty in bio_release_pages block: move the BIO_NO_PAGE_REF check into bio_release_pages block: skd_main.c: Remove call to memset after dma_alloc_coherent block: mtip32xx: Remove call to memset after dma_alloc_coherent ...
2019-06-28block, bfq: NULL out the bic when it's no longer validDouglas Anderson1-0/+1
In reboot tests on several devices we were seeing a "use after free" when slub_debug or KASAN was enabled. The kernel complained about: Unable to handle kernel paging request at virtual address 6b6b6c2b ...which is a classic sign of use after free under slub_debug. The stack crawl in kgdb looked like: 0 test_bit (addr=<optimized out>, nr=<optimized out>) 1 bfq_bfqq_busy (bfqq=<optimized out>) 2 bfq_select_queue (bfqd=<optimized out>) 3 __bfq_dispatch_request (hctx=<optimized out>) 4 bfq_dispatch_request (hctx=<optimized out>) 5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440) 6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440) 7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440) 8 0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>) 9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480) 10 0xc024cff4 in worker_thread (__worker=0xec6d4640) Digging in kgdb, it could be found that, though bfqq looked fine, bfqq->bic had been freed. Through further digging, I postulated that perhaps it is illegal to access a "bic" (AKA an "icq") after bfq_exit_icq() had been called because the "bic" can be freed at some point in time after this call is made. I confirmed that there certainly were cases where the exact crashing code path would access the "bic" after bfq_exit_icq() had been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and saw that the bic was 0x7 at the time of the crash. To understand a bit more about why this crash was fairly uncommon (I saw it only once in a few hundred reboots), you can see that much of the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't access the ->bic anymore. The only case it doesn't is if bfq_put_queue() sees a reference still held. However, even in the case when bfqq isn't freed, the crash is still rare. Why? I tracked what happened to the "bic" after the exit routine. It doesn't get freed right away. Rather, put_io_context_active() eventually called put_io_context() which queued up freeing on a workqueue. The freeing then actually happened later than that through call_rcu(). Despite all these delays, some extra debugging showed that all the hoops could be jumped through in time and the memory could be freed causing the original crash. Phew! To make a long story short, assuming it truly is illegal to access an icq after the "exit_icq" callback is finished, this patch is needed. Cc: stable@vger.kernel.org Reviewed-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-26block, bfq: Init saved_wr_start_at_switch_to_srt in unlikely caseDouglas Anderson1-0/+1
Some debug code suggested by Paolo was tripping when I did reboot stress tests. Specifically in bfq_bfqq_resume_state() "bic->saved_wr_start_at_switch_to_srt" was later than the current value of "jiffies". A bit of debugging showed that "bic->saved_wr_start_at_switch_to_srt" was actually 0 and a bit more debugging showed that was because we had run through the "unlikely" case in the bfq_bfqq_save_state() function. Let's init "saved_wr_start_at_switch_to_srt" in the unlikely case to something sane. NOTE: this fixes no known real-world errors. Reviewed-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25block, bfq: fix operator in BFQQ_TOTALLY_SEEKYPaolo Valente1-1/+1
By mistake, there is a '&' instead of a '==' in the definition of the macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with the correct one. Fixes: 7074f076ff15 ("block, bfq: do not tag totally seeky queues as soft rt") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25block, bfq: re-schedule empty queues if they deserve I/O pluggingPaolo Valente1-184/+203
Consider, on one side, a bfq_queue Q that remains empty while in service, and, on the other side, the pending I/O of bfq_queues that, according to their timestamps, have to be served after Q. If an uncontrolled amount of I/O from the latter bfq_queues were dispatched while Q is waiting for its new I/O to arrive, then Q's bandwidth guarantees would be violated. To prevent this, I/O dispatch is plugged until Q receives new I/O (except for a properly controlled amount of injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging, for the following reason. Preemption is performed in two steps. First, Q is expired and re-scheduled. Second, the new bfq_queue to serve is chosen. The first step is needed by the second, as the second can be performed only after Q's timestamps have been properly updated (done in the expiration step), and Q has been re-queued for service. This dependency is a consequence of the way how BFQ's scheduling algorithm is currently implemented. But Q is not re-scheduled at all in the first step, because Q is empty. As a consequence, an uncontrolled amount of I/O may be dispatched until Q becomes non empty again. This breaks Q's service guarantees. This commit addresses this issue by re-scheduling Q even if it is empty. This in turn breaks the assumption that all scheduled queues are non empty. Then a few extra checks are now needed. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25block, bfq: preempt lower-weight or lower-priority queuesPaolo Valente1-20/+75
BFQ enqueues the I/O coming from each process into a separate bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be served for at most timeout_sync milliseconds (default: 125 ms). This service scheme is prone to the following inaccuracy. While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may receive I/O, and, according to BFQ's scheduling policy, may become the right bfq_queue to serve, in place of the currently in-service bfq_queue. In this respect, postponing the service of Q2 to after the service of Q1 finishes may delay the completion of Q2's I/O, compared with an ideal service in which all non-empty bfq_queues are served in parallel, and every non-empty bfq_queue is served at a rate proportional to the bfq_queue's weight. This additional delay is equal at most to the time Q1 may unjustly remain in service before switching to Q2. If Q1 and Q2 have the same weight, then this time is most likely negligible compared with the completion time to be guaranteed to Q2's I/O. In addition, first, one of the reasons why BFQ may want to serve Q1 for a while is that this boosts throughput and, second, serving Q1 longer reduces BFQ's overhead. As a conclusion, it is usually better not to preempt Q1 if both Q1 and Q2 have the same weight. In contrast, as Q2's weight or priority becomes higher and higher compared with that of Q1, the above delay becomes larger and larger, compared with the I/O completion times that have to be guaranteed to Q2 according to Q2's weight. So reducing this delay may be more important than avoiding the costs of preempting Q1. Accordingly, this commit preempts Q1 if Q2 has a higher weight or a higher priority than Q1. Preemption causes Q1 to be re-scheduled, and triggers a new choice of the next bfq_queue to serve. If Q2 really is the next bfq_queue to serve, then Q2 will be set in service immediately. This change reduces the component of the I/O latency caused by the above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5 SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for a process doing sporadic random reads while another process is doing continuous sequential reads. Signed-off-by: Nicola Bottura <bottura.nicola95@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-25block, bfq: detect wakers and unconditionally inject their I/OPaolo Valente1-33/+237
A bfq_queue Q may happen to be synchronized with another bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to receive new I/O. We call Q2 "waker queue". If I/O plugging is being performed for Q, and Q is not receiving any more I/O because of the above synchronization, then, thanks to BFQ's injection mechanism, the waker queue is likely to get served before the I/O-plugging timeout fires. Unfortunately, this fact may not be sufficient to guarantee a high throughput during the I/O plugging, because the inject limit for Q may be too low to guarantee a lot of injected I/O. In addition, the duration of the plugging, i.e., the time before Q finally receives new I/O, may not be minimized, because the waker queue may happen to be served only after other queues. To address these issues, this commit introduces the explicit detection of the waker queue, and the unconditional injection of a pending I/O request of the waker queue on each invocation of bfq_dispatch_request(). One may be concerned that this systematic injection of I/O from the waker queue delays the service of Q's I/O. Fortunately, it doesn't. On the contrary, next Q's I/O is brought forward dramatically, for it is not blocked for milliseconds. Reported-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: Srivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>