diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2025-06-06 13:12:50 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2025-06-06 13:12:50 -0700 |
commit | 6d8854216ebb60959ddb6f4ea4123bd449ba6cf6 (patch) | |
tree | 065c7ffc29952ae309fcb815fbd4f81423655292 | |
parent | 794a54920781162c4503acea62d88e725726e319 (diff) | |
parent | 6f65947a1e684db28b9407ea51927ed5157caf41 (diff) | |
download | linux-6d8854216ebb60959ddb6f4ea4123bd449ba6cf6.tar.gz |
Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- NVMe pull request via Christoph:
- TCP error handling fix (Shin'ichiro Kawasaki)
- TCP I/O stall handling fixes (Hannes Reinecke)
- fix command limits status code (Keith Busch)
- support vectored buffers also for passthrough (Pavel Begunkov)
- spelling fixes (Yi Zhang)
- MD pull request via Yu:
- fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10
- fix max_write_behind setting for dm-raid
- some minor cleanups
- Integrity data direction fix and cleanup
- bcache NULL pointer fix
- Fix for loop missing write start/end handling
- Decouple hardware queues and IO threads in ublk
- Slew of ublk selftests additions and updates
* tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits)
nvme: spelling fixes
nvme-tcp: fix I/O stalls on congested sockets
nvme-tcp: sanitize request list handling
nvme-tcp: remove tag set when second admin queue config fails
nvme: enable vectored registered bufs for passthrough cmds
nvme: fix implicit bool to flags conversion
nvme: fix command limits status code
selftests: ublk: kublk: improve behavior on init failure
block: flip iter directions in blk_rq_integrity_map_user()
block: drop direction param from bio_integrity_copy_user()
selftests: ublk: cover PER_IO_DAEMON in more stress tests
Documentation: ublk: document UBLK_F_PER_IO_DAEMON
selftests: ublk: add stress test for per io daemons
selftests: ublk: add functional test for per io daemons
selftests: ublk: kublk: decouple ublk_queues from ublk server threads
selftests: ublk: kublk: move per-thread data out of ublk_queue
selftests: ublk: kublk: lift queue initialization out of thread
selftests: ublk: kublk: tie sqe allocation to io instead of queue
selftests: ublk: kublk: plumb q_id in io_uring user_data
ublk: have a per-io daemon instead of a per-queue daemon
...
48 files changed, 693 insertions, 378 deletions
diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst index 854f823b46c2ad..c368e1081b4111 100644 --- a/Documentation/block/ublk.rst +++ b/Documentation/block/ublk.rst @@ -115,15 +115,15 @@ managing and controlling ublk devices with help of several control commands: - ``UBLK_CMD_START_DEV`` - After the server prepares userspace resources (such as creating per-queue - pthread & io_uring for handling ublk IO), this command is sent to the + After the server prepares userspace resources (such as creating I/O handler + threads & io_uring for handling ublk IO), this command is sent to the driver for allocating & exposing ``/dev/ublkb*``. Parameters set via ``UBLK_CMD_SET_PARAMS`` are applied for creating the device. - ``UBLK_CMD_STOP_DEV`` Halt IO on ``/dev/ublkb*`` and remove the device. When this command returns, - ublk server will release resources (such as destroying per-queue pthread & + ublk server will release resources (such as destroying I/O handler threads & io_uring). - ``UBLK_CMD_DEL_DEV`` @@ -208,15 +208,15 @@ managing and controlling ublk devices with help of several control commands: modify how I/O is handled while the ublk server is dying/dead (this is called the ``nosrv`` case in the driver code). - With just ``UBLK_F_USER_RECOVERY`` set, after one ubq_daemon(ublk server's io - handler) is dying, ublk does not delete ``/dev/ublkb*`` during the whole + With just ``UBLK_F_USER_RECOVERY`` set, after the ublk server exits, + ublk does not delete ``/dev/ublkb*`` during the whole recovery stage and ublk device ID is kept. It is ublk server's responsibility to recover the device context by its own knowledge. Requests which have not been issued to userspace are requeued. Requests which have been issued to userspace are aborted. - With ``UBLK_F_USER_RECOVERY_REISSUE`` additionally set, after one ubq_daemon - (ublk server's io handler) is dying, contrary to ``UBLK_F_USER_RECOVERY``, + With ``UBLK_F_USER_RECOVERY_REISSUE`` additionally set, after the ublk server + exits, contrary to ``UBLK_F_USER_RECOVERY``, requests which have been issued to userspace are requeued and will be re-issued to the new process after handling ``UBLK_CMD_END_USER_RECOVERY``. ``UBLK_F_USER_RECOVERY_REISSUE`` is designed for backends who tolerate @@ -241,10 +241,11 @@ can be controlled/accessed just inside this container. Data plane ---------- -ublk server needs to create per-queue IO pthread & io_uring for handling IO -commands via io_uring passthrough. The per-queue IO pthread -focuses on IO handling and shouldn't handle any control & management -tasks. +The ublk server should create dedicated threads for handling I/O. Each +thread should have its own io_uring through which it is notified of new +I/O, and through which it can complete I/O. These dedicated threads +should focus on IO handling and shouldn't handle any control & +management tasks. The's IO is assigned by a unique tag, which is 1:1 mapping with IO request of ``/dev/ublkb*``. @@ -265,6 +266,18 @@ with specified IO tag in the command data: destined to ``/dev/ublkb*``. This command is sent only once from the server IO pthread for ublk driver to setup IO forward environment. + Once a thread issues this command against a given (qid,tag) pair, the thread + registers itself as that I/O's daemon. In the future, only that I/O's daemon + is allowed to issue commands against the I/O. If any other thread attempts + to issue a command against a (qid,tag) pair for which the thread is not the + daemon, the command will fail. Daemons can be reset only be going through + recovery. + + The ability for every (qid,tag) pair to have its own independent daemon task + is indicated by the ``UBLK_F_PER_IO_DAEMON`` feature. If this feature is not + supported by the driver, daemons must be per-queue instead - i.e. all I/Os + associated to a single qid must be handled by the same task. + - ``UBLK_IO_COMMIT_AND_FETCH_REQ`` When an IO request is destined to ``/dev/ublkb*``, the driver stores diff --git a/block/bio-integrity.c b/block/bio-integrity.c index cb94e9be26dc21..10912988c8f5c7 100644 --- a/block/bio-integrity.c +++ b/block/bio-integrity.c @@ -154,10 +154,9 @@ int bio_integrity_add_page(struct bio *bio, struct page *page, EXPORT_SYMBOL(bio_integrity_add_page); static int bio_integrity_copy_user(struct bio *bio, struct bio_vec *bvec, - int nr_vecs, unsigned int len, - unsigned int direction) + int nr_vecs, unsigned int len) { - bool write = direction == ITER_SOURCE; + bool write = op_is_write(bio_op(bio)); struct bio_integrity_payload *bip; struct iov_iter iter; void *buf; @@ -168,7 +167,7 @@ static int bio_integrity_copy_user(struct bio *bio, struct bio_vec *bvec, return -ENOMEM; if (write) { - iov_iter_bvec(&iter, direction, bvec, nr_vecs, len); + iov_iter_bvec(&iter, ITER_SOURCE, bvec, nr_vecs, len); if (!copy_from_iter_full(buf, len, &iter)) { ret = -EFAULT; goto free_buf; @@ -264,7 +263,7 @@ int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter) struct page *stack_pages[UIO_FASTIOV], **pages = stack_pages; struct bio_vec stack_vec[UIO_FASTIOV], *bvec = stack_vec; size_t offset, bytes = iter->count; - unsigned int direction, nr_bvecs; + unsigned int nr_bvecs; int ret, nr_vecs; bool copy; @@ -273,11 +272,6 @@ int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter) if (bytes >> SECTOR_SHIFT > queue_max_hw_sectors(q)) return -E2BIG; - if (bio_data_dir(bio) == READ) - direction = ITER_DEST; - else - direction = ITER_SOURCE; - nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS + 1); if (nr_vecs > BIO_MAX_VECS) return -E2BIG; @@ -300,8 +294,7 @@ int bio_integrity_map_user(struct bio *bio, struct iov_iter *iter) copy = true; if (copy) - ret = bio_integrity_copy_user(bio, bvec, nr_bvecs, bytes, - direction); + ret = bio_integrity_copy_user(bio, bvec, nr_bvecs, bytes); else ret = bio_integrity_init_user(bio, bvec, nr_bvecs, bytes); if (ret) diff --git a/block/blk-integrity.c b/block/blk-integrity.c index a1678f0a9f81f9..e4e2567061f9db 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -117,13 +117,8 @@ int blk_rq_integrity_map_user(struct request *rq, void __user *ubuf, { int ret; struct iov_iter iter; - unsigned int direction; - if (op_is_write(req_op(rq))) - direction = ITER_DEST; - else - direction = ITER_SOURCE; - iov_iter_ubuf(&iter, direction, ubuf, bytes); + iov_iter_ubuf(&iter, rq_data_dir(rq), ubuf, bytes); ret = bio_integrity_map_user(rq->bio, &iter); if (ret) return ret; diff --git a/drivers/block/loop.c b/drivers/block/loop.c index e2b1f377f58563..f8d136684109aa 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -308,11 +308,14 @@ end_io: static void lo_rw_aio_do_completion(struct loop_cmd *cmd) { struct request *rq = blk_mq_rq_from_pdu(cmd); + struct loop_device *lo = rq->q->queuedata; if (!atomic_dec_and_test(&cmd->ref)) return; kfree(cmd->bvec); cmd->bvec = NULL; + if (req_op(rq) == REQ_OP_WRITE) + file_end_write(lo->lo_backing_file); if (likely(!blk_should_fake_timeout(rq->q))) blk_mq_complete_request(rq); } @@ -387,9 +390,10 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd, cmd->iocb.ki_flags = 0; } - if (rw == ITER_SOURCE) + if (rw == ITER_SOURCE) { + file_start_write(lo->lo_backing_file); ret = file->f_op->write_iter(&cmd->iocb, &iter); - else + } else ret = file->f_op->read_iter(&cmd->iocb, &iter); lo_rw_aio_do_completion(cmd); diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index 6f51072776f1d7..c637ea010d3474 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -69,7 +69,8 @@ | UBLK_F_USER_RECOVERY_FAIL_IO \ | UBLK_F_UPDATE_SIZE \ | UBLK_F_AUTO_BUF_REG \ - | UBLK_F_QUIESCE) + | UBLK_F_QUIESCE \ + | UBLK_F_PER_IO_DAEMON) #define UBLK_F_ALL_RECOVERY_FLAGS (UBLK_F_USER_RECOVERY \ | UBLK_F_USER_RECOVERY_REISSUE \ @@ -166,6 +167,8 @@ struct ublk_io { /* valid if UBLK_IO_FLAG_OWNED_BY_SRV is set */ struct request *req; }; + + struct task_struct *task; }; struct ublk_queue { @@ -173,11 +176,9 @@ struct ublk_queue { int q_depth; unsigned long flags; - struct task_struct *ubq_daemon; struct ublksrv_io_desc *io_cmd_buf; bool force_abort; - bool timeout; bool canceling; bool fail_io; /* copy of dev->state == UBLK_S_DEV_FAIL_IO */ unsigned short nr_io_ready; /* how many ios setup */ @@ -1099,11 +1100,6 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu( return io_uring_cmd_to_pdu(ioucmd, struct ublk_uring_cmd_pdu); } -static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq) -{ - return !ubq->ubq_daemon || ubq->ubq_daemon->flags & PF_EXITING; -} - /* todo: handle partial completion */ static inline void __ublk_complete_rq(struct request *req) { @@ -1275,13 +1271,13 @@ static void ublk_dispatch_req(struct ublk_queue *ubq, /* * Task is exiting if either: * - * (1) current != ubq_daemon. + * (1) current != io->task. * io_uring_cmd_complete_in_task() tries to run task_work - * in a workqueue if ubq_daemon(cmd's task) is PF_EXITING. + * in a workqueue if cmd's task is PF_EXITING. * * (2) current->flags & PF_EXITING. */ - if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) { + if (unlikely(current != io->task || current->flags & PF_EXITING)) { __ublk_abort_rq(ubq, req); return; } @@ -1330,24 +1326,22 @@ static void ublk_cmd_list_tw_cb(struct io_uring_cmd *cmd, { struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); struct request *rq = pdu->req_list; - struct ublk_queue *ubq = pdu->ubq; struct request *next; do { next = rq->rq_next; rq->rq_next = NULL; - ublk_dispatch_req(ubq, rq, issue_flags); + ublk_dispatch_req(rq->mq_hctx->driver_data, rq, issue_flags); rq = next; } while (rq); } -static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l) +static void ublk_queue_cmd_list(struct ublk_io *io, struct rq_list *l) { - struct request *rq = rq_list_peek(l); - struct io_uring_cmd *cmd = ubq->ios[rq->tag].cmd; + struct io_uring_cmd *cmd = io->cmd; struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); - pdu->req_list = rq; + pdu->req_list = rq_list_peek(l); rq_list_init(l); io_uring_cmd_complete_in_task(cmd, ublk_cmd_list_tw_cb); } @@ -1355,13 +1349,10 @@ static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l) static enum blk_eh_timer_return ublk_timeout(struct request *rq) { struct ublk_queue *ubq = rq->mq_hctx->driver_data; + struct ublk_io *io = &ubq->ios[rq->tag]; if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) { - if (!ubq->timeout) { - send_sig(SIGKILL, ubq->ubq_daemon, 0); - ubq->timeout = true; - } - + send_sig(SIGKILL, io->task, 0); return BLK_EH_DONE; } @@ -1429,24 +1420,25 @@ static void ublk_queue_rqs(struct rq_list *rqlist) { struct rq_list requeue_list = { }; struct rq_list submit_list = { }; - struct ublk_queue *ubq = NULL; + struct ublk_io *io = NULL; struct request *req; while ((req = rq_list_pop(rqlist))) { struct ublk_queue *this_q = req->mq_hctx->driver_data; + struct ublk_io *this_io = &this_q->ios[req->tag]; - if (ubq && ubq != this_q && !rq_list_empty(&submit_list)) - ublk_queue_cmd_list(ubq, &submit_list); - ubq = this_q; + if (io && io->task != this_io->task && !rq_list_empty(&submit_list)) + ublk_queue_cmd_list(io, &submit_list); + io = this_io; - if (ublk_prep_req(ubq, req, true) == BLK_STS_OK) + if (ublk_prep_req(this_q, req, true) == BLK_STS_OK) rq_list_add_tail(&submit_list, req); else rq_list_add_tail(&requeue_list, req); } - if (ubq && !rq_list_empty(&submit_list)) - ublk_queue_cmd_list(ubq, &submit_list); + if (!rq_list_empty(&submit_list)) + ublk_queue_cmd_list(io, &submit_list); *rqlist = requeue_list; } @@ -1474,17 +1466,6 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq) /* All old ioucmds have to be completed */ ubq->nr_io_ready = 0; - /* - * old daemon is PF_EXITING, put it now - * - * It could be NULL in case of closing one quisced device. - */ - if (ubq->ubq_daemon) - put_task_struct(ubq->ubq_daemon); - /* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */ - ubq->ubq_daemon = NULL; - ubq->timeout = false; - for (i = 0; i < ubq->q_depth; i++) { struct ublk_io *io = &ubq->ios[i]; @@ -1495,6 +1476,17 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq) io->flags &= UBLK_IO_FLAG_CANCELED; io->cmd = NULL; io->addr = 0; + + /* + * old task is PF_EXITING, put it now + * + * It could be NULL in case of closing one quiesced + * device. + */ + if (io->task) { + put_task_struct(io->task); + io->task = NULL; + } } } @@ -1516,7 +1508,7 @@ static void ublk_reset_ch_dev(struct ublk_device *ub) for (i = 0; i < ub->dev_info.nr_hw_queues; i++) ublk_queue_reinit(ub, ublk_get_queue(ub, i)); - /* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */ + /* set to NULL, otherwise new tasks cannot mmap io_cmd_buf */ ub->mm = NULL; ub->nr_queues_ready = 0; ub->nr_privileged_daemon = 0; @@ -1783,6 +1775,7 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd, struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd); struct ublk_queue *ubq = pdu->ubq; struct task_struct *task; + struct ublk_io *io; if (WARN_ON_ONCE(!ubq)) return; @@ -1791,13 +1784,14 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd, return; task = io_uring_cmd_get_task(cmd); - if (WARN_ON_ONCE(task && task != ubq->ubq_daemon)) + io = &ubq->ios[pdu->tag]; + if (WARN_ON_ONCE(task && task != io->task)) return; if (!ubq->canceling) ublk_start_cancel(ubq); - WARN_ON_ONCE(ubq->ios[pdu->tag].cmd != cmd); + WARN_ON_ONCE(io->cmd != cmd); ublk_cancel_cmd(ubq, pdu->tag, issue_flags); } @@ -1930,8 +1924,6 @@ static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq) { ubq->nr_io_ready++; if (ublk_queue_ready(ubq)) { - ubq->ubq_daemon = current; - get_task_struct(ubq->ubq_daemon); ub->nr_queues_ready++; if (capable(CAP_SYS_ADMIN)) @@ -2084,6 +2076,7 @@ static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_queue *ubq, } ublk_fill_io_cmd(io, cmd, buf_addr); + WRITE_ONCE(io->task, get_task_struct(current)); ublk_mark_io_ready(ub, ubq); out: mutex_unlock(&ub->mutex); @@ -2179,6 +2172,7 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd, const struct ublksrv_io_cmd *ub_cmd) { struct ublk_device *ub = cmd->file->private_data; + struct task_struct *task; struct ublk_queue *ubq; struct ublk_io *io; u32 cmd_op = cmd->cmd_op; @@ -2193,13 +2187,14 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd, goto out; ubq = ublk_get_queue(ub, ub_cmd->q_id); - if (ubq->ubq_daemon && ubq->ubq_daemon != current) - goto out; if (tag >= ubq->q_depth) goto out; io = &ubq->ios[tag]; + task = READ_ONCE(io->task); + if (task && task != current) + goto out; /* there is pending io cmd, something must be wrong */ if (io->flags & UBLK_IO_FLAG_ACTIVE) { @@ -2449,9 +2444,14 @@ static void ublk_deinit_queue(struct ublk_device *ub, int q_id) { int size = ublk_queue_cmd_buf_size(ub, q_id); struct ublk_queue *ubq = ublk_get_queue(ub, q_id); + int i; + + for (i = 0; i < ubq->q_depth; i++) { + struct ublk_io *io = &ubq->ios[i]; + if (io->task) + put_task_struct(io->task); + } - if (ubq->ubq_daemon) - put_task_struct(ubq->ubq_daemon); if (ubq->io_cmd_buf) free_pages((unsigned long)ubq->io_cmd_buf, get_order(size)); } @@ -2923,7 +2923,8 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header) ub->dev_info.flags &= UBLK_F_ALL; ub->dev_info.flags |= UBLK_F_CMD_IOCTL_ENCODE | - UBLK_F_URING_CMD_COMP_IN_TASK; + UBLK_F_URING_CMD_COMP_IN_TASK | + UBLK_F_PER_IO_DAEMON; /* GET_DATA isn't needed any more with USER_COPY or ZERO COPY */ if (ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY | @@ -3188,14 +3189,14 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub, int ublksrv_pid = (int)header->data[0]; int ret = -EINVAL; - pr_devel("%s: Waiting for new ubq_daemons(nr: %d) are ready, dev id %d...\n", - __func__, ub->dev_info.nr_hw_queues, header->dev_id); - /* wait until new ubq_daemon sending all FETCH_REQ */ + pr_devel("%s: Waiting for all FETCH_REQs, dev id %d...\n", __func__, + header->dev_id); + if (wait_for_completion_interruptible(&ub->completion)) return -EINTR; - pr_devel("%s: All new ubq_daemons(nr: %d) are ready, dev id %d\n", - __func__, ub->dev_info.nr_hw_queues, header->dev_id); + pr_devel("%s: All FETCH_REQs received, dev id %d\n", __func__, + header->dev_id); mutex_lock(&ub->mutex); if (ublk_nosrv_should_stop_dev(ub)) diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index 2cc2eb24dc8af6..1d010067735730 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -89,8 +89,6 @@ * Test module load/unload */ -#define MAX_NEED_GC 64 -#define MAX_SAVE_PRIO 72 #define MAX_GC_TIMES 100 #define MIN_GC_NODES 100 #define GC_SLEEP_MS 100 diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 1a2ce1a4b4566a..1efb768b289058 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1733,7 +1733,12 @@ static CLOSURE_CALLBACK(cache_set_flush) mutex_unlock(&b->write_lock); } - if (ca->alloc_thread) + /* + * If the register_cache_set() call to bch_cache_set_alloc() failed, + * ca has not been assigned a value and return error. + * So we need check ca is not NULL during bch_cache_set_unregister(). + */ + if (ca && ca->alloc_thread) kthread_stop(ca->alloc_thread); if (c->journal.cur) { @@ -2233,15 +2238,47 @@ static int cache_alloc(struct cache *ca) bio_init(&ca->journal.bio, NULL, ca->journal.bio.bi_inline_vecs, 8, 0); /* - * when ca->sb.njournal_buckets is not zero, journal exists, - * and in bch_journal_replay(), tree node may split, - * so bucket of RESERVE_BTREE type is needed, - * the worst situation is all journal buckets are valid journal, - * and all the keys need to replay, - * so the number of RESERVE_BTREE type buckets should be as much - * as journal buckets + * When the cache disk is first registered, ca->sb.njournal_buckets + * is zero, and it is assigned in run_cache_set(). + * + * When ca->sb.njournal_buckets is not zero, journal exists, + * and in bch_journal_replay(), tree node may split. + * The worst situation is all journal buckets are valid journal, + * and all the keys need to replay, so the number of RESERVE_BTREE + * type buckets should be as much as journal buckets. + * + * If the number of RESERVE_BTREE type buckets is too few, the + * bch_allocator_thread() may hang up and unable to allocate + * bucket. The situation is roughly as follows: + * + * 1. In bch_data_insert_keys(), if the operation is not op->replace, + * it will call the bch_journal(), which increments the journal_ref + * counter. This counter is only decremented after bch_btree_insert + * completes. + * + * 2. When calling bch_btree_insert, if the btree needs to split, + * it will call btree_split() and btree_check_reserve() to check + * whether there are enough reserved buckets in the RESERVE_BTREE + * slot. If not enough, bcache_btree_root() will repeatedly retry. + * + * 3. Normally, the bch_allocator_thread is responsible for filling + * the reservation slots from the free_inc bucket list. When the + * free_inc bucket list is exhausted, the bch_allocator_thread + * will call invalidate_buckets() until free_inc is refilled. + * Then bch_allocator_thread calls bch_prio_write() once. and + * bch_prio_write() will call bch_journal_meta() and waits for + * the journal write to complete. + * + * 4. During journal_write, journal_write_unlocked() is be called. + * If journal full occurs, journal_reclaim() and btree_flush_write() + * will be called sequentially, then retry journal_write. + * + * 5. When 2 and 4 occur together, IO will hung up and cannot recover. + * + * Therefore, reserve more RESERVE_BTREE type buckets. */ - btree_buckets = ca->sb.njournal_buckets ?: 8; + btree_buckets = clamp_t(size_t, ca->sb.nbuckets >> 7, + 32, SB_JOURNAL_BUCKETS); free = roundup_pow_of_two(ca->sb.nbuckets) >> 10; if (!free) { ret = -EPERM; diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index 127138c61be58a..d296770478b299 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c @@ -1356,11 +1356,7 @@ static int parse_raid_params(struct raid_set *rs, struct dm_arg_set *as, return -EINVAL; } - /* - * In device-mapper, we specify things in sectors, but - * MD records this value in kB - */ - if (value < 0 || value / 2 > COUNTER_MAX) { + if (value < 0) { rs->ti->error = "Max write-behind limit out of range"; return -EINVAL; } diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 37b08f26c62f5d..bd694910b01b0b 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -105,9 +105,19 @@ * */ +typedef __u16 bitmap_counter_t; + #define PAGE_BITS (PAGE_SIZE << 3) #define PAGE_BIT_SHIFT (PAGE_SHIFT + 3) +#define COUNTER_BITS 16 +#define COUNTER_BIT_SHIFT 4 +#define COUNTER_BYTE_SHIFT (COUNTER_BIT_SHIFT - 3) + +#define NEEDED_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 1))) +#define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2))) +#define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1) + #define NEEDED(x) (((bitmap_counter_t) x) & NEEDED_MASK) #define RESYNC(x) (((bitmap_counter_t) x) & RESYNC_MASK) #define COUNTER(x) (((bitmap_counter_t) x) & COUNTER_MAX) @@ -789,7 +799,7 @@ static int md_bitmap_new_disk_sb(struct bitmap *bitmap) * is a good choice? We choose COUNTER_MAX / 2 arbitrarily. */ write_behind = bitmap->mddev->bitmap_info.max_write_behind; - if (write_behind > COUNTER_MAX) + if (write_behind > COUNTER_MAX / 2) write_behind = COUNTER_MAX / 2; sb->write_behind = cpu_to_le32(write_behind); bitmap->mddev->bitmap_info.max_write_behind = write_behind; @@ -1672,13 +1682,13 @@ __acquires(bitmap->lock) &(bitmap->bp[page].map[pageoff]); } -static int bitmap_startwrite(struct mddev *mddev, sector_t offset, - unsigned long sectors) +static void bitmap_start_write(struct mddev *mddev, sector_t offset, + unsigned long sectors) { struct bitmap *bitmap = mddev->bitmap; if (!bitmap) - return 0; + return; while (sectors) { sector_t blocks; @@ -1688,7 +1698,7 @@ static int bitmap_startwrite(struct mddev *mddev, sector_t offset, bmc = md_bitmap_get_counter(&bitmap->counts, offset, &blocks, 1); if (!bmc) { spin_unlock_irq(&bitmap->counts.lock); - return 0; + return; } if (unlikely(COUNTER(*bmc) == COUNTER_MAX)) { @@ -1724,11 +1734,10 @@ static int bitmap_startwrite(struct mddev *mddev, sector_t offset, else sectors = 0; } - return 0; } -static void bitmap_endwrite(struct mddev *mddev, sector_t offset, - unsigned long sectors) +static void bitmap_end_write(struct mddev *mddev, sector_t offset, + unsigned long sectors) { struct bitmap *bitmap = mddev->bitmap; @@ -2205,9 +2214,9 @@ static struct bitmap *__bitmap_create(struct mddev *mddev, int slot) return ERR_PTR(err); } -static int bitmap_create(struct mddev *mddev, int slot) +static int bitmap_create(struct mddev *mddev) { - struct bitmap *bitmap = __bitmap_create(mddev, slot); + struct bitmap *bitmap = __bitmap_create(mddev, -1); if (IS_ERR(bitmap)) return PTR_ERR(bitmap); @@ -2670,7 +2679,7 @@ location_store(struct mddev *mddev, const char *buf, size_t len) } mddev->bitmap_info.offset = offset; - rv = bitmap_create(mddev, -1); + rv = bitmap_create(mddev); if (rv) goto out; @@ -3003,8 +3012,8 @@ static struct bitmap_operations bitmap_ops = { .end_behind_write = bitmap_end_behind_write, .wait_behind_writes = bitmap_wait_behind_writes, - .startwrite = bitmap_startwrite, - .endwrite = bitmap_endwrite, + .start_write = bitmap_start_write, + .end_write = bitmap_end_write, .start_sync = bitmap_start_sync, .end_sync = bitmap_end_sync, .cond_end_sync = bitmap_cond_end_sync, diff --git a/drivers/md/md-bitmap.h b/drivers/md/md-bitmap.h index 31c93019c76bf3..59e9dd45cfdec5 100644 --- a/drivers/md/md-bitmap.h +++ b/drivers/md/md-bitmap.h @@ -9,15 +9,6 @@ #define BITMAP_MAGIC 0x6d746962 -typedef __u16 bitmap_counter_t; -#define COUNTER_BITS 16 -#define COUNTER_BIT_SHIFT 4 -#define COUNTER_BYTE_SHIFT (COUNTER_BIT_SHIFT - 3) - -#define NEEDED_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 1))) -#define RESYNC_MASK ((bitmap_counter_t) (1 << (COUNTER_BITS - 2))) -#define COUNTER_MAX ((bitmap_counter_t) RESYNC_MASK - 1) - /* use these for bitmap->flags and bitmap->sb->state bit-fields */ enum bitmap_state { BITMAP_STALE = 1, /* the bitmap file is out of date or had -EIO */ @@ -72,7 +63,7 @@ struct md_bitmap_stats { struct bitmap_operations { bool (*enabled)(struct mddev *mddev); - int (*create)(struct mddev *mddev, int slot); + int (*create)(struct mddev *mddev); int (*resize)(struct mddev *mddev, sector_t blocks, int chunksize, bool init); @@ -89,10 +80,10 @@ struct bitmap_operations { void (*end_behind_write)(struct mddev *mddev); void (*wait_behind_writes)(struct mddev *mddev); - int (*startwrite)(struct mddev *mddev, sector_t offset, + void (*start_write)(struct mddev *mddev, sector_t offset, + unsigned long sectors); + void (*end_write)(struct mddev *mddev, sector_t offset, unsigned long sectors); - void (*endwrite)(struct mddev *mddev, sector_t offset, - unsigned long sectors); bool (*start_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks, bool degraded); void (*end_sync)(struct mddev *mddev, sector_t offset, sector_t *blocks); diff --git a/drivers/md/md.c b/drivers/md/md.c index 0fde115e921ff9..09042b060086bc 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -6225,7 +6225,7 @@ int md_run(struct mddev *mddev) } if (err == 0 && pers->sync_request && (mddev->bitmap_info.file || mddev->bitmap_info.offset)) { - err = mddev->bitmap_ops->create(mddev, -1); + err = mddev->bitmap_ops->create(mddev); if (err) pr_warn("%s: failed to create bitmap (%d)\n", mdname(mddev), err); @@ -7285,7 +7285,7 @@ static int set_bitmap_file(struct mddev *mddev, int fd) err = 0; if (mddev->pers) { if (fd >= 0) { - err = mddev->bitmap_ops->create(mddev, -1); + err = mddev->bitmap_ops->create(mddev); if (!err) err = mddev->bitmap_ops->load(mddev); @@ -7601,7 +7601,7 @@ static int update_array_info(struct mddev *mddev, mdu_array_info_t *info) mddev->bitmap_info.default_offset; mddev->bitmap_info.space = mddev->bitmap_info.default_space; - rv = mddev->bitmap_ops->create(mddev, -1); + rv = mddev->bitmap_ops->create(mddev); if (!rv) rv = mddev->bitmap_ops->load(mddev); @@ -8799,14 +8799,14 @@ static void md_bitmap_start(struct mddev *mddev, mddev->pers->bitmap_sector(mddev, &md_io_clone->offset, &md_io_clone->sectors); - mddev->bitmap_ops->startwrite(mddev, md_io_clone->offset, - md_io_clone->sectors); + mddev->bitmap_ops->start_write(mddev, md_io_clone->offset, + md_io_clone->sectors); } static void md_bitmap_end(struct mddev *mddev, struct md_io_clone *md_io_clone) { - mddev->bitmap_ops->endwrite(mddev, md_io_clone->offset, - md_io_clone->sectors); + mddev->bitmap_ops->end_write(mddev, md_io_clone->offset, + md_io_clone->sectors); } static void md_end_clone_io(struct bio *bio) diff --git a/drivers/md/raid1-10.c b/drivers/md/raid1-10.c index c7efd8aab675cc..b8b3a90697012c 100644 --- a/drivers/md/raid1-10.c +++ b/drivers/md/raid1-10.c @@ -293,3 +293,13 @@ static inline bool raid1_should_read_first(struct mddev *mddev, return false; } + +/* + * bio with REQ_RAHEAD or REQ_NOWAIT can fail at anytime, before such IO is + * submitted to the underlying disks, hence don't record badblocks or retry + * in this case. + */ +static inline bool raid1_should_handle_error(struct bio *bio) +{ + return !(bio->bi_opf & (REQ_RAHEAD | REQ_NOWAIT)); +} diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 657d481525be68..19c5a0ce5a408f 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -373,14 +373,16 @@ static void raid1_end_read_request(struct bio *bio) */ update_head_pos(r1_bio->read_disk, r1_bio); - if (uptodate) + if (uptodate) { set_bit(R1BIO_Uptodate, &r1_bio->state); - else if (test_bit(FailFast, &rdev->flags) && - test_bit(R1BIO_FailFast, &r1_bio->state)) + } else if (test_bit(FailFast, &rdev->flags) && + test_bit(R1BIO_FailFast, &r1_bio->state)) { /* This was a fail-fast read so we definitely * want to retry */ ; - else { + } else if (!raid1_should_handle_error(bio)) { + uptodate = 1; + } else { /* If all other devices have failed, we want to return * the error upwards rather than fail the last device. * Here we redefine "uptodate" to mean "Don't want to retry" @@ -451,16 +453,15 @@ static void raid1_end_write_request(struct bio *bio) struct bio *to_put = NULL; int mirror = find_bio_disk(r1_bio, bio); struct md_rdev *rdev = conf->mirrors[mirror].rdev; - bool discard_error; sector_t lo = r1_bio->sector; sector_t hi = r1_bio->sector + r1_bio->sectors; - - discard_error = bio->bi_status && bio_op(bio) == REQ_OP_DISCARD; + bool ignore_error = !raid1_should_handle_error(bio) || + (bio->bi_status && bio_op(bio) == REQ_OP_DISCARD); /* * 'one mirror IO has finished' event handler: */ - if (bio->bi_status && !discard_error) { + if (bio->bi_status && !ignore_error) { set_bit(WriteErrorSeen, &rdev->flags); if (!test_and_set_bit(WantReplacement, &rdev->flags)) set_bit(MD_RECOVERY_NEEDED, & @@ -511,7 +512,7 @@ static void raid1_end_write_request(struct bio *bio) /* Maybe we can clear some bad blocks. */ if (rdev_has_badblock(rdev, r1_bio->sector, r1_bio->sectors) && - !discard_error) { + !ignore_error) { r1_bio->bios[mirror] = IO_MADE_GOOD; set_bit(R1BIO_MadeGood, &r1_bio->state); } diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index dce06bf65016fb..b74780af4c220d 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -399,6 +399,8 @@ static void raid10_end_read_request(struct bio *bio) * wait for the 'master' bio. */ set_bit(R10BIO_Uptodate, &r10_bio->state); + } else if (!raid1_should_handle_error(bio)) { + uptodate = 1; } else { /* If all other devices that store this block have * failed, we want to return the error upwards rather @@ -456,9 +458,8 @@ static void raid10_end_write_request(struct bio *bio) int slot, repl; struct md_rdev *rdev = NULL; struct bio *to_put = NULL; - bool discard_error; - - discard_error = bio->bi_status && bio_op(bio) == REQ_OP_DISCARD; + bool ignore_error = !raid1_should_handle_error(bio) || + (bio->bi_status && bio_op(bio) == REQ_OP_DISCARD); dev = find_bio_disk(conf, r10_bio, bio, &slot, &repl); @@ -472,7 +473,7 @@ static void raid10_end_write_request(struct bio *bio) /* * this branch is our 'one mirror IO has finished' event handler: */ - if (bio->bi_status && !discard_error) { + if (bio->bi_status && !ignore_error) { if (repl) /* Never record new bad blocks to replacement, * just fail it. @@ -527,7 +528,7 @@ static void raid10_end_write_request(struct bio *bio) /* Maybe we can clear some bad blocks. */ if (rdev_has_badblock(rdev, r10_bio->devs[slot].addr, r10_bio->sectors) && - !discard_error) { + !ignore_error) { bio_put(bio); if (repl) r10_bio->devs[slot].repl_bio = IO_MADE_GOOD; diff --git a/drivers/nvme/common/auth.c b/drivers/nvme/common/auth.c index 3b6d759bcdf26c..91e273b89fea38 100644 --- a/drivers/nvme/common/auth.c +++ b/drivers/nvme/common/auth.c @@ -471,7 +471,7 @@ EXPORT_SYMBOL_GPL(nvme_auth_generate_key); * @c1: Value of challenge C1 * @c2: Value of challenge C2 * @hash_len: Hash length of the hash algorithm - * @ret_psk: Pointer too the resulting generated PSK + * @ret_psk: Pointer to the resulting generated PSK * @ret_len: length of @ret_psk * * Generate a PSK for TLS as specified in NVMe base specification, section @@ -759,8 +759,8 @@ int nvme_auth_derive_tls_psk(int hmac_id, u8 *psk, size_t psk_len, goto out_free_prk; /* - * 2 addtional bytes for the length field from HDKF-Expand-Label, - * 2 addtional bytes for the HMAC ID, and one byte for the space + * 2 additional bytes for the length field from HDKF-Expand-Label, + * 2 additional bytes for the HMAC ID, and one byte for the space * separator. */ info_len = strlen(psk_digest) + strlen(psk_prefix) + 5; diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig index 7dca58f0a237bc..31974c7dd20c91 100644 --- a/drivers/nvme/host/Kconfig +++ b/drivers/nvme/host/Kconfig @@ -106,7 +106,7 @@ config NVME_TCP_TLS help Enables TLS encryption for NVMe TCP using the netlink handshake API. - The TLS handshake daemon is availble at + The TLS handshake daemon is available at https://github.com/oracle/ktls-utils. If unsure, say N. diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c index 2b9e6cfaf2a80a..1a0058be582104 100644 --- a/drivers/nvme/host/constants.c +++ b/drivers/nvme/host/constants.c @@ -145,7 +145,7 @@ static const char * const nvme_statuses[] = { [NVME_SC_BAD_ATTRIBUTES] = "Conflicting Attributes", [NVME_SC_INVALID_PI] = "Invalid Protection Information", [NVME_SC_READ_ONLY] = "Attempted Write to Read Only Range", - [NVME_SC_ONCS_NOT_SUPPORTED] = "ONCS Not Supported", + [NVME_SC_CMD_SIZE_LIM_EXCEEDED ] = "Command Size Limits Exceeded", [NVME_SC_ZONE_BOUNDARY_ERROR] = "Zoned Boundary Error", [NVME_SC_ZONE_FULL] = "Zone Is Full", [NVME_SC_ZONE_READ_ONLY] = "Zone Is Read Only", diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index f69a232a000ac8..92697f98c601d4 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -290,7 +290,6 @@ static blk_status_t nvme_error_status(u16 status) case NVME_SC_NS_NOT_READY: return BLK_STS_TARGET; case NVME_SC_BAD_ATTRIBUTES: - case NVME_SC_ONCS_NOT_SUPPORTED: case NVME_SC_INVALID_OPCODE: case NVME_SC_INVALID_FIELD: case NVME_SC_INVALID_NS: @@ -1027,7 +1026,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns, if (ns->head->ms) { /* - * If formated with metadata, the block layer always provides a + * If formatted with metadata, the block layer always provides a * metadata buffer if CONFIG_BLK_DEV_INTEGRITY is enabled. Else * we enable the PRACT bit for protection information or set the * namespace capacity to zero to prevent any I/O. diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 93e9041b9657eb..2e58a7ce109052 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -582,7 +582,7 @@ EXPORT_SYMBOL_GPL(nvmf_connect_io_queue); * Do not retry when: * * - the DNR bit is set and the specification states no further connect - * attempts with the same set of paramenters should be attempted. + * attempts with the same set of parameters should be attempted. * * - when the authentication attempt fails, because the key was invalid. * This error code is set on the host side. diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h index 9cf5b020adba27..1b58ee7d0dcee4 100644 --- a/drivers/nvme/host/fabrics.h +++ b/drivers/nvme/host/fabrics.h @@ -80,7 +80,7 @@ enum { * @transport: Holds the fabric transport "technology name" (for a lack of * better description) that will be used by an NVMe controller * being added. - * @subsysnqn: Hold the fully qualified NQN subystem name (format defined + * @subsysnqn: Hold the fully qualified NQN subsystem name (format defined * in the NVMe specification, "NVMe Qualified Names"). * @traddr: The transport-specific TRADDR field for a port on the * subsystem which is adding a controller. @@ -156,7 +156,7 @@ struct nvmf_ctrl_options { * @create_ctrl(): function pointer that points to a non-NVMe * implementation-specific fabric technology * that would go into starting up that fabric - * for the purpose of conneciton to an NVMe controller + * for the purpose of connection to an NVMe controller * using that fabric technology. * * Notes: @@ -165,7 +165,7 @@ struct nvmf_ctrl_options { * 2. create_ctrl() must be defined (even if it does nothing) * 3. struct nvmf_transport_ops must be statically allocated in the * modules .bss section so that a pure module_get on @module - * prevents the memory from beeing freed. + * prevents the memory from being freed. */ struct nvmf_transport_ops { struct list_head entry; diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index fdafa3e9e66fa9..014b387f1e8b1f 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -1955,7 +1955,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req) } /* - * For the linux implementation, if we have an unsuccesful + * For the linux implementation, if we have an unsucceesful * status, they blk-mq layer can typically be called with the * non-zero status and the content of the cqe isn't important. */ @@ -2479,7 +2479,7 @@ __nvme_fc_abort_outstanding_ios(struct nvme_fc_ctrl *ctrl, bool start_queues) * writing the registers for shutdown and polling (call * nvme_disable_ctrl()). Given a bunch of i/o was potentially * just aborted and we will wait on those contexts, and given - * there was no indication of how live the controlelr is on the + * there was no indication of how live the controller is on the * link, don't send more io to create more contexts for the * shutdown. Let the controller fail via keepalive failure if * its still present. diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index ca86d3bf7ea49d..0b50da2f117531 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -493,13 +493,15 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, d.timeout_ms = READ_ONCE(cmd->timeout_ms); if (d.data_len && (ioucmd->flags & IORING_URING_CMD_FIXED)) { - /* fixedbufs is only for non-vectored io */ - if (vec) - return -EINVAL; + int ddir = nvme_is_write(&c) ? WRITE : READ; - ret = io_uring_cmd_import_fixed(d.addr, d.data_len, - nvme_is_write(&c) ? WRITE : READ, &iter, ioucmd, - issue_flags); + if (vec) + ret = io_uring_cmd_import_fixed_vec(ioucmd, + u64_to_user_ptr(d.addr), d.data_len, + ddir, &iter, issue_flags); + else + ret = io_uring_cmd_import_fixed(d.addr, d.data_len, + ddir, &iter, ioucmd, issue_flags); if (ret < 0) return ret; @@ -521,7 +523,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns, if (d.data_len) { ret = nvme_map_user_request(req, d.addr, d.data_len, nvme_to_user_ptr(d.metadata), d.metadata_len, - map_iter, vec); + map_iter, vec ? NVME_IOCTL_VEC : 0); if (ret) goto out_free_req; } @@ -727,7 +729,7 @@ int nvme_ns_head_ioctl(struct block_device *bdev, blk_mode_t mode, /* * Handle ioctls that apply to the controller instead of the namespace - * seperately and drop the ns SRCU reference early. This avoids a + * separately and drop the ns SRCU reference early. This avoids a * deadlock when deleting namespaces using the passthrough interface. */ if (is_ctrl_ioctl(cmd)) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 878ea8b1a0acf7..140079ff86e6b8 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -760,7 +760,7 @@ int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head) * controller's scan_work context. If a path error occurs here, the IO * will wait until a path becomes available or all paths are torn down, * but that action also occurs within scan_work, so it would deadlock. - * Defer the partion scan to a different context that does not block + * Defer the partition scan to a different context that does not block * scan_work. */ set_bit(GD_SUPPRESS_PART_SCAN, &head->disk->state); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index ad0c1f834f09eb..a468cdc5b5cb7b 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -523,7 +523,7 @@ static inline bool nvme_ns_head_multipath(struct nvme_ns_head *head) enum nvme_ns_features { NVME_NS_EXT_LBAS = 1 << 0, /* support extended LBA format */ NVME_NS_METADATA_SUPPORTED = 1 << 1, /* support getting generated md */ - NVME_NS_DEAC = 1 << 2, /* DEAC bit in Write Zeores supported */ + NVME_NS_DEAC = 1 << 2, /* DEAC bit in Write Zeroes supported */ }; struct nvme_ns { diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index e0bfe04a2bc2a7..8ff12e415cb5d1 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -3015,7 +3015,7 @@ static void nvme_reset_work(struct work_struct *work) goto out; /* - * Freeze and update the number of I/O queues as thos might have + * Freeze and update the number of I/O queues as those might have * changed. If there are no I/O queues left after this reset, keep the * controller around but remove all namespaces. */ @@ -3186,7 +3186,7 @@ static unsigned long check_vendor_combination_bug(struct pci_dev *pdev) /* * Exclude some Kingston NV1 and A2000 devices from * NVME_QUIRK_SIMPLE_SUSPEND. Do a full suspend to save a - * lot fo energy with s2idle sleep on some TUXEDO platforms. + * lot of energy with s2idle sleep on some TUXEDO platforms. */ if (dmi_match(DMI_BOARD_NAME, "NS5X_NS7XAU") || dmi_match(DMI_BOARD_NAME, "NS5x_7xAU") || diff --git a/drivers/nvme/host/pr.c b/drivers/nvme/host/pr.c index cf2d2c5039ddbf..ca6a74607b1397 100644 --- a/drivers/nvme/host/pr.c +++ b/drivers/nvme/host/pr.c @@ -82,8 +82,6 @@ static int nvme_status_to_pr_err(int status) return PR_STS_SUCCESS; case NVME_SC_RESERVATION_CONFLICT: return PR_STS_RESERVATION_CONFLICT; - case NVME_SC_ONCS_NOT_SUPPORTED: - return -EOPNOTSUPP; case NVME_SC_BAD_ATTRIBUTES: case NVME_SC_INVALID_OPCODE: case NVME_SC_INVALID_FIELD: diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index b5a0295b5bf457..9bd3646568d03c 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -221,7 +221,7 @@ static struct nvme_rdma_qe *nvme_rdma_alloc_ring(struct ib_device *ibdev, /* * Bind the CQEs (post recv buffers) DMA mapping to the RDMA queue - * lifetime. It's safe, since any chage in the underlying RDMA device + * lifetime. It's safe, since any change in the underlying RDMA device * will issue error recovery and queue re-creation. */ for (i = 0; i < ib_queue_size; i++) { @@ -800,7 +800,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl, /* * Bind the async event SQE DMA mapping to the admin queue lifetime. - * It's safe, since any chage in the underlying RDMA device will issue + * It's safe, since any change in the underlying RDMA device will issue * error recovery and queue re-creation. */ error = nvme_rdma_alloc_qe(ctrl->device->dev, &ctrl->async_event_sqe, diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index f6379aa33d77fa..d924008c39491f 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -452,7 +452,8 @@ nvme_tcp_fetch_request(struct nvme_tcp_queue *queue) return NULL; } - list_del(&req->entry); + list_del_init(&req->entry); + init_llist_node(&req->lentry); return req; } @@ -565,6 +566,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set, req->queue = queue; nvme_req(rq)->ctrl = &ctrl->ctrl; nvme_req(rq)->cmd = &pdu->cmd; + init_llist_node(&req->lentry); + INIT_LIST_HEAD(&req->entry); return 0; } @@ -769,6 +772,14 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue, return -EPROTO; } + if (llist_on_list(&req->lentry) || + !list_empty(&req->entry)) { + dev_err(queue->ctrl->ctrl.device, + "req %d unexpected r2t while processing request\n", + rq->tag); + return -EPROTO; + } + req->pdu_len = 0; req->h2cdata_left = r2t_length; req->h2cdata_offset = r2t_offset; @@ -1355,7 +1366,7 @@ static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue) queue->nr_cqe = 0; consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb); release_sock(sk); - return consumed; + return consumed == -EAGAIN ? 0 : consumed; } static void nvme_tcp_io_work(struct work_struct *w) @@ -1383,6 +1394,11 @@ static void nvme_tcp_io_work(struct work_struct *w) else if (unlikely(result < 0)) return; + /* did we get some space after spending time in recv? */ + if (nvme_tcp_queue_has_pending(queue) && + sk_stream_is_writeable(queue->sock->sk)) + pending = true; + if (!pending || !queue->rd_enabled) return; @@ -2350,7 +2366,7 @@ static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new) nvme_tcp_teardown_admin_queue(ctrl, false); ret = nvme_tcp_configure_admin_queue(ctrl, false); if (ret) - return ret; + goto destroy_admin; } if (ctrl->icdoff) { @@ -2594,6 +2610,8 @@ static void nvme_tcp_submit_async_event(struct nvme_ctrl *arg) ctrl->async_req.offset = 0; ctrl->async_req.curr_bio = NULL; ctrl->async_req.data_len = 0; + init_llist_node(&ctrl->async_req.lentry); + INIT_LIST_HEAD(&ctrl->async_req.entry); nvme_tcp_queue_request(&ctrl->async_req, true); } diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index c7317299078d3a..3e378153a78174 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -1165,7 +1165,7 @@ static void nvmet_execute_identify(struct nvmet_req *req) * A "minimum viable" abort implementation: the command is mandatory in the * spec, but we are not required to do any useful work. We couldn't really * do a useful abort, so don't bother even with waiting for the command - * to be exectuted and return immediately telling the command to abort + * to be executed and return immediately telling the command to abort * wasn't found. */ static void nvmet_execute_abort(struct nvmet_req *req) diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index db7b17d1094e8b..175c5b6d4dd587 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -62,14 +62,7 @@ inline u16 errno_to_nvme_status(struct nvmet_req *req, int errno) return NVME_SC_LBA_RANGE | NVME_STATUS_DNR; case -EOPNOTSUPP: req->error_loc = offsetof(struct nvme_common_command, opcode); - switch (req->cmd->common.opcode) { - case nvme_cmd_dsm: - case nvme_cmd_write_zeroes: - return NVME_SC_ONCS_NOT_SUPPORTED | NVME_STATUS_DNR; - default: - return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; - } - break; + return NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; case -ENODATA: req->error_loc = offsetof(struct nvme_rw_command, nsid); return NVME_SC_ACCESS_DENIED; @@ -651,7 +644,7 @@ void nvmet_ns_disable(struct nvmet_ns *ns) * Now that we removed the namespaces from the lookup list, we * can kill the per_cpu ref and wait for any remaining references * to be dropped, as well as a RCU grace period for anyone only - * using the namepace under rcu_read_lock(). Note that we can't + * using the namespace under rcu_read_lock(). Note that we can't * use call_rcu here as we need to ensure the namespaces have * been fully destroyed before unloading the module. */ diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c index 254537b93e633c..25598a46bf0d66 100644 --- a/drivers/nvme/target/fc.c +++ b/drivers/nvme/target/fc.c @@ -1339,7 +1339,7 @@ nvmet_fc_portentry_rebind_tgt(struct nvmet_fc_tgtport *tgtport) /** * nvmet_fc_register_targetport - transport entry point called by an * LLDD to register the existence of a local - * NVME subystem FC port. + * NVME subsystem FC port. * @pinfo: pointer to information about the port to be registered * @template: LLDD entrypoints and operational parameters for the port * @dev: physical hardware device node port corresponds to. Will be diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c index 83be0657e6df4e..eba42df2f8215e 100644 --- a/drivers/nvme/target/io-cmd-bdev.c +++ b/drivers/nvme/target/io-cmd-bdev.c @@ -133,7 +133,7 @@ u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts) * Right now there exists M : 1 mapping between block layer error * to the NVMe status code (see nvme_error_status()). For consistency, * when we reverse map we use most appropriate NVMe Status code from - * the group of the NVMe staus codes used in the nvme_error_status(). + * the group of the NVMe status codes used in the nvme_error_status(). */ switch (blk_sts) { case BLK_STS_NOSPC: @@ -145,15 +145,8 @@ u16 blk_to_nvme_status(struct nvmet_req *req, blk_status_t blk_sts) req->error_loc = offsetof(struct nvme_rw_command, slba); break; case BLK_STS_NOTSUPP: + status = NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; req->error_loc = offsetof(struct nvme_common_command, opcode); - switch (req->cmd->common.opcode) { - case nvme_cmd_dsm: - case nvme_cmd_write_zeroes: - status = NVME_SC_ONCS_NOT_SUPPORTED | NVME_STATUS_DNR; - break; - default: - status = NVME_SC_INVALID_OPCODE | NVME_STATUS_DNR; - } break; case BLK_STS_MEDIUM: status = NVME_SC_ACCESS_DENIED; diff --git a/drivers/nvme/target/passthru.c b/drivers/nvme/target/passthru.c index 26e2907ce8bb09..b7515c53829b89 100644 --- a/drivers/nvme/target/passthru.c +++ b/drivers/nvme/target/passthru.c @@ -99,7 +99,7 @@ static u16 nvmet_passthru_override_id_ctrl(struct nvmet_req *req) /* * The passthru NVMe driver may have a limit on the number of segments - * which depends on the host's memory fragementation. To solve this, + * which depends on the host's memory fragmentation. To solve this, * ensure mdts is limited to the pages equal to the number of segments. */ max_hw_sectors = min_not_zero(pctrl->max_segments << PAGE_SECTORS_SHIFT, diff --git a/include/linux/nvme.h b/include/linux/nvme.h index 51308f65b72fdd..b65a1b9f2116c2 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -2171,7 +2171,7 @@ enum { NVME_SC_BAD_ATTRIBUTES = 0x180, NVME_SC_INVALID_PI = 0x181, NVME_SC_READ_ONLY = 0x182, - NVME_SC_ONCS_NOT_SUPPORTED = 0x183, + NVME_SC_CMD_SIZE_LIM_EXCEEDED = 0x183, /* * I/O Command Set Specific - Fabrics commands: diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h index 56c7e3fc666fc5..77d9d6af46da87 100644 --- a/include/uapi/linux/ublk_cmd.h +++ b/include/uapi/linux/ublk_cmd.h @@ -272,6 +272,15 @@ */ #define UBLK_F_QUIESCE (1ULL << 12) +/* + * If this feature is set, ublk_drv supports each (qid,tag) pair having + * its own independent daemon task that is responsible for handling it. + * If it is not set, daemons are per-queue instead, so for two pairs + * (qid1,tag1) and (qid2,tag2), if qid1 == qid2, then the same task must + * be responsible for handling (qid1,tag1) and (qid2,tag2). + */ +#define UBLK_F_PER_IO_DAEMON (1ULL << 13) + /* device state */ #define UBLK_S_DEV_DEAD 0 #define UBLK_S_DEV_LIVE 1 diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile index 4dde8838261d66..5d7f4ecfb81612 100644 --- a/tools/testing/selftests/ublk/Makefile +++ b/tools/testing/selftests/ublk/Makefile @@ -19,6 +19,7 @@ TEST_PROGS += test_generic_08.sh TEST_PROGS += test_generic_09.sh TEST_PROGS += test_generic_10.sh TEST_PROGS += test_generic_11.sh +TEST_PROGS += test_generic_12.sh TEST_PROGS += test_null_01.sh TEST_PROGS += test_null_02.sh diff --git a/tools/testing/selftests/ublk/fault_inject.c b/tools/testing/selftests/ublk/fault_inject.c index 5421774d7867cb..6e60f7d9712593 100644 --- a/tools/testing/selftests/ublk/fault_inject.c +++ b/tools/testing/selftests/ublk/fault_inject.c @@ -46,9 +46,9 @@ static int ublk_fault_inject_queue_io(struct ublk_queue *q, int tag) .tv_nsec = (long long)q->dev->private_data, }; - ublk_queue_alloc_sqes(q, &sqe, 1); + ublk_io_alloc_sqes(ublk_get_io(q, tag), &sqe, 1); io_uring_prep_timeout(sqe, &ts, 1, 0); - sqe->user_data = build_user_data(tag, ublksrv_get_op(iod), 0, 1); + sqe->user_data = build_user_data(tag, ublksrv_get_op(iod), 0, q->q_id, 1); ublk_queued_tgt_io(q, tag, 1); diff --git a/tools/testing/selftests/ublk/file_backed.c b/tools/testing/selftests/ublk/file_backed.c index 509842df9beefa..cfa59b63169379 100644 --- a/tools/testing/selftests/ublk/file_backed.c +++ b/tools/testing/selftests/ublk/file_backed.c @@ -18,11 +18,11 @@ static int loop_queue_flush_io(struct ublk_queue *q, const struct ublksrv_io_des unsigned ublk_op = ublksrv_get_op(iod); struct io_uring_sqe *sqe[1]; - ublk_queue_alloc_sqes(q, sqe, 1); + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 1); io_uring_prep_fsync(sqe[0], 1 /*fds[1]*/, IORING_FSYNC_DATASYNC); io_uring_sqe_set_flags(sqe[0], IOSQE_FIXED_FILE); /* bit63 marks us as tgt io */ - sqe[0]->user_data = build_user_data(tag, ublk_op, 0, 1); + sqe[0]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1); return 1; } @@ -36,7 +36,7 @@ static int loop_queue_tgt_rw_io(struct ublk_queue *q, const struct ublksrv_io_de void *addr = (zc | auto_zc) ? NULL : (void *)iod->addr; if (!zc || auto_zc) { - ublk_queue_alloc_sqes(q, sqe, 1); + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 1); if (!sqe[0]) return -ENOMEM; @@ -48,26 +48,26 @@ static int loop_queue_tgt_rw_io(struct ublk_queue *q, const struct ublksrv_io_de sqe[0]->buf_index = tag; io_uring_sqe_set_flags(sqe[0], IOSQE_FIXED_FILE); /* bit63 marks us as tgt io */ - sqe[0]->user_data = build_user_data(tag, ublk_op, 0, 1); + sqe[0]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1); return 1; } - ublk_queue_alloc_sqes(q, sqe, 3); + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 3); - io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, tag); + io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index); sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK; sqe[0]->user_data = build_user_data(tag, - ublk_cmd_op_nr(sqe[0]->cmd_op), 0, 1); + ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1); io_uring_prep_rw(op, sqe[1], 1 /*fds[1]*/, 0, iod->nr_sectors << 9, iod->start_sector << 9); sqe[1]->buf_index = tag; sqe[1]->flags |= IOSQE_FIXED_FILE | IOSQE_IO_HARDLINK; - sqe[1]->user_data = build_user_data(tag, ublk_op, 0, 1); + sqe[1]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1); - io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, tag); - sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, 1); + io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index); + sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1); return 2; } diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c index b5131a000795d6..e2d2042810d4bb 100644 --- a/tools/testing/selftests/ublk/kublk.c +++ b/tools/testing/selftests/ublk/kublk.c @@ -348,8 +348,8 @@ static void ublk_ctrl_dump(struct ublk_dev *dev) for (i = 0; i < info->nr_hw_queues; i++) { ublk_print_cpu_set(&affinity[i], buf, sizeof(buf)); - printf("\tqueue %u: tid %d affinity(%s)\n", - i, dev->q[i].tid, buf); + printf("\tqueue %u: affinity(%s)\n", + i, buf); } free(affinity); } @@ -412,16 +412,6 @@ static void ublk_queue_deinit(struct ublk_queue *q) int i; int nr_ios = q->q_depth; - io_uring_unregister_buffers(&q->ring); - - io_uring_unregister_ring_fd(&q->ring); - - if (q->ring.ring_fd > 0) { - io_uring_unregister_files(&q->ring); - close(q->ring.ring_fd); - q->ring.ring_fd = -1; - } - if (q->io_cmd_buf) munmap(q->io_cmd_buf, ublk_queue_cmd_buf_sz(q)); @@ -429,20 +419,30 @@ static void ublk_queue_deinit(struct ublk_queue *q) free(q->ios[i].buf_addr); } +static void ublk_thread_deinit(struct ublk_thread *t) +{ + io_uring_unregister_buffers(&t->ring); + + io_uring_unregister_ring_fd(&t->ring); + + if (t->ring.ring_fd > 0) { + io_uring_unregister_files(&t->ring); + close(t->ring.ring_fd); + t->ring.ring_fd = -1; + } +} + static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags) { struct ublk_dev *dev = q->dev; int depth = dev->dev_info.queue_depth; - int i, ret = -1; + int i; int cmd_buf_size, io_buf_size; unsigned long off; - int ring_depth = dev->tgt.sq_depth, cq_depth = dev->tgt.cq_depth; q->tgt_ops = dev->tgt.ops; q->state = 0; q->q_depth = depth; - q->cmd_inflight = 0; - q->tid = gettid(); if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_AUTO_BUF_REG)) { q->state |= UBLKSRV_NO_BUF; @@ -467,6 +467,7 @@ static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags) for (i = 0; i < q->q_depth; i++) { q->ios[i].buf_addr = NULL; q->ios[i].flags = UBLKSRV_NEED_FETCH_RQ | UBLKSRV_IO_FREE; + q->ios[i].tag = i; if (q->state & UBLKSRV_NO_BUF) continue; @@ -479,39 +480,57 @@ static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags) } } - ret = ublk_setup_ring(&q->ring, ring_depth, cq_depth, + return 0; + fail: + ublk_queue_deinit(q); + ublk_err("ublk dev %d queue %d failed\n", + dev->dev_info.dev_id, q->q_id); + return -ENOMEM; +} + +static int ublk_thread_init(struct ublk_thread *t) +{ + struct ublk_dev *dev = t->dev; + int ring_depth = dev->tgt.sq_depth, cq_depth = dev->tgt.cq_depth; + int ret; + + ret = ublk_setup_ring(&t->ring, ring_depth, cq_depth, IORING_SETUP_COOP_TASKRUN | IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN); if (ret < 0) { - ublk_err("ublk dev %d queue %d setup io_uring failed %d\n", - q->dev->dev_info.dev_id, q->q_id, ret); + ublk_err("ublk dev %d thread %d setup io_uring failed %d\n", + dev->dev_info.dev_id, t->idx, ret); goto fail; } if (dev->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_AUTO_BUF_REG)) { - ret = io_uring_register_buffers_sparse(&q->ring, q->q_depth); + unsigned nr_ios = dev->dev_info.queue_depth * dev->dev_info.nr_hw_queues; + unsigned max_nr_ios_per_thread = nr_ios / dev->nthreads; + max_nr_ios_per_thread += !!(nr_ios % dev->nthreads); + ret = io_uring_register_buffers_sparse( + &t->ring, max_nr_ios_per_thread); if (ret) { - ublk_err("ublk dev %d queue %d register spare buffers failed %d", - dev->dev_info.dev_id, q->q_id, ret); + ublk_err("ublk dev %d thread %d register spare buffers failed %d", + dev->dev_info.dev_id, t->idx, ret); goto fail; } } - io_uring_register_ring_fd(&q->ring); + io_uring_register_ring_fd(&t->ring); - ret = io_uring_register_files(&q->ring, dev->fds, dev->nr_fds); + ret = io_uring_register_files(&t->ring, dev->fds, dev->nr_fds); if (ret) { - ublk_err("ublk dev %d queue %d register files failed %d\n", - q->dev->dev_info.dev_id, q->q_id, ret); + ublk_err("ublk dev %d thread %d register files failed %d\n", + t->dev->dev_info.dev_id, t->idx, ret); goto fail; } return 0; - fail: - ublk_queue_deinit(q); - ublk_err("ublk dev %d queue %d failed\n", - dev->dev_info.dev_id, q->q_id); +fail: + ublk_thread_deinit(t); + ublk_err("ublk dev %d thread %d init failed\n", + dev->dev_info.dev_id, t->idx); return -ENOMEM; } @@ -562,7 +581,7 @@ static void ublk_set_auto_buf_reg(const struct ublk_queue *q, if (q->tgt_ops->buf_index) buf.index = q->tgt_ops->buf_index(q, tag); else - buf.index = tag; + buf.index = q->ios[tag].buf_index; if (q->state & UBLKSRV_AUTO_BUF_REG_FALLBACK) buf.flags = UBLK_AUTO_BUF_REG_FALLBACK; @@ -570,8 +589,10 @@ static void ublk_set_auto_buf_reg(const struct ublk_queue *q, sqe->addr = ublk_auto_buf_reg_to_sqe_addr(&buf); } -int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag) +int ublk_queue_io_cmd(struct ublk_io *io) { + struct ublk_thread *t = io->t; + struct ublk_queue *q = ublk_io_to_queue(io); struct ublksrv_io_cmd *cmd; struct io_uring_sqe *sqe[1]; unsigned int cmd_op = 0; @@ -596,13 +617,13 @@ int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag) else if (io->flags & UBLKSRV_NEED_FETCH_RQ) cmd_op = UBLK_U_IO_FETCH_REQ; - if (io_uring_sq_space_left(&q->ring) < 1) - io_uring_submit(&q->ring); + if (io_uring_sq_space_left(&t->ring) < 1) + io_uring_submit(&t->ring); - ublk_queue_alloc_sqes(q, sqe, 1); + ublk_io_alloc_sqes(io, sqe, 1); if (!sqe[0]) { - ublk_err("%s: run out of sqe %d, tag %d\n", - __func__, q->q_id, tag); + ublk_err("%s: run out of sqe. thread %u, tag %d\n", + __func__, t->idx, io->tag); return -1; } @@ -617,7 +638,7 @@ int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag) sqe[0]->opcode = IORING_OP_URING_CMD; sqe[0]->flags = IOSQE_FIXED_FILE; sqe[0]->rw_flags = 0; - cmd->tag = tag; + cmd->tag = io->tag; cmd->q_id = q->q_id; if (!(q->state & UBLKSRV_NO_BUF)) cmd->addr = (__u64) (uintptr_t) io->buf_addr; @@ -625,37 +646,72 @@ int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag) cmd->addr = 0; if (q->state & UBLKSRV_AUTO_BUF_REG) - ublk_set_auto_buf_reg(q, sqe[0], tag); + ublk_set_auto_buf_reg(q, sqe[0], io->tag); - user_data = build_user_data(tag, _IOC_NR(cmd_op), 0, 0); + user_data = build_user_data(io->tag, _IOC_NR(cmd_op), 0, q->q_id, 0); io_uring_sqe_set_data64(sqe[0], user_data); io->flags = 0; - q->cmd_inflight += 1; + t->cmd_inflight += 1; - ublk_dbg(UBLK_DBG_IO_CMD, "%s: (qid %d tag %u cmd_op %u) iof %x stopping %d\n", - __func__, q->q_id, tag, cmd_op, - io->flags, !!(q->state & UBLKSRV_QUEUE_STOPPING)); + ublk_dbg(UBLK_DBG_IO_CMD, "%s: (thread %u qid %d tag %u cmd_op %u) iof %x stopping %d\n", + __func__, t->idx, q->q_id, io->tag, cmd_op, + io->flags, !!(t->state & UBLKSRV_THREAD_STOPPING)); return 1; } -static void ublk_submit_fetch_commands(struct ublk_queue *q) +static void ublk_submit_fetch_commands(struct ublk_thread *t) { - int i = 0; + struct ublk_queue *q; + struct ublk_io *io; + int i = 0, j = 0; - for (i = 0; i < q->q_depth; i++) - ublk_queue_io_cmd(q, &q->ios[i], i); + if (t->dev->per_io_tasks) { + /* + * Lexicographically order all the (qid,tag) pairs, with + * qid taking priority (so (1,0) > (0,1)). Then make + * this thread the daemon for every Nth entry in this + * list (N is the number of threads), starting at this + * thread's index. This ensures that each queue is + * handled by as many ublk server threads as possible, + * so that load that is concentrated on one or a few + * queues can make use of all ublk server threads. + */ + const struct ublksrv_ctrl_dev_info *dinfo = &t->dev->dev_info; + int nr_ios = dinfo->nr_hw_queues * dinfo->queue_depth; + for (i = t->idx; i < nr_ios; i += t->dev->nthreads) { + int q_id = i / dinfo->queue_depth; + int tag = i % dinfo->queue_depth; + q = &t->dev->q[q_id]; + io = &q->ios[tag]; + io->t = t; + io->buf_index = j++; + ublk_queue_io_cmd(io); + } + } else { + /* + * Service exclusively the queue whose q_id matches our + * thread index. + */ + struct ublk_queue *q = &t->dev->q[t->idx]; + for (i = 0; i < q->q_depth; i++) { + io = &q->ios[i]; + io->t = t; + io->buf_index = i; + ublk_queue_io_cmd(io); + } + } } -static int ublk_queue_is_idle(struct ublk_queue *q) +static int ublk_thread_is_idle(struct ublk_thread *t) { - return !io_uring_sq_ready(&q->ring) && !q->io_inflight; + return !io_uring_sq_ready(&t->ring) && !t->io_inflight; } -static int ublk_queue_is_done(struct ublk_queue *q) +static int ublk_thread_is_done(struct ublk_thread *t) { - return (q->state & UBLKSRV_QUEUE_STOPPING) && ublk_queue_is_idle(q); + return (t->state & UBLKSRV_THREAD_STOPPING) && ublk_thread_is_idle(t); } static inline void ublksrv_handle_tgt_cqe(struct ublk_queue *q, @@ -673,14 +729,16 @@ static inline void ublksrv_handle_tgt_cqe(struct ublk_queue *q, q->tgt_ops->tgt_io_done(q, tag, cqe); } -static void ublk_handle_cqe(struct io_uring *r, +static void ublk_handle_cqe(struct ublk_thread *t, struct io_uring_cqe *cqe, void *data) { - struct ublk_queue *q = container_of(r, struct ublk_queue, ring); + struct ublk_dev *dev = t->dev; + unsigned q_id = user_data_to_q_id(cqe->user_data); + struct ublk_queue *q = &dev->q[q_id]; unsigned tag = user_data_to_tag(cqe->user_data); unsigned cmd_op = user_data_to_op(cqe->user_data); int fetch = (cqe->res != UBLK_IO_RES_ABORT) && - !(q->state & UBLKSRV_QUEUE_STOPPING); + !(t->state & UBLKSRV_THREAD_STOPPING); struct ublk_io *io; if (cqe->res < 0 && cqe->res != -ENODEV) @@ -691,7 +749,7 @@ static void ublk_handle_cqe(struct io_uring *r, __func__, cqe->res, q->q_id, tag, cmd_op, is_target_io(cqe->user_data), user_data_to_tgt_data(cqe->user_data), - (q->state & UBLKSRV_QUEUE_STOPPING)); + (t->state & UBLKSRV_THREAD_STOPPING)); /* Don't retrieve io in case of target io */ if (is_target_io(cqe->user_data)) { @@ -700,10 +758,10 @@ static void ublk_handle_cqe(struct io_uring *r, } io = &q->ios[tag]; - q->cmd_inflight--; + t->cmd_inflight--; if (!fetch) { - q->state |= UBLKSRV_QUEUE_STOPPING; + t->state |= UBLKSRV_THREAD_STOPPING; io->flags &= ~UBLKSRV_NEED_FETCH_RQ; } @@ -713,7 +771,7 @@ static void ublk_handle_cqe(struct io_uring *r, q->tgt_ops->queue_io(q, tag); } else if (cqe->res == UBLK_IO_RES_NEED_GET_DATA) { io->flags |= UBLKSRV_NEED_GET_DATA | UBLKSRV_IO_FREE; - ublk_queue_io_cmd(q, io, tag); + ublk_queue_io_cmd(io); } else { /* * COMMIT_REQ will be completed immediately since no fetching @@ -727,92 +785,93 @@ static void ublk_handle_cqe(struct io_uring *r, } } -static int ublk_reap_events_uring(struct io_uring *r) +static int ublk_reap_events_uring(struct ublk_thread *t) { struct io_uring_cqe *cqe; unsigned head; int count = 0; - io_uring_for_each_cqe(r, head, cqe) { - ublk_handle_cqe(r, cqe, NULL); + io_uring_for_each_cqe(&t->ring, head, cqe) { + ublk_handle_cqe(t, cqe, NULL); count += 1; } - io_uring_cq_advance(r, count); + io_uring_cq_advance(&t->ring, count); return count; } -static int ublk_process_io(struct ublk_queue *q) +static int ublk_process_io(struct ublk_thread *t) { int ret, reapped; - ublk_dbg(UBLK_DBG_QUEUE, "dev%d-q%d: to_submit %d inflight cmd %u stopping %d\n", - q->dev->dev_info.dev_id, - q->q_id, io_uring_sq_ready(&q->ring), - q->cmd_inflight, - (q->state & UBLKSRV_QUEUE_STOPPING)); + ublk_dbg(UBLK_DBG_THREAD, "dev%d-t%u: to_submit %d inflight cmd %u stopping %d\n", + t->dev->dev_info.dev_id, + t->idx, io_uring_sq_ready(&t->ring), + t->cmd_inflight, + (t->state & UBLKSRV_THREAD_STOPPING)); - if (ublk_queue_is_done(q)) + if (ublk_thread_is_done(t)) return -ENODEV; - ret = io_uring_submit_and_wait(&q->ring, 1); - reapped = ublk_reap_events_uring(&q->ring); + ret = io_uring_submit_and_wait(&t->ring, 1); + reapped = ublk_reap_events_uring(t); - ublk_dbg(UBLK_DBG_QUEUE, "submit result %d, reapped %d stop %d idle %d\n", - ret, reapped, (q->state & UBLKSRV_QUEUE_STOPPING), - (q->state & UBLKSRV_QUEUE_IDLE)); + ublk_dbg(UBLK_DBG_THREAD, "submit result %d, reapped %d stop %d idle %d\n", + ret, reapped, (t->state & UBLKSRV_THREAD_STOPPING), + (t->state & UBLKSRV_THREAD_IDLE)); return reapped; } -static void ublk_queue_set_sched_affinity(const struct ublk_queue *q, +static void ublk_thread_set_sched_affinity(const struct ublk_thread *t, cpu_set_t *cpuset) { if (sched_setaffinity(0, sizeof(*cpuset), cpuset) < 0) - ublk_err("ublk dev %u queue %u set affinity failed", - q->dev->dev_info.dev_id, q->q_id); + ublk_err("ublk dev %u thread %u set affinity failed", + t->dev->dev_info.dev_id, t->idx); } -struct ublk_queue_info { - struct ublk_queue *q; - sem_t *queue_sem; +struct ublk_thread_info { + struct ublk_dev *dev; + unsigned idx; + sem_t *ready; cpu_set_t *affinity; - unsigned char auto_zc_fallback; }; static void *ublk_io_handler_fn(void *data) { - struct ublk_queue_info *info = data; - struct ublk_queue *q = info->q; - int dev_id = q->dev->dev_info.dev_id; - unsigned extra_flags = 0; + struct ublk_thread_info *info = data; + struct ublk_thread *t = &info->dev->threads[info->idx]; + int dev_id = info->dev->dev_info.dev_id; int ret; - if (info->auto_zc_fallback) - extra_flags = UBLKSRV_AUTO_BUF_REG_FALLBACK; + t->dev = info->dev; + t->idx = info->idx; - ret = ublk_queue_init(q, extra_flags); + ret = ublk_thread_init(t); if (ret) { - ublk_err("ublk dev %d queue %d init queue failed\n", - dev_id, q->q_id); + ublk_err("ublk dev %d thread %u init failed\n", + dev_id, t->idx); return NULL; } /* IO perf is sensitive with queue pthread affinity on NUMA machine*/ - ublk_queue_set_sched_affinity(q, info->affinity); - sem_post(info->queue_sem); + if (info->affinity) + ublk_thread_set_sched_affinity(t, info->affinity); + sem_post(info->ready); - ublk_dbg(UBLK_DBG_QUEUE, "tid %d: ublk dev %d queue %d started\n", - q->tid, dev_id, q->q_id); + ublk_dbg(UBLK_DBG_THREAD, "tid %d: ublk dev %d thread %u started\n", + gettid(), dev_id, t->idx); /* submit all io commands to ublk driver */ - ublk_submit_fetch_commands(q); + ublk_submit_fetch_commands(t); do { - if (ublk_process_io(q) < 0) + if (ublk_process_io(t) < 0) break; } while (1); - ublk_dbg(UBLK_DBG_QUEUE, "ublk dev %d queue %d exited\n", dev_id, q->q_id); - ublk_queue_deinit(q); + ublk_dbg(UBLK_DBG_THREAD, "tid %d: ublk dev %d thread %d exiting\n", + gettid(), dev_id, t->idx); + ublk_thread_deinit(t); return NULL; } @@ -855,20 +914,20 @@ static int ublk_send_dev_event(const struct dev_ctx *ctx, struct ublk_dev *dev, static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev) { const struct ublksrv_ctrl_dev_info *dinfo = &dev->dev_info; - struct ublk_queue_info *qinfo; + struct ublk_thread_info *tinfo; + unsigned extra_flags = 0; cpu_set_t *affinity_buf; void *thread_ret; - sem_t queue_sem; + sem_t ready; int ret, i; ublk_dbg(UBLK_DBG_DEV, "%s enter\n", __func__); - qinfo = (struct ublk_queue_info *)calloc(sizeof(struct ublk_queue_info), - dinfo->nr_hw_queues); - if (!qinfo) + tinfo = calloc(sizeof(struct ublk_thread_info), dev->nthreads); + if (!tinfo) return -ENOMEM; - sem_init(&queue_sem, 0, 0); + sem_init(&ready, 0, 0); ret = ublk_dev_prep(ctx, dev); if (ret) return ret; @@ -877,22 +936,44 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev) if (ret) return ret; + if (ctx->auto_zc_fallback) + extra_flags = UBLKSRV_AUTO_BUF_REG_FALLBACK; + for (i = 0; i < dinfo->nr_hw_queues; i++) { dev->q[i].dev = dev; dev->q[i].q_id = i; - qinfo[i].q = &dev->q[i]; - qinfo[i].queue_sem = &queue_sem; - qinfo[i].affinity = &affinity_buf[i]; - qinfo[i].auto_zc_fallback = ctx->auto_zc_fallback; - pthread_create(&dev->q[i].thread, NULL, + ret = ublk_queue_init(&dev->q[i], extra_flags); + if (ret) { + ublk_err("ublk dev %d queue %d init queue failed\n", + dinfo->dev_id, i); + goto fail; + } + } + + for (i = 0; i < dev->nthreads; i++) { + tinfo[i].dev = dev; + tinfo[i].idx = i; + tinfo[i].ready = &ready; + + /* + * If threads are not tied 1:1 to queues, setting thread + * affinity based on queue affinity makes little sense. + * However, thread CPU affinity has significant impact + * on performance, so to compare fairly, we'll still set + * thread CPU affinity based on queue affinity where + * possible. + */ + if (dev->nthreads == dinfo->nr_hw_queues) + tinfo[i].affinity = &affinity_buf[i]; + pthread_create(&dev->threads[i].thread, NULL, ublk_io_handler_fn, - &qinfo[i]); + &tinfo[i]); } - for (i = 0; i < dinfo->nr_hw_queues; i++) - sem_wait(&queue_sem); - free(qinfo); + for (i = 0; i < dev->nthreads; i++) + sem_wait(&ready); + free(tinfo); free(affinity_buf); /* everything is fine now, start us */ @@ -914,9 +995,11 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev) ublk_send_dev_event(ctx, dev, dev->dev_info.dev_id); /* wait until we are terminated */ - for (i = 0; i < dinfo->nr_hw_queues; i++) - pthread_join(dev->q[i].thread, &thread_ret); + for (i = 0; i < dev->nthreads; i++) + pthread_join(dev->threads[i].thread, &thread_ret); fail: + for (i = 0; i < dinfo->nr_hw_queues; i++) + ublk_queue_deinit(&dev->q[i]); ublk_dev_unprep(dev); ublk_dbg(UBLK_DBG_DEV, "%s exit\n", __func__); @@ -1022,13 +1105,14 @@ wait: static int __cmd_dev_add(const struct dev_ctx *ctx) { + unsigned nthreads = ctx->nthreads; unsigned nr_queues = ctx->nr_hw_queues; const char *tgt_type = ctx->tgt_type; unsigned depth = ctx->queue_depth; __u64 features; const struct ublk_tgt_ops *ops; struct ublksrv_ctrl_dev_info *info; - struct ublk_dev *dev; + struct ublk_dev *dev = NULL; int dev_id = ctx->dev_id; int ret, i; @@ -1036,29 +1120,55 @@ static int __cmd_dev_add(const struct dev_ctx *ctx) if (!ops) { ublk_err("%s: no such tgt type, type %s\n", __func__, tgt_type); - return -ENODEV; + ret = -ENODEV; + goto fail; } if (nr_queues > UBLK_MAX_QUEUES || depth > UBLK_QUEUE_DEPTH) { ublk_err("%s: invalid nr_queues or depth queues %u depth %u\n", __func__, nr_queues, depth); - return -EINVAL; + ret = -EINVAL; + goto fail; + } + + /* default to 1:1 threads:queues if nthreads is unspecified */ + if (!nthreads) + nthreads = nr_queues; + + if (nthreads > UBLK_MAX_THREADS) { + ublk_err("%s: %u is too many threads (max %u)\n", + __func__, nthreads, UBLK_MAX_THREADS); + ret = -EINVAL; + goto fail; + } + + if (nthreads != nr_queues && !ctx->per_io_tasks) { + ublk_err("%s: threads %u must be same as queues %u if " + "not using per_io_tasks\n", + __func__, nthreads, nr_queues); + ret = -EINVAL; + goto fail; } dev = ublk_ctrl_init(); if (!dev) { ublk_err("%s: can't alloc dev id %d, type %s\n", __func__, dev_id, tgt_type); - return -ENOMEM; + ret = -ENOMEM; + goto fail; } /* kernel doesn't support get_features */ ret = ublk_ctrl_get_features(dev, &features); - if (ret < 0) - return -EINVAL; + if (ret < 0) { + ret = -EINVAL; + goto fail; + } - if (!(features & UBLK_F_CMD_IOCTL_ENCODE)) - return -ENOTSUP; + if (!(features & UBLK_F_CMD_IOCTL_ENCODE)) { + ret = -ENOTSUP; + goto fail; + } info = &dev->dev_info; info->dev_id = ctx->dev_id; @@ -1068,6 +1178,8 @@ static int __cmd_dev_add(const struct dev_ctx *ctx) if ((features & UBLK_F_QUIESCE) && (info->flags & UBLK_F_USER_RECOVERY)) info->flags |= UBLK_F_QUIESCE; + dev->nthreads = nthreads; + dev->per_io_tasks = ctx->per_io_tasks; dev->tgt.ops = ops; dev->tgt.sq_depth = depth; dev->tgt.cq_depth = depth; @@ -1097,7 +1209,8 @@ static int __cmd_dev_add(const struct dev_ctx *ctx) fail: if (ret < 0) ublk_send_dev_event(ctx, dev, -1); - ublk_ctrl_deinit(dev); + if (dev) + ublk_ctrl_deinit(dev); return ret; } @@ -1159,6 +1272,8 @@ run: shmctl(ctx->_shmid, IPC_RMID, NULL); /* wait for child and detach from it */ wait(NULL); + if (exit_code == EXIT_FAILURE) + ublk_err("%s: command failed\n", __func__); exit(exit_code); } else { exit(EXIT_FAILURE); @@ -1266,6 +1381,7 @@ static int cmd_dev_get_features(void) [const_ilog2(UBLK_F_UPDATE_SIZE)] = "UPDATE_SIZE", [const_ilog2(UBLK_F_AUTO_BUF_REG)] = "AUTO_BUF_REG", [const_ilog2(UBLK_F_QUIESCE)] = "QUIESCE", + [const_ilog2(UBLK_F_PER_IO_DAEMON)] = "PER_IO_DAEMON", }; struct ublk_dev *dev; __u64 features = 0; @@ -1360,8 +1476,10 @@ static void __cmd_create_help(char *exe, bool recovery) exe, recovery ? "recover" : "add"); printf("\t[--foreground] [--quiet] [-z] [--auto_zc] [--auto_zc_fallback] [--debug_mask mask] [-r 0|1 ] [-g]\n"); printf("\t[-e 0|1 ] [-i 0|1]\n"); + printf("\t[--nthreads threads] [--per_io_tasks]\n"); printf("\t[target options] [backfile1] [backfile2] ...\n"); printf("\tdefault: nr_queues=2(max 32), depth=128(max 1024), dev_id=-1(auto allocation)\n"); + printf("\tdefault: nthreads=nr_queues"); for (i = 0; i < sizeof(tgt_ops_list) / sizeof(tgt_ops_list[0]); i++) { const struct ublk_tgt_ops *ops = tgt_ops_list[i]; @@ -1418,6 +1536,8 @@ int main(int argc, char *argv[]) { "auto_zc", 0, NULL, 0 }, { "auto_zc_fallback", 0, NULL, 0 }, { "size", 1, NULL, 's'}, + { "nthreads", 1, NULL, 0 }, + { "per_io_tasks", 0, NULL, 0 }, { 0, 0, 0, 0 } }; const struct ublk_tgt_ops *ops = NULL; @@ -1493,6 +1613,10 @@ int main(int argc, char *argv[]) ctx.flags |= UBLK_F_AUTO_BUF_REG; if (!strcmp(longopts[option_idx].name, "auto_zc_fallback")) ctx.auto_zc_fallback = 1; + if (!strcmp(longopts[option_idx].name, "nthreads")) + ctx.nthreads = strtol(optarg, NULL, 10); + if (!strcmp(longopts[option_idx].name, "per_io_tasks")) + ctx.per_io_tasks = 1; break; case '?': /* diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h index e34508bf5798b5..6be601536b3d2c 100644 --- a/tools/testing/selftests/ublk/kublk.h +++ b/tools/testing/selftests/ublk/kublk.h @@ -49,11 +49,14 @@ #define UBLKSRV_IO_IDLE_SECS 20 #define UBLK_IO_MAX_BYTES (1 << 20) -#define UBLK_MAX_QUEUES 32 +#define UBLK_MAX_QUEUES_SHIFT 5 +#define UBLK_MAX_QUEUES (1 << UBLK_MAX_QUEUES_SHIFT) +#define UBLK_MAX_THREADS_SHIFT 5 +#define UBLK_MAX_THREADS (1 << UBLK_MAX_THREADS_SHIFT) #define UBLK_QUEUE_DEPTH 1024 #define UBLK_DBG_DEV (1U << 0) -#define UBLK_DBG_QUEUE (1U << 1) +#define UBLK_DBG_THREAD (1U << 1) #define UBLK_DBG_IO_CMD (1U << 2) #define UBLK_DBG_IO (1U << 3) #define UBLK_DBG_CTRL_CMD (1U << 4) @@ -61,6 +64,7 @@ struct ublk_dev; struct ublk_queue; +struct ublk_thread; struct stripe_ctx { /* stripe */ @@ -76,6 +80,7 @@ struct dev_ctx { char tgt_type[16]; unsigned long flags; unsigned nr_hw_queues; + unsigned short nthreads; unsigned queue_depth; int dev_id; int nr_files; @@ -85,6 +90,7 @@ struct dev_ctx { unsigned int fg:1; unsigned int recovery:1; unsigned int auto_zc_fallback:1; + unsigned int per_io_tasks:1; int _evtfd; int _shmid; @@ -123,10 +129,14 @@ struct ublk_io { unsigned short flags; unsigned short refs; /* used by target code only */ + int tag; + int result; + unsigned short buf_index; unsigned short tgt_ios; void *private_data; + struct ublk_thread *t; }; struct ublk_tgt_ops { @@ -165,28 +175,39 @@ struct ublk_tgt { struct ublk_queue { int q_id; int q_depth; - unsigned int cmd_inflight; - unsigned int io_inflight; struct ublk_dev *dev; const struct ublk_tgt_ops *tgt_ops; struct ublksrv_io_desc *io_cmd_buf; - struct io_uring ring; + struct ublk_io ios[UBLK_QUEUE_DEPTH]; -#define UBLKSRV_QUEUE_STOPPING (1U << 0) -#define UBLKSRV_QUEUE_IDLE (1U << 1) #define UBLKSRV_NO_BUF (1U << 2) #define UBLKSRV_ZC (1U << 3) #define UBLKSRV_AUTO_BUF_REG (1U << 4) #define UBLKSRV_AUTO_BUF_REG_FALLBACK (1U << 5) unsigned state; - pid_t tid; +}; + +struct ublk_thread { + struct ublk_dev *dev; + struct io_uring ring; + unsigned int cmd_inflight; + unsigned int io_inflight; + pthread_t thread; + unsigned idx; + +#define UBLKSRV_THREAD_STOPPING (1U << 0) +#define UBLKSRV_THREAD_IDLE (1U << 1) + unsigned state; }; struct ublk_dev { struct ublk_tgt tgt; struct ublksrv_ctrl_dev_info dev_info; struct ublk_queue q[UBLK_MAX_QUEUES]; + struct ublk_thread threads[UBLK_MAX_THREADS]; + unsigned nthreads; + unsigned per_io_tasks; int fds[MAX_BACK_FILES + 1]; /* fds[0] points to /dev/ublkcN */ int nr_fds; @@ -211,7 +232,7 @@ struct ublk_dev { extern unsigned int ublk_dbg_mask; -extern int ublk_queue_io_cmd(struct ublk_queue *q, struct ublk_io *io, unsigned tag); +extern int ublk_queue_io_cmd(struct ublk_io *io); static inline int ublk_io_auto_zc_fallback(const struct ublksrv_io_desc *iod) @@ -225,11 +246,14 @@ static inline int is_target_io(__u64 user_data) } static inline __u64 build_user_data(unsigned tag, unsigned op, - unsigned tgt_data, unsigned is_target_io) + unsigned tgt_data, unsigned q_id, unsigned is_target_io) { - assert(!(tag >> 16) && !(op >> 8) && !(tgt_data >> 16)); + /* we only have 7 bits to encode q_id */ + _Static_assert(UBLK_MAX_QUEUES_SHIFT <= 7); + assert(!(tag >> 16) && !(op >> 8) && !(tgt_data >> 16) && !(q_id >> 7)); - return tag | (op << 16) | (tgt_data << 24) | (__u64)is_target_io << 63; + return tag | (op << 16) | (tgt_data << 24) | + (__u64)q_id << 56 | (__u64)is_target_io << 63; } static inline unsigned int user_data_to_tag(__u64 user_data) @@ -247,6 +271,11 @@ static inline unsigned int user_data_to_tgt_data(__u64 user_data) return (user_data >> 24) & 0xffff; } +static inline unsigned int user_data_to_q_id(__u64 user_data) +{ + return (user_data >> 56) & 0x7f; +} + static inline unsigned short ublk_cmd_op_nr(unsigned int op) { return _IOC_NR(op); @@ -280,17 +309,23 @@ static inline void ublk_dbg(int level, const char *fmt, ...) } } -static inline int ublk_queue_alloc_sqes(struct ublk_queue *q, +static inline struct ublk_queue *ublk_io_to_queue(const struct ublk_io *io) +{ + return container_of(io, struct ublk_queue, ios[io->tag]); +} + +static inline int ublk_io_alloc_sqes(struct ublk_io *io, struct io_uring_sqe *sqes[], int nr_sqes) { - unsigned left = io_uring_sq_space_left(&q->ring); + struct io_uring *ring = &io->t->ring; + unsigned left = io_uring_sq_space_left(ring); int i; if (left < nr_sqes) - io_uring_submit(&q->ring); + io_uring_submit(ring); for (i = 0; i < nr_sqes; i++) { - sqes[i] = io_uring_get_sqe(&q->ring); + sqes[i] = io_uring_get_sqe(ring); if (!sqes[i]) return i; } @@ -373,7 +408,7 @@ static inline int ublk_complete_io(struct ublk_queue *q, unsigned tag, int res) ublk_mark_io_done(io, res); - return ublk_queue_io_cmd(q, io, tag); + return ublk_queue_io_cmd(io); } static inline void ublk_queued_tgt_io(struct ublk_queue *q, unsigned tag, int queued) @@ -383,7 +418,7 @@ static inline void ublk_queued_tgt_io(struct ublk_queue *q, unsigned tag, int qu else { struct ublk_io *io = ublk_get_io(q, tag); - q->io_inflight += queued; + io->t->io_inflight += queued; io->tgt_ios = queued; io->result = 0; } @@ -393,7 +428,7 @@ static inline int ublk_completed_tgt_io(struct ublk_queue *q, unsigned tag) { struct ublk_io *io = ublk_get_io(q, tag); - q->io_inflight--; + io->t->io_inflight--; return --io->tgt_ios == 0; } diff --git a/tools/testing/selftests/ublk/null.c b/tools/testing/selftests/ublk/null.c index 44aca31cf2b058..afe0b99d77eec7 100644 --- a/tools/testing/selftests/ublk/null.c +++ b/tools/testing/selftests/ublk/null.c @@ -43,7 +43,7 @@ static int ublk_null_tgt_init(const struct dev_ctx *ctx, struct ublk_dev *dev) } static void __setup_nop_io(int tag, const struct ublksrv_io_desc *iod, - struct io_uring_sqe *sqe) + struct io_uring_sqe *sqe, int q_id) { unsigned ublk_op = ublksrv_get_op(iod); @@ -52,7 +52,7 @@ static void __setup_nop_io(int tag, const struct ublksrv_io_desc *iod, sqe->flags |= IOSQE_FIXED_FILE; sqe->rw_flags = IORING_NOP_FIXED_BUFFER | IORING_NOP_INJECT_RESULT; sqe->len = iod->nr_sectors << 9; /* injected result */ - sqe->user_data = build_user_data(tag, ublk_op, 0, 1); + sqe->user_data = build_user_data(tag, ublk_op, 0, q_id, 1); } static int null_queue_zc_io(struct ublk_queue *q, int tag) @@ -60,18 +60,18 @@ static int null_queue_zc_io(struct ublk_queue *q, int tag) const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag); struct io_uring_sqe *sqe[3]; - ublk_queue_alloc_sqes(q, sqe, 3); + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 3); - io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, tag); + io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index); sqe[0]->user_data = build_user_data(tag, - ublk_cmd_op_nr(sqe[0]->cmd_op), 0, 1); + ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1); sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK; - __setup_nop_io(tag, iod, sqe[1]); + __setup_nop_io(tag, iod, sqe[1], q->q_id); sqe[1]->flags |= IOSQE_IO_HARDLINK; - io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, tag); - sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, 1); + io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index); + sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1); // buf register is marked as IOSQE_CQE_SKIP_SUCCESS return 2; @@ -82,8 +82,8 @@ static int null_queue_auto_zc_io(struct ublk_queue *q, int tag) const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag); struct io_uring_sqe *sqe[1]; - ublk_queue_alloc_sqes(q, sqe, 1); - __setup_nop_io(tag, iod, sqe[0]); + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, 1); + __setup_nop_io(tag, iod, sqe[0], q->q_id); return 1; } @@ -136,7 +136,7 @@ static unsigned short ublk_null_buf_index(const struct ublk_queue *q, int tag) { if (q->state & UBLKSRV_AUTO_BUF_REG_FALLBACK) return (unsigned short)-1; - return tag; + return q->ios[tag].buf_index; } const struct ublk_tgt_ops null_tgt_ops = { diff --git a/tools/testing/selftests/ublk/stripe.c b/tools/testing/selftests/ublk/stripe.c index 404a143bf3d695..37d50bbf5f5e86 100644 --- a/tools/testing/selftests/ublk/stripe.c +++ b/tools/testing/selftests/ublk/stripe.c @@ -138,13 +138,13 @@ static int stripe_queue_tgt_rw_io(struct ublk_queue *q, const struct ublksrv_io_ io->private_data = s; calculate_stripe_array(conf, iod, s, base); - ublk_queue_alloc_sqes(q, sqe, s->nr + extra); + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, s->nr + extra); if (zc) { - io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, tag); + io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, io->buf_index); sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK; sqe[0]->user_data = build_user_data(tag, - ublk_cmd_op_nr(sqe[0]->cmd_op), 0, 1); + ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1); } for (i = zc; i < s->nr + extra - zc; i++) { @@ -162,13 +162,14 @@ static int stripe_queue_tgt_rw_io(struct ublk_queue *q, const struct ublksrv_io_ sqe[i]->flags |= IOSQE_IO_HARDLINK; } /* bit63 marks us as tgt io */ - sqe[i]->user_data = build_user_data(tag, ublksrv_get_op(iod), i - zc, 1); + sqe[i]->user_data = build_user_data(tag, ublksrv_get_op(iod), i - zc, q->q_id, 1); } if (zc) { struct io_uring_sqe *unreg = sqe[s->nr + 1]; - io_uring_prep_buf_unregister(unreg, 0, tag, q->q_id, tag); - unreg->user_data = build_user_data(tag, ublk_cmd_op_nr(unreg->cmd_op), 0, 1); + io_uring_prep_buf_unregister(unreg, 0, tag, q->q_id, io->buf_index); + unreg->user_data = build_user_data( + tag, ublk_cmd_op_nr(unreg->cmd_op), 0, q->q_id, 1); } /* register buffer is skip_success */ @@ -181,11 +182,11 @@ static int handle_flush(struct ublk_queue *q, const struct ublksrv_io_desc *iod, struct io_uring_sqe *sqe[NR_STRIPE]; int i; - ublk_queue_alloc_sqes(q, sqe, conf->nr_files); + ublk_io_alloc_sqes(ublk_get_io(q, tag), sqe, conf->nr_files); for (i = 0; i < conf->nr_files; i++) { io_uring_prep_fsync(sqe[i], i + 1, IORING_FSYNC_DATASYNC); io_uring_sqe_set_flags(sqe[i], IOSQE_FIXED_FILE); - sqe[i]->user_data = build_user_data(tag, UBLK_IO_OP_FLUSH, 0, 1); + sqe[i]->user_data = build_user_data(tag, UBLK_IO_OP_FLUSH, 0, q->q_id, 1); } return conf->nr_files; } diff --git a/tools/testing/selftests/ublk/test_common.sh b/tools/testing/selftests/ublk/test_common.sh index 0145569ee7e9a4..8a4dbd09feb0a8 100755 --- a/tools/testing/selftests/ublk/test_common.sh +++ b/tools/testing/selftests/ublk/test_common.sh @@ -278,6 +278,11 @@ __run_io_and_remove() fio --name=job1 --filename=/dev/ublkb"${dev_id}" --ioengine=libaio \ --rw=randrw --norandommap --iodepth=256 --size="${size}" --numjobs="$(nproc)" \ --runtime=20 --time_based > /dev/null 2>&1 & + fio --name=batchjob --filename=/dev/ublkb"${dev_id}" --ioengine=io_uring \ + --rw=randrw --norandommap --iodepth=256 --size="${size}" \ + --numjobs="$(nproc)" --runtime=20 --time_based \ + --iodepth_batch_submit=32 --iodepth_batch_complete_min=32 \ + --force_async=7 > /dev/null 2>&1 & sleep 2 if [ "${kill_server}" = "yes" ]; then local state diff --git a/tools/testing/selftests/ublk/test_generic_12.sh b/tools/testing/selftests/ublk/test_generic_12.sh new file mode 100755 index 00000000000000..7abbb00d251df9 --- /dev/null +++ b/tools/testing/selftests/ublk/test_generic_12.sh @@ -0,0 +1,55 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 + +. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh + +TID="generic_12" +ERR_CODE=0 + +if ! _have_program bpftrace; then + exit "$UBLK_SKIP_CODE" +fi + +_prep_test "null" "do imbalanced load, it should be balanced over I/O threads" + +NTHREADS=6 +dev_id=$(_add_ublk_dev -t null -q 4 -d 16 --nthreads $NTHREADS --per_io_tasks) +_check_add_dev $TID $? + +dev_t=$(_get_disk_dev_t "$dev_id") +bpftrace trace/count_ios_per_tid.bt "$dev_t" > "$UBLK_TMP" 2>&1 & +btrace_pid=$! +sleep 2 + +if ! kill -0 "$btrace_pid" > /dev/null 2>&1; then + _cleanup_test "null" + exit "$UBLK_SKIP_CODE" +fi + +# do imbalanced I/O on the ublk device +# pin to cpu 0 to prevent migration/only target one queue +fio --name=write_seq \ + --filename=/dev/ublkb"${dev_id}" \ + --ioengine=libaio --iodepth=16 \ + --rw=write \ + --size=512M \ + --direct=1 \ + --bs=4k \ + --cpus_allowed=0 > /dev/null 2>&1 +ERR_CODE=$? +kill "$btrace_pid" +wait + +# check that every task handles some I/O, even though all I/O was issued +# from a single CPU. when ublk gets support for round-robin tag +# allocation, this check can be strengthened to assert that every thread +# handles the same number of I/Os +NR_THREADS_THAT_HANDLED_IO=$(grep -c '@' ${UBLK_TMP}) +if [[ $NR_THREADS_THAT_HANDLED_IO -ne $NTHREADS ]]; then + echo "only $NR_THREADS_THAT_HANDLED_IO handled I/O! expected $NTHREADS" + cat "$UBLK_TMP" + ERR_CODE=255 +fi + +_cleanup_test "null" +_show_result $TID $ERR_CODE diff --git a/tools/testing/selftests/ublk/test_stress_03.sh b/tools/testing/selftests/ublk/test_stress_03.sh index 7d728ce507740d..6eef282d569f87 100755 --- a/tools/testing/selftests/ublk/test_stress_03.sh +++ b/tools/testing/selftests/ublk/test_stress_03.sh @@ -41,5 +41,13 @@ if _have_feature "AUTO_BUF_REG"; then fi wait +if _have_feature "PER_IO_DAEMON"; then + ublk_io_and_remove 8G -t null -q 4 --auto_zc --nthreads 8 --per_io_tasks & + ublk_io_and_remove 256M -t loop -q 4 --auto_zc --nthreads 8 --per_io_tasks "${UBLK_BACKFILES[0]}" & + ublk_io_and_remove 256M -t stripe -q 4 --auto_zc --nthreads 8 --per_io_tasks "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" & + ublk_io_and_remove 8G -t null -q 4 -z --auto_zc --auto_zc_fallback --nthreads 8 --per_io_tasks & +fi +wait + _cleanup_test "stress" _show_result $TID $ERR_CODE diff --git a/tools/testing/selftests/ublk/test_stress_04.sh b/tools/testing/selftests/ublk/test_stress_04.sh index 9bcfa64ea1f05c..40d1437ca298f7 100755 --- a/tools/testing/selftests/ublk/test_stress_04.sh +++ b/tools/testing/selftests/ublk/test_stress_04.sh @@ -38,6 +38,13 @@ if _have_feature "AUTO_BUF_REG"; then ublk_io_and_kill_daemon 256M -t stripe -q 4 --auto_zc "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" & ublk_io_and_kill_daemon 8G -t null -q 4 -z --auto_zc --auto_zc_fallback & fi + +if _have_feature "PER_IO_DAEMON"; then + ublk_io_and_kill_daemon 8G -t null -q 4 --nthreads 8 --per_io_tasks & + ublk_io_and_kill_daemon 256M -t loop -q 4 --nthreads 8 --per_io_tasks "${UBLK_BACKFILES[0]}" & + ublk_io_and_kill_daemon 256M -t stripe -q 4 --nthreads 8 --per_io_tasks "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" & + ublk_io_and_kill_daemon 8G -t null -q 4 --nthreads 8 --per_io_tasks & +fi wait _cleanup_test "stress" diff --git a/tools/testing/selftests/ublk/test_stress_05.sh b/tools/testing/selftests/ublk/test_stress_05.sh index bcfc904cefc602..566cfd90d192ce 100755 --- a/tools/testing/selftests/ublk/test_stress_05.sh +++ b/tools/testing/selftests/ublk/test_stress_05.sh @@ -69,5 +69,12 @@ if _have_feature "AUTO_BUF_REG"; then done fi +if _have_feature "PER_IO_DAEMON"; then + ublk_io_and_remove 8G -t null -q 4 --nthreads 8 --per_io_tasks -r 1 -i "$reissue" & + ublk_io_and_remove 256M -t loop -q 4 --nthreads 8 --per_io_tasks -r 1 -i "$reissue" "${UBLK_BACKFILES[0]}" & + ublk_io_and_remove 8G -t null -q 4 --nthreads 8 --per_io_tasks -r 1 -i "$reissue" & +fi +wait + _cleanup_test "stress" _show_result $TID $ERR_CODE diff --git a/tools/testing/selftests/ublk/trace/count_ios_per_tid.bt b/tools/testing/selftests/ublk/trace/count_ios_per_tid.bt new file mode 100644 index 00000000000000..f4aa63ff2938a4 --- /dev/null +++ b/tools/testing/selftests/ublk/trace/count_ios_per_tid.bt @@ -0,0 +1,11 @@ +/* + * Tabulates and prints I/O completions per thread for the given device + * + * $1: dev_t +*/ +tracepoint:block:block_rq_complete +{ + if (args.dev == $1) { + @[tid] = count(); + } +} |