aboutsummaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
4 daysMerge tag 'timers-cleanups-2025-06-08' of ↵Linus Torvalds5-5/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanup from Thomas Gleixner: "The delayed from_timer() API cleanup: The renaming to the timer_*() namespace was delayed due massive conflicts against Linux-next. Now that everything is upstream finish the conversion" * tag 'timers-cleanups-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: treewide, timers: Rename from_timer() to timer_container_of()
4 daysMerge tag 'x86-urgent-2025-06-08' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A small set of x86 fixes: - Cure IO bitmap inconsistencies A failed fork cleans up all resources of the newly created thread via exit_thread(). exit_thread() invokes io_bitmap_exit() which does the IO bitmap cleanups, which unfortunately assume that the cleanup is related to the current task, which is obviously bogus. Make it work correctly - A lockdep fix in the resctrl code removed the clearing of the command buffer in two places, which keeps stale error messages around. Bring them back. - Remove unused trace events" * tag 'x86-urgent-2025-06-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: fs/resctrl: Restore the rdt_last_cmd_clear() calls after acquiring rdtgroup_mutex x86/iopl: Cure TIF_IO_BITMAP inconsistencies x86/fpu: Remove unused trace events
4 daysMerge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-42/+71
Pull mount fixes from Al Viro: "Various mount-related bugfixes: - split the do_move_mount() checks in subtree-of-our-ns and entire-anon cases and adapt detached mount propagation selftest for mount_setattr - allow clone_private_mount() for a path on real rootfs - fix a race in call of has_locked_children() - fix move_mount propagation graph breakage by MOVE_MOUNT_SET_GROUP - make sure clone_private_mnt() caller has CAP_SYS_ADMIN in the right userns - avoid false negatives in path_overmount() - don't leak MNT_LOCKED from parent to child in finish_automount() - do_change_type(): refuse to operate on unmounted/not ours mounts" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do_change_type(): refuse to operate on unmounted/not ours mounts clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns selftests/mount_setattr: adapt detached mount propagation test do_move_mount(): split the checks in subtree-of-our-ns and entire-anon cases fs: allow clone_private_mount() for a path on real rootfs fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2) finish_automount(): don't leak MNT_LOCKED from parent to child path_overmount(): avoid false negatives fs/fhandle.c: fix a race in call of has_locked_children()
4 daysMerge tag '6.16-rc-part2-smb3-client-fixes' of ↵Linus Torvalds12-286/+421
git://git.samba.org/sfrench/cifs-2.6 Pull more smb client updates from Steve French: - multichannel/reconnect fixes - move smbdirect (smb over RDMA) defines to fs/smb/common so they will be able to be used in the future more broadly, and a documentation update explaining setting up smbdirect mounts - update email address for Paulo * tag '6.16-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal version number MAINTAINERS, mailmap: Update Paulo Alcantara's email address cifs: add documentation for smbdirect setup cifs: do not disable interface polling on failure cifs: serialize other channels when query server interfaces is pending cifs: deal with the channel loading lag while picking channels smb: client: make use of common smbdirect_socket_parameters smb: smbdirect: introduce smbdirect_socket_parameters smb: client: make use of common smbdirect_socket smb: smbdirect: add smbdirect_socket.h smb: client: make use of common smbdirect.h smb: smbdirect: add smbdirect.h with public structures smb: client: make use of common smbdirect_pdu.h smb: smbdirect: add smbdirect_pdu.h with protocol definitions
4 daystreewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar5-5/+6
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
5 daysMerge tag 'ubifs-for-linus-6.16-rc1' of ↵Linus Torvalds4-4/+13
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull JFFS2 and UBIFS fixes from Richard Weinberger: "JFFS2: - Correctly check return code of jffs2_prealloc_raw_node_refs() UBIFS: - Spelling fixes" * tag 'ubifs-for-linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: jffs2: check jffs2_prealloc_raw_node_refs() result in few other places jffs2: check that raw node were preallocated before writing summary ubifs: Fix grammar in error message
5 daysdo_change_type(): refuse to operate on unmounted/not ours mountsAl Viro1-0/+4
Ensure that propagation settings can only be changed for mounts located in the caller's mount namespace. This change aligns permission checking with the rest of mount(2). Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: 07b20889e305 ("beginning of the shared-subtree proper") Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5 daysclone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right usernsAl Viro1-0/+3
What we want is to verify there is that clone won't expose something hidden by a mount we wouldn't be able to undo. "Wouldn't be able to undo" may be a result of MNT_LOCKED on a child, but it may also come from lacking admin rights in the userns of the namespace mount belongs to. clone_private_mnt() checks the former, but not the latter. There's a number of rather confusing CAP_SYS_ADMIN checks in various userns during the mount, especially with the new mount API; they serve different purposes and in case of clone_private_mnt() they usually, but not always end up covering the missing check mentioned above. Reviewed-by: Christian Brauner <brauner@kernel.org> Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com> Fixes: 427215d85e8d ("ovl: prevent private clone if bind mount is not allowed") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5 daysdo_move_mount(): split the checks in subtree-of-our-ns and entire-anon casesAl Viro1-21/+25
... and fix the breakage in anon-to-anon case. There are two cases acceptable for do_move_mount() and mixing checks for those is making things hard to follow. One case is move of a subtree in caller's namespace. * source and destination must be in caller's namespace * source must be detachable from parent Another is moving the entire anon namespace elsewhere * source must be the root of anon namespace * target must either in caller's namespace or in a suitable anon namespace (see may_use_mount() for details). * target must not be in the same namespace as source. It's really easier to follow if tests are *not* mixed together... Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: 3b5260d12b1f ("Don't propagate mounts into detached trees") Reported-by: Allison Karlitskaya <lis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5 daysfs: allow clone_private_mount() for a path on real rootfsKONDO KAZUMA(近藤 和真)1-10/+11
Mounting overlayfs with a directory on real rootfs (initramfs) as upperdir has failed with following message since commit db04662e2f4f ("fs: allow detached mounts in clone_private_mount()"). [ 4.080134] overlayfs: failed to clone upperpath Overlayfs mount uses clone_private_mount() to create internal mount for the underlying layers. The commit made clone_private_mount() reject real rootfs because it does not have a parent mount and is in the initial mount namespace, that is not an anonymous mount namespace. This issue can be fixed by modifying the permission check of clone_private_mount() following [1]. Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: db04662e2f4f ("fs: allow detached mounts in clone_private_mount()") Link: https://lore.kernel.org/all/20250514190252.GQ2023217@ZenIV/ [1] Link: https://lore.kernel.org/all/20250506194849.GT2023217@ZenIV/ Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kazuma Kondo <kazuma-kondo@nec.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5 daysfix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2)Al Viro1-1/+1
9ffb14ef61ba "move_mount: allow to add a mount into an existing group" breaks assertions on ->mnt_share/->mnt_slave. For once, the data structures in question are actually documented. Documentation/filesystem/sharedsubtree.rst: All vfsmounts in a peer group have the same ->mnt_master. If it is non-NULL, they form a contiguous (ordered) segment of slave list. do_set_group() puts a mount into the same place in propagation graph as the old one. As the result, if old mount gets events from somewhere and is not a pure event sink, new one needs to be placed next to the old one in the slave list the old one's on. If it is a pure event sink, we only need to make sure the new one doesn't end up in the middle of some peer group. "move_mount: allow to add a mount into an existing group" ends up putting the new one in the beginning of list; that's definitely not going to be in the middle of anything, so that's fine for case when old is not marked shared. In case when old one _is_ marked shared (i.e. is not a pure event sink), that breaks the assumptions of propagation graph iterators. Put the new mount next to the old one on the list - that does the right thing in "old is marked shared" case and is just as correct as the current behaviour if old is not marked shared (kudos to Pavel for pointing that out - my original suggested fix changed behaviour in the "nor marked" case, which complicated things for no good reason). Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: 9ffb14ef61ba ("move_mount: allow to add a mount into an existing group") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5 dayspath_overmount(): avoid false negativesAl Viro1-6/+13
Holding namespace_sem is enough to make sure that result remains valid. It is *not* enough to avoid false negatives from __lookup_mnt(). Mounts can be unhashed outside of namespace_sem (stuck children getting detached on final mntput() of lazy-umounted mount) and having an unrelated mount removed from the hash chain while we traverse it may end up with false negative from __lookup_mnt(). We need to sample and recheck the seqlock component of mount_lock... Bug predates the introduction of path_overmount() - it had come from the code in finish_automount() that got abstracted into that helper. Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: 26df6034fdb2 ("fix automount/automount race properly") Fixes: 6ac392815628 ("fs: allow to mount beneath top mount") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5 daysfs/fhandle.c: fix a race in call of has_locked_children()Al Viro1-4/+14
may_decode_fh() is calling has_locked_children() while holding no locks. That's an oopsable race... The rest of the callers are safe since they are holding namespace_sem and are guaranteed a positive refcount on the mount in question. Rename the current has_locked_children() to __has_locked_children(), make it static and switch the fs/namespace.c users to it. Make has_locked_children() a wrapper for __has_locked_children(), calling the latter under read_seqlock_excl(&mount_lock). Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Fixes: 620c266f3949 ("fhandle: relax open_by_handle_at() permission checks") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
5 daysMerge tag 'ceph-for-6.16-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds4-11/+25
Pull ceph updates from Ilya Dryomov: - a one-liner that leads to a startling (but also very much rational) performance improvement in cases where an IMA policy with rules that are based on fsmagic matching is enforced - an encryption-related fixup that addresses generic/397 and other fstest failures - a couple of cleanups in CephFS * tag 'ceph-for-6.16-rc1' of https://github.com/ceph/ceph-client: ceph: fix variable dereferenced before check in ceph_umount_begin() ceph: set superblock s_magic for IMA fsmagic matching ceph: cleanup hardcoded constants of file handle size ceph: fix possible integer overflow in ceph_zero_objects() ceph: avoid kernel BUG for encrypted inode with unaligned file size
5 daysMerge tag 'ovl-update-v2-6.16' of ↵Linus Torvalds6-80/+77
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs update from Miklos Szeredi: - Fix a regression in getting the path of an open file (e.g. in /proc/PID/maps) for a nested overlayfs setup (André Almeida) - Support data-only layers and verity in a user namespace (unprivileged composefs use case) - Fix a gcc warning (Kees) - Cleanups * tag 'ovl-update-v2-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: Annotate struct ovl_entry with __counted_by() ovl: Replace offsetof() with struct_size() in ovl_stack_free() ovl: Replace offsetof() with struct_size() in ovl_cache_entry_new() ovl: Check for NULL d_inode() in ovl_dentry_upper() ovl: Use str_on_off() helper in ovl_show_options() ovl: don't require "metacopy=on" for "verity" ovl: relax redirect/metacopy requirements for lower -> data redirect ovl: make redirect/metacopy rejection consistent ovl: Fix nested backing file paths
6 daysceph: fix variable dereferenced before check in ceph_umount_begin()Viacheslav Dubeyko1-2/+1
smatch warnings: fs/ceph/super.c:1042 ceph_umount_begin() warn: variable dereferenced before check 'fsc' (see line 1041) vim +/fsc +1042 fs/ceph/super.c void ceph_umount_begin(struct super_block *sb) { struct ceph_fs_client *fsc = ceph_sb_to_fs_client(sb); doutc(fsc->client, "starting forced umount\n"); ^^^^^^^^^^^ Dereferenced if (!fsc) ^^^^ Checked too late. return; fsc->mount_state = CEPH_MOUNT_SHUTDOWN; __ceph_umount_begin(fsc); } The VFS guarantees that the superblock is still alive when it calls into ceph via ->umount_begin(). Finally, we don't need to check the fsc and it should be valid. This patch simply removes the fsc check. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202503280852.YDB3pxUY-lkp@intel.com/ Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
7 dayscifs: update internal version numberSteve French1-2/+2
to 2.55 Signed-off-by: Steve French <stfrench@microsoft.com>
7 daysMerge tag '6.16-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds10-93/+53
Pull smb server updates from Steve French: "Four smb3 server fixes: - Fix for special character handling when mounting with "posix" - Fix for mounts from Mac for fs that don't provide unique inode numbers - Two cleanup patches (e.g. for crypto calls)" * tag '6.16-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: allow a filename to contain special characters on SMB3.1.1 posix extension ksmbd: provide zero as a unique ID to the Mac client ksmbd: remove unnecessary softdep on crc32 ksmbd: use SHA-256 library API instead of crypto_shash API
7 daysMerge tag 'bcachefs-2025-06-04' of git://evilpiepirate.org/bcachefsLinus Torvalds98-1757/+2323
Pull more bcachefs updates from Kent Overstreet: "More bcachefs updates: - More stack usage improvements (~600 bytes) - Define CLASS()es for some commonly used types, and convert most rcu_read_lock() uses to the new lock guards - New introspection: - Superblock error counters are now available in sysfs: previously, they were only visible with 'show-super', which doesn't provide a live view - New tracepoint, error_throw(), which is called any time we return an error and start to unwind - Repair - check_fix_ptrs() can now repair btree node roots - We can now repair when we've somehow ended up with the journal using a superblock bucket - Revert some leftovers from the aborted directory i_size feature, and add repair code: some userspace programs (e.g. sshfs) were getting confused It seems in 6.15 there's a bug where i_nlink on the vfs inode has been getting incorrectly set to 0, with some unfortunate results; list_journal analysis showed bch2_inode_rm() being called (by bch2_evict_inode()) when it clearly should not have been. - bch2_inode_rm() now runs "should we be deleting this inode?" checks that were previously only run when deleting unlinked inodes in recovery - check_subvol() was treating a dangling subvol (pointing to a missing root inode) like a dangling dirent, and deleting it. This was the really unfortunate one: check_subvol() will now recreate the root inode if necessary This took longer to debug than it should have, and we lost several filesystems unnecessarily, because users have been ignoring the release notes and blindly running 'fsck -y'. Debugging required reconstructing what happened through analyzing the journal, when ideally someone would have noticed 'hey, fsck is asking me if I want to repair this: it usually doesn't, maybe I should run this in dry run mode and check what's going on?' As a reminder, fsck errors are being marked as autofix once we've verified, in real world usage, that they're working correctly; blindly running 'fsck -y' on an experimental filesystem is playing with fire Up to this incident we've had an excellent track record of not losing data, so let's try to learn from this one This is a community effort, I wouldn't be able to get this done without the help of all the people QAing and providing excellent bug reports and feedback based on real world usage. But please don't ignore advice and expect me to pick up the pieces If an error isn't marked as autofix, and it /is/ happening in the wild, that's also something I need to know about so we can check it out and add it to the autofix list if repair looks good. I haven't been getting those reports, and I should be; since we don't have any sort of telemetry yet I am absolutely dependent on user reports Now I'll be spending the weekend working on new repair code to see if I can get a filesystem back for a user who didn't have backups" * tag 'bcachefs-2025-06-04' of git://evilpiepirate.org/bcachefs: (69 commits) bcachefs: add cond_resched() to handle_overwrites() bcachefs: Make journal read log message a bit quieter bcachefs: Fix subvol to missing root repair bcachefs: Run may_delete_deleted_inode() checks in bch2_inode_rm() bcachefs: delete dead code from may_delete_deleted_inode() bcachefs: Add flags to subvolume_to_text() bcachefs: Fix oops in btree_node_seq_matches() bcachefs: Fix dirent_casefold_mismatch repair bcachefs: Fix bch2_fsck_rename_dirent() for casefold bcachefs: Redo bch2_dirent_init_name() bcachefs: Fix -Wc23-extensions in bch2_check_dirents() bcachefs: Run check_dirents second time if required bcachefs: Run snapshot deletion out of system_long_wq bcachefs: Make check_key_has_snapshot safer bcachefs: BCH_RECOVERY_PASS_NO_RATELIMIT bcachefs: bch2_require_recovery_pass() bcachefs: bch_err_throw() bcachefs: Repair code for directory i_size bcachefs: Kill un-reverted directory i_size code bcachefs: Delete redundant fsck_err() ...
8 daysbcachefs: add cond_resched() to handle_overwrites()Kent Overstreet1-0/+2
Fix soft lockup warnings in btree nodes can. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
8 daysbcachefs: Make journal read log message a bit quieterKent Overstreet1-5/+5
Users seem to be assuming that the 'dropped unflushed entries' message at the end of journal read indicates some sort of problem, when it does not - we expect there to be entries in the journal that weren't commited, it's purely informational so that we can correlate journal sequence numbers elsewhere when debugging. Shorten the log message a bit to hopefully make this clearer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
8 daysbcachefs: Fix subvol to missing root repairKent Overstreet1-4/+14
We had a bug where the root inode of a subvolume was erronously deleted: bch2_evict_inode() called bch2_inode_rm(), meaning the VFS inode's i_nlink was somehow set to 0 when it shouldn't have - the inode in the btree indicated it clearly was not unlinked. This has been addressed with additional safety checks in bch2_inode_rm() - pulling in the safety checks we already were doing when deleting unlinked inodes in recovery - but the really disastrous bug was in check_subvols(), which on finding a dangling subvol (subvol with a missing root inode) would delete the subvolume. I assume this bug dates from early check_directory_structure() code, which originally handled subvolumes and normal paths - the idea being that still live contents of the subvolume would get reattached somewhere. But that's incorrect, and disastrously so; deleting a subvolume triggers deleting the snapshot ID it points to, deleting the entire contents. The correct way to repair is to recreate the root inode if it's missing; then any contents will get reattached under that subvolume's lost+found. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
8 daysbcachefs: Run may_delete_deleted_inode() checks in bch2_inode_rm()Kent Overstreet4-19/+62
We had a bug where bch2_evict_inode() incorrectly called bch2_inode_rm() - the journal clearly showed the inode was not unlinked. We've got checks that we use in recovery when cleaning up deleted inodes, lift them to bch2_inode_rm() as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
8 daysbcachefs: delete dead code from may_delete_deleted_inode()Kent Overstreet1-12/+3
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
8 daysbcachefs: Add flags to subvolume_to_text()Kent Overstreet1-0/+7
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
8 daysbcachefs: Fix oops in btree_node_seq_matches()Kent Overstreet1-2/+21
btree_update_nodes_written() needs to wait on in-flight writes to old nodes before marking them as freed. But it has no reason to pin those old nodes in memory, so some trickyness ensues. The update we're completing deleted references to those nodes from the btree, so we know if they've been evicted they can't be pulled back in. We just have to check if the nodes we have pointers to are still those old nodes, and haven't been reused. To do that we check the node's "sequence number" (actually a random 64 bit cookie), but that lives in the node's data buffer. 'struct btree' can't be freed until filesystem shutdown (as they're quite small), but the data buffers can be freed or swapped around. Commit 1f88c3567495, which was fixing a kmsan warning, assumed that we could safely do this locklessly with just a READ_ONCE() - if we've got a non-null ptr it would be safe to read from. But that's not true if the data buffer is a vmalloc allocation, so we need to restore the locking that commit deleted (or alternatively RCU free those data buffers, but there's no other reason for that). Fixes: 1f88c3567495 ("bcachefs: Fix a KMSAN splat in btree_update_nodes_written()") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
8 daysbcachefs: Fix dirent_casefold_mismatch repairKent Overstreet5-118/+162
Instead of simply recreating a mis-casefolded dirent, use the str_hash repair code, which will rename it if necessary - the dirent might have been created again with the correct casefolding. Factor out out bch2_str_hash_repair key() from __bch2_str_hash_check_key() for the new path to use, and export bch2_dirent_create_key() as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
8 daysbcachefs: Fix bch2_fsck_rename_dirent() for casefoldKent Overstreet3-13/+25
bch2_fsck_renamed_dirent was creating bch_dirent keys open-coded - but we need to use the appropriate helper, if the directory is casefolded. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
8 daysbcachefs: Redo bch2_dirent_init_name()Kent Overstreet2-70/+66
Redo (and simplify somewhat) how casefolded and non casefolded dirents are initialized, and export this to be used by fsck_rename_dirent(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
8 daysbcachefs: Fix -Wc23-extensions in bch2_check_dirents()Nathan Chancellor1-1/+2
Clang warns (or errors with CONFIG_WERROR=y): fs/bcachefs/fsck.c:2325:2: error: label followed by a declaration is a C23 extension [-Werror,-Wc23-extensions] 2325 | int ret = bch2_trans_run(c, | ^ On clang-17 and older, this is an unconditional error: fs/bcachefs/fsck.c:2325:2: error: expected expression 2325 | int ret = bch2_trans_run(c, | ^ Move the declaration of ret to the top of the function to resolve both ways this issue manifests. Fixes: c72def523799 ("bcachefs: Run check_dirents second time if required") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
8 daysfs/resctrl: Restore the rdt_last_cmd_clear() calls after acquiring ↵Zeng Heng1-0/+4
rdtgroup_mutex A lockdep fix removed two rdt_last_cmd_clear() calls that were used to clear the last_cmd_status buffer but called without holding the required rdtgroup_mutex. The impacted resctrl commands are writing to the cpus or cpus_list files and creating a new monitor or control group. With stale data in the last_cmd_status buffer the impacted resctrl commands report the stale error on success, or append its own failure message to the stale error on failure. Consequently, restore the rdt_last_cmd_clear() calls after acquiring rdtgroup_mutex. Fixes: c8eafe149530 ("x86/resctrl: Fix potential lockdep warning") Signed-off-by: Zeng Heng <zengheng4@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/all/20250603125828.1590067-1-zengheng4@huawei.com
9 dayscifs: do not disable interface polling on failureShyam Prasad N2-9/+6
When a server has multichannel enabled, we keep polling the server for interfaces periodically. However, when this query fails, we disable the polling. This can be problematic as it takes away the chance for the server to start advertizing again. This change reschedules the delayed work, even if the current call failed. That way, multichannel sessions can recover. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
9 dayscifs: serialize other channels when query server interfaces is pendingShyam Prasad N2-6/+19
Today, during smb2_reconnect, session_mutex is released as soon as the tcon is reconnected and is in a good state. However, in case multichannel is enabled, there is also a query of server interfaces that follows. We've seen that this query can race with reconnects of other channels, causing them to step on each other with reconnects. This change extends the hold of session_mutex till after the query of server interfaces is complete. In order to avoid recursive smb2_reconnect checks during query ioctl, this change also introduces a session flag for sessions where such a query is in progress. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
9 dayscifs: deal with the channel loading lag while picking channelsShyam Prasad N1-7/+7
Our current approach to select a channel for sending requests is this: 1. iterate all channels to find the min and max queue depth 2. if min and max are not the same, pick the channel with min depth 3. if min and max are same, round robin, as all channels are equally loaded The problem with this approach is that there's a lag between selecting a channel and sending the request (that increases the queue depth on the channel). While these numbers will eventually catch up, there could be a skew in the channel usage, depending on the application's I/O parallelism and the server's speed of handling requests. With sufficient parallelism, this lag can artificially increase the queue depth, thereby impacting the performance negatively. This change will change the step 1 above to start the iteration from the last selected channel. This is to reduce the skew in channel usage even in the presence of this lag. Fixes: ea90708d3cf3 ("cifs: use the least loaded channel for sending requests") Cc: <stable@vger.kernel.org> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
9 dayssmb: client: make use of common smbdirect_socket_parametersStefan Metzmacher4-59/+77
Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
9 dayssmb: smbdirect: introduce smbdirect_socket_parametersStefan Metzmacher3-0/+23
This is the next step in the direction of a common smbdirect layer. Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
9 dayssmb: client: make use of common smbdirect_socketStefan Metzmacher3-126/+146
This is the next step in the direction of a common smbdirect layer. Currently only structures are shared, but that will change over time until everything is shared. Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
9 dayssmb: smbdirect: add smbdirect_socket.hStefan Metzmacher1-0/+41
This abstracts the common smbdirect layer. Currently with just a few things in it, but that will change over time until everything is in common. Will be used in client and server in the next commits Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
9 dayssmb: client: make use of common smbdirect.hStefan Metzmacher2-15/+9
Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
9 dayssmb: smbdirect: add smbdirect.h with public structuresStefan Metzmacher1-0/+17
Will be used in client and server in the next commits. Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Hyunchul Lee <hyc.lee@gmail.com> CC: Meetakshi Setiya <meetakshisetiyaoss@gmail.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
9 dayssmb: client: make use of common smbdirect_pdu.hStefan Metzmacher2-62/+19
Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
9 dayssmb: smbdirect: add smbdirect_pdu.h with protocol definitionsStefan Metzmacher1-0/+55
This is just a start moving into a common smbdirect layer. It will be used in the next commits... Cc: Steve French <smfrench@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: Long Li <longli@microsoft.com> Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Meetakshi Setiya <meetakshisetiyaoss@gmail.com> Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
9 daysMerge tag 'nfs-for-6.16-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds23-172/+468
Pull NFS clent updates from Anna Schumaker: "New Features: - Implement the Sunrpc rfc2203 rpcsec_gss sequence number cache - Add support for FALLOC_FL_ZERO_RANGE on NFS v4.2 - Add a localio sysfs attribute Stable Fixes: - Fix double-unlock bug in nfs_return_empty_folio() - Don't check for OPEN feature support in v4.1 - Always probe for LOCALIO support asynchronously - Prevent hang on NFS mounts with xprtsec=[m]tls Other Bugfixes: - xattr handlers should check for absent nfs filehandles - Fix setattr caching of TIME_[MODIFY|ACCESS]_SET when timestamps are delegated - Fix listxattr to return selinux security labels - Connect to NFSv3 DS using TLS if MDS connection uses TLS - Clear SB_RDONLY before getting a superblock, and ignore when remounting - Fix incorrect handling of NFS error codes in nfs4_do_mkdir() - Various nfs_localio fixes from Neil Brown that include fixing an rcu compilation error found by older gcc versions. - Update stats on flexfiles pNFS DSes when receiving NFS4ERR_DELAY Cleanups: - Add a refcount tracker for struct net in the nfs_client - Allow FREE_STATEID to clean up delegations - Always set NLINK even if the server doesn't support it - Cleanups to the NFS folio writeback code - Remove dead code from xs_tcp_tls_setup_socket()" * tag 'nfs-for-6.16-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (30 commits) flexfiles/pNFS: update stats on NFS4ERR_DELAY for v4.1 DSes nfs_localio: change nfsd_file_put_local() to take a pointer to __rcu pointer nfs_localio: protect race between nfs_uuid_put() and nfs_close_local_fh() nfs_localio: duplicate nfs_close_local_fh() nfs_localio: simplify interface to nfsd for getting nfsd_file nfs_localio: always hold nfsd net ref with nfsd_file ref nfs_localio: use cmpxchg() to install new nfs_file_localio SUNRPC: Remove dead code from xs_tcp_tls_setup_socket() SUNRPC: Prevent hang on NFS mount with xprtsec=[m]tls nfs: fix incorrect handling of large-number NFS errors in nfs4_do_mkdir() nfs: ignore SB_RDONLY when remounting nfs nfs: clear SB_RDONLY before getting superblock NFS: always probe for LOCALIO support asynchronously pnfs/flexfiles: connect to NFSv3 DS using TLS if MDS connection uses TLS NFS: add localio to sysfs nfs: use writeback_iter directly nfs: refactor nfs_do_writepage nfs: don't return AOP_WRITEPAGE_ACTIVATE from nfs_do_writepage nfs: fold nfs_page_async_flush into nfs_do_writepage NFSv4: Always set NLINK even if the server doesn't support it ...
9 daysMerge tag 'v6.16-rc-part1-smb-client-fixes' of ↵Linus Torvalds13-97/+130
git://git.samba.org/sfrench/cifs-2.6 Pull smb client updates from Steve French: - multichannel fixes (mostly reconnect related), and clarification of locking documentation - automount null pointer check fix - fixes to add support for ParentLeaseKey - minor cleanup - smb1/cifs fixes * tag 'v6.16-rc-part1-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: update the lock ordering comments with new mutex cifs: dns resolution is needed only for primary channel cifs: update dstaddr whenever channel iface is updated cifs: reset connections for all channels when reconnect requested smb: client: use ParentLeaseKey in cifs_do_create smb: client: use ParentLeaseKey in open_cached_dir smb: client: add ParentLeaseKey support cifs: Fix cifs_query_path_info() for Windows NT servers cifs: Fix validation of SMB1 query reparse point response cifs: Correctly set SMB1 SessionKey field in Session Setup Request cifs: Fix encoding of SMB1 Session Setup NTLMSSP Request in non-UNICODE mode smb: client: add NULL check in automount_fullpath smb: client: Remove an unused function and variable
10 daysMerge tag 'mm-stable-2025-06-01-14-06' of ↵Linus Torvalds1-18/+13
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "zram: support algorithm-specific parameters" from Sergey Senozhatsky adds infrastructure for passing algorithm-specific parameters into zram. A single parameter `winbits' is implemented at this time. - "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg charging nmi-safe, which is required by BFP, which can operate in NMI context. - "Some random fixes and cleanup to shmem" from Kemeng Shi implements small fixes and cleanups in the shmem code. - "Skip mm selftests instead when kernel features are not present" from Zi Yan fixes some issues in the MM selftest code. - "mm/damon: build-enable essential DAMON components by default" from SeongJae Park reworks DAMON Kconfig to make it easier to enable CONFIG_DAMON. - "sched/numa: add statistics of numa balance task migration" from Libo Chen adds more info into sysfs and procfs files to improve visibility into the NUMA balancer's task migration activity. - "selftests/mm: cow and gup_longterm cleanups" from Mark Brown provides various updates to some of the MM selftests to make them play better with the overall containing framework. * tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits) mm/khugepaged: clean up refcount check using folio_expected_ref_count() selftests/mm: fix test result reporting in gup_longterm selftests/mm: report unique test names for each cow test selftests/mm: add helper for logging test start and results selftests/mm: use standard ksft_finished() in cow and gup_longterm selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled sched/numa: add statistics of numa balance task sched/numa: fix task swap by skipping kernel threads tools/testing: check correct variable in open_procmap() tools/testing/vma: add missing function stub mm/gup: update comment explaining why gup_fast() disables IRQs selftests/mm: two fixes for the pfnmap test mm/khugepaged: fix race with folio split/free using temporary reference mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order mmu_notifiers: remove leftover stub macros selftests/mm: deduplicate test names in madv_populate kcov: rust: add flags for KCOV with Rust mm: rust: make CONFIG_MMU ifdefs more narrow mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables() mm/damon/Kconfig: enable CONFIG_DAMON by default ...
10 daysMerge tag 'gfs2-for-6.16-fix' of ↵Linus Torvalds2-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fix from Andreas Gruenbacher: - Fix a NULL pointer dereference reported by syzbot * tag 'gfs2-for-6.16-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Don't clear sb->s_fs_info in gfs2_sys_fs_add
10 daysMerge tag 'fuse-update-6.16' of ↵Linus Torvalds8-496/+306
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Remove tmp page copying in writeback path (Joanne). This removes ~300 lines and with that a lot of complexity related to avoiding reclaim related deadlock. The old mechanism is replaced with a mapping flag that tells the MM not to block reclaim waiting for writeback to complete. The MM parts have been reviewed/acked by respective maintainers. - Convert more code to handle large folios (Joanne). This still just adds the code to deal with large folios and does not enable them yet. - Allow invalidating all cached lookups atomically (Luis Henriques). This feature is useful for CernVMFS, which currently does this iteratively. - Align write prefaulting in fuse with generic one (Dave Hansen) - Fix race causing invalid data to be cached when setting attributes on different nodes of a distributed fs (Guang Yuan Wu) - Update documentation for passthrough (Chen Linxuan) - Add fdinfo about the device number associated with an opened /dev/fuse instance (Chen Linxuan) - Increase readdir buffer size (Miklos). This depends on a patch to VFS readdir code that was already merged through Christians tree. - Optimize io-uring request expiration (Joanne) - Misc cleanups * tag 'fuse-update-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits) fuse: increase readdir buffer size readdir: supply dir_context.count as readdir buffer size hint fuse: don't allow signals to interrupt getdents copying fuse: support large folios for writeback fuse: support large folios for readahead fuse: support large folios for queued writes fuse: support large folios for stores fuse: support large folios for symlinks fuse: support large folios for folio reads fuse: support large folios for writethrough writes fuse: refactor fuse_fill_write_pages() fuse: support large folios for retrieves fuse: support copying large folios fs: fuse: add dev id to /dev/fuse fdinfo docs: filesystems: add fuse-passthrough.rst MAINTAINERS: update filter of FUSE documentation fuse: fix race between concurrent setattrs from multiple nodes fuse: remove tmp folio for writebacks and internal rb tree mm: skip folio reclaim in legacy memcg contexts for deadlockable mappings fuse: optimize over-io-uring request expiration check ...
10 dayscifs: update the lock ordering comments with new mutexShyam Prasad N1-5/+8
The lock ordering rules listed as comments in cifsglob.h were missing some lock details and also the fid_lock. Updated those notes in this commit. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
10 daysMerge tag 'vfs-6.16-rc1.netfs' of ↵Linus Torvalds26-399/+449
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull netfs updates from Christian Brauner: - The main API document has been extensively updated/rewritten - Fix an oops in write-retry due to mis-resetting the I/O iterator - Fix the recording of transferred bytes for short DIO reads - Fix a request's work item to not require a reference, thereby avoiding the need to get rid of it in BH/IRQ context - Fix waiting and waking to be consistent about the waitqueue used - Remove NETFS_SREQ_SEEK_DATA_READ, NETFS_INVALID_WRITE, NETFS_ICTX_WRITETHROUGH, NETFS_READ_HOLE_CLEAR, NETFS_RREQ_DONT_UNLOCK_FOLIOS, and NETFS_RREQ_BLOCKED - Reorder structs to eliminate holes - Remove netfs_io_request::ractl - Only provide proc_link field if CONFIG_PROC_FS=y - Remove folio_queue::marks3 - Fix undifferentiation of DIO reads from unbuffered reads * tag 'vfs-6.16-rc1.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix undifferentiation of DIO reads from unbuffered reads netfs: Fix wait/wake to be consistent about the waitqueue used netfs: Fix the request's work item to not require a ref netfs: Fix setting of transferred bytes with short DIO reads netfs: Fix oops in write-retry from mis-resetting the subreq iterator fs/netfs: remove unused flag NETFS_RREQ_BLOCKED fs/netfs: remove unused flag NETFS_RREQ_DONT_UNLOCK_FOLIOS folio_queue: remove unused field `marks3` fs/netfs: declare field `proc_link` only if CONFIG_PROC_FS=y fs/netfs: remove `netfs_io_request.ractl` fs/netfs: reorder struct fields to eliminate holes fs/netfs: remove unused enum choice NETFS_READ_HOLE_CLEAR fs/netfs: remove unused flag NETFS_ICTX_WRITETHROUGH fs/netfs: remove unused source NETFS_INVALID_WRITE fs/netfs: remove unused flag NETFS_SREQ_SEEK_DATA_READ
10 daysMerge tag 'vfs-6.16-rc2.fixes' of ↵Linus Torvalds3-3/+23
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix the AT_HANDLE_CONNECTABLE option so filesystems that don't know how to decode a connected non-dir dentry fail the request - Use repr(transparent) to ensure identical layout between the C and Rust implementation of struct file - Add a missing xas_pause() into the dax code employing wait_entry_unlocked_exclusive() - Fix FOP_DONTCACHE which we disabled for v6.15. A folio could get redirtied and/or scheduled for writeback after the initial dropbehind test. Change the test accordingly to handle these cases so we can re-enable FOP_DONTCACHE again * tag 'vfs-6.16-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: exportfs: require ->fh_to_parent() to encode connectable file handles rust: file: improve safety comments rust: file: mark `LocalFile` as `repr(transparent)` fs/dax: Fix "don't skip locked entries when scanning entries" iomap: don't lose folio dropbehind state for overwrites mm/filemap: unify dropbehind flag testing and clearing mm/filemap: unify read/write dropbehind naming Revert "Disable FOP_DONTCACHE for now due to bugs" mm/filemap: use filemap_end_dropbehind() for read invalidation mm/filemap: gate dropbehind invalidate on folio !dirty && !writeback
10 dayscifs: dns resolution is needed only for primary channelShyam Prasad N1-1/+2
When calling cifs_reconnect, before the connection to the server is reestablished, the code today does a DNS resolution and updates server->dstaddr. However, this is not necessary for secondary channels. Secondary channels use the interface list returned by the server to decide which address to connect to. And that happens after tcon is reconnected and server interfaces are requested. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
10 dayscifs: update dstaddr whenever channel iface is updatedShyam Prasad N1-0/+4
When the server interface info changes (more common in clustered servers like Azure Files), the per-channel iface gets updated. However, this did not update the corresponding dstaddr. As a result these channels will still connect (or try connecting) to older addresses. Fixes: b54034a73baf ("cifs: during reconnect, update interface if necessary") Cc: <stable@vger.kernel.org> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
10 dayscifs: reset connections for all channels when reconnect requestedShyam Prasad N1-0/+7
cifs_reconnect can be called with a flag to mark the session as needing reconnect too. When this is done, we expect the connections of all channels to be reconnected too, which is not happening today. Without doing this, we have seen bad things happen when primary and secondary channels are connected to different servers (in case of cloud services like Azure Files SMB). This change would force all connections to reconnect as well, not just the sessions and tcons. Cc: <stable@vger.kernel.org> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
10 daysbcachefs: Run check_dirents second time if requiredKent Overstreet4-27/+49
If we move a key backwards, we'll need a second pass to run the rest of the fsck checks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 daysbcachefs: Run snapshot deletion out of system_long_wqKent Overstreet1-1/+1
We don't want this running out of the same workqueue, and blocking, writes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 daysbcachefs: Make check_key_has_snapshot saferKent Overstreet2-22/+37
Snapshot deletion v2 added sentinal values for deleted snapshots, so "key for deleted snapshot" - i.e. snapshot deletion missed something - is safe to repair automatically. But if we find a key for a missing snapshot we have no idea what happened, and we shouldn't delete it unless we're very sure that everything else is consistent. So hook it up to the new bch2_require_recovery_pass(), we'll now only delete if snapshots and subvolumes have recenlty been checked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 daysbcachefs: BCH_RECOVERY_PASS_NO_RATELIMITKent Overstreet5-13/+53
Add a superblock flag to temporarily disable ratelimiting for a recovery pass. This will be used to make check_key_has_snapshot safer: we don't want to delete a key for a missing snapshot unless we know that the snapshots and subvolumes btrees are consistent, i.e. check_snapshots and check_subvols have run recently. Changing those btrees - creating/deleting a subvolume or snapshot - will set the "disable ratelimit" flag, i.e. ensuring that those passes run if check_key_has_snapshot discovers an error. We're only disabling ratelimiting in the snapshot/subvol delete paths, we're not so concerned about the create paths. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 daysbcachefs: bch2_require_recovery_pass()Kent Overstreet3-2/+37
Add a helper for requiring that a recovery pass has already run: either run it directly, if we're still in recovery, or if we're not in recovery check if it has run recently and schedule it if it hasn't. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 daysbcachefs: bch_err_throw()Kent Overstreet60-405/+459
Add a tracepoint for any time we return an error and unwind. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 daysbcachefs: Repair code for directory i_sizeKent Overstreet2-1/+10
We had a bug due due to an incomplete revert of the patch implementing directory i_size (summing up the size of the dirents), leading to completely screwy i_size values that underflow. Most userspace programs don't seem to care (e.g. du ignores it), but it turns out this broke sshfs, so needs to be repaired. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 daysbcachefs: Kill un-reverted directory i_size codeKent Overstreet3-14/+6
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 daysbcachefs: Delete redundant fsck_err()Kent Overstreet2-18/+0
'inode_has_wrong_backpointer'; we have more specific errors for every case afterwards. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 daysbcachefs: Convert BUG() to errorKent Overstreet1-1/+2
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
10 dayssmb: client: use ParentLeaseKey in cifs_do_createHenrique Carvalho1-0/+23
Implement ParentLeaseKey logic in cifs_do_create() by looking up the parent cfid, copying its lease key into the fid struct, and setting the appropriate lease flag. Fixes: f047390a097e ("CIFS: Add create lease v2 context for SMB3") Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
10 dayssmb: client: use ParentLeaseKey in open_cached_dirHenrique Carvalho1-1/+23
Implement ParentLeaseKey logic in open_cached_dir() by looking up the parent cfid, copying its lease key into the fid struct, and setting the appropriate lease flag. Fixes: f047390a097e ("CIFS: Add create lease v2 context for SMB3") Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
10 dayssmb: client: add ParentLeaseKey supportHenrique Carvalho3-6/+18
According to MS-SMB2 3.2.4.3.8, when opening a file the client must lookup its parent directory, copy that entry’s LeaseKey into ParentLeaseKey, and set SMB2_LEASE_FLAG_PARENT_LEASE_KEY_SET. Extend lease context functions to carry a parent_lease_key and lease_flags and to add them to the lease context buffer accordingly in smb3_create_lease_buf. Also add a parent_lease_key field to struct cifs_fid and lease_flags to cifs_open_parms. Only applies to the SMB 3.x dialect family. Fixes: f047390a097e ("CIFS: Add create lease v2 context for SMB3") Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
10 dayscifs: Fix cifs_query_path_info() for Windows NT serversPali Rohár1-0/+8
For TRANS2 QUERY_PATH_INFO request when the path does not exist, the Windows NT SMB server returns error response STATUS_OBJECT_NAME_NOT_FOUND or ERRDOS/ERRbadfile without the SMBFLG_RESPONSE flag set. Similarly it returns STATUS_DELETE_PENDING when the file is being deleted. And looks like that any error response from TRANS2 QUERY_PATH_INFO does not have SMBFLG_RESPONSE flag set. So relax check in check_smb_hdr() for detecting if the packet is response for this special case. This change fixes stat() operation against Windows NT SMB servers and also all operations which depends on -ENOENT result from stat like creat() or mkdir(). Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
10 dayscifs: Fix validation of SMB1 query reparse point responsePali Rohár1-2/+18
Validate the SMB1 query reparse point response per [MS-CIFS] section 2.2.7.2 NT_TRANSACT_IOCTL. NT_TRANSACT_IOCTL response contains one word long setup data after which is ByteCount member. So check that SetupCount is 1 before trying to read and use ByteCount member. Output setup data contains ReturnedDataLen member which is the output length of executed IOCTL command by remote system. So check that output was not truncated before transferring over network. Change MaxSetupCount of NT_TRANSACT_IOCTL request from 4 to 1 as io_rsp structure already expects one word long output setup data. This should prevent server sending incompatible structure (in case it would be extended in future, which is unlikely). Change MaxParameterCount of NT_TRANSACT_IOCTL request from 2 to 0 as NT IOCTL does not have any documented output parameters and this function does not parse any output parameters at all. Fixes: ed3e0a149b58 ("smb: client: implement ->query_reparse_point() for SMB1") Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
10 dayscifs: Correctly set SMB1 SessionKey field in Session Setup RequestPali Rohár4-3/+6
[MS-CIFS] specification in section 2.2.4.53.1 where is described SMB_COM_SESSION_SETUP_ANDX Request, for SessionKey field says: The client MUST set this field to be equal to the SessionKey field in the SMB_COM_NEGOTIATE Response for this SMB connection. Linux SMB client currently set this field to zero. This is working fine against Windows NT SMB servers thanks to [MS-CIFS] product behavior <94>: Windows NT Server ignores the client's SessionKey. For compatibility with [MS-CIFS], set this SessionKey field in Session Setup Request to value retrieved from Negotiate response. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
10 dayscifs: Fix encoding of SMB1 Session Setup NTLMSSP Request in non-UNICODE modePali Rohár1-10/+10
SMB1 Session Setup NTLMSSP Request in non-UNICODE mode is similar to UNICODE mode, just strings are encoded in ASCII and not in UTF-16. With this change it is possible to setup SMB1 session with NTLM authentication in non-UNICODE mode with Windows SMB server. This change fixes mounting SMB1 servers with -o nounicode mount option together with -o sec=ntlmssp mount option (which is the default sec=). Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
11 dayssmb: client: add NULL check in automount_fullpathRuben Devos1-0/+3
page is checked for null in __build_path_from_dentry_optional_prefix when tcon->origin_fullpath is not set. However, the check is missing when it is set. Add a check to prevent a potential NULL pointer dereference. Signed-off-by: Ruben Devos <devosruben6@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
11 daysceph: set superblock s_magic for IMA fsmagic matchingDennis Marttinen1-0/+1
The CephFS kernel driver forgets to set the filesystem magic signature in its superblock. As a result, IMA policy rules based on fsmagic matching do not apply as intended. This causes a major performance regression in Talos Linux [1] when mounting CephFS volumes, such as when deploying Rook Ceph [2]. Talos Linux ships a hardened kernel with the following IMA policy (irrelevant lines omitted): [...] dont_measure fsmagic=0xc36400 # CEPH_SUPER_MAGIC [...] measure func=FILE_CHECK mask=^MAY_READ euid=0 measure func=FILE_CHECK mask=^MAY_READ uid=0 [...] Currently, IMA compares 0xc36400 == 0x0 for CephFS files, resulting in all files opened with O_RDONLY or O_RDWR getting measured with SHA512 on every open(2): 10 69990c87e8af323d47e2d6ae4... ima-ng sha512:<hash> /data/cephfs/test-file Since O_WRONLY is rare, this results in an order of magnitude lower performance than expected for practically all file operations. Properly setting CEPH_SUPER_MAGIC in the CephFS superblock resolves the regression. Tests performed on a 3x replicated Ceph v19.3.0 cluster across three i5-7200U nodes each equipped with one Micron 7400 MAX M.2 disk (BlueStore) and Gigabit ethernet, on Talos Linux v1.10.2: FS-Mark 3.3 Test: 500 Files, Empty Files/s > Higher Is Better 6.12.27-talos . 16.6 |==== +twelho patch . 208.4 |==================================================== FS-Mark 3.3 Test: 500 Files, 1KB Size Files/s > Higher Is Better 6.12.27-talos . 15.6 |======= +twelho patch . 118.6 |==================================================== FS-Mark 3.3 Test: 500 Files, 32 Sub Dirs, 1MB Size Files/s > Higher Is Better 6.12.27-talos . 12.7 |=============== +twelho patch . 44.7 |===================================================== IO500 [3] 2fcd6d6 results (benchmarks within variance omitted): | IO500 benchmark | 6.12.27-talos | +twelho patch | Speedup | |-------------------|----------------|----------------|-----------| | mdtest-easy-write | 0.018524 kIOPS | 1.135027 kIOPS | 6027.33 % | | mdtest-hard-write | 0.018498 kIOPS | 0.973312 kIOPS | 5161.71 % | | ior-easy-read | 0.064727 GiB/s | 0.155324 GiB/s | 139.97 % | | mdtest-hard-read | 0.018246 kIOPS | 0.780800 kIOPS | 4179.29 % | This applies outside of synthetic benchmarks as well, for example, the time to rsync a 55 MiB directory with ~12k of mostly small files drops from an unusable 10m5s to a reasonable 26s (23x the throughput). [1]: https://www.talos.dev/ [2]: https://www.talos.dev/v1.10/kubernetes-guides/configuration/ceph-with-rook/ [3]: https://github.com/IO500/io500 Cc: stable@vger.kernel.org Signed-off-by: Dennis Marttinen <twelho@welho.tech> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
11 daysceph: cleanup hardcoded constants of file handle sizeViacheslav Dubeyko1-8/+13
The ceph/export.c contains very confusing logic of file handle size calculation based on hardcoded values. This patch makes the cleanup of this logic by means of introduction the named constants. Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
11 daysceph: fix possible integer overflow in ceph_zero_objects()Dmitry Kandybka1-1/+1
In 'ceph_zero_objects', promote 'object_size' to 'u64' to avoid possible integer overflow. Compile tested only. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Dmitry Kandybka <d.kandybka@gmail.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
11 daysceph: avoid kernel BUG for encrypted inode with unaligned file sizeViacheslav Dubeyko1-0/+9
The generic/397 test hits a BUG_ON for the case of encrypted inode with unaligned file size (for example, 33K or 1K): [ 877.737811] run fstests generic/397 at 2025-01-03 12:34:40 [ 877.875761] libceph: mon0 (2)127.0.0.1:40674 session established [ 877.876130] libceph: client4614 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949 [ 877.991965] libceph: mon0 (2)127.0.0.1:40674 session established [ 877.992334] libceph: client4617 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949 [ 878.017234] libceph: mon0 (2)127.0.0.1:40674 session established [ 878.017594] libceph: client4620 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949 [ 878.031394] xfs_io (pid 18988) is setting deprecated v1 encryption policy; recommend upgrading to v2. [ 878.054528] libceph: mon0 (2)127.0.0.1:40674 session established [ 878.054892] libceph: client4623 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949 [ 878.070287] libceph: mon0 (2)127.0.0.1:40674 session established [ 878.070704] libceph: client4626 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949 [ 878.264586] libceph: mon0 (2)127.0.0.1:40674 session established [ 878.265258] libceph: client4629 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949 [ 878.374578] -----------[ cut here ]------------ [ 878.374586] kernel BUG at net/ceph/messenger.c:1070! [ 878.375150] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 878.378145] CPU: 2 UID: 0 PID: 4759 Comm: kworker/2:9 Not tainted 6.13.0-rc5+ #1 [ 878.378969] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 878.380167] Workqueue: ceph-msgr ceph_con_workfn [ 878.381639] RIP: 0010:ceph_msg_data_cursor_init+0x42/0x50 [ 878.382152] Code: 89 17 48 8b 46 70 55 48 89 47 08 c7 47 18 00 00 00 00 48 89 e5 e8 de cc ff ff 5d 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 0f 0b <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 [ 878.383928] RSP: 0018:ffffb4ffc7cbbd28 EFLAGS: 00010287 [ 878.384447] RAX: ffffffff82bb9ac0 RBX: ffff981390c2f1f8 RCX: 0000000000000000 [ 878.385129] RDX: 0000000000009000 RSI: ffff981288232b58 RDI: ffff981390c2f378 [ 878.385839] RBP: ffffb4ffc7cbbe18 R08: 0000000000000000 R09: 0000000000000000 [ 878.386539] R10: 0000000000000000 R11: 0000000000000000 R12: ffff981390c2f030 [ 878.387203] R13: ffff981288232b58 R14: 0000000000000029 R15: 0000000000000001 [ 878.387877] FS: 0000000000000000(0000) GS:ffff9814b7900000(0000) knlGS:0000000000000000 [ 878.388663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 878.389212] CR2: 00005e106a0554e0 CR3: 0000000112bf0001 CR4: 0000000000772ef0 [ 878.389921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 878.390620] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 878.391307] PKRU: 55555554 [ 878.391567] Call Trace: [ 878.391807] <TASK> [ 878.392021] ? show_regs+0x71/0x90 [ 878.392391] ? die+0x38/0xa0 [ 878.392667] ? do_trap+0xdb/0x100 [ 878.392981] ? do_error_trap+0x75/0xb0 [ 878.393372] ? ceph_msg_data_cursor_init+0x42/0x50 [ 878.393842] ? exc_invalid_op+0x53/0x80 [ 878.394232] ? ceph_msg_data_cursor_init+0x42/0x50 [ 878.394694] ? asm_exc_invalid_op+0x1b/0x20 [ 878.395099] ? ceph_msg_data_cursor_init+0x42/0x50 [ 878.395583] ? ceph_con_v2_try_read+0xd16/0x2220 [ 878.396027] ? _raw_spin_unlock+0xe/0x40 [ 878.396428] ? raw_spin_rq_unlock+0x10/0x40 [ 878.396842] ? finish_task_switch.isra.0+0x97/0x310 [ 878.397338] ? __schedule+0x44b/0x16b0 [ 878.397738] ceph_con_workfn+0x326/0x750 [ 878.398121] process_one_work+0x188/0x3d0 [ 878.398522] ? __pfx_worker_thread+0x10/0x10 [ 878.398929] worker_thread+0x2b5/0x3c0 [ 878.399310] ? __pfx_worker_thread+0x10/0x10 [ 878.399727] kthread+0xe1/0x120 [ 878.400031] ? __pfx_kthread+0x10/0x10 [ 878.400431] ret_from_fork+0x43/0x70 [ 878.400771] ? __pfx_kthread+0x10/0x10 [ 878.401127] ret_from_fork_asm+0x1a/0x30 [ 878.401543] </TASK> [ 878.401760] Modules linked in: hctr2 nhpoly1305_avx2 nhpoly1305_sse2 nhpoly1305 chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 essiv authenc mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit kvm_intel kvm crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel joydev crypto_simd cryptd rapl input_leds psmouse sch_fq_codel serio_raw bochs i2c_piix4 floppy qemu_fw_cfg i2c_smbus mac_hid pata_acpi msr parport_pc ppdev lp parport efi_pstore ip_tables x_tables [ 878.407319] ---[ end trace 0000000000000000 ]--- [ 878.407775] RIP: 0010:ceph_msg_data_cursor_init+0x42/0x50 [ 878.408317] Code: 89 17 48 8b 46 70 55 48 89 47 08 c7 47 18 00 00 00 00 48 89 e5 e8 de cc ff ff 5d 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 0f 0b <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 [ 878.410087] RSP: 0018:ffffb4ffc7cbbd28 EFLAGS: 00010287 [ 878.410609] RAX: ffffffff82bb9ac0 RBX: ffff981390c2f1f8 RCX: 0000000000000000 [ 878.411318] RDX: 0000000000009000 RSI: ffff981288232b58 RDI: ffff981390c2f378 [ 878.412014] RBP: ffffb4ffc7cbbe18 R08: 0000000000000000 R09: 0000000000000000 [ 878.412735] R10: 0000000000000000 R11: 0000000000000000 R12: ffff981390c2f030 [ 878.413438] R13: ffff981288232b58 R14: 0000000000000029 R15: 0000000000000001 [ 878.414121] FS: 0000000000000000(0000) GS:ffff9814b7900000(0000) knlGS:0000000000000000 [ 878.414935] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 878.415516] CR2: 00005e106a0554e0 CR3: 0000000112bf0001 CR4: 0000000000772ef0 [ 878.416211] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 878.416907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 878.417630] PKRU: 55555554 (gdb) l *ceph_msg_data_cursor_init+0x42 0xffffffff823b45a2 is in ceph_msg_data_cursor_init (net/ceph/messenger.c:1070). 1065 1066 void ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor, 1067 struct ceph_msg *msg, size_t length) 1068 { 1069 BUG_ON(!length); 1070 BUG_ON(length > msg->data_length); 1071 BUG_ON(!msg->num_data_items); 1072 1073 cursor->total_resid = length; 1074 cursor->data = msg->data; The issue takes place because of this: [ 202.628853] libceph: net/ceph/messenger_v2.c:2034 prepare_sparse_read_data(): msg->data_length 33792, msg->sparse_read_total 36864 1070 BUG_ON(length > msg->data_length); The generic/397 test (xfstests) executes such steps: (1) create encrypted files and directories; (2) access the created files and folders with encryption key; (3) access the created files and folders without encryption key. The issue takes place in this portion of code: if (IS_ENCRYPTED(inode)) { struct page **pages; size_t page_off; err = iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len, &page_off); if (err < 0) { doutc(cl, "%llx.%llx failed to allocate pages, %d\n", ceph_vinop(inode), err); goto out; } /* should always give us a page-aligned read */ WARN_ON_ONCE(page_off); len = err; err = 0; osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, false, false); The reason of the issue is that subreq->io_iter.count keeps unaligned value of length: [ 347.751182] lib/iov_iter.c:1185 __iov_iter_get_pages_alloc(): maxsize 36864, maxpages 4294967295, start 18446659367320516064 [ 347.752808] lib/iov_iter.c:1196 __iov_iter_get_pages_alloc(): maxsize 33792, maxpages 4294967295, start 18446659367320516064 [ 347.754394] lib/iov_iter.c:1015 iter_folioq_get_pages(): maxsize 33792, maxpages 4294967295, extracted 0, _start_offset 18446659367320516064 This patch simply assigns the aligned value to subreq->io_iter.count before calling iov_iter_get_pages_alloc2(). [ idryomov: tag the comment with FIXME to make it clear that it's only a workaround for netfslib not coexisting with fscrypt nicely (this is also noted in another pre-existing comment) ] Cc: David Howells <dhowells@redhat.com> Cc: stable@vger.kernel.org Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
11 daysntfs3: use folios more in ntfs_compress_write()Matthew Wilcox (Oracle)1-18/+13
Remove the local 'page' variable and do everything in terms of folios. Removes the last user of copy_page_from_iter_atomic() and a hidden call to compound_head() in ClearPageDirty(). Link: https://lkml.kernel.org/r/20250514170607.3000994-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 daysbcachefs: Add better logging to fsck_rename_dirent()Kent Overstreet1-3/+8
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Replace rcu_read_lock() with guardsKent Overstreet41-525/+344
The new guard(), scoped_guard() allow for more natural code. Some of the uses with creative flow control have been left. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: CLASS(btree_trans)Kent Overstreet1-10/+21
Allow btree_trans to be used with CLASS(). Automatic cleanup, instead of manually calling bch2_trans_put(). We don't use DEFINE_CLASS because using a static inline for the constructor breaks bch2_trans_get()'s use of __func__, so we have to open code it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysMerge tag 'mm-nonmm-stable-2025-05-31-15-28' of ↵Linus Torvalds19-43/+83
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "hung_task: extend blocking task stacktrace dump to semaphore" from Lance Yang enhances the hung task detector. The detector presently dumps the blocking tasks's stack when it is blocked on a mutex. Lance's series extends this to semaphores - "nilfs2: improve sanity checks in dirty state propagation" from Wentao Liang addresses a couple of minor flaws in nilfs2 - "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn fixes a couple of issues in the gdb scripts - "Support kdump with LUKS encryption by reusing LUKS volume keys" from Coiby Xu addresses a usability problem with kdump. When the dump device is LUKS-encrypted, the kdump kernel may not have the keys to the encrypted filesystem. A full writeup of this is in the series [0/N] cover letter - "sysfs: add counters for lockups and stalls" from Max Kellermann adds /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count - "fork: Page operation cleanups in the fork code" from Pasha Tatashin implements a number of code cleanups in fork.c - "scripts/gdb/symbols: determine KASLR offset on s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in the gdb scripts * tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits) llist: make llist_add_batch() a static inline delayacct: remove redundant code and adjust indentation squashfs: add optional full compressed block caching crash_dump, nvme: select CONFIGFS_FS as built-in scripts/gdb/symbols: determine KASLR offset on s390 during early boot scripts/gdb/symbols: factor out pagination_off() scripts/gdb/symbols: factor out get_vmlinux() kernel/panic.c: format kernel-doc comments mailmap: update and consolidate Casey Connolly's name and email nilfs2: remove wbc->for_reclaim handling fork: define a local GFP_VMAP_STACK fork: check charging success before zeroing stack fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code fork: clean-up ifdef logic around stack allocation kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count x86/crash: make the page that stores the dm crypt keys inaccessible x86/crash: pass dm crypt keys to kdump kernel Revert "x86/mm: Remove unused __set_memory_prot()" crash_dump: retrieve dm crypt keys in kdump kernel ...
11 daysbcachefs: CLASS(darray)Kent Overstreet1-0/+28
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: CLASS(printbuf)Kent Overstreet1-0/+8
Add a DEFINE_CLASS() for printbufs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: sysfs trigger_journal_commitKent Overstreet1-0/+5
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: sysfs trigger_emergency_read_onlyKent Overstreet1-0/+13
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: darray_find(), darray_find_p()Kent Overstreet5-33/+36
New helpers to avoid open coded loops. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Journal keys are retained until shutdown, or journal replay finishesKent Overstreet1-4/+0
If we don't finish journal replay we need to keep journal keys around until the filesystem shuts down - otherwise e.g. -o norecovery, various tools (dump, list) break, and eventually we'll be doing journal replay in the background. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Improve error printing in btree_node_check_topology()Kent Overstreet1-36/+35
We had a bug report where the errors from btree_node_check_topology() don't seem to be getting printed; log_fsck_err() does some fancy ratelimiting-type stuff that we don't want here. Instead, just use bch2_count_fsck_err(); this is simpler, and modelled after how we're currently handling bucket ref update errors in buckets.c. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: bch2_readdir() now calls str_hash_check_key()Kent Overstreet3-4/+10
More self healing code: readdir will now notice if there are dirents hashed incorrectly, and it'll repair them if errors=fix_safe. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: bch2_str_hash_check_key() may now be called without snapshots_seenKent Overstreet3-6/+28
We don't track snapshot overwrites outside of fsck, so for this to be called at runtime outside of fsck we need to create it on demand, when we have repair to do. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: __bch2_insert_snapshot_whiteouts() refactoringKent Overstreet2-52/+30
Now uses bch2_get_snapshot_overwrites(), and much shorter. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: bch2_get_snapshot_overwrites()Kent Overstreet3-1/+60
New helper for getting a list of snapshot IDs that have overwritten a given key. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: bch2_dev_journal_bucket_delete()Kent Overstreet4-5/+82
Recover from "journal and btree in same bucket". Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Runtime self healing for keys for deleted snapshotsKent Overstreet1-7/+18
If snapshot deletion incorrectly missing some keys and leaves keys for deleted snapshots, that causes a bit of a problem for data move - we can't move an extent for a nonexistent snapshot, because the extent might have to be fragmented, and maintaining correct visibility in child snapshots doesn't work if it doesn't have a snapshot. Previously we'd just skip these keys, but it turns out that causes copygc to spin. So we need runtime self healing, i.e. calling check_key_has_snapshot() from the data move path. Snapshot deletion v2 included sentinal values for deleted snapshot nodes, so this is quite safe. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Don't unlock trans before data_update_init()Kent Overstreet2-7/+9
data_update_init() does need to do btree operations, delay doing the unlock-before-io. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
11 daysbcachefs: Use bch2_err_matches() for BCH_ERR_fsck_(fix|ignore)Kent Overstreet4-29/+36
We'll be adding subtypes of these errors, and new error code tracing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
12 daysMerge tag 'mm-stable-2025-05-31-14-50' of ↵Linus Torvalds6-179/+181
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal Clapinski provides some tuning of the (sadly, hard-coded (more sadly, not auto-tuned)) thresholds for our invokation of proactive compaction. In a simple test case, the reduction of a guest VM's memory consumption was dramatic. - "Minor cleanups and improvements to swap freeing code" from Kemeng Shi provides some code cleaups and a small efficiency improvement to this part of our swap handling code. - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin adds the ability for a ptracer to modify syscalls arguments. At this time we can alter only "system call information that are used by strace system call tampering, namely, syscall number, syscall arguments, and syscall return value. This series should have been incorporated into mm.git's "non-MM" branch, but I goofed. - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl against /proc/pid/pagemap. This permits CRIU to more efficiently get at the info about guard regions. - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan implements that fix. No runtime effect is expected because validate_page_before_insert() happens to fix up this error. - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David Hildenbrand basically brings uprobe text poking into the current decade. Remove a bunch of hand-rolled implementation in favor of using more current facilities. - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman Khandual provides enhancements and generalizations to the pte dumping code. This might be needed when 128-bit Page Table Descriptors are enabled for ARM. - "Always call constructor for kernel page tables" from Kevin Brodsky ensures that the ctor/dtor is always called for kernel pgtables, as it already is for user pgtables. This permits the addition of more functionality such as "insert hooks to protect page tables". This change does result in various architectures performing unnecesary work, but this is fixed up where it is anticipated to occur. - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice Ryhl adds plumbing to permit Rust access to core MM structures. - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo Stoakes takes advantage of some VMA merging opportunities which we've been missing for 15 years. - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from SeongJae Park optimizes process_madvise()'s TLB flushing. Instead of flushing each address range in the provided iovec, we batch the flushing across all the iovec entries. The syscall's cost was approximately halved with a microbenchmark which was designed to load this particular operation. - "Track node vacancy to reduce worst case allocation counts" from Sidhartha Kumar makes the maple tree smarter about its node preallocation. stress-ng mmap performance increased by single-digit percentages and the amount of unnecessarily preallocated memory was dramaticelly reduced. - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes a few unnecessary things which Baoquan noted when reading the code. - ""Enhance sysfs handling for memory hotplug in weighted interleave" from Rakie Kim "enhances the weighted interleave policy in the memory management subsystem by improving sysfs handling, fixing memory leaks, and introducing dynamic sysfs updates for memory hotplug support". Fixes things on error paths which we are unlikely to hit. - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory" from SeongJae Park introduces new DAMOS quota goal metrics which eliminate the manual tuning which is required when utilizing DAMON for memory tiering. - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He provides cleanups and small efficiency improvements which Baoquan found via code inspection. - "vmscan: enforce mems_effective during demotion" from Gregory Price changes reclaim to respect cpuset.mems_effective during demotion when possible. because presently, reclaim explicitly ignores cpuset.mems_effective when demoting, which may cause the cpuset settings to violated. This is useful for isolating workloads on a multi-tenant system from certain classes of memory more consistently. - "Clean up split_huge_pmd_locked() and remove unnecessary folio pointers" from Gavin Guo provides minor cleanups and efficiency gains in in the huge page splitting and migrating code. - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache for `struct mem_cgroup', yielding improved memory utilization. - "add max arg to swappiness in memory.reclaim and lru_gen" from Zhongkun He adds a new "max" argument to the "swappiness=" argument for memory.reclaim MGLRU's lru_gen. This directs proactive reclaim to reclaim from only anon folios rather than file-backed folios. - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the first step on the path to permitting the kernel to maintain existing VMs while replacing the host kernel via file-based kexec. At this time only memblock's reserve_mem is preserved. - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides and uses a smarter way of looping over a pfn range. By skipping ranges of invalid pfns. - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning when a task is pinned a single NUMA mode. Dramatic performance benefits were seen in some real world cases. - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank Garg addresses a warning which occurs during memory compaction when using JFS. - "move all VMA allocation, freeing and duplication logic to mm" from Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more appropriate mm/vma.c. - "mm, swap: clean up swap cache mapping helper" from Kairui Song provides code consolidation and cleanups related to the folio_index() function. - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that. - "memcg: Fix test_memcg_min/low test failures" from Waiman Long addresses some bogus failures which are being reported by the test_memcontrol selftest. - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo Stoakes commences the deprecation of file_operations.mmap() in favor of the new file_operations.mmap_prepare(). The latter is more restrictive and prevents drivers from messing with things in ways which, amongst other problems, may defeat VMA merging. - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples the per-cpu memcg charge cache from the objcg's one. This is a step along the way to making memcg and objcg charging NMI-safe, which is a BPF requirement. - "mm/damon: minor fixups and improvements for code, tests, and documents" from SeongJae Park is yet another batch of miscellaneous DAMON changes. Fix and improve minor problems in code, tests and documents. - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg stats to be irq safe. Another step along the way to making memcg charging and stats updates NMI-safe, a BPF requirement. - "Let unmap_hugepage_range() and several related functions take folio instead of page" from Fan Ni provides folio conversions in the hugetlb code. * tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits) mm: pcp: increase pcp->free_count threshold to trigger free_high mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range() mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page mm/hugetlb: pass folio instead of page to unmap_ref_private() memcg: objcg stock trylock without irq disabling memcg: no stock lock for cpu hot-unplug memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs memcg: make count_memcg_events re-entrant safe against irqs memcg: make mod_memcg_state re-entrant safe against irqs memcg: move preempt disable to callers of memcg_rstat_updated memcg: memcg_rstat_updated re-entrant safe against irqs mm: khugepaged: decouple SHMEM and file folios' collapse selftests/eventfd: correct test name and improve messages alloc_tag: check mem_profiling_support in alloc_tag_init Docs/damon: update titles and brief introductions to explain DAMOS selftests/damon/_damon_sysfs: read tried regions directories in order mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject() mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat() mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs ...
13 daysMerge tag 'pull-automount' of ↵Linus Torvalds6-35/+4
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull automount updates from Al Viro: "Automount wart removal A bunch of odd boilerplate gone from instances - the reason for those was the need to protect the yet-to-be-attched mount from mark_mounts_for_expiry() deciding to take it out. But that's easy to detect and take care of in mark_mounts_for_expiry() itself; no need to have every instance simulate mount being busy by grabbing an extra reference to it, with finish_automount() undoing that once it attaches that mount. Should've done it that way from the very beginning... This is a flagday change, thankfully there are very few instances. vfs_submount() is gone - its sole remaining user (trace_automount) had been switched to saner primitives" * tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill vfs_submount() saner calling conventions for ->d_automount()
13 daysMerge tag 'pull-ufs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2-177/+139
Pull UFS updates from Al Viro: "The bulk of this is Eric's conversion of UFS to new mount API, with a bit of cleanups from me. I hoped to get stricter sanity checks on superblock flags into that pile, but... next cycle, hopefully" * tag 'pull-ufs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ufs: convert ufs to the new mount API ufs: reject multiple conflicting -o ufstype=... on mount ufs: split ->s_mount_opt - don't mix flavour and on-error
13 daysMerge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds3-20/+4
Pull mount propagation fix from Al Viro: "6.15 allowed mount propagation to destinations in detached trees; unfortunately, that breaks existing userland, so the old behaviour needs to be restored. It's not exactly a revert - the original behaviour had a bug, where existence of detached tree might disrupt propagation between locations not in detached trees. Thankfully, userland did not depend upon that bug, so we want to keep the fix" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Don't propagate mounts into detached trees
13 daysgfs2: Don't clear sb->s_fs_info in gfs2_sys_fs_addAndrew Price2-2/+3
When gfs2_sys_fs_add() fails, it sets sb->s_fs_info to NULL on its error path (see commit 0d515210b696 ("GFS2: Add kobject release method")). The intention seems to be to prevent dereferencing sb->s_fs_info once the object pointed to has been deallocated, but that would be better achieved by setting the pointer to NULL in free_sbd(). As a consequence, when the call to gfs2_sys_fs_add() fails in gfs2_fill_super(), sdp = GFS2_SB(inode) will evaluate to NULL in iput() -> gfs2_drop_inode(), and accessing sdp->sd_flags will be a NULL pointer dereference. Fix that by only setting sb->s_fs_info to NULL when actually freeing the object pointed to in free_sbd(). Fixes: ae9f3bd8259a ("gfs2: replace sd_aspace with sd_inode") Reported-by: syzbot+b12826218502df019f9d@syzkaller.appspotmail.com Signed-off-by: Andrew Price <anprice@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
13 daysMerge tag 'f2fs-for-6.16-rc1' of ↵Linus Torvalds23-1729/+1960
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, Matthew converted most of page operations to using folio. Beyond the work, we've applied some performance tunings such as GC and linear lookup, in addition to enhancing fault injection and sanity checks. Enhancements: - large number of folio conversions - add a control to turn on/off the linear lookup for performance - tune GC logics for zoned block device - improve fault injection and sanity checks Bug fixes: - handle error cases of memory donation - fix to correct check conditions in f2fs_cross_rename - fix to skip f2fs_balance_fs() if checkpoint is disabled - don't over-report free space or inodes in statvfs - prevent the current section from being selected as a victim during GC - fix to calculate first_zoned_segno correctly - fix to avoid inconsistence between SIT and SSA for zoned block device As usual, there are several debugging patches and clean-ups as well" * tag 'f2fs-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (195 commits) f2fs: fix to correct check conditions in f2fs_cross_rename f2fs: use d_inode(dentry) cleanup dentry->d_inode f2fs: fix to skip f2fs_balance_fs() if checkpoint is disabled f2fs: clean up to check bi_status w/ BLK_STS_OK f2fs: introduce is_{meta,node}_folio f2fs: add ckpt_valid_blocks to the section entry f2fs: add a method for calculating the remaining blocks in the current segment in LFS mode. f2fs: introduce FAULT_VMALLOC f2fs: use vmalloc instead of kvmalloc in .init_{,de}compress_ctx f2fs: add f2fs_bug_on() in f2fs_quota_read() f2fs: add f2fs_bug_on() to detect potential bug f2fs: remove unused sbi argument from checksum functions f2fs: fix 32-bits hexademical number in fault injection doc f2fs: don't over-report free space or inodes in statvfs f2fs: return bool from __write_node_folio f2fs: simplify return value handling in f2fs_fsync_node_pages f2fs: always unlock the page in f2fs_write_single_data_page f2fs: remove wbc->for_reclaim handling f2fs: return bool from __f2fs_write_meta_folio f2fs: fix to return correct error number in f2fs_sync_node_pages() ...
13 daysbcachefs: Mark bch_errcode helpers __attribute__((const))Kent Overstreet2-3/+7
These don't access global memory or defer pointer arguments - this enables CSE optimizations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Add missing printbuf_reset() in bch2_check_dirent_inode_dirent()Kent Overstreet1-0/+1
We were accidentally including the contents from the previous fsck_err(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: sysfs/errorsKent Overstreet3-0/+29
Make the superblock error counters available in sysfs; the only other way they can be seen is 'show-super', but we don't write the superblock every time the error count gets incremented. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: bch2_check_fix_ptrs() can now repair btree rootsKent Overstreet2-22/+54
This is straightforward enough: check_fix_ptrs() currently only runs before we go RW, so updating the btree root pointer in c->btree_roots suffices - it'll be written out in the first journal write we do. For that, do_bch2_trans_commit_to_journal_replay() now handles JSET_ENTRY_btree_root entries. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Include b->ob.nr in cached_btree_node_to_text()Kent Overstreet1-0/+2
We have a bug report that looks like we might be leaking open buckets - let's check if they got left attached to the cached btree node. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Move devs_sorted to alloc_requestKent Overstreet3-21/+24
More stack usage work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: reduce stack usage in alloc_sectors_start()Kent Overstreet1-2/+2
with typical config options, variables in different inline functions aren't sharing stack space - and these are slowpaths. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: bch2_alloc_v4_to_text()Kent Overstreet2-5/+18
Specialize the .to_text() for alloc_v4, to avoid the temporary on the stack for conversion from old versions. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Tweak bch2_data_update_init() for stack usageKent Overstreet1-44/+67
- Separate out a slowpath for bkey_nocow_lock() - Don't call bch2_bkey_ptrs_c() or loop over pointers more than necessary Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: kill replicas_sectors arg to __trigger_extent()Kent Overstreet1-13/+10
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Don't stack allocate bch_writepage_stateKent Overstreet1-19/+11
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: factor out break_cycle_fail()Kent Overstreet1-25/+28
More stack usage work. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: btree_node_missing_err()Kent Overstreet1-12/+18
Factor out an error path for a small stack usage improvement. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Kill bkey_buf in btree_path_down()Kent Overstreet2-22/+18
Allocate some (smaller) temporary storage in btree_trans for this - btree_path_down() is in our max-stack call stack. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Add missing error logging in delete_dead_inodes()Kent Overstreet1-0/+1
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Fix misaligned bucket check in journal space calculationsKent Overstreet1-9/+10
Fix an assertion pop in the tiering_misaligned test: rounding down to bucket size at the end of the journal space calculations leaves cur_entry_sectors == 0, which is incorrect with !cur_entry_err. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Fix incorrect multiple dev check in journal write pathKent Overstreet1-1/+11
It's uncomon to have multiple devices with journalling only on a subset, but can be specified with the 'data_allowed' option. We need to know if we're doing data/metadata writes to multiple devices, as that requires issuing flushes before the journal writes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Catch data_update_done events in trace_io_move_start_failKent Overstreet1-3/+3
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: io_move_evacuate_bucket tracepoint, counterKent Overstreet3-0/+24
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: trace_io_move_predKent Overstreet2-26/+65
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Fix infinite loop in journal_entry_btree_keys_to_text()Kent Overstreet1-0/+4
Fix an infinite loop when bkey_i->k.u64s is 0. This only happens in userspace, where 'bcachefs list_journal' can print the entire contents of the journal, and non-dirty entries aren't validated. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Journal read error message improvementsKent Overstreet2-89/+140
- Don't print a checksum error when we first read a journal entry: we print a checksum error later if we'll be using the journal entry. - Continuing with the theme of of improving error messages and grouping errors into a single log message per error, print a single 'checksum error' message per journal entry, and use bch2_journal_ptr_to_text() to print out where on the device it was. - Factor out checksum error messages and checking for missing journal entries into helpers, bch2_journal_read() has gotten obnoxiously big. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysfs/dax: Fix "don't skip locked entries when scanning entries"Alistair Popple1-1/+1
Commit 6be3e21d25ca ("fs/dax: don't skip locked entries when scanning entries") introduced a new function, wait_entry_unlocked_exclusive(), which waits for the current entry to become unlocked without advancing the XArray iterator state. Waiting for the entry to become unlocked requires dropping the XArray lock. This requires calling xas_pause() prior to dropping the lock which leaves the xas in a suitable state for the next iteration. However this has the side-effect of advancing the xas state to the next index. Normally this isn't an issue because xas_for_each() contains code to detect this state and thus avoid advancing the index a second time on the next loop iteration. However both callers of and wait_entry_unlocked_exclusive() itself subsequently use the xas state to reload the entry. As xas_pause() updated the state to the next index this will cause the current entry which is being waited on to be skipped. This caused the following warning to fire intermittently when running xftest generic/068 on an XFS filesystem with FS DAX enabled: [ 35.067397] ------------[ cut here ]------------ [ 35.068229] WARNING: CPU: 21 PID: 1640 at mm/truncate.c:89 truncate_folio_batch_exceptionals+0xd8/0x1e0 [ 35.069717] Modules linked in: nd_pmem dax_pmem nd_btt nd_e820 libnvdimm [ 35.071006] CPU: 21 UID: 0 PID: 1640 Comm: fstest Not tainted 6.15.0-rc7+ #77 PREEMPT(voluntary) [ 35.072613] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/204 [ 35.074845] RIP: 0010:truncate_folio_batch_exceptionals+0xd8/0x1e0 [ 35.075962] Code: a1 00 00 00 f6 47 0d 20 0f 84 97 00 00 00 4c 63 e8 41 39 c4 7f 0b eb 61 49 83 c5 01 45 39 ec 7e 58 42 f68 [ 35.079522] RSP: 0018:ffffb04e426c7850 EFLAGS: 00010202 [ 35.080359] RAX: 0000000000000000 RBX: ffff9d21e3481908 RCX: ffffb04e426c77f4 [ 35.081477] RDX: ffffb04e426c79e8 RSI: ffffb04e426c79e0 RDI: ffff9d21e34816e8 [ 35.082590] RBP: ffffb04e426c79e0 R08: 0000000000000001 R09: 0000000000000003 [ 35.083733] R10: 0000000000000000 R11: 822b53c0f7a49868 R12: 000000000000001f [ 35.084850] R13: 0000000000000000 R14: ffffb04e426c78e8 R15: fffffffffffffffe [ 35.085953] FS: 00007f9134c87740(0000) GS:ffff9d22abba0000(0000) knlGS:0000000000000000 [ 35.087346] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 35.088244] CR2: 00007f9134c86000 CR3: 000000040afff000 CR4: 00000000000006f0 [ 35.089354] Call Trace: [ 35.089749] <TASK> [ 35.090168] truncate_inode_pages_range+0xfc/0x4d0 [ 35.091078] truncate_pagecache+0x47/0x60 [ 35.091735] xfs_setattr_size+0xc7/0x3e0 [ 35.092648] xfs_vn_setattr+0x1ea/0x270 [ 35.093437] notify_change+0x1f4/0x510 [ 35.094219] ? do_truncate+0x97/0xe0 [ 35.094879] do_truncate+0x97/0xe0 [ 35.095640] path_openat+0xabd/0xca0 [ 35.096278] do_filp_open+0xd7/0x190 [ 35.096860] do_sys_openat2+0x8a/0xe0 [ 35.097459] __x64_sys_openat+0x6d/0xa0 [ 35.098076] do_syscall_64+0xbb/0x1d0 [ 35.098647] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 35.099444] RIP: 0033:0x7f9134d81fc1 [ 35.100033] Code: 75 57 89 f0 25 00 00 41 00 3d 00 00 41 00 74 49 80 3d 2a 26 0e 00 00 74 6d 89 da 48 89 ee bf 9c ff ff ff5 [ 35.102993] RSP: 002b:00007ffcd41e0d10 EFLAGS: 00000202 ORIG_RAX: 0000000000000101 [ 35.104263] RAX: ffffffffffffffda RBX: 0000000000000242 RCX: 00007f9134d81fc1 [ 35.105452] RDX: 0000000000000242 RSI: 00007ffcd41e1200 RDI: 00000000ffffff9c [ 35.106663] RBP: 00007ffcd41e1200 R08: 0000000000000000 R09: 0000000000000064 [ 35.107923] R10: 00000000000001a4 R11: 0000000000000202 R12: 0000000000000066 [ 35.109112] R13: 0000000000100000 R14: 0000000000100000 R15: 0000000000000400 [ 35.110357] </TASK> [ 35.110769] irq event stamp: 8415587 [ 35.111486] hardirqs last enabled at (8415599): [<ffffffff8d74b562>] __up_console_sem+0x52/0x60 [ 35.113067] hardirqs last disabled at (8415610): [<ffffffff8d74b547>] __up_console_sem+0x37/0x60 [ 35.114575] softirqs last enabled at (8415300): [<ffffffff8d6ac625>] handle_softirqs+0x315/0x3f0 [ 35.115933] softirqs last disabled at (8415291): [<ffffffff8d6ac811>] __irq_exit_rcu+0xa1/0xc0 [ 35.117316] ---[ end trace 0000000000000000 ]--- Fix this by using xas_reset() instead, which is equivalent in implementation to xas_pause() but does not advance the XArray state. Fixes: 6be3e21d25ca ("fs/dax: don't skip locked entries when scanning entries") Signed-off-by: Alistair Popple <apopple@nvidia.com> Link: https://lore.kernel.org/20250523043749.1460780-1-apopple@nvidia.com Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: "Matthew Wilcow (Oracle)" <willy@infradead.org> Cc: Balbir Singh <balbirs@nvidia.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
14 daysMerge tag 'fs_for_v6.16-rc1' of ↵Linus Torvalds6-43/+66
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2 and isofs updates from Jan Kara: - isofs fix of handling of particularly formatted Rock Ridge timestamps - Add deprecation notice about support of DAX in ext2 filesystem driver * tag 'fs_for_v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: Deprecate DAX isofs: fix Y2038 and Y2156 issues in Rock Ridge TF entry
14 daysMerge tag 'fsnotify_for_v6.16-rc1' of ↵Linus Torvalds3-27/+35
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "Two fanotify cleanups and support for watching namespace-owned filesystems by namespace admins (most useful for being able to watch for new mounts / unmounts happening within a user namespace)" * tag 'fsnotify_for_v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fanotify: support watching filesystems and mounts inside userns fanotify: remove redundant permission checks fanotify: Drop use of flex array in fanotify_fh
14 daysMerge tag 'driver-core-6.16-rc1' of ↵Linus Torvalds4-23/+35
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core updates from Greg KH: "Here are the driver core / kernfs changes for 6.16-rc1. Not a huge number of changes this development cycle, here's the summary of what is included in here: - kernfs locking tweaks, pushing some global locks down into a per-fs image lock - rust driver core and pci device bindings added for new features. - sysfs const work for bin_attributes. The final churn of switching away from and removing the transitional struct members, "read_new", "write_new" and "bin_attrs_new" will come after the merge window to avoid unnecesary merge conflicts. - auxbus device creation helpers added - fauxbus fix for creating sysfs files after the probe completed properly - other tiny updates for driver core things. All of these have been in linux-next for over a week with no reported issues" * tag 'driver-core-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: kernfs: Relax constraint in draining guard Documentation: embargoed-hardware-issues.rst: Remove myself drivers: hv: fix up const issue with vmbus_chan_bin_attrs firmware_loader: use SHA-256 library API instead of crypto_shash API docs: debugfs: do not recommend debugfs_remove_recursive PM: wakeup: Do not expose 4 device wakeup source APIs kernfs: switch global kernfs_rename_lock to per-fs lock kernfs: switch global kernfs_idr_lock to per-fs lock driver core: auxiliary bus: Fix IS_ERR() vs NULL mixup in __devm_auxiliary_device_create() sysfs: constify attribute_group::bin_attrs sysfs: constify bin_attribute argument of bin_attribute::read/write() software node: Correct a OOB check in software_node_get_reference_args() devres: simplify devm_kstrdup() using devm_kmemdup() platform: replace magic number with macro PLATFORM_DEVID_NONE component: do not try to unbind unbound components driver core: auxiliary bus: add device creation helpers driver core: faux: Add sysfs groups after probing
2025-05-29fuse: increase readdir buffer sizeMiklos Szeredi1-19/+14
Increase the buffer size to the count requested by userspace. This improves performance. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
2025-05-29readdir: supply dir_context.count as readdir buffer size hintMiklos Szeredi3-16/+26
This is a preparation for large readdir buffers in fuse. Simply setting the fuse buffer size to the userspace buffer size should work, the record sizes are similar (fuse's is slightly larger than libc's, so no overflow should ever happen). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Jaco Kroon <jaco@uls.co.za>
2025-05-29fuse: don't allow signals to interrupt getdents copyingMiklos Szeredi2-5/+17
When getting the directory contents, the entries are first fetched to a kernel buffer, then they are copied to userspace with dir_emit(). This second phase is non-blocking as long as the userspace buffer is not paged out, making it interruptible makes zero sense. Overload d_type as flags, since it only uses 4 bits from 32. Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for writebackJoanne Koong1-4/+8
Add support for folios larger than one page size for writeback. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for readaheadJoanne Koong1-9/+29
Add support for folios larger than one page size for readahead. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for queued writesJoanne Koong1-4/+7
Add support for folios larger than one page size for queued writes. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for storesJoanne Koong1-7/+12
Add support for folios larger than one page size for stores. Also change variable naming from "this_num" to "nr_bytes". Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for symlinksJoanne Koong1-4/+4
Support large folios for symlinks and change the name from fuse_getlink_page() to fuse_getlink_folio(). Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for folio readsJoanne Koong1-1/+1
Add support for folios larger than one page size for folio reads into the page cache. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for writethrough writesJoanne Koong1-5/+10
Add support for folios larger than one page size for writethrough writes. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: refactor fuse_fill_write_pages()Joanne Koong1-12/+10
Refactor the logic in fuse_fill_write_pages() for copying out write data. This will make the future change for supporting large folios for writes easier. No functional changes. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support large folios for retrievesJoanne Koong1-10/+15
Add support for folios larger than one page size for retrieves. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-29fuse: support copying large foliosJoanne Koong2-50/+49
Currently, all folios associated with fuse are one page size. As part of the work to enable large folios, this commit adds support for copying to/from folios larger than one page size. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-05-28Merge tag 'net-next-6.16' of ↵Linus Torvalds8-4/+428
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Implement the Device Memory TCP transmit path, allowing zero-copy data transmission on top of TCP from e.g. GPU memory to the wire. - Move all the IPv6 routing tables management outside the RTNL scope, under its own lock and RCU. The route control path is now 3x times faster. - Convert queue related netlink ops to instance lock, reducing again the scope of the RTNL lock. This improves the control plane scalability. - Refactor the software crc32c implementation, removing unneeded abstraction layers and improving significantly the related micro-benchmarks. - Optimize the GRO engine for UDP-tunneled traffic, for a 10% performance improvement in related stream tests. - Cover more per-CPU storage with local nested BH locking; this is a prep work to remove the current per-CPU lock in local_bh_disable() on PREMPT_RT. - Introduce and use nlmsg_payload helper, combining buffer bounds verification with accessing payload carried by netlink messages. Netfilter: - Rewrite the procfs conntrack table implementation, improving considerably the dump performance. A lot of user-space tools still use this interface. - Implement support for wildcard netdevice in netdev basechain and flowtables. - Integrate conntrack information into nft trace infrastructure. - Export set count and backend name to userspace, for better introspection. BPF: - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops programs and can be controlled in similar way to traditional qdiscs using the "tc qdisc" command. - Refactor the UDP socket iterator, addressing long standing issues WRT duplicate hits or missed sockets. Protocols: - Improve TCP receive buffer auto-tuning and increase the default upper bound for the receive buffer; overall this improves the single flow maximum thoughput on 200Gbs link by over 60%. - Add AFS GSSAPI security class to AF_RXRPC; it provides transport security for connections to the AFS fileserver and VL server. - Improve TCP multipath routing, so that the sources address always matches the nexthop device. - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS, and thus preventing DoS caused by passing around problematic FDs. - Retire DCCP socket. DCCP only receives updates for bugs, and major distros disable it by default. Its removal allows for better organisation of TCP fields to reduce the number of cache lines hit in the fast path. - Extend TCP drop-reason support to cover PAWS checks. Driver API: - Reorganize PTP ioctl flag support to require an explicit opt-in for the drivers, avoiding the problem of drivers not rejecting new unsupported flags. - Converted several device drivers to timestamping APIs. - Introduce per-PHY ethtool dump helpers, improving the support for dump operations targeting PHYs. Tests and tooling: - Add support for classic netlink in user space C codegen, so that ynl-c can now read, create and modify links, routes addresses and qdisc layer configuration. - Add ynl sub-types for binary attributes, allowing ynl-c to output known struct instead of raw binary data, clarifying the classic netlink output. - Extend MPTCP selftests to improve the code-coverage. - Add tests for XDP tail adjustment in AF_XDP. New hardware / drivers: - OpenVPN virtual driver: offload OpenVPN data channels processing to the kernel-space, increasing the data transfer throughput WRT the user-space implementation. - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC. - Broadcom asp-v3.0 ethernet driver. - AMD Renoir ethernet device. - ReakTek MT9888 2.5G ethernet PHY driver. - Aeonsemi 10G C45 PHYs driver. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - refactor the steering table handling to significantly reduce the amount of memory used - add support for complex matches in H/W flow steering - improve flow streeing error handling - convert to netdev instance locking - Intel (100G, ice, igb, ixgbe, idpf): - ice: add switchdev support for LLDP traffic over VF - ixgbe: add firmware manipulation and regions devlink support - igb: introduce support for frame transmission premption - igb: adds persistent NAPI configuration - idpf: introduce RDMA support - idpf: add initial PTP support - Meta (fbnic): - extend hardware stats coverage - add devlink dev flash support - Broadcom (bnxt): - add support for RX-side device memory TCP - Wangxun (txgbe): - implement support for udp tunnel offload - complete PTP and SRIOV support for AML 25G/10G devices - Ethernet NICs embedded and virtual: - Google (gve): - add device memory TCP TX support - Amazon (ena): - support persistent per-NAPI config - Airoha: - add H/W support for L2 traffic offload - add per flow stats for flow offloading - RealTek (rtl8211): add support for WoL magic packet - Synopsys (stmmac): - dwmac-socfpga 1000BaseX support - add Loongson-2K3000 support - introduce support for hardware-accelerated VLAN stripping - Broadcom (bcmgenet): - expose more H/W stats - Freescale (enetc, dpaa2-eth): - enetc: add MAC filter, VLAN filter RSS and loopback support - dpaa2-eth: convert to H/W timestamping APIs - vxlan: convert FDB table to rhashtable, for better scalabilty - veth: apply qdisc backpressure on full ring to reduce TX drops - Ethernet switches: - Microchip (kzZ88x3): add ETS scheduler support - Ethernet PHYs: - RealTek (rtl8211): - add support for WoL magic packet - add support for PHY LEDs - CAN: - Adds RZ/G3E CANFD support to the rcar_canfd driver. - Preparatory work for CAN-XL support. - Add self-tests framework with support for CAN physical interfaces. - WiFi: - mac80211: - scan improvements with multi-link operation (MLO) - Qualcomm (ath12k): - enable AHB support for IPQ5332 - add monitor interface support to QCN9274 - add multi-link operation support to WCN7850 - add 802.11d scan offload support to WCN7850 - monitor mode for WCN7850, better 6 GHz regulatory - Qualcomm (ath11k): - restore hibernation support - MediaTek (mt76): - WiFi-7 improvements - implement support for mt7990 - Intel (iwlwifi): - enhanced multi-link single-radio (EMLSR) support on 5 GHz links - rework device configuration - RealTek (rtw88): - improve throughput for RTL8814AU - RealTek (rtw89): - add multi-link operation support - STA/P2P concurrency improvements - support different SAR configs by antenna - Bluetooth: - introduce HCI Driver protocol - btintel_pcie: do not generate coredump for diagnostic events - btusb: add HCI Drv commands for configuring altsetting - btusb: add RTL8851BE device 0x0bda:0xb850 - btusb: add new VID/PID 13d3/3584 for MT7922 - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925 - btnxpuart: implement host-wakeup feature" * tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits) selftests/bpf: Fix bpf selftest build warning selftests: netfilter: Fix skip of wildcard interface test net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames net: openvswitch: Fix the dead loop of MPLS parse calipso: Don't call calipso functions for AF_INET sk. selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem net_sched: hfsc: Address reentrant enqueue adding class to eltree twice octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback octeontx2-pf: QOS: Perform cache sync on send queue teardown net: mana: Add support for Multi Vports on Bare metal net: devmem: ncdevmem: remove unused variable net: devmem: ksft: upgrade rx test to send 1K data net: devmem: ksft: add 5 tuple FS support net: devmem: ksft: add exit_wait to make rx test pass net: devmem: ksft: add ipv4 support net: devmem: preserve sockc_err page_pool: fix ugly page_pool formatting net: devmem: move list_add to net_devmem_bind_dmabuf. selftests: netfilter: nft_queue.sh: include file transfer duration in log message net: phy: mscc: Fix memory leak when using one step timestamping ...
2025-05-28flexfiles/pNFS: update stats on NFS4ERR_DELAY for v4.1 DSesTigran Mkrtchyan1-0/+2
On NFS4ERR_DELAY nfs slient updates its stats, but misses for flexfiles v4.1 DSes. Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs_localio: change nfsd_file_put_local() to take a pointer to __rcu pointerNeilBrown4-24/+24
Instead of calling xchg() and unrcu_pointer() before nfsd_file_put_local(), we now pass pointer to the __rcu pointer and call xchg() and unrcu_pointer() inside that function. Where unrcu_pointer() is currently called the internals of "struct nfsd_file" are not known and that causes older compilers such as gcc-8 to complain. In some cases we have a __kernel (aka normal) pointer not an __rcu pointer so we need to cast it to __rcu first. This is strictly a weakening so no information is lost. Somewhat surprisingly, this cast is accepted by gcc-8. This has the pleasing result that the cmpxchg() which sets ro_file and rw_file, and also the xchg() which clears them, are both now in the nfsd code. Reported-by: Pali Rohár <pali@kernel.org> Reported-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs_localio: protect race between nfs_uuid_put() and nfs_close_local_fh()NeilBrown1-25/+56
nfs_uuid_put() and nfs_close_local_fh() can race if a "struct nfs_file_localio" is released at the same time that nfsd calls nfs_localio_invalidate_clients(). It is important that neither of these functions completes after the other has started looking at a given nfs_file_localio and before it finishes. If nfs_uuid_put() exits while nfs_close_local_fh() is closing ro_file and rw_file it could return to __nfd_file_cache_purge() while some files are still referenced so the purge may not succeed. If nfs_close_local_fh() exits while nfsd_uuid_put() is still closing the files then the "struct nfs_file_localio" could be freed while nfsd_uuid_put() is still looking at it. This side is currently handled by copying the pointers out of ro_file and rw_file before deleting from the list in nfsd_uuid. We need to preserve this while ensuring that nfsd_uuid_put() does wait for nfs_close_local_fh(). This patch use nfl->uuid and nfl->list to provide the required interlock. nfs_uuid_put() removes the nfs_file_localio from the list, then drops locks and puts the two files, then reclaims the spinlock and sets ->nfs_uuid to NULL. nfs_close_local_fh() operates in the reverse order, setting ->nfs_uuid to NULL, then closing the files, then unlinking from the list. If nfs_uuid_put() finds that ->nfs_uuid is already NULL, it waits for the nfs_file_localio to be removed from the list. If nfs_close_local_fh() find that it has already been unlinked it waits for ->nfs_uuid to become NULL. This ensure that one of the two tries to close the files, but they each waits for the other. As nfs_uuid_put() is making the list empty, change from a list_for_each_safe loop to a while that always takes the first entry. This makes the intent more clear. Also don't move the list to a temporary local list as this would defeat the guarantees required for the interlock. Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs_localio: duplicate nfs_close_local_fh()NeilBrown1-1/+20
nfs_close_local_fh() is called from two different places for quite different use case. It is called from nfs_uuid_put() when the nfs_uuid is being detached - possibly because the nfs server is not longer serving that filesystem. In this case there will always be an nfs_uuid and so rcu_read_lock() is not needed. It is also called when the nfs_file_localio is no longer needed. In this case there may not be an active nfs_uuid. These two can race, and handling the race properly while avoiding excessive locking will require different handling on each side. This patch prepares the way by opencoding nfs_close_local_fh() into nfs_uuid_put(), then simplifying the code there as befits the context. Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs_localio: simplify interface to nfsd for getting nfsd_fileNeilBrown3-44/+55
The nfsd_localio_operations structure contains nfsd_file_get() to get a reference to an nfsd_file. This is only used in one place, where nfsd_open_local_fh() is also used. This patch combines the two, calling nfsd_open_local_fh() passing a pointer to where the nfsd_file pointer might be stored. If there is a pointer there an nfsd_file_get() can get a reference, that reference is returned. If not a new nfsd_file is acquired, stored at the pointer, and returned. When we store a reference we also increase the refcount on the net, as that refcount is decrements when we clear the stored pointer. We now get an extra reference *before* storing the new nfsd_file at the given location. This avoids possible races with the nfsd_file being freed before the final reference can be taken. This patch moves the rcu_dereference() needed after fetching from ro_file or rw_file into the nfsd code where the 'struct nfs_file' is fully defined. This avoids an error reported by older versions of gcc such as gcc-8 which complain about rcu_dereference() use in contexts where the structure (which will supposedly be accessed) is not fully defined. Reported-by: Pali Rohár <pali@kernel.org> Reported-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr> Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs_localio: always hold nfsd net ref with nfsd_file refNeilBrown5-7/+33
Having separate nfsd_file_put and nfsd_file_put_local in struct nfsd_localio_operations doesn't make much sense. The difference is that nfsd_file_put doesn't drop a reference to the nfs_net which is what keeps nfsd from shutting down. Currently, if nfsd tries to shutdown it will invalidate the files stored in the list from the nfs_uuid and this will drop all references to the nfsd net that the client holds. But the client could still hold some references to nfsd_files for active IO. So nfsd might think is has completely shut down local IO, but hasn't and has no way to wait for those active IO requests to complete. So this patch changes nfsd_file_get to nfsd_file_get_local and has it increase the ref count on the nfsd net and it replaces all calls to ->nfsd_put_file to ->nfsd_put_file_local. It also changes ->nfsd_open_local_fh to return with the refcount on the net elevated precisely when a valid nfsd_file is returned. This means that whenever the client holds a valid nfsd_file, there will be an associated count on the nfsd net, and so the count can only reach zero when all nfsd_files have been returned. nfs_local_file_put() is changed to call nfs_to_nfsd_file_put_local() instead of replacing calls to one with calls to the other because this will help a later patch which changes nfs_to_nfsd_file_put_local() to take an __rcu pointer while nfs_local_file_put() doesn't. Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs_localio: use cmpxchg() to install new nfs_file_localioNeilBrown2-30/+20
Rather than using nfs_uuid.lock to protect installing a new ro_file or rw_file, change to use cmpxchg(). Removing the file already uses xchg() so this improves symmetry and also makes the code a little simpler. Also remove the optimisation of not taking the lock, and not removing the nfs_file_localio from the linked list, when both ->ro_file and ->rw_file are already NULL. Given that ->nfs_uuid was not NULL, it is extremely unlikely that neither ->ro_file or ->rw_file is NULL so this optimisation can be of little value and it complicates understanding of the code - why can the list_del_init() be skipped? Finally, move the assignment of NULL to ->nfs_uuid until after the last action on the nfs_file_localio (the list_del_init). As soon as this is NULL a racing nfs_close_local_fh() can bypass all the locking and go on to free the nfs_file_localio, so we must be certain to be finished with it first. Fixes: 86e00412254a ("nfs: cache all open LOCALIO nfsd_file(s) in client") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs: fix incorrect handling of large-number NFS errors in nfs4_do_mkdir()NeilBrown1-12/+20
A recent commit introduced nfs4_do_mkdir() which reports an error from nfs4_call_sync() by returning it with ERR_PTR(). This is a problem as nfs4_call_sync() can return negative NFS-specific errors with values larger than MAX_ERRNO (4095). One example is NFS4ERR_DELAY which has value 10008. This "pointer" gets to PTR_ERR_OR_ZERO() in nfs4_proc_mkdir() which chooses ZERO because it isn't in the range of value errors. Ultimately the pointer is dereferenced. This patch changes nfs4_do_mkdir() to report the dentry pointer and status separately - pointer as a return value, status in an "int *" parameter. The same separation is used for _nfs4_proc_mkdir() and the two are combined only in nfs4_proc_mkdir() after the status has passed through nfs4_handle_exception(), which ensures the error code does not exceed MAX_ERRNO. It also fixes a problem in the even when nfs4_handle_exception() updated the error value, the original 'alias' was still returned. Reported-by: Anna Schumaker <anna@kernel.org> Fixes: 8376583b84a1 ("nfs: change mkdir inode_operation to return alternate dentry if needed.") Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs: ignore SB_RDONLY when remounting nfsLi Lingfeng1-0/+10
In some scenarios, when mounting NFS, more than one superblock may be created. The final superblock used is the last one created, but only the first superblock carries the ro flag passed from user space. If a ro flag is added to the superblock via remount, it will trigger the issue described in Link[1]. Link[2] attempted to address this by marking the superblock as ro during the initial mount. However, this introduced a new problem in scenarios where multiple mount points share the same superblock: [root@a ~]# mount /dev/sdb /mnt/sdb [root@a ~]# echo "/mnt/sdb *(rw,no_root_squash)" > /etc/exports [root@a ~]# echo "/mnt/sdb/test_dir2 *(ro,no_root_squash)" >> /etc/exports [root@a ~]# systemctl restart nfs-server [root@a ~]# mount -t nfs -o rw 127.0.0.1:/mnt/sdb/test_dir1 /mnt/test_mp1 [root@a ~]# mount | grep nfs4 127.0.0.1:/mnt/sdb/test_dir1 on /mnt/test_mp1 type nfs4 (rw,relatime,... [root@a ~]# mount -t nfs -o ro 127.0.0.1:/mnt/sdb/test_dir2 /mnt/test_mp2 [root@a ~]# mount | grep nfs4 127.0.0.1:/mnt/sdb/test_dir1 on /mnt/test_mp1 type nfs4 (ro,relatime,... 127.0.0.1:/mnt/sdb/test_dir2 on /mnt/test_mp2 type nfs4 (ro,relatime,... [root@a ~]# When mounting the second NFS, the shared superblock is marked as ro, causing the previous NFS mount to become read-only. To resolve both issues, the ro flag is no longer applied to the superblock during remount. Instead, the ro flag on the mount is used to control whether the mount point is read-only. Fixes: 281cad46b34d ("NFS: Create a submount rpc_op") Link[1]: https://lore.kernel.org/all/20240604112636.236517-3-lilingfeng@huaweicloud.com/ Link[2]: https://lore.kernel.org/all/20241130035818.1459775-1-lilingfeng3@huawei.com/ Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs: clear SB_RDONLY before getting superblockLi Lingfeng1-0/+9
As described in the link, commit 52cb7f8f1778 ("nfs: ignore SB_RDONLY when mounting nfs") removed the check for the ro flag when determining whether to share the superblock, which caused issues when mounting different subdirectories under the same export directory via NFSv3. However, this change did not affect NFSv4. For NFSv3: 1) A single superblock is created for the initial mount. 2) When mounted read-only, this superblock carries the SB_RDONLY flag. 3) Before commit 52cb7f8f1778 ("nfs: ignore SB_RDONLY when mounting nfs"): Subsequent rw mounts would not share the existing ro superblock due to flag mismatch, creating a new superblock without SB_RDONLY. After the commit: The SB_RDONLY flag is ignored during superblock comparison, and this leads to sharing the existing superblock even for rw mounts. Ultimately results in write operations being rejected at the VFS layer. For NFSv4: 1) Multiple superblocks are created and the last one will be kept. 2) The actually used superblock for ro mounts doesn't carry SB_RDONLY flag. Therefore, commit 52cb7f8f1778 doesn't affect NFSv4 mounts. Clear SB_RDONLY before getting superblock when NFS_MOUNT_UNSHARED is not set to fix it. Fixes: 52cb7f8f1778 ("nfs: ignore SB_RDONLY when mounting nfs") Closes: https://lore.kernel.org/all/12d7ea53-1202-4e21-a7ef-431c94758ce5@app.fastmail.com/T/ Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28NFS: always probe for LOCALIO support asynchronouslyMike Snitzer4-5/+6
It was reported that NFS client mounts of AWS Elastic File System (EFS) volumes is slow, this is because the AWS firewall disallows LOCALIO (because it doesn't consider the use of NFS_LOCALIO_PROGRAM valid), see: https://bugzilla.redhat.com/show_bug.cgi?id=2335129 Switch to performing the LOCALIO probe asynchronously to address the potential for the NFS LOCALIO protocol being disallowed and/or slowed by the remote server's response. While at it, fix nfs_local_probe_async() to always take/put a reference on the nfs_client that is using the LOCALIO protocol. Also, unexport the nfs_local_probe() symbol and make it private to fs/nfs/localio.c This change has the side-effect of initially issuing reads, writes and commits over the wire via SUNRPC until the LOCALIO probe completes. Suggested-by: Jeff Layton <jlayton@kernel.org> # to always probe async Fixes: 76d4cb6345da ("nfs: probe for LOCALIO when v4 client reconnects to server") Cc: stable@vger.kernel.org # 6.14+ Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28pnfs/flexfiles: connect to NFSv3 DS using TLS if MDS connection uses TLSMike Snitzer1-1/+10
Implementation follows bones of the pattern that was established in commit a35518cae4b325 ("NFSv4.1/pnfs: fix NFS with TLS in pnfs"). Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28NFS: add localio to sysfsMike Snitzer1-0/+28
The Linux NFS client and server added support for LOCALIO in Linux v6.12. It is useful to know if a client and server negotiated LOCALIO be used, so expose it through the 'localio' attribute. Suggested-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs: use writeback_iter directlyChristoph Hellwig1-12/+6
Stop using write_cache_pages and use writeback_iter directly. This removes an indirect call per written folio and makes the code easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs: refactor nfs_do_writepageChristoph Hellwig1-10/+9
Use early returns wherever possible to simplify the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs: don't return AOP_WRITEPAGE_ACTIVATE from nfs_do_writepageChristoph Hellwig1-4/+1
nfs_do_writepage is a successful return that requires the caller to unlock the folio. Using it here requires special casing both in nfs_do_writepage and nfs_writepages_callback and leaves a land mine in nfs_wb_folio in case it ever set the flag. Remove it and just unconditionally unlock in nfs_writepages_callback. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28nfs: fold nfs_page_async_flush into nfs_do_writepageChristoph Hellwig1-10/+4
Fold nfs_page_async_flush into its only caller to clean up the code a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28NFSv4: Always set NLINK even if the server doesn't support itHan Young1-0/+2
fattr4_numlinks is a recommended attribute, so the client should emulate it even if the server doesn't support it. In decode_attr_nlink function in nfs4xdr.c, nlink is initialized to 1. However, this default value isn't set to the inode due to the check in nfs_fhget. So if the server doesn't support numlinks, inode's nlink will be zero, the mount will fail with error "Stale file handle". Set the nlink to 1 if the server doesn't support it. Signed-off-by: Han Young <hanyang.tony@bytedance.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28NFSv4: Allow FREE_STATEID to clean up delegationsBenjamin Coddington3-15/+25
The NFS client's list of delegations can grow quite large (well beyond the delegation watermark) if the server is revoking or there are repeated events that expire state. Once this happens, the revoked delegations can cause a performance problem for subsequent walks of the servers->delegations list when the client tries to test and free state. If we can determine that the FREE_STATEID operation has completed without error, we can prune the delegation from the list. Since the NFS client combines TEST_STATEID with FREE_STATEID in its minor version operations, there isn't an easy way to communicate success of FREE_STATEID. Rather than re-arrange quite a number of calling paths to break out the separate procedures, let's signal the success of FREE_STATEID by setting the stateid's type. Set NFS4_FREED_STATEID_TYPE for stateids that have been successfully discarded from the server, and use that type to signal that the delegation can be cleaned up. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28NFSv4: Don't check for OPEN feature support in v4.1Scott Mayhew1-2/+3
fattr4_open_arguments is a v4.2 recommended attribute, so we shouldn't be sending it to v4.1 servers. Fixes: cb78f9b7d0c0 ("nfs: fix the fetch of FATTR4_OPEN_ARGUMENTS") Signed-off-by: Scott Mayhew <smayhew@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Cc: stable@vger.kernel.org # 6.11+ Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28NFSv4.2: fix listxattr to return selinux security labelOlga Kornievskaia1-2/+10
Currently, when NFS is queried for all the labels present on the file via a command example "getfattr -d -m . /mnt/testfile", it does not return the security label. Yet when asked specifically for the label (getfattr -n security.selinux) it will be returned. Include the security label when all attributes are queried. Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28NFSv4.2: fix setattr caching of TIME_[MODIFY|ACCESS]_SET when timestamps are ↵Sagi Grimberg2-8/+49
delegated nfs_setattr will flush all pending writes before updating a file time attributes. However when the client holds delegated timestamps, it can update its timestamps locally as it is the authority for the file times attributes. The client will later set the file attributes by adding a setattr to the delegreturn compound updating the server time attributes. Fix nfs_setattr to avoid flushing pending writes when the file time attributes are delegated and the mtime/atime are set to a fixed timestamp (ATTR_[MODIFY|ACCESS]_SET. Also, when sending the setattr procedure over the wire, we need to clear the correct attribute bits from the bitmask. I was able to measure a noticable speedup when measuring untar performance. Test: $ time tar xzf ~/dir.tgz Baseline: 1m13.072s Patched: 0m49.038s Which is more than 30% latency improvement. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28NFS: Add support for fallocate(FALLOC_FL_ZERO_RANGE)Anna Schumaker6-3/+103
This implements a suggestion from Trond that we can mimic FALLOC_FL_ZERO_RANGE by sending a compound that first does a DEALLOCATE to punch a hole in a file, and then an ALLOCATE to fill the hole with zeroes. There might technically be a race here, but once the DEALLOCATE finishes any reads from the region would return zeroes anyway, so I don't expect it to cause problems. Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28fs/nfs/read: fix double-unlock bug in nfs_return_empty_folio()Max Kellermann1-1/+2
Sometimes, when a file was read while it was being truncated by another NFS client, the kernel could deadlock because folio_unlock() was called twice, and the second call would XOR back the `PG_locked` flag. Most of the time (depending on the timing of the truncation), nobody notices the problem because folio_unlock() gets called three times, which flips `PG_locked` back off: 1. vfs_read, nfs_read_folio, ... nfs_read_add_folio, nfs_return_empty_folio 2. vfs_read, nfs_read_folio, ... netfs_read_collection, netfs_unlock_abandoned_read_pages 3. vfs_read, ... nfs_do_read_folio, nfs_read_add_folio, nfs_return_empty_folio The problem is that nfs_read_add_folio() is not supposed to unlock the folio if fscache is enabled, and a nfs_netfs_folio_unlock() check is missing in nfs_return_empty_folio(). Rarely this leads to a warning in netfs_read_collection(): ------------[ cut here ]------------ R=0000031c: folio 10 is not locked WARNING: CPU: 0 PID: 29 at fs/netfs/read_collect.c:133 netfs_read_collection+0x7c0/0xf00 [...] Workqueue: events_unbound netfs_read_collection_worker RIP: 0010:netfs_read_collection+0x7c0/0xf00 [...] Call Trace: <TASK> netfs_read_collection_worker+0x67/0x80 process_one_work+0x12e/0x2c0 worker_thread+0x295/0x3a0 Most of the time, however, processes just get stuck forever in folio_wait_bit_common(), waiting for `PG_locked` to disappear, which never happens because nobody is really holding the folio lock. Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when fscache is enabled") Cc: stable@vger.kernel.org Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-05-28Merge tag 'jfs-6.16' of github.com:kleikamp/linux-shaggyLinus Torvalds3-5/+22
Pull jfs updates from David Kleikamp: "A few small fixes for jfs" * tag 'jfs-6.16' of github.com:kleikamp/linux-shaggy: jfs: fix array-index-out-of-bounds read in add_missing_indices jfs: Fix null-ptr-deref in jfs_ioc_trim jfs: validate AG parameters in dbMount() to prevent crashes
2025-05-28Merge tag 'dlm-6.16' of ↵Linus Torvalds3-3/+8
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: "This fixes delays when shutting down SCTP connections, and updates dlm Kconfig for SCTP" * tag 'dlm-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: drop SCTP Kconfig dependency dlm: reject SCTP configuration if not enabled dlm: use SHUT_RDWR for SCTP shutdown dlm: mask sk_shutdown value
2025-05-28Merge tag 'nfsd-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds21-260/+703
Pull nfsd updates from Chuck Lever: "The marquee feature for this release is that the limit on the maximum rsize and wsize has been raised to 4MB. The default remains at 1MB, but risk-seeking administrators now have the ability to try larger I/O sizes with NFS clients that support them. Eventually the default setting will be increased when we have confidence that this change will not have negative impact. With v6.16, NFSD now has its own debugfs file system where we can add experimental features and make them available outside of our development community without impacting production deployments. The first experimental setting added is one that makes all NFS READ operations use vfs_iter_read() instead of the NFSD splice actor. The plan is to eventually retire the splice actor, as that will enable a number of new capabilities such as the use of struct bio_vec from the top to the bottom of the NFSD stack. Jeff Layton contributed a number of observability improvements. The use of dprintk() in a number of high-traffic code paths has been replaced with static trace points. This release sees the continuation of efforts to harden the NFSv4.2 COPY operation. Soon, the restriction on async COPY operations can be lifted. Many thanks to the contributors, reviewers, testers, and bug reporters who participated during the v6.16 development cycle" * tag 'nfsd-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (60 commits) xdrgen: Fix code generated for counted arrays SUNRPC: Bump the maximum payload size for the server NFSD: Add a "default" block size NFSD: Remove NFSSVC_MAXBLKSIZE_V2 macro NFSD: Remove NFSD_BUFSIZE sunrpc: Remove the RPCSVC_MAXPAGES macro svcrdma: Adjust the number of entries in svc_rdma_send_ctxt::sc_pages svcrdma: Adjust the number of entries in svc_rdma_recv_ctxt::rc_pages sunrpc: Adjust size of socket's receive page array dynamically SUNRPC: Remove svc_rqst :: rq_vec SUNRPC: Remove svc_fill_write_vector() NFSD: Use rqstp->rq_bvec in nfsd_iter_write() SUNRPC: Export xdr_buf_to_bvec() NFSD: De-duplicate the svc_fill_write_vector() call sites NFSD: Use rqstp->rq_bvec in nfsd_iter_read() sunrpc: Replace the rq_bvec array with dynamically-allocated memory sunrpc: Replace the rq_pages array with dynamically-allocated memory sunrpc: Remove backchannel check in svc_init_buffer() sunrpc: Add a helper to derive maxpages from sv_max_mesg svcrdma: Reduce the number of rdma_rw contexts per-QP ...
2025-05-28Merge tag 'ext4_for_linus-6.16-rc1' of ↵Linus Torvalds24-471/+1062
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "New ext4 features and performance improvements: - Fast commit performance improvements - Multi-fsblock atomic write support for bigalloc file systems - Large folio support for regular files This last can result in really stupendous performance for the right workloads. For example, see [1] where the Kernel Test Robot reported over 37% improvement on a large sequential I/O workload. There are also the usual bug fixes and cleanups. Of note are cleanups of the extent status tree to fix potential races that could result in the extent status tree getting corrupted under heavy simultaneous allocation and deallocation to a single file" Link: https://lore.kernel.org/all/202505161418.ec0d753f-lkp@intel.com/ [1] * tag 'ext4_for_linus-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (52 commits) ext4: Add a WARN_ON_ONCE for querying LAST_IN_LEAF instead ext4: Simplify flags in ext4_map_query_blocks() ext4: Rename and document EXT4_EX_FILTER to EXT4_EX_QUERY_FILTER ext4: Simplify last in leaf check in ext4_map_query_blocks ext4: Unwritten to written conversion requires EXT4_EX_NOCACHE ext4: only dirty folios when data journaling regular files ext4: Add atomic block write documentation ext4: Enable support for ext4 multi-fsblock atomic write using bigalloc ext4: Add multi-fsblock atomic write support with bigalloc ext4: Add support for EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS ext4: Make ext4_meta_trans_blocks() non-static for later use ext4: Check if inode uses extents in ext4_inode_can_atomic_write() ext4: Document an edge case for overwrites jbd2: remove journal_t argument from jbd2_superblock_csum() jbd2: remove journal_t argument from jbd2_chksum() ext4: remove sb argument from ext4_superblock_csum() ext4: remove sbi argument from ext4_chksum() ext4: enable large folio for regular file ext4: make online defragmentation support large folios ext4: make the writeback path support large folios ...
2025-05-28Merge tag 'ntfs3_for_6.16' of ↵Linus Torvalds8-257/+28
https://github.com/Paragon-Software-Group/linux-ntfs3 Pull ntfs updates from Konstantin Komarov: "Added: - missing direct_IO in ntfs_aops_cmpr - handling of hdr_first_de() return value Fixed: - handling of InitializeFileRecordSegment operation. Removed: - ability to change compression on mounted volume - redundant NULL check" * tag 'ntfs3_for_6.16' of https://github.com/Paragon-Software-Group/linux-ntfs3: fs/ntfs3: remove ability to change compression on mounted volume fs/ntfs3: Fix handling of InitializeFileRecordSegment fs/ntfs3: Add missing direct_IO in ntfs_aops_cmpr fs/ntfs3: handle hdr_first_de() return value fs/ntfs3: Drop redundant NULL check
2025-05-28Merge tag 'for-linus-6.16-ofs1' of ↵Linus Torvalds3-98/+102
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs update from Mike Marshall: "Convert to use the new mount API. Code from Eric Sandeen at redhat that converts orangefs over to the new mount API" * tag 'for-linus-6.16-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: Convert to use the new mount API
2025-05-28Merge tag 'exfat-for-6.16-rc1' of ↵Linus Torvalds2-23/+8
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat updates from Namjae Jeon: - Fix xfstests generic/482 test failure - Fix double free in delayed_free * tag 'exfat-for-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: exfat: do not clear volume dirty flag during sync exfat: fix double free in delayed_free
2025-05-28Merge tag 'for-6.16-tag' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A fixup to the xarray conversion sent in the main 6.16 batch. It was not included because it would cause rebase/refresh of like 80 patches, right before sending the early pull request last week. It's fixing a bug when zoned mode is enabled on btrfs so it's not affecting most people" * tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails
2025-05-28smb: client: Remove an unused function and variableDr. David Alan Gilbert3-69/+0
SMB2_QFS_info() has been unused since 2018's commit 730928c8f4be ("cifs: update smb2_queryfs() to use compounding") sign_CIFS_PDUs has been unused since 2009's commit 2edd6c5b0517 ("[CIFS] NTLMSSP support moving into new file, old dead code removed") Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-28f2fs: fix to correct check conditions in f2fs_cross_renameZhiguo Niu1-1/+1
Should be "old_dir" here. Fixes: 5c57132eaf52 ("f2fs: support project quota") Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-28f2fs: use d_inode(dentry) cleanup dentry->d_inodeZhiguo Niu2-6/+6
no logic changes. Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-28f2fs: fix to skip f2fs_balance_fs() if checkpoint is disabledChao Yu1-1/+1
Syzbot reports a f2fs bug as below: INFO: task syz-executor328:5856 blocked for more than 144 seconds. Not tainted 6.15.0-rc6-syzkaller-00208-g3c21441eeffc #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor328 state:D stack:24392 pid:5856 tgid:5832 ppid:5826 task_flags:0x400040 flags:0x00004006 Call Trace: <TASK> context_switch kernel/sched/core.c:5382 [inline] __schedule+0x168f/0x4c70 kernel/sched/core.c:6767 __schedule_loop kernel/sched/core.c:6845 [inline] schedule+0x165/0x360 kernel/sched/core.c:6860 io_schedule+0x81/0xe0 kernel/sched/core.c:7742 f2fs_balance_fs+0x4b4/0x780 fs/f2fs/segment.c:444 f2fs_map_blocks+0x3af1/0x43b0 fs/f2fs/data.c:1791 f2fs_expand_inode_data+0x653/0xaf0 fs/f2fs/file.c:1872 f2fs_fallocate+0x4f5/0x990 fs/f2fs/file.c:1975 vfs_fallocate+0x6a0/0x830 fs/open.c:338 ioctl_preallocate fs/ioctl.c:290 [inline] file_ioctl fs/ioctl.c:-1 [inline] do_vfs_ioctl+0x1b8f/0x1eb0 fs/ioctl.c:885 __do_sys_ioctl fs/ioctl.c:904 [inline] __se_sys_ioctl+0x82/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f The root cause is after commit 84b5bb8bf0f6 ("f2fs: modify f2fs_is_checkpoint_ready logic to allow more data to be written with the CP disable"), we will get chance to allow f2fs_is_checkpoint_ready() to return true once below conditions are all true: 1. checkpoint is disabled 2. there are not enough free segments 3. there are enough free blocks Then it will cause f2fs_balance_fs() to trigger foreground GC. void f2fs_balance_fs(struct f2fs_sb_info *sbi, bool need) ... if (!f2fs_is_checkpoint_ready(sbi)) return; And the testcase mounts f2fs image w/ gc_merge,checkpoint=disable, so deadloop will happen through below race condition: - f2fs_do_shutdown - vfs_fallocate - gc_thread_func - file_start_write - __sb_start_write(SB_FREEZE_WRITE) - f2fs_fallocate - f2fs_expand_inode_data - f2fs_map_blocks - f2fs_balance_fs - prepare_to_wait - wake_up(gc_wait_queue_head) - io_schedule - bdev_freeze - freeze_super - sb->s_writers.frozen = SB_FREEZE_WRITE; - sb_wait_write(sb, SB_FREEZE_WRITE); - if (sbi->sb->s_writers.frozen >= SB_FREEZE_WRITE) continue; : cause deadloop This patch fix to add check condition in f2fs_balance_fs(), so that if checkpoint is disabled, we will just skip trigger foreground GC to avoid such deadloop issue. Meanwhile let's remove f2fs_is_checkpoint_ready() check condition in f2fs_balance_fs(), since it's redundant, due to the main logic in the function is to check: a) whether checkpoint is disabled b) there is enough free segments f2fs_balance_fs() still has all logics after f2fs_is_checkpoint_ready()'s removal. Reported-by: syzbot+aa5bb5f6860e08a60450@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/682d743a.a00a0220.29bc26.0289.GAE@google.com Fixes: 84b5bb8bf0f6 ("f2fs: modify f2fs_is_checkpoint_ready logic to allow more data to be written with the CP disable") Cc: Qi Han <hanqi@vivo.com> Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-28f2fs: clean up to check bi_status w/ BLK_STS_OKChao Yu2-4/+4
Check bi_status w/ BLK_STS_OK instead of 0 for cleanup. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-28f2fs: introduce is_{meta,node}_folioChao Yu5-15/+24
Just cleanup, no changes. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-28f2fs: add ckpt_valid_blocks to the section entryyohan.joung2-18/+85
when performing buffered writes in a large section, overhead is incurred due to the iteration through ckpt_valid_blocks within the section. when SEGS_PER_SEC is 128, this overhead accounts for 20% within the f2fs_write_single_data_page routine. as the size of the section increases, the overhead also grows. to handle this problem ckpt_valid_blocks is added within the section entries. Test insmod null_blk.ko nr_devices=1 completion_nsec=1 submit_queues=8 hw_queue_depth=64 max_sectors=512 bs=4096 memory_backed=1 make_f2fs /dev/block/nullb0 make_f2fs -s 128 /dev/block/nullb0 fio --bs=512k --size=1536M --rw=write --name=1 --filename=/mnt/test_dir/seq_write --ioengine=io_uring --iodepth=64 --end_fsync=1 before SEGS_PER_SEC 1 2556MiB/s SEGS_PER_SEC 128 2145MiB/s after SEGS_PER_SEC 1 2556MiB/s SEGS_PER_SEC 128 2556MiB/s Signed-off-by: yohan.joung <yohan.joung@sk.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-28f2fs: add a method for calculating the remaining blocks in the current ↵yohan.joung1-4/+19
segment in LFS mode. In LFS mode, the previous segment cannot use invalid blocks, so the remaining blocks from the next_blkoff of the current segment to the end of the section are calculated. Signed-off-by: yohan.joung <yohan.joung@sk.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-28iomap: don't lose folio dropbehind state for overwritesJens Axboe2-2/+22
DONTCACHE I/O must have the completion punted to a workqueue, just like what is done for unwritten extents, as the completion needs task context to perform the invalidation of the folio(s). However, if writeback is started off filemap_fdatawrite_range() off generic_sync() and it's an overwrite, then the DONTCACHE marking gets lost as iomap_add_to_ioend() don't look at the folio being added and no further state is passed down to help it know that this is a dropbehind/DONTCACHE write. Check if the folio being added is marked as dropbehind, and set IOMAP_IOEND_DONTCACHE if that is the case. Then XFS can factor this into the decision making of completion context in xfs_submit_ioend(). Additionally include this ioend flag in the NOMERGE flags, to avoid mixing it with unrelated IO. Since this is the 3rd flag that will cause XFS to punt the completion to a workqueue, add a helper so that each one of them can get appropriately commented. This fixes extra page cache being instantiated when the write performed is an overwrite, rather than newly instantiated blocks. Fixes: b2cd5ae693a3 ("iomap: make buffered writes work with RWF_DONTCACHE") Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/5153f6e8-274d-4546-bf55-30a5018e0d03@kernel.dk Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-27squashfs: add optional full compressed block cachingChanho Min2-0/+49
The commit 93e72b3c612adcaca1 ("squashfs: migrate from ll_rw_block usage to BIO") removed caching of compressed blocks in SquashFS, causing fio performance regression in workloads with repeated file reads. Without caching, every read triggers disk I/O, severely impacting performance in tools like fio. This patch introduces a new CONFIG_SQUASHFS_COMP_CACHE_FULL Kconfig option to enable caching of all compressed blocks, restoring performance to pre-BIO migration levels. When enabled, all pages in a BIO are cached in the page cache, reducing disk I/O for repeated reads. The fio test results with this patch confirm the performance restoration: For example, fio tests (iodepth=1, numjobs=1, ioengine=psync) show a notable performance restoration: Disable CONFIG_SQUASHFS_COMP_CACHE_FULL: IOPS=815, BW=102MiB/s (107MB/s)(6113MiB/60001msec) Enable CONFIG_SQUASHFS_COMP_CACHE_FULL: IOPS=2223, BW=278MiB/s (291MB/s)(16.3GiB/59999msec) The tradeoff is increased memory usage due to caching all compressed blocks. The CONFIG_SQUASHFS_COMP_CACHE_FULL option allows users to enable this feature selectively, balancing performance and memory usage for workloads with frequent repeated reads. Link: https://lkml.kernel.org/r/20250521072559.2389-1-chanho.min@lge.com Signed-off-by: Chanho Min <chanho.min@lge.com> Reviewed-by Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-27crash_dump, nvme: select CONFIGFS_FS as built-inArnd Bergmann1-1/+0
Configfs can be configured as a loadable module, which causes a link-time failure for dm-crypt crash dump support: crash_dump_dm_crypt.c:(.text+0x3a4): undefined reference to `config_item_init_type_name' aarch64-linux-ld: kernel/crash_dump_dm_crypt.o: in function `configfs_dmcrypt_keys_init': crash_dump_dm_crypt.c:(.init.text+0x90): undefined reference to `config_group_init' aarch64-linux-ld: crash_dump_dm_crypt.c:(.init.text+0xb4): undefined reference to `configfs_register_subsystem' aarch64-linux-ld: crash_dump_dm_crypt.c:(.init.text+0xd8): undefined reference to `configfs_unregister_subsystem' This could be avoided with a dependency on CONFIGFS_FS=y, but the dependency has an additional problem of causing Kconfig dependency loops since most other uses select the symbol. Using a simple 'select CONFIGFS_FS' here in turn fails with CONFIG_DM_CRYPT=m, because that still only causes configfs to be a loadable module. The only version I found that fixes this reliably uses an additional Kconfig symbol to ensure the 'select' actually turns on configfs as builtin, with two additional changes to avoid dependency loops with nvme and sysfs. There is no compile-time dependency between configfs and sysfs, so selecting configfs from a driver with sysfs disabled does not cause link failures, only the default /sys/kernel/config mount point will not be created. Link: https://lkml.kernel.org/r/20250521160359.2132363-1-arnd@kernel.org Fixes: 6b23858fd63b ("crash_dump: make dm crypt keys persist for the kdump kernel") Fixes: 1fb470408497 ("nvme-loop: add configfs dependency") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Breno Leitao <leitao@debian.org> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Coiby Xu <coxu@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-27f2fs: introduce FAULT_VMALLOCChao Yu3-5/+11
Introduce a new fault type FAULT_VMALLOC to simulate no memory error in f2fs_vmalloc(). Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-27f2fs: use vmalloc instead of kvmalloc in .init_{,de}compress_ctxChao Yu2-13/+15
.init_{,de}compress_ctx uses kvmalloc() to alloc memory, it will try to allocate physically continuous page first, it may cause more memory allocation pressure, let's use vmalloc instead to mitigate it. [Test] cd /data/local/tmp touch file f2fs_io setflags compression file f2fs_io getflags file for i in $(seq 1 10); do sync; echo 3 > /proc/sys/vm/drop_caches;\ time f2fs_io write 512 0 4096 zero osync file; truncate -s 0 file;\ done [Result] Before After Delta 21.243 21.694 -2.12% For compression, we recommend to use ioctl to compress file data in background for workaround. For decompression, only zstd will be affected. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-27f2fs: add f2fs_bug_on() in f2fs_quota_read()Chao Yu1-0/+6
mapping_read_folio_gfp() will return a folio, it should always be uptodate, let's check folio uptodate status to detect any potenial bug. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-27f2fs: add f2fs_bug_on() to detect potential bugChao Yu2-3/+8
Add f2fs_bug_on() to check whether memory preallocation will fail or not after radix_tree_preload(GFP_NOFS | __GFP_NOFAIL). Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-27f2fs: remove unused sbi argument from checksum functionsEric Biggers5-35/+24
Since __f2fs_crc32() now calls crc32() directly, it no longer uses its sbi argument. Remove that, and simplify its callers accordingly. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-27Merge tag 'x86_cache_for_v6.16_rc1' of ↵Linus Torvalds10-0/+7554
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 resource control updates from Borislav Petkov: "Carve out the resctrl filesystem-related code into fs/resctrl/ so that multiple architectures can share the fs API for manipulating their respective hw resource control implementation. This is the second step in the work towards sharing the resctrl filesystem interface, the next one being plugging ARM's MPAM into the aforementioned fs API" * tag 'x86_cache_for_v6.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) MAINTAINERS: Add reviewers for fs/resctrl x86,fs/resctrl: Move the resctrl filesystem code to live in /fs/resctrl x86/resctrl: Always initialise rid field in rdt_resources_all[] x86/resctrl: Relax some asm #includes x86/resctrl: Prefer alloc(sizeof(*foo)) idiom in rdt_init_fs_context() x86/resctrl: Squelch whitespace anomalies in resctrl core code x86/resctrl: Move pseudo lock prototypes to include/linux/resctrl.h x86/resctrl: Fix types in resctrl_arch_mon_ctx_{alloc,free}() stubs x86/resctrl: Move enum resctrl_event_id to resctrl.h x86/resctrl: Move the filesystem bits to headers visible to fs/resctrl fs/resctrl: Add boiler plate for external resctrl code x86/resctrl: Add 'resctrl' to the title of the resctrl documentation x86/resctrl: Split trace.h x86/resctrl: Expand the width of domid by replacing mon_data_bits x86/resctrl: Add end-marker to the resctrl_event_id enum x86/resctrl: Move is_mba_sc() out of core.c x86/resctrl: Drop __init/__exit on assorted symbols x86/resctrl: Resctrl_exit() teardown resctrl but leave the mount point x86/resctrl: Check all domains are offline in resctrl_exit() x86/resctrl: Rename resctrl_sched_in() to begin with "resctrl_arch_" ...
2025-05-27Merge tag 'timers-cleanups-2025-05-25' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "Another set of timer API cleanups: - Convert init_timer*(), try_to_del_timer_sync() and destroy_timer_on_stack() over to the canonical timer_*() namespace convention. There is another large conversion pending, which has not been included because it would have caused a gazillion of merge conflicts in next. The conversion scripts will be run towards the end of the merge window and a pull request sent once all conflict dependencies have been merged" * tag 'timers-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack() treewide, timers: Rename try_to_del_timer_sync() as timer_delete_sync_try() timers: Rename init_timers() as timers_init() timers: Rename NEXT_TIMER_MAX_DELTA as TIMER_NEXT_MAX_DELTA timers: Rename __init_timer_on_stack() as __timer_init_on_stack() timers: Rename __init_timer() as __timer_init() timers: Rename init_timer_on_stack_key() as timer_init_key_on_stack() timers: Rename init_timer_key() as timer_init_key()
2025-05-27ksmbd: allow a filename to contain special characters on SMB3.1.1 posix ↵Namjae Jeon1-26/+27
extension If client send SMB2_CREATE_POSIX_CONTEXT to ksmbd, Allow a filename to contain special characters. Reported-by: Philipp Kerling <pkerling@casix.org> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-27ksmbd: provide zero as a unique ID to the Mac clientNamjae Jeon3-2/+21
The Mac SMB client code seems to expect the on-disk file identifier to have the semantics of HFS+ Catalog Node Identifier (CNID). ksmbd provides the inode number as a unique ID to the client, but in the case of subvolumes of btrfs, there are cases where different files have the same inode number, so the mac smb client treats it as an error. There is a report that a similar problem occurs when the share is ZFS. Returning UniqueId of zero will make the Mac client to stop using and trusting the file id returned from the server. Reported-by: Justin Turner Arthur <justinarthur@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-27btrfs: don't drop a reference if btrfs_check_write_meta_pointer() failsJosef Bacik1-1/+0
In the zoned mode there's a bug in the extent buffer tree conversion to xarray. The reference for eb is dropped and code continues but the references get dropped by releasing the batch. Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/linux-btrfs/202505191521.435b97ac-lkp@intel.com/ Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-27bcachefs: Don't rewind to run a recovery pass we already ranKent Overstreet1-1/+3
Fix a small regression from the "run recovery passes" rewrite, which enabled async recovery passes. This fixes getting stuck in a loop in recovery. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-27bcachefs: Move unicode message to after the startup messageKent Overstreet1-4/+6
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-27bcachefs: Fix missing commit in check_direntsKent Overstreet1-1/+2
Other repair code seems to be doing commits themselves, but check_key_has_snapshot() does not. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-27bcachefs: Fix lost rebalance wakeupsKent Overstreet4-8/+13
Fix a missing wakeup in 'bcachefs set-file-option' -> xattr option update -> inode_write this was missing because the wakeup needs to happen after transaction commit. Also, add a 'kick' counter, to make sure we don't miss a wakeup that occured right after we finished checking the rebalance_work btree. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-27bcachefs: bch2_kthread_io_clock_wait_once()Kent Overstreet2-29/+19
Add a version of bch2_kthread_io_clock_wait() that only schedules once - behaving more like schedule_timeout(). This will be used for fixing rebalance wakeups. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-27bcachefs: Ensure we print output of run_recovery_pass if it errorsKent Overstreet1-6/+15
Also, don't error out in bucket_ref_update_err(): we don't want to return -BCH_ERR_cannot_rewind_recovery if it's not an insert, if it's an overwrite we continue. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>