aboutsummaryrefslogtreecommitdiffstats
path: root/fs/super.c
AgeCommit message (Collapse)AuthorFilesLines
2026-06-15Merge tag 'pull-dcache' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache updates from Al Viro: - d_alloc_parallel() API change (Neil's with my changes) - NORCU fixes - Reorganization and simplification of dentry eviction logic - Simplifying rcu_read_lock() scopes in fs/dcache.c - Secondary roots work - getting rid of NFS fake root dentries and dealing with remaining shrink_dcache_for_umount() and shrink_dentry_list() races - making cursors NORCU (surprisingly easy) * tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (22 commits) make cursors NORCU nfs: get rid of fake root dentries wind ->s_roots via ->d_sib instead of ->d_hash shrink_dentry_tree(): unify the calls of shrink_dentry_list() shrinking rcu_read_lock() scope in d_alloc_parallel() d_walk(): shrink rcu_read_lock() scope document dentry_kill() adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill() Document rcu_read_lock() use in select_collect2() Shift rcu_read_{,un}lock() inside fast_dput() simplify safety for lock_for_kill() slowpath fold lock_for_kill() and __dentry_kill() into common helper fold lock_for_kill() into shrink_kill() shrink_dentry_list(): start with removing from shrink list d_prune_aliases(): make sure to skip NORCU aliases kill d_dispose_if_unused() make to_shrink_list() return whether it has moved dentry to list select_collect(): ignore dentries on shrink lists if they have positive refcounts find_acceptable_alias(): skip NORCU aliases with zero refcount fix a race between d_find_any_alias() and final dput() of NORCU dentries ...
2026-06-15Merge tag 'vfs-7.2-rc1.misc' of ↵Linus Torvalds1-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Reduce pipe->mutex contention by pre-allocating pages outside the lock in anon_pipe_write(). anon_pipe_write() called alloc_page() once per page while holding pipe->mutex. The allocation can sleep doing direct reclaim and runs memcg charging, which extends the critical section and stalls any concurrent reader on the same mutex. Now up to 8 pages are pre-allocated before the mutex is taken, leftovers are recycled into the per-pipe tmp_page[] cache before unlock, and any remainder is released after unlock, keeping the allocator out of the critical section on both sides. On a writers x readers sweep with 64KB writes against a 1 MB pipe throughput improves 6-28% and average write latency drops 5-22%; under memory pressure - when the cost of holding the mutex across reclaim is highest - throughput improves 21-48% and latency drops 17-33%. The microbenchmark is added to selftests. - uaccess/sockptr: fix the ignored_trailing logic in copy_struct_to_user() to behave as documented and the usize check in copy_struct_from_sockptr() for user pointers, and add copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr() helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC). - bpf: add a sleepable bpf_real_inode() kfunc that resolves the real inode backing a dentry via d_real_inode(). On overlayfs the inode attached to the dentry doesn't carry the underlying device information; this is used by the filesystem restriction BPF program that was merged into systemd. - docs: add guidelines for submitting new filesystems, motivated by the maintenance burden abandoned and untestable filesystems impose on VFS developers, blocking infrastructure work like folio conversions and iomap migration. Fixes: - libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo() and drop the now-redundant assignments in callers. This began as a one-line dma-buf fix for a path_noexec() warning; a pseudo filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo() callers were audited: the only visible effect is on dma-buf where SB_I_NOEXEC silences the warning. - Handle set_blocksize() failures in legacy filesystems (bfs, hpfs, qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a device with a sector size > PAGE_SIZE crashed roughly half of them; the rest had the same missing error handling pattern. Plus a follow-up releasing the superblock buffer_head when setting the minix v3 block size fails. - mount: honour SB_NOUSER in the new mount API. - fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by switching the process-group paths of send_sigio() and send_sigurg() from read_lock(&tasklist_lock) to RCU, matching the single-PID path. - vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing delegated NFS mounts (fsopen() in a container with the mount performed by a privileged daemon) that broke when non-init s_user_ns was tied to FS_USERNS_MOUNT. - selftests/namespaces: fix a hang in nsid_test where an unreaped grandchild kept the TAP pipe write-end open, a waitpid(-1) race in listns_efault_test, and a false FAIL on kernels without listns() where the tests should SKIP. - filelock: fix the break_lease() stub signature for CONFIG_FILE_LOCKING=n. - init/initramfs_test: wait for the async initramfs unpacking before running; the test and do_populate_rootfs() share the parser state. - fs/coredump: reduce redundant log noise in validate_coredump_safety(). - iomap: pass the correct length to fserror_report_io() in __iomap_write_begin(). - backing-file: fix the backing_file_open() kerneldoc. Cleanups: - initramfs: refactor the cpio hex header parsing to use hex2bin() instead of the hand-rolled simple_strntoul() which is reverted, and extend the initramfs KUnit tests to cover header fields with 0x prefixes. - Replace __get_free_pages() and friends with kmalloc()/kzalloc() across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2, isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the do_mounts init code - part of the larger work of replacing page allocator calls with kmalloc(). - Use clear_and_wake_up_bit() in unlock_buffer() and journal_end_buffer_io_sync() instead of open-coding the sequence. - Drop unused VFS exports: unexport drop_super_exclusive(), remove start_removing_user_path_at(), and fold __start_removing_path() into start_removing_path(). - fs/read_write: narrow the __kernel_write() export with EXPORT_SYMBOL_FOR_MODULES(). - vfs: uapi: retire octal and hex constants in favor of (1 << n) for the O_ flags. Finding a free bit for a new flag across the architectures was needlessly hard with the mixed bases. - dcache: add extra sanity checks of dead dentries in dentry_free() via a new DENTRY_WARN_ONCE() that also prints d_flags. - iov_iter: use kmemdup_array() in dup_iter() to harden the allocation against multiplication overflow. - fs/pipe: write to ->poll_usage only once. - vfs: remove an always-taken if-branch in find_next_fd(). - dcache: use kmalloc_flex() for struct external_name in __d_alloc(). - namei: use QSTR() instead of QSTR_INIT() in path_pts(). - sync_file_range: delete dead S_ISLNK code. - Comment fixes: retire a stale comment in fget_task_next() and fix assorted spelling mistakes" * tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits) backing-file: fix backing_file_open() kerneldoc parameter iomap: pass the correct len to fserror_report_io in __iomap_write_begin vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags bpf: add bpf_real_inode() kfunc fs/read_write: Do not export __kernel_write() to the entire world libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo() mount: honour SB_NOUSER in the new mount API fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling selftests/pipe: add pipe_bench microbenchmark fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write fs: retire stale comment in fget_task_next() fs: fix spelling mistakes in comment bfs: replace get_zeroed_page() with kzalloc() binfmt_misc: replace __get_free_page() with kmalloc() configfs: replace __get_free_pages() with kzalloc() fs/namespace: use __getname() to allocate mntpath buffer fs/select: replace __get_free_page() with kmalloc() ...
2026-06-09vfs: add FS_USERNS_DELEGATABLE flag and set it for NFSJeff Layton1-5/+6
Commit e1c5ae59c0f2 ("fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT") prevents the mount of any filesystem inside a container that doesn't have FS_USERNS_MOUNT set. This broke NFS mounts in our containerized environment. We have a daemon somewhat like systemd-mountfsd running in the init_ns. A process does a fsopen() inside the container and passes it to the daemon via unix socket. The daemon then vets that the request is for an allowed NFS server and performs the mount. This now fails because the fc->user_ns is set to the value in the container and NFS doesn't set FS_USERNS_MOUNT. We don't want to add FS_USERNS_MOUNT to NFS since that would allow the container to mount any NFS server (even malicious ones). Add a new FS_USERNS_DELEGATABLE flag, and enable it on NFS. Fixes: e1c5ae59c0f2 ("fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT") Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260129-twmount-v1-1-4874ed2a15c4@kernel.org Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@futurfusion.io> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-05wind ->s_roots via ->d_sib instead of ->d_hashAl Viro1-0/+1
shrink_dcache_for_umount() is supposed to handle the possibility of some of the dentries to be evicted being in other threads shrink lists; it either kills them, leaving an empty husk to be freed by the owner of shrink list whenever it gets around to that, or it waits for the eviction in progress to get completed. That relies upon dentry remaining attached to the tree until the eviction reaches dentry_unlist() and its ->d_sib gets removed from the list. Unfortunately, the secondary roots are linked via ->d_hash, rather than ->d_sib and they become removed from that list before their inode references are dropped. If shrink_dentry_list() from another thread ends up evicting one of the secondary roots and gets to that point in dentry_kill() when shrink_dcache_for_umount() is looking for secondary roots, the latter will *not* notice anything, possibly leading to warnings about busy inodes at umount time and all kinds of breakage after that. Moreover, shrink_dcache_for_umount() walks the list of secondary roots with no protection whatsoever, so it might end up calling dget() on a dentry that already passed through lockref_mark_dead(&dentry->d_lockref); ending up with corrupted refcount and possible UAF. AFAICS, the most straightforward way to deal with that would be to have secondary roots linked via ->d_sib rather than ->d_hash; then they would remain on the list until killed, and we could use d_add_waiter() machinery to wait for eviction in progress. Changes: * secondary roots look the same as ->s_root from d_unhashed() and d_unlinked() POV now. * secondary roots are represented as "no parent, but on ->d_sib" instead of "no parent, but on ->d_hash". * since ->d_sib is a plain hlist, we protect it with per-superblock spinlock (sb->s_roots_lock) instead of the LSB of the head pointer (for non-root dentries it would be protected by ->d_lock of parent). * __d_obtain_alias() uses ->d_sib for linkage when allocating a secondary root. * d_splice_alias_ops() detects splicing of a secondary root and removes it from the list before calling __d_move(). * dentry_unlist() detects eviction of a secondary root and removes it from the list; no need to play the games for d_walk() sake, since the latter is not going to look for the next sibling of those anyway. * ___d_drop() doesn't care about ->s_roots anymore. * shrink_dcache_for_umount() uses proper locking for access to the list of secondary roots and if it runs into one that is in the middle of eviction waits for that to finish. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-06-03fs: retire sget()Christian Brauner1-66/+5
sget() and sget_fc() have lived side by side as near-duplicate find-or-create-and-publish helpers for the legacy and fs_context mount APIs. The three remaining in-tree callers (CIFS plus the ext4 extents and mballoc KUnit tests) have all been moved to sget_fc(). Nothing calls sget() anymore. Delete sget() from fs/super.c and the prototype in <linux/fs.h>. Update the two comments that referred to "sget()" or "sget{_fc}()" to just say "sget_fc()". This removes ~60 lines of code that only existed to be kept in lockstep with sget_fc() on every superblock publish-path change. Link: https://patch.msgid.link/20260529-work-sget-v2-4-57bbe08604e4@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-21fs: unexport drop_super_exclusiveChristoph Hellwig1-1/+0
drop_super_exclusive is only used by the built-in quota code. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260511072239.2456725-2-hch@lst.de Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds1-1/+1
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook1-3/+3
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-12Merge tag 'fsnotify_for_v6.20-rc1' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "A set of fixes to shutdown fsnotify subsystem before invalidating dcache thus addressing some nasty possible races" * tag 'fsnotify_for_v6.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: Shutdown fsnotify before destroying sb's dcache fsnotify: Use connector list for destroying inode marks fsnotify: Track inode connectors for a superblock
2026-01-23fsnotify: Shutdown fsnotify before destroying sb's dcacheJan Kara1-2/+2
Currently fsnotify_sb_delete() was called after we have evicted superblock's dcache and inode cache. This was done mainly so that we iterate as few inodes as possible when removing inode marks. However, as Jakub reported, this is problematic because for some filesystems encoding of file handles uses sb->s_root which gets cleared as part of dcache eviction. And either delayed fsnotify events or reading fdinfo for fsnotify group with marks on fs being unmounted may trigger encoding of file handles during unmount. So move shutdown of fsnotify subsystem before shrinking of dcache. Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxgXvwumYvJm3cLDFfx-TsU3g5-yVsTiG=6i8KS48dn0mQ@mail.gmail.com/ Reported-by: Jakub Acs <acsjakub@amazon.de> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz>
2026-01-13fs: report filesystem and file I/O errors to fsnotifyDarrick J. Wong1-0/+3
Create some wrapper code around struct super_block so that filesystems have a standard way to queue filesystem metadata and file I/O error reports to have them sent to fsnotify. If a filesystem wants to provide an error number, it must supply only negative error numbers. These are stored internally as negative numbers, but they are converted to positive error numbers before being passed to fanotify, per the fanotify(7) manpage. Implementations of super_operations::report_error are passed the raw internal event data. Note that we have to play some shenanigans with mempools and queue_work so that the error handling doesn't happen outside of process context, and the event handler functions (both ->report_error and fsnotify) can handle file I/O error messages without having to worry about whatever locks might be held. This asynchronicity requires that unmount wait for pending events to clear. Add a new callback to the superblock operations structure so that filesystem drivers can themselves respond to file I/O errors if they so desire. This will be used for an upcoming self-healing patchset for XFS. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Link: https://patch.msgid.link/176826402610.3490369.4378391061533403171.stgit@frogsfrogsfrogs Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-05Merge tag 'vfs-6.19-rc1.fixes' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix a type conversion bug in the ipc subsystem - Fix per-dentry timeout warning in autofs - Drop the fd conversion from sockets - Move assert from iput_not_last() to iput() - Fix reversed check in filesystems_freeze_callback() - Use proper uapi types for new struct delegation definitions * tag 'vfs-6.19-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: use UAPI types for new struct delegation definition mqueue: correct the type of ro to int Revert "net/socket: convert sock_map_fd() to FD_ADD()" autofs: fix per-dentry timeout warning fs: assert on I_FREEING not being set in iput() and iput_not_last() fs: PM: Fix reverse check in filesystems_freeze_callback()
2025-12-05Merge tag 'pull-persistency' of ↵Linus Torvalds1-8/+0
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull persistent dentry infrastructure and conversion from Al Viro: "Some filesystems use a kinda-sorta controlled dentry refcount leak to pin dentries of created objects in dcache (and undo it when removing those). A reference is grabbed and not released, but it's not actually _stored_ anywhere. That works, but it's hard to follow and verify; among other things, we have no way to tell _which_ of the increments is intended to be an unpaired one. Worse, on removal we need to decide whether the reference had already been dropped, which can be non-trivial if that removal is on umount and we need to figure out if this dentry is pinned due to e.g. unlink() not done. Usually that is handled by using kill_litter_super() as ->kill_sb(), but there are open-coded special cases of the same (consider e.g. /proc/self). Things get simpler if we introduce a new dentry flag (DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set claims responsibility for +1 in refcount. The end result this series is aiming for: - get these unbalanced dget() and dput() replaced with new primitives that would, in addition to adjusting refcount, set and clear persistency flag. - instead of having kill_litter_super() mess with removing the remaining "leaked" references (e.g. for all tmpfs files that hadn't been removed prior to umount), have the regular shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries, dropping the corresponding reference if it had been set. After that kill_litter_super() becomes an equivalent of kill_anon_super(). Doing that in a single step is not feasible - it would affect too many places in too many filesystems. It has to be split into a series. This work has really started early in 2024; quite a few preliminary pieces have already gone into mainline. This chunk is finally getting to the meat of that stuff - infrastructure and most of the conversions to it. Some pieces are still sitting in the local branches, but the bulk of that stuff is here" * tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) d_make_discardable(): warn if given a non-persistent dentry kill securityfs_recursive_remove() convert securityfs get rid of kill_litter_super() convert rust_binderfs convert nfsctl convert rpc_pipefs convert hypfs hypfs: swich hypfs_create_u64() to returning int hypfs: switch hypfs_create_str() to returning int hypfs: don't pin dentries twice convert gadgetfs gadgetfs: switch to simple_remove_by_name() convert functionfs functionfs: switch to simple_remove_by_name() functionfs: fix the open/removal races functionfs: need to cancel ->reset_work in ->kill_sb() functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}() functionfs: don't abuse ffs_data_closed() on fs shutdown convert selinuxfs ...
2025-12-03fs: PM: Fix reverse check in filesystems_freeze_callback()Rafael J. Wysocki1-1/+1
The freeze_all_ptr check in filesystems_freeze_callback() introduced by commit a3f8f8662771 ("power: always freeze efivarfs") is reverse which quite confusingly causes all file systems to be frozen when filesystem_freeze_enabled is false. On my systems it causes the WARN_ON_ONCE() in __set_task_frozen() to trigger, most likely due to an attempt to freeze a file system that is not ready for that. Add a logical negation to the check in question to reverse it as appropriate. Fixes: a3f8f8662771 ("power: always freeze efivarfs") Cc: 6.18+ <stable@vger.kernel.org> # 6.18+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/12788397.O9o76ZdvQC@rafael.j.wysocki Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-12-01Merge tag 'vfs-6.19-rc1.writeback' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull writeback updates from Christian Brauner: "Features: - Allow file systems to increase the minimum writeback chunk size. The relatively low minimal writeback size of 4MiB means that written back inodes on rotational media are switched a lot. Besides introducing additional seeks, this also can lead to extreme file fragmentation on zoned devices when a lot of files are cached relative to the available writeback bandwidth. This adds a superblock field that allows the file system to override the default size, and sets it to the zone size for zoned XFS. - Add logging for slow writeback when it exceeds sysctl_hung_task_timeout_secs. This helps identify tasks waiting for a long time and pinpoint potential issues. Recording the starting jiffies is also useful when debugging a crashed vmcore. - Wake up waiting tasks when finishing the writeback of a chunk Cleanups: - filemap_* writeback interface cleanups. Adding filemap_fdatawrite_wbc ended up being a mistake, as all but the original btrfs caller should be using better high level interfaces instead. This series removes all these low-level interfaces, switches btrfs to a more specific interface, and cleans up other too low-level interfaces. With this the writeback_control that is passed to the writeback code is only initialized in three places. - Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and filemap_fdatawrite_wbc - Add filemap_flush_nr helper for btrfs - Push struct writeback_control into start_delalloc_inodes in btrfs - Rename filemap_fdatawrite_range_kick to filemap_flush_range - Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm - Make wbc_to_tag() inline and use it in fs" * tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Make wbc_to_tag() inline and use it in fs. xfs: set s_min_writeback_pages for zoned file systems writeback: allow the file system to override MIN_WRITEBACK_PAGES writeback: cleanup writeback_chunk_size mm: rename filemap_fdatawrite_range_kick to filemap_flush_range mm: remove __filemap_fdatawrite_range mm: remove filemap_fdatawrite_wbc mm: remove __filemap_fdatawrite mm,btrfs: add a filemap_flush_nr helper btrfs: push struct writeback_control into start_delalloc_inodes btrfs: use the local tmp_inode variable in start_delalloc_inodes ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers 9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs) writeback: Wake up waiting tasks when finishing the writeback of a chunk.
2025-11-17get rid of kill_litter_super()Al Viro1-8/+0
Not used anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-11-12power: always freeze efivarfsChristian Brauner1-3/+10
The efivarfs filesystems must always be frozen and thawed to resync variable state. Make it so. Link: https://patch.msgid.link/20251105-vorbild-zutreffen-fe00d1dd98db@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-29writeback: allow the file system to override MIN_WRITEBACK_PAGESChristoph Hellwig1-0/+1
The relatively low minimal writeback size of 4MiB means that written back inodes on rotational media are switched a lot. Besides introducing additional seeks, this also can lead to extreme file fragmentation on zoned devices when a lot of files are cached relative to the available writeback bandwidth. Add a superblock field that allows the file system to override the default size. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251017034611.651385-3-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-03Merge tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-2/+1
Pull vfs mount updates from Al Viro: "Several piles this cycle, this mount-related one being the largest and trickiest: - saner handling of guards in fs/namespace.c, getting rid of needlessly strong locking in some of the users - lock_mount() calling conventions change - have it set the environment for attaching to given location, storing the results in caller-supplied object, without altering the passed struct path. Make unlock_mount() called as __cleanup for those objects. It's not exactly guard(), but similar to it - MNT_WRITE_HOLD done right. mnt_hold_writers() does *not* mess with ->mnt_flags anymore, so insertion of a new mount into ->s_mounts of underlying superblock does not, in itself, expose ->mnt_flags of that mount to concurrent modifications - getting rid of pathological cases when umount() spends quadratic time removing the victims from propagation graph - part of that had been dealt with last cycle, this should finish it - a bunch of stuff constified - assorted cleanups * tag 'pull-mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits) constify {__,}mnt_is_readonly() WRITE_HOLD machinery: no need for to bump mount_lock seqcount struct mount: relocate MNT_WRITE_HOLD bit preparations to taking MNT_WRITE_HOLD out of ->mnt_flags setup_mnt(): primitive for connecting a mount to filesystem simplify the callers of mnt_unhold_writers() copy_mnt_ns(): use guards copy_mnt_ns(): use the regular mechanism for freeing empty mnt_ns on failure open_detached_copy(): separate creation of namespace into helper open_detached_copy(): don't bother with mount_lock_hash() path_has_submounts(): use guard(mount_locked_reader) fs/namespace.c: sanitize descriptions for {__,}lookup_mnt() ecryptfs: get rid of pointless mount references in ecryptfs dentries umount_tree(): take all victims out of propagation graph at once do_mount(): use __free(path_put) do_move_mount_old(): use __free(path_put) constify can_move_mount_beneath() arguments path_umount(): constify struct path argument may_copy_tree(), __do_loopback(): constify struct path argument path_mount(): constify struct path argument ...
2025-09-29Merge tag 'vfs-6.18-rc1.workqueue' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs workqueue updates from Christian Brauner: "This contains various workqueue changes affecting the filesystem layer. Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This replaces the use of system_wq and system_unbound_wq. system_wq is a per-CPU workqueue which isn't very obvious from the name and system_unbound_wq is to be used when locality is not required. So this renames system_wq to system_percpu_wq, and system_unbound_wq to system_dfl_wq. This also adds a new WQ_PERCPU flag to allow the fs subsystem users to explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and WQ_PERCPU flags coexist for one release cycle to allow callers to transition their calls. WQ_UNBOUND will be removed in a next release cycle" * tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: WQ_PERCPU added to alloc_workqueue users fs: replace use of system_wq with system_percpu_wq fs: replace use of system_unbound_wq with system_dfl_wq
2025-09-29Merge tag 'vfs-6.18-rc1.mount' of ↵Linus Torvalds1-63/+0
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains some work around mount api handling: - Output the warning message for mnt_too_revealing() triggered during fsmount() to the fscontext log. This makes it possible for the mount tool to output appropriate warnings on the command line. For example, with the newest fsopen()-based mount(8) from util-linux, the error messages now look like: # mount -t proc proc /tmp mount: /tmp: fsmount() failed: VFS: Mount too revealing. dmesg(1) may have more information after failed mount system call. - Do not consume fscontext log entries when returning -EMSGSIZE Userspace generally expects APIs that return -EMSGSIZE to allow for them to adjust their buffer size and retry the operation. However, the fscontext log would previously clear the message even in the -EMSGSIZE case. Given that it is very cheap for us to check whether the buffer is too small before we remove the message from the ring buffer, let's just do that instead. - Drop an unused argument from do_remount()" * tag 'vfs-6.18-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: fs/namespace.c: remove ms_flags argument from do_remount selftests/filesystems: add basic fscontext log tests fscontext: do not consume log entries when returning -EMSGSIZE vfs: output mount_too_revealing() errors to fscontext docs/vfs: Remove mentions to the old mount API helpers fscontext: add custom-prefix log helpers fs: Remove mount_bdev fs: Remove mount_nodev
2025-09-19fs: WQ_PERCPU added to alloc_workqueue usersMarco Crivellari1-1/+2
Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. alloc_workqueue() treats all queues as per-CPU by default, while unbound workqueues must opt-in via WQ_UNBOUND. This default is suboptimal: most workloads benefit from unbound queues, allowing the scheduler to place worker threads where they’re needed and reducing noise when CPUs are isolated. This patch adds a new WQ_PERCPU flag to all the fs subsystem users to explicitly request the use of the per-CPU behavior. Both flags coexist for one release cycle to allow callers to transition their calls. Once migration is complete, WQ_UNBOUND can be removed and unbound will become the implicit default. With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND), any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND must now use WQ_PERCPU. All existing users have been updated accordingly. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Link: https://lore.kernel.org/20250916082906.77439-4-marco.crivellari@suse.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-17preparations to taking MNT_WRITE_HOLD out of ->mnt_flagsAl Viro1-2/+1
We have an unpleasant wart in accessibility rules for struct mount. There are per-superblock lists of mounts, used by sb_prepare_remount_readonly() to check if any of those is currently claimed for write access and to block further attempts to get write access on those until we are done. As soon as it is attached to a filesystem, mount becomes reachable via that list. Only sb_prepare_remount_readonly() traverses it and it only accesses a few members of struct mount. Unfortunately, ->mnt_flags is one of those and it is modified - MNT_WRITE_HOLD set and then cleared. It is done under mount_lock, so from the locking rules POV everything's fine. However, it has easily overlooked implications - once mount has been attached to a filesystem, it has to be treated as globally visible. In particular, initializing ->mnt_flags *must* be done either prior to that point or under mount_lock. All other members are still private at that point. Life gets simpler if we move that bit (and that's *all* that can get touched by access via this list) out of ->mnt_flags. It's not even hard to do - currently the list is implemented as list_head one, anchored in super_block->s_mounts and linked via mount->mnt_instance. As the first step, switch it to hlist-like open-coded structure - address of the first mount in the set is stored in ->s_mounts and ->mnt_instance replaced with ->mnt_next_for_sb and ->mnt_pprev_for_sb - the former either NULL or pointing to the next mount in set, the latter - address of either ->s_mounts or ->mnt_next_for_sb in the previous element of the set. In the next commit we'll steal the LSB of ->mnt_pprev_for_sb as replacement for MNT_WRITE_HOLD. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-08-25fs: Use try_cmpxchg() in sb_init_done_wq()Uros Bizjak1-3/+5
Use !try_cmpxchg() instead of cmpxchg(*ptr, old, new) != old. The x86 CMPXCHG instruction returns success in the ZF flag, so this change saves a compare after CMPXCHG. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/20250811132326.620521-1-ubizjak@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-11fs: Remove mount_bdevPedro Falcato1-43/+0
mount_bdev has no in-tree users ever since f2fs adopted the new mount API. Remove it. Signed-off-by: Pedro Falcato <pfalcato@suse.de> Link: https://lore.kernel.org/20250723132156.225410-3-pfalcato@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-08-11fs: Remove mount_nodevPedro Falcato1-20/+0
mount_nodev has had no in-tree users since cc0876f817d6 ("vfs: Convert devpts to use the new mount API"). Remove it. Signed-off-by: Pedro Falcato <pfalcato@suse.de> Link: https://lore.kernel.org/20250723132156.225410-2-pfalcato@suse.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-28Merge tag 'vfs-6.17-rc1.super' of ↵Linus Torvalds1-0/+11
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock callback update from Christian Brauner: "Currently all filesystems which implement super_operations::shutdown() can not afford losing a device. Thus fs_bdev_mark_dead() will just call the ->shutdown() callback for the involved filesystem. But it will no longer be the case, as multi-device filesystems like btrfs can handle certain device loss without the need to shutdown the whole filesystem. To allow those multi-device filesystems to be integrated to use fs_holder_ops: - Add a new super_operations::remove_bdev() callback - Try ->remove_bdev() callback first inside fs_bdev_mark_dead(). If the callback returned 0, meaning the fs can handling the device loss, then exit without doing anything else. If there is no such callback or the callback returned non-zero value, continue to shutdown the filesystem as usual. This means the new remove_bdev() should only do the check on whether the operation can continue, and if so do the fs specific handlings. The shutdown handling should still be handled by the existing ->shutdown() callback. For all existing filesystems with shutdown callback, there is no change to the code nor behavior. Btrfs is going to implement both the ->remove_bdev() and ->shutdown() callbacks soon" * tag 'vfs-6.17-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add a new remove_bdev() callback
2025-07-15fs: add a new remove_bdev() callbackQu Wenruo1-0/+11
Currently all filesystems which implement super_operations::shutdown() can not afford losing a device. Thus fs_bdev_mark_dead() will just call the ->shutdown() callback for the involved filesystem. But it will no longer be the case, as multi-device filesystems like btrfs and bcachefs can handle certain device loss without the need to shutdown the whole filesystem. To allow those multi-device filesystems to be integrated to use fs_holder_ops: - Add a new super_operations::remove_bdev() callback - Try ->remove_bdev() callback first inside fs_bdev_mark_dead() If the callback returned 0, meaning the fs can handling the device loss, then exit without doing anything else. If there is no such callback or the callback returned non-zero value, continue to shutdown the filesystem as usual. This means the new remove_bdev() should only do the check on whether the operation can continue, and if so do the fs specific handlings. The shutdown handling should still be handled by the existing ->shutdown() callback. For all existing filesystems with shutdown callback, there is no change to the code nor behavior. Btrfs is going to implement both the ->remove_bdev() and ->shutdown() callbacks soon. Signed-off-by: Qu Wenruo <wqu@suse.com> Link: https://lore.kernel.org/09909fcff7f2763cc037fec97ac2482bdc0a12cb.1752470276.git.wqu@suse.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-12fs: unlock the superblock during iterate_supers_typeDarrick J. Wong1-1/+3
This function takes super_lock in shared mode, so it should release the same lock. Cc: stable@vger.kernel.org # v6.16-rc1 Fixes: af7551cf13cf7f ("super: remove pointless s_root checks") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Link: https://lore.kernel.org/20250611164044.GF6138@frogsfrogsfrogs Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-30Merge tag 'pull-automount' of ↵Linus Torvalds1-8/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull automount updates from Al Viro: "Automount wart removal A bunch of odd boilerplate gone from instances - the reason for those was the need to protect the yet-to-be-attched mount from mark_mounts_for_expiry() deciding to take it out. But that's easy to detect and take care of in mark_mounts_for_expiry() itself; no need to have every instance simulate mount being busy by grabbing an extra reference to it, with finish_automount() undoing that once it attaches that mount. Should've done it that way from the very beginning... This is a flagday change, thankfully there are very few instances. vfs_submount() is gone - its sole remaining user (trace_automount) had been switched to saner primitives" * tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill vfs_submount() saner calling conventions for ->d_automount()
2025-05-26Merge tag 'vfs-6.16-rc1.super' of ↵Linus Torvalds1-90/+224
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs freezing updates from Christian Brauner: "This contains various filesystem freezing related work for this cycle: - Allow the power subsystem to support filesystem freeze for suspend and hibernate. Now all the pieces are in place to actually allow the power subsystem to freeze/thaw filesystems during suspend/resume. Filesystems are only frozen and thawed if the power subsystem does actually own the freeze. If the filesystem is already frozen by the time we've frozen all userspace processes we don't care to freeze it again. That's userspace's job once the process resumes. We only actually freeze filesystems if we absolutely have to and we ignore other failures to freeze. We could bubble up errors and fail suspend/resume if the error isn't EBUSY (aka it's already frozen) but I don't think that this is worth it. Filesystem freezing during suspend/resume is best-effort. If the user has 500 ext4 filesystems mounted and 4 fail to freeze for whatever reason then we simply skip them. What we have now is already a big improvement and let's see how we fare with it before making our lives even harder (and uglier) than we have to. - Allow efivars to support freeze and thaw Allow efivarfs to partake to resync variable state during system hibernation and suspend. Add freeze/thaw support. This is a pretty straightforward implementation. We simply add regular freeze/thaw support for both userspace and the kernel. efivars is the first pseudofilesystem that adds support for filesystem freezing and thawing. The simplicity comes from the fact that we simply always resync variable state after efivarfs has been frozen. It doesn't matter whether that's because of suspend, userspace initiated freeze or hibernation. Efivars is simple enough that it doesn't matter that we walk all dentries. There are no directories and there aren't insane amounts of entries and both freeze/thaw are already heavy-handed operations. If userspace initiated a freeze/thaw cycle they would need CAP_SYS_ADMIN in the initial user namespace (as that's where efivarfs is mounted) so it can't be triggered by random userspace. IOW, we really really don't care" * tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: f2fs: fix freezing filesystem during resize kernfs: add warning about implementing freeze/thaw efivarfs: support freeze/thaw power: freeze filesystems during suspend/resume libfs: export find_next_child() super: add filesystem freezing helpers for suspend and hibernate gfs2: pass through holder from the VFS for freeze/thaw super: use common iterator (Part 2) super: use a common iterator (Part 1) super: skip dying superblocks early super: simplify user_get_super() super: remove pointless s_root checks fs: allow all writers to be frozen locking/percpu-rwsem: add freezable alternative to down_read
2025-05-09super: add filesystem freezing helpers for suspend and hibernateChristian Brauner1-13/+158
Allow the power subsystem to support filesystem freeze for suspend and hibernate. For some kernel subsystems it is paramount that they are guaranteed that they are the owner of the freeze to avoid any risk of deadlocks. This is the case for the power subsystem. Enable it to recognize whether it did actually freeze the filesystem. If userspace has 10 filesystems and suspend/hibernate manges to freeze 5 and then fails on the 6th for whatever odd reason (current or future) then power needs to undo the freeze of the first 5 filesystems. It can't just walk the list again because while it's unlikely that a new filesystem got added in the meantime it still cannot tell which filesystems the power subsystem actually managed to get a freeze reference count on that needs to be dropped during thaw. There's various ways out of this ugliness. For example, record the filesystems the power subsystem managed to freeze on a temporary list in the callbacks and then walk that list backwards during thaw to undo the freezing or make sure that the power subsystem just actually exclusively freezes things it can freeze and marking such filesystems as being owned by power for the duration of the suspend or resume cycle. I opted for the latter as that seemed the clean thing to do even if it means more code changes. If hibernation races with filesystem freezing (e.g. DM reconfiguration), then hibernation need not freeze a filesystem because it's already frozen but userspace may thaw the filesystem before hibernation actually happens. If the race happens the other way around, DM reconfiguration may unexpectedly fail with EBUSY. So allow FREEZE_EXCL to nest with other holders. An exclusive freezer cannot be undone by any of the other concurrent freezers. Link: https://lore.kernel.org/r/20250329-work-freeze-v2-6-a47af37ecc3d@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-06kill vfs_submount()Al Viro1-8/+1
The last remaining user of vfs_submount() (tracefs) is easy to convert to fs_context_for_submount(); do that and bury that thing, along with SB_SUBMOUNT Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-04-29fs: remove useless plus one in super_cache_scan()Jinliang Zheng1-1/+1
After commit 475d0db742e3 ("fs: Fix theoretical division by 0 in super_cache_scan()."), there's no need to plus one to prevent division by zero. Remove it to simplify the code. Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Link: https://lore.kernel.org/20250428135050.267297-1-alexjlzheng@tencent.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07super: use common iterator (Part 2)Christian Brauner1-9/+39
Use a common iterator for all callbacks. We could go for something even more elaborate (advance step-by-step similar to iov_iter) but I really don't think this is warranted. Link: https://lore.kernel.org/r/20250329-work-freeze-v2-5-a47af37ecc3d@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07super: use a common iterator (Part 1)Christian Brauner1-54/+13
Use a common iterator for all callbacks. Link: https://lore.kernel.org/r/20250329-work-freeze-v2-4-a47af37ecc3d@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07super: skip dying superblocks earlyChristian Brauner1-0/+6
Make all iterators uniform by performing an early check whether the superblock is dying. Link: https://lore.kernel.org/r/20250329-work-freeze-v2-3-a47af37ecc3d@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07super: simplify user_get_super()Christian Brauner1-13/+14
Make it easier to read and remove one level of identation. Link: https://lore.kernel.org/r/20250329-work-freeze-v2-2-a47af37ecc3d@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07super: remove pointless s_root checksChristian Brauner1-13/+6
The locking guarantees that the superblock is alive and sb->s_root is still set. Remove the pointless check. Link: https://lore.kernel.org/r/20250329-work-freeze-v2-1-a47af37ecc3d@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-24Merge tag 'vfs-6.15-rc1.misc' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Add CONFIG_DEBUG_VFS infrastucture: - Catch invalid modes in open - Use the new debug macros in inode_set_cached_link() - Use debug-only asserts around fd allocation and install - Place f_ref to 3rd cache line in struct file to resolve false sharing Cleanups: - Start using anon_inode_getfile_fmode() helper in various places - Don't take f_lock during SEEK_CUR if exclusion is guaranteed by f_pos_lock - Add unlikely() to kcmp() - Remove legacy ->remount_fs method from ecryptfs after port to the new mount api - Remove invalidate_inodes() in favour of evict_inodes() - Simplify ep_busy_loopER by removing unused argument - Avoid mmap sem relocks when coredumping with many missing pages - Inline getname() - Inline new_inode_pseudo() and de-staticize alloc_inode() - Dodge an atomic in putname if ref == 1 - Consistently deref the files table with rcu_dereference_raw() - Dedup handling of struct filename init and refcounts bumps - Use wq_has_sleeper() in end_dir_add() - Drop the lock trip around I_NEW wake up in evict() - Load the ->i_sb pointer once in inode_sb_list_{add,del} - Predict not reaching the limit in alloc_empty_file() - Tidy up do_sys_openat2() with likely/unlikely - Call inode_sb_list_add() outside of inode hash lock - Sort out fd allocation vs dup2 race commentary - Turn page_offset() into a wrapper around folio_pos() - Remove locking in exportfs around ->get_parent() call - try_lookup_one_len() does not need any locks in autofs - Fix return type of several functions from long to int in open - Fix return type of several functions from long to int in ioctls Fixes: - Fix watch queue accounting mismatch" * tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) fs: sort out fd allocation vs dup2 race commentary, take 2 fs: call inode_sb_list_add() outside of inode hash lock fs: tidy up do_sys_openat2() with likely/unlikely fs: predict not reaching the limit in alloc_empty_file() fs: load the ->i_sb pointer once in inode_sb_list_{add,del} fs: drop the lock trip around I_NEW wake up in evict() fs: use wq_has_sleeper() in end_dir_add() VFS/autofs: try_lookup_one_len() does not need any locks fs: dedup handling of struct filename init and refcounts bumps fs: consistently deref the files table with rcu_dereference_raw() exportfs: remove locking around ->get_parent() call. fs: use debug-only asserts around fd allocation and install fs: dodge an atomic in putname if ref == 1 vfs: Remove invalidate_inodes() ecryptfs: remove NULL remount_fs from super_operations watch_queue: fix pipe accounting mismatch fs: place f_ref to 3rd cache line in struct file to resolve false sharing epoll: simplify ep_busy_loop by removing always 0 argument fs: Turn page_offset() into a wrapper around folio_pos() kcmp: improve performance adding an unlikely hint to task comparisons ...
2025-03-08vfs: Remove invalidate_inodes()Jan Kara1-1/+1
The function can be replaced by evict_inodes. The only difference is that evict_inodes() skips the inodes with positive refcount without touching ->i_lock, but they are equivalent as evict_inodes() repeats the refcount check after having grabbed ->i_lock. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20250307144318.28120-2-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-06vfs: remove some unused old mount api codeEric Sandeen1-55/+0
Remove reconfigure_single, mount_single, and compare_single now that no users remain. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Link: https://lore.kernel.org/r/20250205213931.74614-5-sandeen@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-25lib/list_debug.c: add object information in case of invalid objectManinder Singh1-1/+1
As of now during link list corruption it prints about cluprit address and its wrong value, but sometime it is not enough to catch the actual issue point. If it prints allocation and free path of that corrupted node, it will be a lot easier to find and fix the issues. Adding the same information when data mismatch is found in link list debug data: [ 14.243055] slab kmalloc-32 start ffff0000cda19320 data offset 32 pointer offset 8 size 32 allocated at add_to_list+0x28/0xb0 [ 14.245259] __kmalloc_cache_noprof+0x1c4/0x358 [ 14.245572] add_to_list+0x28/0xb0 ... [ 14.248632] do_el0_svc_compat+0x1c/0x34 [ 14.249018] el0_svc_compat+0x2c/0x80 [ 14.249244] Free path: [ 14.249410] kfree+0x24c/0x2f0 [ 14.249724] do_force_corruption+0xbc/0x100 ... [ 14.252266] el0_svc_common.constprop.0+0x40/0xe0 [ 14.252540] do_el0_svc_compat+0x1c/0x34 [ 14.252763] el0_svc_compat+0x2c/0x80 [ 14.253071] ------------[ cut here ]------------ [ 14.253303] list_del corruption. next->prev should be ffff0000cda192a8, but was 6b6b6b6b6b6b6b6b. (next=ffff0000cda19348) [ 14.254255] WARNING: CPU: 3 PID: 84 at lib/list_debug.c:65 __list_del_entry_valid_or_report+0x158/0x164 Moved prototype of mem_dump_obj() to bug.h, as mm.h can not be included in bug.h. Link: https://lkml.kernel.org/r/20241230101043.53773-1-maninder1.s@samsung.com Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Rohit Thapliyal <r.thapliyal@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-21fs/super.c: introduce get_tree_bdev_flags()Gao Xiang1-6/+20
As Allison reported [1], currently get_tree_bdev() will store "Can't lookup blockdev" error message. Although it makes sense for pure bdev-based fses, this message may mislead users who try to use EROFS file-backed mounts since get_tree_nodev() is used as a fallback then. Add get_tree_bdev_flags() to specify extensible flags [2] and GET_TREE_BDEV_QUIET_LOOKUP to silence "Can't lookup blockdev" message since it's misleading to EROFS file-backed mounts now. [1] https://lore.kernel.org/r/CAOYeF9VQ8jKVmpy5Zy9DNhO6xmWSKMB-DO8yvBB0XvBE7=3Ugg@mail.gmail.com [2] https://lore.kernel.org/r/ZwUkJEtwIpUA4qMz@infradead.org Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241009033151.2334888-1-hsiangkao@linux.alibaba.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-16Merge tag 'vfs-6.12.misc' of ↵Linus Torvalds1-2/+2
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual pile of misc updates: Features: - Add F_CREATED_QUERY fcntl() that allows userspace to query whether a file was actually created. Often userspace wants to know whether an O_CREATE request did actually create a file without using O_EXCL. The current logic is that to first attempts to open the file without O_CREAT | O_EXCL and if ENOENT is returned userspace tries again with both flags. If that succeeds all is well. If it now reports EEXIST it retries. That works fairly well but some corner cases make this more involved. If this operates on a dangling symlink the first openat() without O_CREAT | O_EXCL will return ENOENT but the second openat() with O_CREAT | O_EXCL will fail with EEXIST. The reason is that openat() without O_CREAT | O_EXCL follows the symlink while O_CREAT | O_EXCL doesn't for security reasons. So it's not something we can really change unless we add an explicit opt-in via O_FOLLOW which seems really ugly. All available workarounds are really nasty (fanotify, bpf lsm etc) so add a simple fcntl(). - Try an opportunistic lookup for O_CREAT. Today, when opening a file we'll typically do a fast lookup, but if O_CREAT is set, the kernel always takes the exclusive inode lock. This was likely done with the expectation that O_CREAT means that we always expect to do the create, but that's often not the case. Many programs set O_CREAT even in scenarios where the file already exists (see related F_CREATED_QUERY patch motivation above). The series contained in the pr rearranges the pathwalk-for-open code to also attempt a fast_lookup in certain O_CREAT cases. If a positive dentry is found, the inode_lock can be avoided altogether and it can stay in rcuwalk mode for the last step_into. - Expose the 64 bit mount id via name_to_handle_at() Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2). While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call - Add a per dentry expire timeout to autofs There are two fairly well known automounter map formats, the autofs format and the amd format (more or less System V and Berkley). Some time ago Linux autofs added an amd map format parser that implemented a fair amount of the amd functionality. This was done within the autofs infrastructure and some functionality wasn't implemented because it either didn't make sense or required extra kernel changes. The idea was to restrict changes to be within the existing autofs functionality as much as possible and leave changes with a wider scope to be considered later. One of these changes is implementing the amd options: 1) "unmount", expire this mount according to a timeout (same as the current autofs default). 2) "nounmount", don't expire this mount (same as setting the autofs timeout to 0 except only for this specific mount) . 3) "utimeout=<seconds>", expire this mount using the specified timeout (again same as setting the autofs timeout but only for this mount) To implement these options per-dentry expire timeouts need to be implemented for autofs indirect mounts. This is because all map keys (mounts) for autofs indirect mounts use an expire timeout stored in the autofs mount super block info. structure and all indirect mounts use the same expire timeout. Fixes: - Fix missing fput for FSCONFIG_SET_FD in autofs - Use param->file for FSCONFIG_SET_FD in coda - Delete the 'fs/netfs' proc subtreee when netfs module exits - Make sure that struct uid_gid_map fits into a single cacheline - Don't flush in-flight wb switches for superblocks without cgroup writeback - Correcting the idmapping mount example in the idmapping documentation - Fix a race between evice_inodes() and find_inode() and iput() - Refine the show_inode_state() macro definition in writeback code - Prevent dump_mapping() from accessing invalid dentry.d_name.name - Show actual source for debugfs in /proc/mounts - Annotate data-race of busy_poll_usecs in eventpoll - Don't WARN for racy path_noexec check in exec code - Handle OOM on mnt_warn_timestamp_expiry() - Fix some spelling in the iomap design documentation - Fix typo in procfs comment - Fix typo in fs/namespace.c comment Cleanups: - Add the VFS git tree to the MAINTAINERS file - Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode bit in struct file bringing us to 5 free f_mode bits - Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify the wait mechanism - Remove the unused path_put_init() helper - Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi specific - Replace the unsigned long i_state member with a u32 i_state member in struct inode freeing up 4 bytes in struct inode. Instead of using the bit based wait apis we're now using the var event apis and using the individual bytes of the i_state member to wait on state changes - Explain how per-syscall AT_* flags should be allocated - Use in_group_or_capable() helper to simplify the posix acl mode update code - Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code - Removed comment about d_rcu_to_refcount() as that function doesn't exist anymore - Add kernel documentation for lookup_fast() - Don't re-zero evenpoll fields - Remove outdated comment after close_fd() - Fix imprecise wording in comment about the pipe filesystem - Drop GFP_NOFAIL mode from alloc_page_buffers - Missing blank line warnings and struct declaration improved in file_table - Annotate struct poll_list with __counted_by() - Remove the unused read parameter in percpu-rwsem - Remove linux/prefetch.h include from direct-io code - Use kmemdup_array instead of kmemdup for multiple allocation in mnt_idmapping code - Remove unused mnt_cursor_del() declaration Performance tweaks: - Dodge smp_mb in break_lease and break_deleg in the common case - Only read fops once in fops_{get,put}() - Use RCU in ilookup() - Elide smp_mb in iversion handling in the common case - Drop one lock trip in evict()" * tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits) uidgid: make sure we fit into one cacheline proc: Fix typo in the comment fs/pipe: Correct imprecise wording in comment fhandle: expose u64 mount id to name_to_handle_at(2) uapi: explain how per-syscall AT_* flags should be allocated fs: drop GFP_NOFAIL mode from alloc_page_buffers writeback: Refine the show_inode_state() macro definition fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation netfs: Delete subtree of 'fs/netfs' when netfs module exits fs: use LIST_HEAD() to simplify code inode: make i_state a u32 inode: port __I_LRU_ISOLATING to var event vfs: fix race between evice_inodes() and find_inode()&iput() inode: port __I_NEW to var event inode: port __I_SYNC to var event fs: reorder i_state bits fs: add i_state helpers MAINTAINERS: add the VFS git tree fs: s/__u32/u32/ for s_fsnotify_mask ...
2024-08-22fs/super.c: improve get_tree() error messageKent Overstreet1-2/+2
seeing an odd bug where we fail to correctly return an error from .get_tree(): https://syzkaller.appspot.com/bug?extid=c0360e8367d6d8d04a66 we need to be able to distinguish between accidently returning a positive error (as implied by the log) and no error. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-08-19percpu-rwsem: remove the unused parameter 'read'Wang Long1-1/+1
In the function percpu_rwsem_release, the parameter `read` is unused, so remove it. Signed-off-by: Wang Long <w@laoqinren.net> Link: https://lore.kernel.org/r/20240802091901.2546797-1-w@laoqinren.net Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-19fs: don't flush in-flight wb switches for superblocks without cgroup writebackHaifeng Xu1-1/+1
When deactivating any type of superblock, it had to wait for the in-flight wb switches to be completed. wb switches are executed in inode_switch_wbs_work_fn() which needs to acquire the wb_switch_rwsem and races against sync_inodes_sb(). If there are too much dirty data in the superblock, the waiting time may increase significantly. For superblocks without cgroup writeback such as tmpfs, they have nothing to do with the wb swithes, so the flushing can be avoided. Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com> Link: https://lore.kernel.org/r/20240726030525.180330-1-haifeng.xu@shopee.com Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-27fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNTSeth Forshee (DigitalOcean)1-0/+11
Christian noticed that it is possible for a privileged user to mount most filesystems with a non-initial user namespace in sb->s_user_ns. When fsopen() is called in a non-init namespace the caller's namespace is recorded in fs_context->user_ns. If the returned file descriptor is then passed to a process priviliged in init_user_ns, that process can call fsconfig(fd_fs, FSCONFIG_CMD_CREATE), creating a new superblock with sb->s_user_ns set to the namespace of the process which called fsopen(). This is problematic. We cannot assume that any filesystem which does not set FS_USERNS_MOUNT has been written with a non-initial s_user_ns in mind, increasing the risk for bugs and security issues. Prevent this by returning EPERM from sget_fc() when FS_USERNS_MOUNT is not set for the filesystem and a non-initial user namespace will be used. sget() does not need to be updated as it always uses the user namespace of the current context, or the initial user namespace if SB_SUBMOUNT is set. Fixes: cb50b348c71f ("convenience helpers: vfs_get_super() and sget_fc()") Reported-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Link: https://lore.kernel.org/r/20240724-s_user_ns-fix-v1-1-895d07c94701@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-06-18fs: don't misleadingly warn during thaw operationsChristian Brauner1-1/+10
The block device may have been frozen before it was claimed by a filesystem. Concurrently another process might try to mount that frozen block device and has temporarily claimed the block device for that purpose causing a concurrent fs_bdev_thaw() to end up here. The mounter is already about to abort mounting because they still saw an elevanted bdev->bd_fsfreeze_count so get_bdev_super() will return NULL in that case. For example, P1 calls dm_suspend() which calls into bdev_freeze() before the block device has been claimed by the filesystem. This brings bdev->bd_fsfreeze_count to 1 and no call into fs_bdev_freeze() is required. Now P2 tries to mount that frozen block device. It claims it and checks bdev->bd_fsfreeze_count. As it's elevated it aborts mounting. In the meantime P3 called dm_resume(). P3 sees that the block device is already claimed by a filesystem and calls into fs_bdev_thaw(). P3 takes a passive reference and realizes that the filesystem isn't ready yet. P3 puts itself to sleep to wait for the filesystem to become ready. P2 now puts the last active reference to the filesystem and marks it as dying. P3 gets woken, sees that the filesystem is dying and get_bdev_super() fails. Fixes: 49ef8832fb1a ("bdev: implement freeze and thaw holder operations") Cc: <stable@vger.kernel.org> Reported-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20240611085210.GA1838544@mit.edu Link: https://lore.kernel.org/r/20240613-lackmantel-einsehen-90f0d727358d@brauner Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-05-20Merge tag 'fsnotify_for_v6.10-rc1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: - reduce overhead of fsnotify infrastructure when no permission events are in use - a few small cleanups * tag 'fsnotify_for_v6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: fix UAF from FS_ERROR event on a shutting down filesystem fsnotify: optimize the case of no permission event watchers fsnotify: use an enum for group priority constants fsnotify: move s_fsnotify_connectors into fsnotify_sb_info fsnotify: lazy attach fsnotify_sb_info state to sb fsnotify: create helper fsnotify_update_sb_watchers() fsnotify: pass object pointer and type to fsnotify mark helpers fanotify: merge two checks regarding add of ignore mark fsnotify: create a wrapper fsnotify_find_inode_mark() fsnotify: create helpers to get sb and connp from object fsnotify: rename fsnotify_{get,put}_sb_connectors() fsnotify: Avoid -Wflex-array-member-not-at-end warning fanotify: remove unneeded sub-zero check for unsigned value
2024-04-17fsnotify: fix UAF from FS_ERROR event on a shutting down filesystemAmir Goldstein1-0/+1
Protect against use after free when filesystem calls fsnotify_sb_error() during fs shutdown. Move freeing of sb->s_fsnotify_info to destroy_super_work(), because it may be accessed from fs shutdown context. Reported-by: syzbot+5e3f9b2a67b45f16d4e6@syzkaller.appspotmail.com Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/linux-fsdevel/20240416173211.4lnmgctyo4jn5fha@quack3/ Fixes: 07a3b8d0bf72 ("fsnotify: lazy attach fsnotify_sb_info state to sb") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20240416181452.567070-1-amir73il@gmail.com>
2024-03-27fs,block: yield devices earlyChristian Brauner1-21/+3
Currently a device is only really released once the umount returns to userspace due to how file closing works. That ultimately could cause an old umount assumption to be violated that concurrent umount and mount don't fail. So an exclusively held device with a temporary holder should be yielded before the filesystem is gone. Add a helper that allows callers to do that. This also allows us to remove the two holder ops that Linus wasn't excited about. Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org Fixes: f3a608827d1f ("bdev: open block device as files") # mainline only Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-18fs,block: get holder during claimChristian Brauner1-0/+18
Now that we open block devices as files we need to deal with the realities that closing is a deferred operation. An operation on the block device such as e.g., freeze, thaw, or removal that runs concurrently with umount, tries to acquire a stable reference on the holder. The holder might already be gone though. Make that reliable by grabbing a passive reference to the holder during bdev_open() and releasing it during bdev_release(). Fixes: f3a608827d1f ("bdev: open block device as files") # mainline only Reported-by: Christoph Hellwig <hch@infradead.org> Link: https://lore.kernel.org/r/ZfEQQ9jZZVes0WCZ@infradead.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@infradead.org> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reported-by: https://lore.kernel.org/r/CAHj4cs8tbDwKRwfS1=DmooP73ysM__xAb2PQc6XsAmWR+VuYmg@mail.gmail.com Link: https://lore.kernel.org/r/20240315-freibad-annehmbar-ca68c375af91@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-11Merge tag 'vfs-6.9.super' of ↵Linus Torvalds1-9/+9
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull block handle updates from Christian Brauner: "Last cycle we changed opening of block devices, and opening a block device would return a bdev_handle. This allowed us to implement support for restricting and forbidding writes to mounted block devices. It was accompanied by converting and adding helpers to operate on bdev_handles instead of plain block devices. That was already a good step forward but ultimately it isn't necessary to have special purpose helpers for opening block devices internally that return a bdev_handle. Fundamentally, opening a block device internally should just be equivalent to opening files. So now all internal opens of block devices return files just as a userspace open would. Instead of introducing a separate indirection into bdev_open_by_*() via struct bdev_handle bdev_file_open_by_*() is made to just return a struct file. Opening and closing a block device just becomes equivalent to opening and closing a file. This all works well because internally we already have a pseudo fs for block devices and so opening block devices is simple. There's a few places where we needed to be careful such as during boot when the kernel is supposed to mount the rootfs directly without init doing it. Here we need to take care to ensure that we flush out any asynchronous file close. That's what we already do for opening, unpacking, and closing the initramfs. So nothing new here. The equivalence of opening and closing block devices to regular files is a win in and of itself. But it also has various other advantages. We can remove struct bdev_handle completely. Various low-level helpers are now private to the block layer. Other helpers were simply removable completely. A follow-up series that is already reviewed build on this and makes it possible to remove bdev->bd_inode and allows various clean ups of the buffer head code as well. All places where we stashed a bdev_handle now just stash a file and use simple accessors to get to the actual block device which was already the case for bdev_handle" * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) block: remove bdev_handle completely block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access bdev: remove bdev pointer from struct bdev_handle bdev: make struct bdev_handle private to the block layer bdev: make bdev_{release, open_by_dev}() private to block layer bdev: remove bdev_open_by_path() reiserfs: port block device access to file ocfs2: port block device access to file nfs: port block device access to files jfs: port block device access to file f2fs: port block device access to files ext4: port block device access to file erofs: port device access to file btrfs: port device access to file bcachefs: port block device access to file target: port block device access to file s390: port block device access to file nvme: port block device access to file block2mtd: port device access to files bcache: port block device access to files ...
2024-02-25bdev: open block device as filesChristian Brauner1-9/+9
Add two new helpers to allow opening block devices as files. This is not the final infrastructure. This still opens the block device before opening a struct a file. Until we have removed all references to struct bdev_handle we can't switch the order: * Introduce blk_to_file_flags() to translate from block specific to flags usable to pen a new file. * Introduce bdev_file_open_by_{dev,path}(). * Introduce temporary sb_bdev_handle() helper to retrieve a struct bdev_handle from a block device file and update places that directly reference struct bdev_handle to rely on it. * Don't count block device openes against the number of open files. A bdev_file_open_by_{dev,path}() file is never installed into any file descriptor table. One idea that came to mind was to use kernel_tmpfile_open() which would require us to pass a path and it would then call do_dentry_open() going through the regular fops->open::blkdev_open() path. But then we're back to the problem of routing block specific flags such as BLK_OPEN_RESTRICT_WRITES through the open path and would have to waste FMODE_* flags every time we add a new one. With this we can avoid using a flag bit and we have more leeway in how we open block devices from bdev_open_by_{dev,path}(). Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-1-adbd023e19cc@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25fs/super.c: don't drop ->s_user_ns until we free struct super_block itselfAl Viro1-9/+4
Avoids fun races in RCU pathwalk... Same goes for freeing LSM shite hanging off super_block's arse. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-01-10Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linuxLinus Torvalds1-6/+6
Pull fscrypt updates from Eric Biggers: "Adjust the timing of the fscrypt keyring destruction, to prepare for btrfs's fscrypt support. Also document that CephFS supports fscrypt now" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux: fs: move fscrypt keyring destruction to after ->put_super f2fs: move release of block devices to after kill_block_super() fscrypt: document that CephFS supports fscrypt now fscrypt: update comment for do_remove_key() fscrypt.rst: update definition of struct fscrypt_context_v2
2024-01-08Merge tag 'vfs-6.8.super' of ↵Linus Torvalds1-223/+275
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs super updates from Christian Brauner: "This contains the super work for this cycle including the long-awaited series by Jan to make it possible to prevent writing to mounted block devices: - Writing to mounted devices is dangerous and can lead to filesystem corruption as well as crashes. Furthermore syzbot comes with more and more involved examples how to corrupt block device under a mounted filesystem leading to kernel crashes and reports we can do nothing about. Add tracking of writers to each block device and a kernel cmdline argument which controls whether other writeable opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are allowed. Note that this effectively only prevents modification of the particular block device's page cache by other writers. The actual device content can still be modified by other means - e.g. by issuing direct scsi commands, by doing writes through devices lower in the storage stack (e.g. in case loop devices, DM, or MD are involved) etc. But blocking direct modifications of the block device page cache is enough to give filesystems a chance to perform data validation when loading data from the underlying storage and thus prevent kernel crashes. Syzbot can use this cmdline argument option to avoid uninteresting crashes. Also users whose userspace setup does not need writing to mounted block devices can set this option for hardening. We expect that this will be interesting to quite a few workloads. Btrfs is currently opted out of this because they still haven't merged patches we require for this to work from three kernel releases ago. - Reimplement block device freezing and thawing as holder operations on the block device. This allows us to extend block device freezing to all devices associated with a superblock and not just the main device. It also allows us to remove get_active_super() and thus another function that scans the global list of superblocks. Freezing via additional block devices only works if the filesystem chooses to use @fs_holder_ops for these additional devices as well. That currently only includes ext4 and xfs. Earlier releases switched get_tree_bdev() and mount_bdev() to use @fs_holder_ops. The remaining nilfs2 open-coded version of mount_bdev() has been converted to rely on @fs_holder_ops as well. So block device freezing for the main block device will continue to work as before. There should be no regressions in functionality. The only special case is btrfs where block device freezing for the main block device never worked because sb->s_bdev isn't set. Block device freezing for btrfs can be fixed once they can switch to @fs_holder_ops but that can happen whenever they're ready" * tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits) block: Fix a memory leak in bdev_open_by_dev() super: don't bother with WARN_ON_ONCE() super: massage wait event mechanism ext4: Block writes to journal device xfs: Block writes to log device fs: Block writes to mounted block devices btrfs: Do not restrict writes to btrfs devices block: Add config option to not allow writing to mounted devices block: Remove blkdev_get_by_*() functions bcachefs: Convert to bdev_open_by_path() fs: handle freezing from multiple devices fs: remove dead check nilfs2: simplify device handling fs: streamline thaw_super_locked ext4: simplify device handling xfs: simplify device handling fs: simplify setup_bdev_super() calls blkdev: comment fs_holder_ops porting: document block device freeze and thaw changes fs: remove unused helper ...
2023-12-27fs: move fscrypt keyring destruction to after ->put_superJosef Bacik1-6/+6
btrfs has a variety of asynchronous things we do with inodes that can potentially last until ->put_super, when we shut everything down and clean up all of our async work. Due to this we need to move fscrypt_destroy_keyring() to after ->put_super, otherwise we get warnings about still having active references on the master key. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Neal Gompa <neal@gompa.dev> Link: https://lore.kernel.org/r/20231227171429.9223-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-12-12fs: super: use GFP_KERNEL instead of GFP_USER for super block allocationAlexander Mikhalitsyn1-1/+1
There is no reason to use a GFP_USER flag for struct super_block allocation in the alloc_super(). Instead, let's use GFP_KERNEL for that. >From the memory management perspective, the only difference between GFP_USER and GFP_KERNEL is that GFP_USER allocations are tied to a cpuset, while GFP_KERNEL ones are not. There is no real issue and this is not a candidate to go to the stable, but let's fix it for a consistency sake. Cc: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: <linux-fsdevel@vger.kernel.org> Cc: <linux-kernel@vger.kernel.org> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/r/20231208151022.156273-1-aleksandr.mikhalitsyn@canonical.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-28super: don't bother with WARN_ON_ONCE()Christian Brauner1-4/+1
We hold our own active reference and we've checked it above. Link: https://lore.kernel.org/r/20231127-vfs-super-massage-wait-v1-2-9ab277bfd01a@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-28super: massage wait event mechanismChristian Brauner1-37/+14
We're currently using two separate helpers wait_born() and wait_dead() when we can just all do it in a single helper super_load_flags(). We're also acquiring the lock before we check whether this superblock is even a viable candidate. If it's already dying we don't even need to bother with the lock. Link: https://lore.kernel.org/r/20231127-vfs-super-massage-wait-v1-1-9ab277bfd01a@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18fs: handle freezing from multiple devicesChristian Brauner1-32/+112
Before [1] freezing a filesystems through the block layer only worked for the main block device as the owning superblock of additional block devices could not be found. Any filesystem that made use of multiple block devices would only be freezable via it's main block device. For example, consider xfs over device mapper with /dev/dm-0 as main block device and /dev/dm-1 as external log device. Two freeze requests before [1]: (1) dmsetup suspend /dev/dm-0 on the main block device bdev_freeze(dm-0) -> dm-0->bd_fsfreeze_count++ -> freeze_super(xfs-sb) The owning superblock is found and the filesystem gets frozen. Returns 0. (2) dmsetup suspend /dev/dm-1 on the log device bdev_freeze(dm-1) -> dm-1->bd_fsfreeze_count++ The owning superblock isn't found and only the block device freeze count is incremented. Returns 0. Two freeze requests after [1]: (1') dmsetup suspend /dev/dm-0 on the main block device bdev_freeze(dm-0) -> dm-0->bd_fsfreeze_count++ -> freeze_super(xfs-sb) The owning superblock is found and the filesystem gets frozen. Returns 0. (2') dmsetup suspend /dev/dm-1 on the log device bdev_freeze(dm-0) -> dm-0->bd_fsfreeze_count++ -> freeze_super(xfs-sb) The owning superblock is found and the filesystem gets frozen. Returns -EBUSY. When (2') is called we initiate a freeze from another block device of the same superblock. So we increment the bd_fsfreeze_count for that additional block device. But we now also find the owning superblock for additional block devices and call freeze_super() again which reports -EBUSY. This can be reproduced through xfstests via: mkfs.xfs -f -m crc=1,reflink=1,rmapbt=1, -i sparse=1 -lsize=1g,logdev=/dev/nvme1n1p4 /dev/nvme1n1p3 mkfs.xfs -f -m crc=1,reflink=1,rmapbt=1, -i sparse=1 -lsize=1g,logdev=/dev/nvme1n1p6 /dev/nvme1n1p5 FSTYP=xfs export TEST_DEV=/dev/nvme1n1p3 export TEST_DIR=/mnt/test export TEST_LOGDEV=/dev/nvme1n1p4 export SCRATCH_DEV=/dev/nvme1n1p5 export SCRATCH_MNT=/mnt/scratch export SCRATCH_LOGDEV=/dev/nvme1n1p6 export USE_EXTERNAL=yes sudo ./check generic/311 Current semantics allow two concurrent freezers: one initiated from userspace via FREEZE_HOLDER_USERSPACE and one initiated from the kernel via FREEZE_HOLDER_KERNEL. If there are multiple concurrent freeze requests from either FREEZE_HOLDER_USERSPACE or FREEZE_HOLDER_KERNEL -EBUSY is returned. We need to preserve these semantics because as they are uapi via FIFREEZE and FITHAW ioctl()s. IOW, freezes don't nest for FIFREEZE and FITHAW. Other kernels consumers rely on non-nesting freezes as well. With freezes initiated from the block layer freezes need to nest if the same superblock is frozen via multiple devices. So we need to start counting the number of freeze requests. If FREEZE_MAY_NEST is passed alongside FREEZE_HOLDER_KERNEL or FREEZE_HOLDER_USERSPACE we allow the caller to nest freeze calls. To accommodate the old semantics we split the freeze counter into two counting kernel initiated and userspace initiated freezes separately. We can then also stop recording FREEZE_HOLDER_* in struct sb_writers. We also simplify freezing by making all concurrent freezers share a single active superblock reference count instead of having separate references for kernel and userspace. I don't see why we would need two active reference counts. Neither FREEZE_HOLDER_KERNEL nor FREEZE_HOLDER_USERSPACE can put the active reference as long as they are concurrent freezers anwyay. That was already true before we allowed nesting freezes. Survives various fstests runs with different options including the reproducer, online scrub, and online repair, fsfreze, and so on. Also survives blktests. Link: https://lore.kernel.org/linux-block/87bkccnwxc.fsf@debian-BULLSEYE-live-builder-AMD64 Link: https://lore.kernel.org/r/20231104-vfs-multi-device-freeze-v2-2-5b5b69626eac@kernel.org Fixes: 288d8706abfc ("bdev: implement freeze and thaw holder operations") [1] # no backport needed Tested-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reported-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18fs: remove dead checkChristian Brauner1-5/+0
Above we call super_lock_excl() which waits until the superblock is SB_BORN and since SB_BORN is never unset once set this check can never fire. Plus, we also hold an active reference at this point already so this superblock can't even be shutdown. Link: https://lore.kernel.org/r/20231104-vfs-multi-device-freeze-v2-1-5b5b69626eac@kernel.org Tested-by: Chandan Babu R <chandanbabu@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18fs: streamline thaw_super_lockedChristoph Hellwig1-23/+20
Add a new out_unlock label to share code that just releases s_umount and returns an error, and rename and reuse the out label that deactivates the sb for one more case. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231027064001.GA9469@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18fs: simplify setup_bdev_super() callsChristian Brauner1-16/+0
There's no need to drop s_umount anymore now that we removed all sources where s_umount is taken beneath open_mutex or bd_holder_lock. Link: https://lore.kernel.org/r/20231024-vfs-super-rework-v1-1-37a8aa697148@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18fs: remove unused helperChristian Brauner1-40/+14
The grab_super() helper is now only used by grab_super_dead(). Merge the two helpers into one. Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-8-599c19f4faac@kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18fs: remove get_active_super()Christian Brauner1-28/+0
This function is now unused so remove it. One less function that uses the global superblock list. Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-6-599c19f4faac@kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18bdev: implement freeze and thaw holder operationsChristian Brauner1-20/+79
The old method of implementing block device freeze and thaw operations required us to rely on get_active_super() to walk the list of all superblocks on the system to find any superblock that might use the block device. This is wasteful and not very pleasant overall. Now that we can finally go straight from block device to owning superblock things become way simpler. Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-5-599c19f4faac@kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18bdev: rename freeze and thaw helpersChristian Brauner1-2/+2
We have bdev_mark_dead() etc and we're going to move block device freezing to holder ops in the next patch. Make the naming consistent: * freeze_bdev() -> bdev_freeze() * thaw_bdev() -> bdev_thaw() Also document the return code. Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-2-599c19f4faac@kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-18fs: massage locking helpersChristian Brauner1-44/+61
Multiple people have balked at the the fact that super_lock{_shared,_excluse}() return booleans and even if they return false hold s_umount. So let's change them to only hold s_umount when true is returned and change the code accordingly. Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-1-599c19f4faac@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-07Merge tag 'ovl-update-6.7' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs updates from Amir Goldstein: - Overlayfs aio cleanups and fixes Cleanups and minor fixes in preparation for factoring out of read/write passthrough code. - Overlayfs lock ordering changes Hold mnt_writers only throughout copy up instead of a long lived elevated refcount. - Add support for nesting overlayfs private xattrs There are cases where you want to use an overlayfs mount as a lowerdir for another overlayfs mount. For example, if the system rootfs is on overlayfs due to composefs, or to make it volatile (via tmpfs), then you cannot currently store a lowerdir on the rootfs, because the inner overlayfs will eat all the whiteouts and overlay xattrs. This means you can't e.g. store on the rootfs a prepared container image for use with overlayfs. This adds support for nesting of overlayfs mounts by escaping the problematic features and unescaping them when exposing to the overlayfs user. - Add new mount options for appending lowerdirs * tag 'ovl-update-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: add support for appending lowerdirs one by one ovl: refactor layer parsing helpers ovl: store and show the user provided lowerdir mount option ovl: remove unused code in lowerdir param parsing ovl: Add documentation on nesting of overlayfs mounts ovl: Add an alternative type of whiteout ovl: Support escaped overlay.* xattrs ovl: Add OVL_XATTR_TRUSTED/USER_PREFIX_LEN macros ovl: Move xattr support to new xattrs.c file ovl: do not encode lower fh with upper sb_writers held ovl: do not open/llseek lower file with upper sb_writers held ovl: reorder ovl_want_write() after ovl_inode_lock() ovl: split ovl_want_write() into two helpers ovl: add helper ovl_file_modified() ovl: protect copying of realinode attributes to ovl inode ovl: punt write aio completion to workqueue ovl: propagate IOCB_APPEND flag on writes to realfile ovl: use simpler function to convert iocb to rw flags
2023-11-02Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds1-16/+19
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-10-31ovl: punt write aio completion to workqueueAmir Goldstein1-0/+1
We want to protect concurrent updates of ovl inode size and mtime (i.e. ovl_copyattr()) from aio completion context. Punt write aio completion to a workqueue so that we can protect ovl_copyattr() with a spinlock. Export sb_init_dio_done_wq(), so that overlayfs can use its own dio workqueue to punt aio completions. Suggested-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/8620dfd3-372d-4ae0-aa3f-2fe97dda1bca@kernel.dk/ Signed-off-by: Amir Goldstein <amir73il@gmail.com>
2023-10-28fs: assert that open_mutex isn't held over holder opsChristian Brauner1-0/+1
With recent block level changes we should never be in a situation where we hold disk->open_mutex when calling into these helpers. So assert that in the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231017184823.1383356-6-hch@lst.de Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lockJan Kara1-18/+32
The implementation of bdev holder operations such as fs_bdev_mark_dead() and fs_bdev_sync() grab sb->s_umount semaphore under bdev->bd_holder_lock. This is problematic because it leads to disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive (usually we grab higher level (e.g. filesystem) locks first and lower level (e.g. block layer) locks later) and indeed makes lockdep complain about possible locking cycles whenever we open a block device while holding sb->s_umount semaphore. Implement a function bdev_super_lock_shared() which safely transitions from holding bdev->bd_holder_lock to holding sb->s_umount on alive superblock without introducing the problematic lock dependency. We use this function fs_bdev_sync() and fs_bdev_mark_dead(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231018152924.3858-1-jack@suse.cz Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-28fs: Convert to bdev_open_by_dev()Jan Kara1-6/+9
Convert mount code to use bdev_open_by_dev() and propagate the handle around to bdev_release(). Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-19-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-04mm: shrinker: convert shrinker_rwsem to mutexQi Zheng1-1/+1
Now there are no readers of shrinker_rwsem, so we can simply replace it with mutex lock. [akpm@linux-foundation.org: update the fix to alloc_shrinker_info()] Link: https://lkml.kernel.org/r/20230911094444.68966-46-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Sterba <dsterba@suse.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04fs: super: dynamically allocate the s_shrinkQi Zheng1-15/+18
In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the s_shrink, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct super_block. Link: https://lkml.kernel.org/r/20230911094444.68966-39-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-31fs: export sget_dev()Christian Brauner1-19/+45
They will be used for mtd devices as well. Acked-by: Richard Weinberger <richard@nod.at> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230829-vfs-super-mtd-v1-1-fecb572e5df3@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-29Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linuxLinus Torvalds1-1/+3
Pull block updates from Jens Axboe: "Pretty quiet round for this release. This contains: - Add support for zoned storage to ublk (Andreas, Ming) - Series improving performance for drivers that mark themselves as needing a blocking context for issue (Bart) - Cleanup the flush logic (Chengming) - sed opal keyring support (Greg) - Fixes and improvements to the integrity support (Jinyoung) - Add some exports for bcachefs that we can hopefully delete again in the future (Kent) - deadline throttling fix (Zhiguo) - Series allowing building the kernel without buffer_head support (Christoph) - Sanitize the bio page adding flow (Christoph) - Write back cache fixes (Christoph) - MD updates via Song: - Fix perf regression for raid0 large sequential writes (Jan) - Fix split bio iostat for raid0 (David) - Various raid1 fixes (Heinz, Xueshi) - raid6test build fixes (WANG) - Deprecate bitmap file support (Christoph) - Fix deadlock with md sync thread (Yu) - Refactor md io accounting (Yu) - Various non-urgent fixes (Li, Yu, Jack) - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li, Ming, Nitesh, Ruan, Tejun, Thomas, Xu)" * tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits) block: use strscpy() to instead of strncpy() block: sed-opal: keyring support for SED keys block: sed-opal: Implement IOC_OPAL_REVERT_LSP block: sed-opal: Implement IOC_OPAL_DISCOVERY blk-mq: prealloc tags when increase tagset nr_hw_queues blk-mq: delete redundant tagset map update when fallback blk-mq: fix tags leak when shrink nr_hw_queues ublk: zoned: support REQ_OP_ZONE_RESET_ALL md: raid0: account for split bio in iostat accounting md/raid0: Fix performance regression for large sequential writes md/raid0: Factor out helper for mapping and submitting a bio md raid1: allow writebehind to work on any leg device set WriteMostly md/raid1: hold the barrier until handle_read_error() finishes md/raid1: free the r1bio before waiting for blocked rdev md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io() blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init drivers/rnbd: restore sysfs interface to rnbd-client md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() raid6: test: only check for Altivec if building on powerpc hosts raid6: test: make sure all intermediate and artifact files are .gitignored ...
2023-08-29Merge tag 'v6.6-vfs.super.fixes' of ↵Linus Torvalds1-20/+31
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock fixes from Christian Brauner: "Two follow-up fixes for the super work this cycle: - Move a misplaced lockep assertion before we potentially free the object containing the lock. - Ensure that filesystems which match superblocks in sget{_fc}() based on sb->s_fs_info are guaranteed to see a valid sb->s_fs_info as long as a superblock still appears on the filesystem type's superblock list. What we want as a proper solution for next cycle is to split sb->free_sb() out of sb->kill_sb() so that we can simply call kill_super_notify() after sb->kill_sb() but before sb->free_sb(). Currently, this is lumped together in sb->kill_sb()" * tag 'v6.6-vfs.super.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: super: ensure valid info super: move lockdep assert
2023-08-29super: ensure valid infoChristian Brauner1-19/+30
For keyed filesystems that recycle superblocks based on s_fs_info or information contained therein s_fs_info must be kept as long as the superblock is on the filesystem type super list. This isn't guaranteed as s_fs_info will be freed latest in sb->kill_sb(). The fix is simply to perform notification and list removal in kill_anon_super(). Any filesystem needs to free s_fs_info after they call the kill_*() helpers. If they don't they risk use-after-free right now so fixing it here is guaranteed that s_fs_info remain valid. For block backed filesystems notifying in pass sb->kill_sb() in deactivate_locked_super() remains unproblematic and is required because multiple other block devices can be shut down after kill_block_super() has been called from a filesystem's sb->kill_sb() handler. For example, ext4 and xfs close additional devices. Block based filesystems don't depend on s_fs_info (btrfs does use s_fs_info but also uses kill_anon_super() and not kill_block_super().). Sorry for that braino. Goal should be to unify this behavior during this cycle obviously. But let's please do a simple bugfix now. Fixes: 2c18a63b760a ("super: wait until we passed kill super") Fixes: syzbot+5b64180f8d9e39d3f061@syzkaller.appspotmail.com Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: syzbot+5b64180f8d9e39d3f061@syzkaller.appspotmail.com Message-Id: <20230828-vfs-super-fixes-v1-2-b37a4a04a88f@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-29super: move lockdep assertChristian Brauner1-1/+1
Fix braino and move the lockdep assertion after put_super() otherwise we risk a use-after-free. Fixes: 2c18a63b760a ("super: wait until we passed kill super") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230828-vfs-super-fixes-v1-1-b37a4a04a88f@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-28Merge tag 'v6.6-vfs.super' of ↵Linus Torvalds1-240/+525
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock updates from Christian Brauner: "This contains the super rework that was ready for this cycle. The first part changes the order of how we open block devices and allocate superblocks, contains various cleanups, simplifications, and a new mechanism to wait on superblock state changes. This unblocks work to ultimately limit the number of writers to a block device. Jan has already scheduled follow-up work that will be ready for v6.7 and allows us to restrict the number of writers to a given block device. That series builds on this work right here. The second part contains filesystem freezing updates. Overview: The generic superblock changes are rougly organized as follows (ignoring additional minor cleanups): (1) Removal of the bd_super member from struct block_device. This was a very odd back pointer to struct super_block with unclear rules. For all relevant places we have other means to get the same information so just get rid of this. (2) Simplify rules for superblock cleanup. Roughly, everything that is allocated during fs_context initialization and that's stored in fs_context->s_fs_info needs to be cleaned up by the fs_context->free() implementation before the superblock allocation function has been called successfully. After sget_fc() returned fs_context->s_fs_info has been transferred to sb->s_fs_info at which point sb->kill_sb() if fully responsible for cleanup. Adhering to these rules means that cleanup of sb->s_fs_info in fill_super() is to be avoided as it's brittle and inconsistent. Cleanup shouldn't be duplicated between sb->put_super() as sb->put_super() is only called if sb->s_root has been set aka when the filesystem has been successfully born (SB_BORN). That complexity should be avoided. This also means that block devices are to be closed in sb->kill_sb() instead of sb->put_super(). More details in the lower section. (3) Make it possible to lookup or create a superblock before opening block devices There's a subtle dependency on (2) as some filesystems did rely on fill_super() to be called in order to correctly clean up sb->s_fs_info. All these filesystems have been fixed. (4) Switch most filesystem to follow the same logic as the generic mount code now does as outlined in (3). (5) Use the superblock as the holder of the block device. We can now easily go back from block device to owning superblock. (6) Export and extend the generic fs_holder_ops and use them as holder ops everywhere and remove the filesystem specific holder ops. (7) Call from the block layer up into the filesystem layer when the block device is removed, allowing to shut down the filesystem without risk of deadlocks. (8) Get rid of get_super(). We can now easily go back from the block device to owning superblock and can call up from the block layer into the filesystem layer when the device is removed. So no need to wade through all registered superblock to find the owning superblock anymore" Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/ * tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits) super: use higher-level helper for {freeze,thaw} super: wait until we passed kill super super: wait for nascent superblocks super: make locking naming consistent super: use locking helpers fs: simplify invalidate_inodes fs: remove get_super block: call into the file system for ioctl BLKFLSBUF block: call into the file system for bdev_mark_dead block: consolidate __invalidate_device and fsync_bdev block: drop the "busy inodes on changed media" log message dasd: also call __invalidate_device when setting the device offline amiflop: don't call fsync_bdev in FDFMTBEG floppy: call disk_force_media_change when changing the format block: simplify the disk_force_media_change interface nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl xfs use fs_holder_ops for the log and RT devices xfs: drop s_umount over opening the log and RT devices ext4: use fs_holder_ops for the log device ext4: drop s_umount over opening the log device ...
2023-08-23Merge tag 'vfs-6.6-merge-2' of ↵Christian Brauner1-18/+102
ssh://gitolite.kernel.org/pub/scm/fs/xfs/xfs-linux Pull filesystem freezing updates from Darrick Wong: New code for 6.6: * Allow the kernel to initiate a freeze of a filesystem. The kernel and userspace can both hold a freeze on a filesystem at the same time; the freeze is not lifted until /both/ holders lift it. This will enable us to fix a longstanding bug in XFS online fsck. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Message-Id: <20230822182604.GB11286@frogsfrogsfrogs> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-22super: use higher-level helper for {freeze,thaw}Christian Brauner1-3/+12
It's not necessary to use low-level locking helpers here. Use the higher-level locking helpers and log if the superblock is dying. Since the caller is assumed to already hold an active reference it isn't possible to observe a dying superblock. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21super: wait until we passed kill superChristian Brauner1-7/+64
Recent rework moved block device closing out of sb->put_super() and into sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and blkdev_put() can end up taking s_umount again. That means we need to move the removal of the superblock from @fs_supers out of generic_shutdown_super() and into deactivate_locked_super() to ensure that concurrent mounters don't fail to open block devices that are still in use because blkdev_put() in sb->kill_sb() hasn't been called yet. We can now do this as we can make iterators through @fs_super and @super_blocks wait without holding s_umount. Concurrent mounts will wait until a dying superblock is fully dead so until sb->kill_sb() has been called and SB_DEAD been set. Concurrent iterators can already discard any SB_DYING superblock. Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230818-vfs-super-fixes-v3-v3-4-9f0b1876e46b@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21super: wait for nascent superblocksChristian Brauner1-51/+153
Recent patches experiment with making it possible to allocate a new superblock before opening the relevant block device. Naturally this has intricate side-effects that we get to learn about while developing this. Superblock allocators such as sget{_fc}() return with s_umount of the new superblock held and lock ordering currently requires that block level locks such as bdev_lock and open_mutex rank above s_umount. Before aca740cecbe5 ("fs: open block device after superblock creation") ordering was guaranteed to be correct as block devices were opened prior to superblock allocation and thus s_umount wasn't held. But now s_umount must be dropped before opening block devices to avoid locking violations. This has consequences. The main one being that iterators over @super_blocks and @fs_supers that grab a temporary reference to the superblock can now also grab s_umount before the caller has managed to open block devices and called fill_super(). So whereas before such iterators or concurrent mounts would have simply slept on s_umount until SB_BORN was set or the superblock was discard due to initalization failure they can now needlessly spin through sget{_fc}(). If the caller is sleeping on bdev_lock or open_mutex one caller waiting on SB_BORN will always spin somewhere and potentially this can go on for quite a while. It should be possible to drop s_umount while allowing iterators to wait on a nascent superblock to either be born or discarded. This patch implements a wait_var_event() mechanism allowing iterators to sleep until they are woken when the superblock is born or discarded. This also allows us to avoid relooping through @fs_supers and @super_blocks if a superblock isn't yet born or dying. Link: aca740cecbe5 ("fs: open block device after superblock creation") Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230818-vfs-super-fixes-v3-v3-3-9f0b1876e46b@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21super: make locking naming consistentChristian Brauner1-14/+14
Make the naming consistent with the earlier introduced super_lock_{read,write}() helpers. Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230818-vfs-super-fixes-v3-v3-2-9f0b1876e46b@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21super: use locking helpersChristian Brauner1-48/+78
Replace the open-coded {down,up}_{read,write}() calls with simple wrappers. Follow-up patches will benefit from this as well. Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230818-vfs-super-fixes-v3-v3-1-9f0b1876e46b@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21fs: simplify invalidate_inodesChristoph Hellwig1-1/+1
kill_dirty has always been true for a long time, so hard code it and remove the unused return value. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-18-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21fs: remove get_superChristoph Hellwig1-37/+0
get_super is unused now, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-17-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21block: call into the file system for ioctl BLKFLSBUFChristoph Hellwig1-0/+13
BLKFLSBUF is a historic ioctl that is called on a file handle to a block device and syncs either the file system mounted on that block device if there is one, or otherwise the just the data on the block device. Replace the get_super based syncing with a holder operation to remove the last usage of get_super, and to also support syncing the file system if the block device is not the main block device stored in s_dev. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-16-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21block: call into the file system for bdev_mark_deadChristoph Hellwig1-2/+6
Combine the newly merged bdev_mark_dead helper with the existing mark_dead holder operation so that all operations that invalidate a device that is dead or being removed now go through the holder ops. This allows file systems to explicitly shutdown either ASAP (for a surprise removal) or after writing back data (for an orderly removal), and do so not only for the main device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-15-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21block: consolidate __invalidate_device and fsync_bdevChristoph Hellwig1-2/+2
We currently have two interfaces that take a block_devices and the find a mounted file systems to flush or invaldidate data on it. Both are a bit problematic because they only work for the "main" block devices that is used as s_dev for the super_block, and because they don't call into the file system at all. Merge the two into a new bdev_mark_dead helper that does both the syncing and invalidation and which is properly documented. This is in preparation of merging the functionality into the ->mark_dead holder operation so that it will work on additional block devices used by a file systems and give us a single entry point for invalidation of dead devices or media. Note that a single standalone fsync_bdev call for an obscure ioctl remains for now, but that one will also be deal with in a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-14-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-14fs: add FSCONFIG_CMD_CREATE_EXCLChristian Brauner1-9/+27
Summary ======= This introduces FSCONFIG_CMD_CREATE_EXCL which will allows userspace to implement something like mount -t ext4 --exclusive /dev/sda /B which fails if a superblock for the requested filesystem does already exist: Before this patch ----------------- $ sudo ./move-mount -f xfs -o source=/dev/sda4 /A Requesting filesystem type xfs Mount options requested: source=/dev/sda4 Attaching mount at /A Moving single attached mount Setting key(source) with val(/dev/sda4) $ sudo ./move-mount -f xfs -o source=/dev/sda4 /B Requesting filesystem type xfs Mount options requested: source=/dev/sda4 Attaching mount at /B Moving single attached mount Setting key(source) with val(/dev/sda4) After this patch with --exclusive as a switch for FSCONFIG_CMD_CREATE_EXCL -------------------------------------------------------------------------- $ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /A Requesting filesystem type xfs Request exclusive superblock creation Mount options requested: source=/dev/sda4 Attaching mount at /A Moving single attached mount Setting key(source) with val(/dev/sda4) $ sudo ./move-mount -f xfs --exclusive -o source=/dev/sda4 /B Requesting filesystem type xfs Request exclusive superblock creation Mount options requested: source=/dev/sda4 Attaching mount at /B Moving single attached mount Setting key(source) with val(/dev/sda4) Device or resource busy | move-mount.c: 300: do_fsconfig: i xfs: reusing existing filesystem not allowed Details ======= As mentioned on the list (cf. [1]-[3]) mount requests like mount -t ext4 /dev/sda /A are ambigous for userspace. Either a new superblock has been created and mounted or an existing superblock has been reused and a bind-mount has been created. This becomes clear in the following example where two processes create the same mount for the same block device: P1 P2 fd_fs = fsopen("ext4"); fd_fs = fsopen("ext4"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "dax", "always"); fsconfig(fd_fs, FSCONFIG_SET_STRING, "resuid", "1000"); // wins and creates superblock fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...) // finds compatible superblock of P1 // spins until P1 sets SB_BORN and grabs a reference fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...) fd_mnt1 = fsmount(fd_fs); fd_mnt2 = fsmount(fd_fs); move_mount(fd_mnt1, "/A") move_mount(fd_mnt2, "/B") Not just does P2 get a bind-mount but the mount options that P2 requestes are silently ignored. The VFS itself doesn't, can't and shouldn't enforce filesystem specific mount option compatibility. It only enforces incompatibility for read-only <-> read-write transitions: mount -t ext4 /dev/sda /A mount -t ext4 -o ro /dev/sda /B The read-only request will fail with EBUSY as the VFS can't just silently transition a superblock from read-write to read-only or vica versa without risking security issues. To userspace this silent superblock reuse can become a security issue in because there is currently no straightforward way for userspace to know that they did indeed manage to create a new superblock and didn't just reuse an existing one. This adds a new FSCONFIG_CMD_CREATE_EXCL command to fsconfig() that returns EBUSY if an existing superblock would be reused. Userspace that needs to be sure that it did create a new superblock with the requested mount options can request superblock creation using this command. If the command succeeds they can be sure that they did create a new superblock with the requested mount options. This requires the new mount api. With the old mount api it would be necessary to plumb this through every legacy filesystem's file_system_type->mount() method. If they want this feature they are most welcome to switch to the new mount api. Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on each high-level superblock creation helper: (1) get_tree_nodev() Always allocate new superblock. Hence, FSCONFIG_CMD_CREATE and FSCONFIG_CMD_CREATE_EXCL are equivalent. The binderfs or overlayfs filesystems are examples. (4) get_tree_keyed() Finds an existing superblock based on sb->s_fs_info. Hence, FSCONFIG_CMD_CREATE would reuse an existing superblock whereas FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY. The mqueue or nfsd filesystems are examples. (2) get_tree_bdev() This effectively works like get_tree_keyed(). The ext4 or xfs filesystems are examples. (3) get_tree_single() Only one superblock of this filesystem type can ever exist. Hence, FSCONFIG_CMD_CREATE would reuse an existing superblock whereas FSCONFIG_CMD_CREATE_EXCL would reject it with EBUSY. The securityfs or configfs filesystems are examples. Note that some single-instance filesystems never destroy the superblock once it has been created during the first mount. For example, if securityfs has been mounted at least onces then the created superblock will never be destroyed again as long as there is still an LSM making use it. Consequently, even if securityfs is unmounted and the superblock seemingly destroyed it really isn't which means that FSCONFIG_CMD_CREATE_EXCL will continue rejecting reusing an existing superblock. This is acceptable thugh since special purpose filesystems such as this shouldn't have a need to use FSCONFIG_CMD_CREATE_EXCL anyway and if they do it's probably to make sure that mount options aren't ignored. Following is an analysis of the effect of FSCONFIG_CMD_CREATE_EXCL on filesystems that make use of the low-level sget_fc() helper directly. They're all effectively variants on get_tree_keyed(), get_tree_bdev(), or get_tree_nodev(): (5) mtd_get_sb() Similar logic to get_tree_keyed(). (6) afs_get_tree() Similar logic to get_tree_keyed(). (7) ceph_get_tree() Similar logic to get_tree_keyed(). Already explicitly allows forcing the allocation of a new superblock via CEPH_OPT_NOSHARE. This turns it into get_tree_nodev(). (8) fuse_get_tree_submount() Similar logic to get_tree_nodev(). (9) fuse_get_tree() Forces reuse of existing FUSE superblock. Forces reuse of existing superblock if passed in file refers to an existing FUSE connection. If FSCONFIG_CMD_CREATE_EXCL is specified together with an fd referring to an existing FUSE connections this would cause the superblock reusal to fail. If reusing is the intent then FSCONFIG_CMD_CREATE_EXCL shouldn't be specified. (10) fuse_get_tree() -> get_tree_nodev() Same logic as in get_tree_nodev(). (11) fuse_get_tree() -> get_tree_bdev() Same logic as in get_tree_bdev(). (12) virtio_fs_get_tree() Same logic as get_tree_keyed(). (13) gfs2_meta_get_tree() Forces reuse of existing gfs2 superblock. Mounting gfs2meta enforces that a gf2s superblock must already exist. If not, it will error out. Consequently, mounting gfs2meta with FSCONFIG_CMD_CREATE_EXCL would always fail. If reusing is the intent then FSCONFIG_CMD_CREATE_EXCL shouldn't be specified. (14) kernfs_get_tree() Similar logic to get_tree_keyed(). (15) nfs_get_tree_common() Similar logic to get_tree_keyed(). Already explicitly allows forcing the allocation of a new superblock via NFS_MOUNT_UNSHARED. This effectively turns it into get_tree_nodev(). Link: [1] https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner Link: [2] https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner Link: [3] https://lore.kernel.org/linux-fsdevel/20230725-einnahmen-warnschilder-17779aec0a97@brauner Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Message-Id: <20230802-vfs-super-exclusive-v2-4-95dc4e41b870@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-14super: remove get_tree_single_reconf()Christian Brauner1-23/+5
The get_tree_single_reconf() helper isn't used anywhere. Remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Message-Id: <20230802-vfs-super-exclusive-v2-1-95dc4e41b870@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11fs: export fs_holder_opsChristoph Hellwig1-1/+2
Export fs_holder_ops so that file systems that open additional block devices can use it as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Message-Id: <20230802154131.2221419-9-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11fs: stop using get_super in fs_mark_deadChristoph Hellwig1-4/+26
fs_mark_dead currently uses get_super to find the superblock for the block device that is going away. This means it is limited to the main device stored in sb->s_dev, leading to a lot of code duplication for file systems that can use multiple block devices. Now that the holder for all block devices used by file systems is set to the super_block, we can instead look at that holder and then check if the file system is born and active, so do that instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Message-Id: <20230802154131.2221419-8-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-11fs: use the super_block as holder when mounting file systemsChristoph Hellwig1-4/+4
The file system type is not a very useful holder as it doesn't allow us to go back to the actual file system instance. Pass the super_block instead which is useful when passed back to the file system driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christian Brauner <brauner@kernel.org> Message-Id: <20230802154131.2221419-7-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-10fs: export setup_bdev_superChristoph Hellwig1-1/+2
We'll want to use setup_bdev_super instead of duplicating it in nilfs2. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Message-Id: <20230802154131.2221419-2-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-10fs: open block device after superblock creationJan Kara1-93/+95
Currently get_tree_bdev and mount_bdev open the block device before committing to allocating a super block. That creates problems for restricting the number of writers to a device, and also leads to a unusual and not very helpful holder (the fs_type). Reorganize the super block code to first look whether the superblock for a particular device does already exist and open the block device only if it doesn't. [hch: port to before the bdev_handle changes, duplicate the bdev read-only check from blkdev_get_by_path, extend the fsfree_mutex coverage to protect against freezes, fix an open bdev leak when the bdev is frozen, use the bdev local variable more, rename the s variable to sb to be more descriptive] [brauner: remove references to mounts as they're mostly irrelevant] [brauner & hch: fold fixes for romfs and cramfs for syzbot+2faac0423fdc9692822b@syzkaller.appspotmail.com] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230724175145.201318-1-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-09fs, block: remove bdev->bd_superChristoph Hellwig1-3/+0
bdev->bd_super is unused now, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Message-Id: <20230807112625.652089-5-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-02fs: remove emergency_thaw_bdevChristoph Hellwig1-1/+3
Fold emergency_thaw_bdev into it's only caller, to prepare for buffer.c to be built only when buffer_head support is enabled. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230801172201.1923299-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17fs: wait for partially frozen filesystemsDarrick J. Wong1-2/+32
Jan Kara suggested that when one thread is in the middle of freezing a filesystem, another thread trying to freeze the same fs but with a different freeze_holder should wait until the freezer reaches either end state (UNFROZEN or COMPLETE) instead of returning EBUSY immediately. Neither caller can do anything sensible with this race other than retry but they cannot really distinguish EBUSY as in "some other holder of the same type has the sb already frozen" from "freezing raced with holder of a different type". Plumb in the extra code needed to wait for the fs freezer to reach an end state and try the freeze again. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz>
2023-07-17fs: distinguish between user initiated freeze and kernel initiated freezeDarrick J. Wong1-9/+70
Userspace can freeze a filesystem using the FIFREEZE ioctl or by suspending the block device; this state persists until userspace thaws the filesystem with the FITHAW ioctl or resuming the block device. Since commit 18e9e5104fcd ("Introduce freeze_super and thaw_super for the fsfreeze ioctl") we only allow the first freeze command to succeed. The kernel may decide that it is necessary to freeze a filesystem for its own internal purposes, such as suspends in progress, filesystem fsck activities, or quiescing a device prior to removal. Userspace thaw commands must never break a kernel freeze, and kernel thaw commands shouldn't undo userspace's freeze command. Introduce a couple of freeze holder flags and wire it into the sb_writers state. One kernel and one userspace freeze are allowed to coexist at the same time; the filesystem will not thaw until both are lifted. I wonder if the f2fs/gfs2 code should be using a kernel freeze here, but for now we'll use FREEZE_HOLDER_USERSPACE to preserve existing behaviors. Cc: mcgrof@kernel.org Cc: jack@suse.cz Cc: hch@infradead.org Cc: ruansy.fnst@fujitsu.com Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
2023-06-29Merge tag 'fs_for_v6.5-rc1' of ↵Linus Torvalds1-4/+0
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull misc filesystem updates from Jan Kara: - Rewrite kmap_local() handling in ext2 - Convert ext2 direct IO path to iomap (with some infrastructure tweaks associated with that) - Convert two boilerplate licenses in udf to SPDX identifiers - Other small udf, ext2, and quota fixes and cleanups * tag 'fs_for_v6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Fix uninitialized array access for some pathnames ext2: Drop fragment support quota: fix warning in dqgrab() quota: Properly disable quotas when add_dquot_ref() fails fs: udf: udftime: Replace LGPL boilerplate with SPDX identifier fs: udf: Replace GPL 2.0 boilerplate license notice with SPDX identifier fs: Drop wait_unfrozen wait queue ext2_find_entry()/ext2_dotdot(): callers don't need page_addr anymore ext2_{set_link,delete_entry}(): don't bother with page_addr ext2_put_page(): accept any pointer within the page ext2_get_page(): saner type ext2: use offset_in_page() instead of open-coding it as subtraction ext2_rename(): set_link and delete_entry may fail ext2: Add direct-io trace points ext2: Move direct-io to use iomap ext2: Use generic_buffers_fsync() implementation ext4: Use generic_buffers_fsync_noflush() implementation fs/buffer.c: Add generic_buffers_fsync*() implementation ext2/dax: Fix ext2_setsize when len is page aligned
2023-06-26Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds1-21/+27
Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Various cleanups all around (Irvin, Chaitanya, Christophe) - Better struct packing (Christophe JAILLET) - Reduce controller error logs for optional commands (Keith) - Support for >=64KiB block sizes (Daniel Gomez) - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner) - bcache updates via Coly: - Fix a race at init time (Mingzhe Zou) - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye) - use page pinning in the block layer for dio (David) - convert old block dio code to page pinning (David, Christoph) - cleanups for pktcdvd (Andy) - cleanups for rnbd (Guoqing) - use the unchecked __bio_add_page() for the initial single page additions (Johannes) - fix overflows in the Amiga partition handling code (Michael) - improve mq-deadline zoned device support (Bart) - keep passthrough requests out of the IO schedulers (Christoph, Ming) - improve support for flush requests, making them less special to deal with (Christoph) - add bdev holder ops and shutdown methods (Christoph) - fix the name_to_dev_t() situation and use cases (Christoph) - decouple the block open flags from fmode_t (Christoph) - ublk updates and cleanups, including adding user copy support (Ming) - BFQ sanity checking (Bart) - convert brd from radix to xarray (Pankaj) - constify various structures (Thomas, Ivan) - more fine grained persistent reservation ioctl capability checks (Jingbo) - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan, Jordy, Li, Min, Yu, Zhong, Waiman) * tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits) scsi/sg: don't grab scsi host module reference ext4: Fix warning in blkdev_put() block: don't return -EINVAL for not found names in devt_from_devname cdrom: Fix spectre-v1 gadget block: Improve kernel-doc headers blk-mq: don't insert passthrough request into sw queue bsg: make bsg_class a static const structure ublk: make ublk_chr_class a static const structure aoe: make aoe_class a static const structure block/rnbd: make all 'class' structures const block: fix the exclusive open mask in disk_scan_partitions block: add overflow checks for Amiga partition support block: change all __u32 annotations to __be32 in affs_hardblocks.h block: fix signed int overflow in Amiga partition support block: add capacity validation in bdev_add_partition() block: fine-granular CAP_SYS_ADMIN for Persistent Reservation block: disallow Persistent Reservation on partitions reiserfs: fix blkdev_put() warning from release_journal_dev() block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions() block: document the holder argument to blkdev_get_by_path ...
2023-06-26Merge tag 'v6.5/vfs.misc' of ↵Linus Torvalds1-9/+13
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Miscellaneous features, cleanups, and fixes for vfs and individual fs Features: - Use mode 0600 for file created by cachefilesd so it can be run by unprivileged users. This aligns them with directories which are already created with mode 0700 by cachefilesd - Reorder a few members in struct file to prevent some false sharing scenarios - Indicate that an eventfd is used a semaphore in the eventfd's fdinfo procfs file - Add a missing uapi header for eventfd exposing relevant uapi defines - Let the VFS protect transitions of a superblock from read-only to read-write in addition to the protection it already provides for transitions from read-write to read-only. Protecting read-only to read-write transitions allows filesystems such as ext4 to perform internal writes, keeping writers away until the transition is completed Cleanups: - Arnd removed the architecture specific arch_report_meminfo() prototypes and added a generic one into procfs.h. Note, we got a report about a warning in amdpgpu codepaths that suggested this was bisectable to this change but we concluded it was a false positive - Remove unused parameters from split_fs_names() - Rename put_and_unmap_page() to unmap_and_put_page() to let the name reflect the order of the cleanup operation that has to unmap before the actual put - Unexport buffer_check_dirty_writeback() as it is not used outside of block device aops - Stop allocating aio rings from highmem - Protecting read-{only,write} transitions in the VFS used open-coded barriers in various places. Replace them with proper little helpers and document both the helpers and all barrier interactions involved when transitioning between read-{only,write} states - Use flexible array members in old readdir codepaths Fixes: - Use the correct type __poll_t for epoll and eventfd - Replace all deprecated strlcpy() invocations, whose return value isn't checked with an equivalent strscpy() call - Fix some kernel-doc warnings in fs/open.c - Reduce the stack usage in jffs2's xattr codepaths finally getting rid of this: fs/jffs2/xattr.c:887:1: error: the frame size of 1088 bytes is larger than 1024 bytes [-Werror=frame-larger-than=] royally annoying compilation warning - Use __FMODE_NONOTIFY instead of FMODE_NONOTIFY where an int and not fmode_t is required to avoid fmode_t to integer degradation warnings - Create coredumps with O_WRONLY instead of O_RDWR. There's a long explanation in that commit how O_RDWR is actually a bug which we found out with the help of Linus and git archeology - Fix "no previous prototype" warnings in the pipe codepaths - Add overflow calculations for remap_verify_area() as a signed addition overflow could be triggered in xfstests - Fix a null pointer dereference in sysv - Use an unsigned variable for length calculations in jfs avoiding compilation warnings with gcc 13 - Fix a dangling pipe pointer in the watch queue codepath - The legacy mount option parser provided as a fallback by the VFS for filesystems not yet converted to the new mount api did prefix the generated mount option string with a leading ',' causing issues for some filesystems - Fix a repeated word in a comment in fs.h - autofs: Update the ctime when mtime is updated as mandated by POSIX" * tag 'v6.5/vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits) readdir: Replace one-element arrays with flexible-array members fs: Provide helpers for manipulating sb->s_readonly_remount fs: Protect reconfiguration of sb read-write from racing writes eventfd: add a uapi header for eventfd userspace APIs autofs: set ctime as well when mtime changes on a dir eventfd: show the EFD_SEMAPHORE flag in fdinfo fs/aio: Stop allocating aio rings from HIGHMEM fs: Fix comment typo fs: unexport buffer_check_dirty_writeback fs: avoid empty option when generating legacy mount string watch_queue: prevent dangling pipe pointer fs.h: Optimize file struct to prevent false sharing highmem: Rename put_and_unmap_page() to unmap_and_put_page() cachefiles: Allow the cache to be non-root init: remove unused names parameter in split_fs_names() jfs: Use unsigned variable for length calculations fs/sysv: Null check to prevent null-ptr-deref bug fs: use UB-safe check for signed addition overflow in remap_verify_area procfs: consolidate arch_report_meminfo declaration fs: pipe: reveal missing function protoypes ...
2023-06-20fs: Provide helpers for manipulating sb->s_readonly_remountJan Kara1-11/+6
Provide helpers to set and clear sb->s_readonly_remount including appropriate memory barriers. Also use this opportunity to document what the barriers pair with and why they are needed. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Message-Id: <20230620112832.5158-1-jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19Revert "mm: shrinkers: convert shrinker_rwsem to mutex"Qi Zheng1-1/+1
Patch series "revert shrinker_srcu related changes". This patch (of 7): This reverts commit cf2e309ebca7bb0916771839f9b580b06c778530. Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 ("mm: vmscan: make global slab shrink lockless"). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That's why unregister_shrinker() has become slower. After discussion, we will try to use the refcount+RCU method [2] proposed by Dave Chinner to continue to re-implement the lockless slab shrink. So revert the shrinker_mutex back to shrinker_rwsem first. [1]. https://lore.kernel.org/lkml/202305230837.db2c233f-yujie.liu@intel.com/ [2]. https://lore.kernel.org/lkml/ZIJhou1d55d4H1s0@dread.disaster.area/ Link: https://lkml.kernel.org/r/20230609081518.3039120-1-qi.zheng@linux.dev Link: https://lkml.kernel.org/r/20230609081518.3039120-2-qi.zheng@linux.dev Reported-by: kernel test robot <yujie.liu@intel.com> Closes: https://lore.kernel.org/oe-lkp/202305230837.db2c233f-yujie.liu@intel.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yujie Liu <yujie.liu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-15fs: Protect reconfiguration of sb read-write from racing writesJan Kara1-1/+10
The reconfigure / remount code takes a lot of effort to protect filesystem's reconfiguration code from racing writes on remounting read-only. However during remounting read-only filesystem to read-write mode userspace writes can start immediately once we clear SB_RDONLY flag. This is inconvenient for example for ext4 because we need to do some writes to the filesystem (such as preparation of quota files) before we can take userspace writes so we are clearing SB_RDONLY flag before we are fully ready to accept userpace writes and syzbot has found a way to exploit this [1]. Also as far as I'm reading the code the filesystem remount code was protected from racing writes in the legacy mount path by the mount's MNT_READONLY flag so this is relatively new problem. It is actually fairly easy to protect remount read-write from racing writes using sb->s_readonly_remount flag so let's just do that instead of having to workaround these races in the filesystem code. [1] https://lore.kernel.org/all/00000000000006a0df05f6667499@google.com/T/ Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230615113848.8439-1-jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-12fs: remove sb->s_modeChristoph Hellwig1-2/+0
There is no real need to store the open mode in the super_block now. It is only used by f2fs, which can easily recalculate it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: add a sb_open_mode helperChristoph Hellwig1-11/+4
Add a helper to return the open flags for blkdev_get_by* for passed in super block flags instead of open coding the logic in many places. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230608110258.189493-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12block: use the holder as indication for exclusive opensChristoph Hellwig1-11/+9
The current interface for exclusive opens is rather confusing as it requires both the FMODE_EXCL flag and a holder. Remove the need to pass FMODE_EXCL and just key off the exclusive open off a non-NULL holder. For blkdev_put this requires adding the holder argument, which provides better debug checking that only the holder actually releases the hold, but at the same time allows removing the now superfluous mode argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Christian Brauner <brauner@kernel.org> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Jack Wang <jinpu.wang@ionos.com> [rnbd] Link: https://lore.kernel.org/r/20230608110258.189493-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05fs: add a method to shut down the file systemChristoph Hellwig1-2/+19
Add a new ->shutdown super operation that can be used to tell the file system to shut down, and call it from newly created holder ops when the block device under a file system shuts down. This only covers the main block device for "simple" file systems using get_tree_bdev / mount_bdev. File systems their own get_tree method or opening additional devices will need to set up their own blk_holder_ops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-05block: introduce holder opsChristoph Hellwig1-2/+2
Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and installed in the block_device for exclusive claims. It will be used to allow the block layer to call back into the user of the block device for thing like notification of a removed device or a device resize. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-30fs: Drop wait_unfrozen wait queueJan Kara1-4/+0
wait_unfrozen waitqueue is used only in quota code to wait for filesystem to become unfrozen. In that place we can just use sb_start_write() - sb_end_write() pair to achieve the same. So just remove the waitqueue. Reviewed-by: Christian Brauner <brauner@kernel.org> Message-Id: <20230525141710.7595-1-jack@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz>
2023-05-15vfs: Replace all non-returning strlcpy with strscpyAzeem Shaikh1-2/+2
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Message-Id: <20230510221119.3508930-1-azeemshaikh38@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-03-28mm: shrinkers: convert shrinker_rwsem to mutexQi Zheng1-1/+1
Now there are no readers of shrinker_rwsem, so we can simply replace it with mutex lock. Link: https://lkml.kernel.org/r/20230313112819.38938-9-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill Tkhai <tkhai@ya.ru> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christian König <christian.koenig@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sultan Alsawaf <sultan@kerneltoast.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-14fscrypt: destroy keyring after security_sb_delete()Eric Biggers1-3/+12
fscrypt_destroy_keyring() must be called after all potentially-encrypted inodes were evicted; otherwise it cannot safely destroy the keyring. Since inodes that are in-use by the Landlock LSM don't get evicted until security_sb_delete(), this means that fscrypt_destroy_keyring() must be called *after* security_sb_delete(). This fixes a WARN_ON followed by a NULL dereference, only possible if Landlock was being used on encrypted files. Fixes: d7e7b9af104c ("fscrypt: stop using keyrings subsystem for fscrypt_master_key") Cc: stable@vger.kernel.org Reported-by: syzbot+93e495f6a4f748827c88@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/00000000000044651705f6ca1e30@google.com Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230313221231.272498-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2023-02-20Merge tag 'for-6.3/dio-2023-02-16' of git://git.kernel.dk/linuxLinus Torvalds1-0/+24
Pull legacy dio update from Jens Axboe: "We only have a few file systems that use the old dio code, make them select it rather than build it unconditionally" * tag 'for-6.3/dio-2023-02-16' of git://git.kernel.dk/linux: fs: build the legacy direct I/O code conditionally fs: move sb_init_dio_done_wq out of direct-io.c
2023-02-20Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linuxLinus Torvalds1-1/+0
Pull fscrypt updates from Eric Biggers: "Simplify the implementation of the test_dummy_encryption mount option by adding the 'test dummy key' on-demand" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux: fscrypt: clean up fscrypt_add_test_dummy_key() fs/super.c: stop calling fscrypt_destroy_keyring() from __put_super() f2fs: stop calling fscrypt_add_test_dummy_key() ext4: stop calling fscrypt_add_test_dummy_key() fscrypt: add the test dummy encryption key on-demand
2023-02-07fs/super.c: stop calling fscrypt_destroy_keyring() from __put_super()Eric Biggers1-1/+0
Now that the key associated with the "test_dummy_operation" mount option is added on-demand when it's needed, rather than immediately when the filesystem is mounted, fscrypt_destroy_keyring() no longer needs to be called from __put_super() to avoid a memory leak on mount failure. Remove this call, which was causing confusion because it appeared to be a sleep-in-atomic bug (though it wasn't, for a somewhat-subtle reason). Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230208062107.199831-5-ebiggers@kernel.org
2023-01-27fs: Use CHECK_DATA_CORRUPTION() when kernel bugs are detectedJann Horn1-4/+17
Currently, filp_close() and generic_shutdown_super() use printk() to log messages when bugs are detected. This is problematic because infrastructure like syzkaller has no idea that this message indicates a bug. In addition, some people explicitly want their kernels to BUG() when kernel data corruption has been detected (CONFIG_BUG_ON_DATA_CORRUPTION). And finally, when generic_shutdown_super() detects remaining inodes on a system without CONFIG_BUG_ON_DATA_CORRUPTION, it would be nice if later accesses to a busy inode would at least crash somewhat cleanly rather than walking through freed memory. To address all three, use CHECK_DATA_CORRUPTION() when kernel bugs are detected. Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
2023-01-26fs: move sb_init_dio_done_wq out of direct-io.cChristoph Hellwig1-0/+24
sb_init_dio_done_wq is also used by the iomap code, so move it to super.c in preparation for building direct-io.c conditionally. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230125065839.191256-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-12Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-51/+9
Pull misc vfs updates from Al Viro: "misc pile" * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: sysv: Fix sysv_nblocks() returns wrong value get rid of INT_LIMIT, use type_max() instead btrfs: replace INT_LIMIT(loff_t) with OFFSET_MAX fs: simplify vfs_get_super fs: drop useless condition from inode_needs_update_time
2022-11-25fs: simplify vfs_get_superChristoph Hellwig1-51/+9
Remove the pointless keying argument and associated enum and pass the fill_super callback and a "bool reconf" instead. Also mark the function static given that there are no users outside of super.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-10-19fscrypt: fix keyring memory leak on mount failureEric Biggers1-1/+2
Commit d7e7b9af104c ("fscrypt: stop using keyrings subsystem for fscrypt_master_key") moved the keyring destruction from __put_super() to generic_shutdown_super() so that the filesystem's block device(s) are still available. Unfortunately, this causes a memory leak in the case where a mount is attempted with the test_dummy_encryption mount option, but the mount fails after the option has already been processed. To fix this, attempt the keyring destruction in both places. Reported-by: syzbot+104c2a89561289cec13e@syzkaller.appspotmail.com Fixes: d7e7b9af104c ("fscrypt: stop using keyrings subsystem for fscrypt_master_key") Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20221011213838.209879-1-ebiggers@kernel.org
2022-09-21fscrypt: stop using keyrings subsystem for fscrypt_master_keyEric Biggers1-1/+1
The approach of fs/crypto/ internally managing the fscrypt_master_key structs as the payloads of "struct key" objects contained in a "struct key" keyring has outlived its usefulness. The original idea was to simplify the code by reusing code from the keyrings subsystem. However, several issues have arisen that can't easily be resolved: - When a master key struct is destroyed, blk_crypto_evict_key() must be called on any per-mode keys embedded in it. (This started being the case when inline encryption support was added.) Yet, the keyrings subsystem can arbitrarily delay the destruction of keys, even past the time the filesystem was unmounted. Therefore, currently there is no easy way to call blk_crypto_evict_key() when a master key is destroyed. Currently, this is worked around by holding an extra reference to the filesystem's request_queue(s). But it was overlooked that the request_queue reference is *not* guaranteed to pin the corresponding blk_crypto_profile too; for device-mapper devices that support inline crypto, it doesn't. This can cause a use-after-free. - When the last inode that was using an incompletely-removed master key is evicted, the master key removal is completed by removing the key struct from the keyring. Currently this is done via key_invalidate(). Yet, key_invalidate() takes the key semaphore. This can deadlock when called from the shrinker, since in fscrypt_ioctl_add_key(), memory is allocated with GFP_KERNEL under the same semaphore. - More generally, the fact that the keyrings subsystem can arbitrarily delay the destruction of keys (via garbage collection delay, or via random processes getting temporary key references) is undesirable, as it means we can't strictly guarantee that all secrets are ever wiped. - Doing the master key lookups via the keyrings subsystem results in the key_permission LSM hook being called. fscrypt doesn't want this, as all access control for encrypted files is designed to happen via the files themselves, like any other files. The workaround which SELinux users are using is to change their SELinux policy to grant key search access to all domains. This works, but it is an odd extra step that shouldn't really have to be done. The fix for all these issues is to change the implementation to what I should have done originally: don't use the keyrings subsystem to keep track of the filesystem's fscrypt_master_key structs. Instead, just store them in a regular kernel data structure, and rework the reference counting, locking, and lifetime accordingly. Retain support for RCU-mode key lookups by using a hash table. Replace fscrypt_sb_free() with fscrypt_sb_delete(), which releases the keys synchronously and runs a bit earlier during unmount, so that block devices are still available. A side effect of this patch is that neither the master keys themselves nor the filesystem keyrings will be listed in /proc/keys anymore. ("Master key users" and the master key users keyrings will still be listed.) However, this was mostly an implementation detail, and it was intended just for debugging purposes. I don't know of anyone using it. This patch does *not* change how "master key users" (->mk_users) works; that still uses the keyrings subsystem. That is still needed for key quotas, and changing that isn't necessary to solve the issues listed above. If we decide to change that too, it would be a separate patch. I've marked this as fixing the original commit that added the fscrypt keyring, but as noted above the most important issue that this patch fixes wasn't introduced until the addition of inline encryption support. Fixes: 22d94f493bfb ("fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl") Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220901193208.138056-2-ebiggers@kernel.org
2022-08-08Merge tag 'fuse-update-6.0' of ↵Linus Torvalds1-2/+31
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Fix an issue with reusing the bdi in case of block based filesystems - Allow root (in init namespace) to access fuse filesystems in user namespaces if expicitly enabled with a module param - Misc fixes * tag 'fuse-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: retire block-device-based superblock on force unmount vfs: function to prevent re-use of block-device-based superblocks virtio_fs: Modify format for virtio_fs_direct_access virtiofs: delete unused parameter for virtio_fs_cleanup_vqs fuse: Add module param for CAP_SYS_ADMIN access bypassing allow_other fuse: Remove the control interface for virtio-fs fuse: ioctl: translate ENOSYS fuse: limit nsec fuse: avoid unnecessary spinlock bump fuse: fix deadlock between atomic O_TRUNC and page invalidation fuse: write inode in fuse_release()
2022-07-27vfs: function to prevent re-use of block-device-based superblocksDaniil Lunev1-2/+31
The function is to be called from filesystem-specific code to mark a superblock to be ignored by superblock test and thus never re-used. The function also unregisters bdi if the bdi is per-superblock to avoid collision if a new superblock is created to represent the filesystem. generic_shutdown_super() skips unregistering bdi for a retired superlock as it assumes retire function has already done it. This patch adds the functionality only for the block-device-based supers, since the primary use case of the feature is to gracefully handle force unmount of external devices, mounted with FUSE. This can be further extended to cover all superblocks, if the need arises. Signed-off-by: Daniil Lunev <dlunev@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2022-07-03mm: shrinkers: provide shrinkers with namesRoman Gushchin1-1/+5
Currently shrinkers are anonymous objects. For debugging purposes they can be identified by count/scan function names, but it's not always useful: e.g. for superblock's shrinkers it's nice to have at least an idea of to which superblock the shrinker belongs. This commit adds names to shrinkers. register_shrinker() and prealloc_shrinker() functions are extended to take a format and arguments to master a name. In some cases it's not possible to determine a good name at the time when a shrinker is allocated. For such cases shrinker_debugfs_rename() is provided. The expected format is: <subsystem>-<shrinker_type>[:<instance>]-<id> For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair. After this change the shrinker debugfs directory looks like: $ cd /sys/kernel/debug/shrinker/ $ ls dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 sb-dax-11 sb-proc-45 sb-tmpfs-35 sb-debugfs-7 sb-proc-46 sb-tmpfs-40 [roman.gushchin@linux.dev: fix build warnings] Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Dave Chinner <dchinner@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-17block: add a bdev_stable_writes helperChristoph Hellwig1-1/+1
Add a helper to check the stable writes flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-30vfs: make freeze_super abort when sync_filesystem returns errorDarrick J. Wong1-7/+12
If we fail to synchronize the filesystem while preparing to freeze the fs, abort the freeze. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org>
2022-01-22mm: remove cleancacheChristoph Hellwig1-3/+0
Patch series "remove Xen tmem leftovers". Since the removal of the Xen tmem driver in 2019, the cleancache hooks are entirely unused, as are large parts of frontswap. This series against linux-next (with the folio changes included) removes cleancaches, and cuts down frontswap to the bits actually used by zswap. This patch (of 13): The cleancache subsystem is unused since the removal of Xen tmem driver in commit 814bbf49dcd0 ("xen: remove tmem driver"). [akpm@linux-foundation.org: remove now-unreachable code] Link: https://lkml.kernel.org/r/20211224062246.1258487-1-hch@lst.de Link: https://lkml.kernel.org/r/20211224062246.1258487-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-17devtmpfs regression fix: reconfigure on each mountNeilBrown1-2/+2
Prior to Linux v5.4 devtmpfs used mount_single() which treats the given mount options as "remount" options, so it updates the configuration of the single super_block on each mount. Since that was changed, the mount options used for devtmpfs are ignored. This is a regression which affect systemd - which mounts devtmpfs with "-o mode=755,size=4m,nr_inodes=1m". This patch restores the "remount" effect by calling reconfigure_single() Fixes: d401727ea0d7 ("devtmpfs: don't mix {ramfs,shmem}_fill_super() with mount_single()") Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06fs: explicitly unregister per-superblock BDIsChristoph Hellwig1-0/+3
Add a new SB_I_ flag to mark superblocks that have an ephemeral bdi associated with them, and unregister it when the superblock is shut down. Link: https://lkml.kernel.org/r/20211021124441.668816-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Miquel Raynal <miquel.raynal@bootlin.com> Cc: Richard Weinberger <richard@nod.at> Cc: Vignesh Raghavendra <vigneshr@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-08-09block: remove the bd_bdi in struct block_deviceChristoph Hellwig1-1/+1
Just retrieve the bdi from the disk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210809141744.1203023-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-01block: move bd_mutex to struct gendiskChristoph Hellwig1-4/+4
Replace the per-block device bd_mutex with a per-gendisk open_mutex, thus simplifying locking wherever we deal with partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-04-22fs,security: Add sb_delete hookMickaël Salaün1-0/+1
The sb_delete security hook is called when shutting down a superblock, which may be useful to release kernel objects tied to the superblock's lifetime (e.g. inodes). This new hook is needed by Landlock to release (ephemerally) tagged struct inodes. This comes from the unprivileged nature of Landlock described in the next commit. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: James Morris <jmorris@namei.org> Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com> Reviewed-by: Jann Horn <jannh@google.com> Acked-by: Serge Hallyn <serge@hallyn.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210422154123.13086-7-mic@digikod.net Signed-off-by: James Morris <jamorris@linux.microsoft.com>
2021-02-22Merge tag 'docs-5.12' of git://git.lwn.net/linuxLinus Torvalds1-6/+6
Pull documentation updates from Jonathan Corbet: "It has been a relatively quiet cycle in docsland. - As promised, the minimum Sphinx version to build the docs is now 1.7, and we have dropped support for Python 2 entirely. That allowed the removal of a bunch of compatibility code. - A set of treewide warning fixups from Mauro that I applied after it became clear nobody else was going to deal with them. - The automarkup mechanism can now create cross-references from relative paths to RST files. - More translations, typo fixes, and warning fixes" * tag 'docs-5.12' of git://git.lwn.net/linux: (75 commits) docs: kernel-hacking: be more civil docs: Remove the Microsoft rhetoric Documentation/admin-guide: kernel-parameters: Update nohlt section doc/admin-guide: fix spelling mistake: "perfomance" -> "performance" docs: Document cross-referencing using relative path docs: Enable usage of relative paths to docs on automarkup docs: thermal: fix spelling mistakes Documentation: admin-guide: Update kvm/xen config option docs: Make syscalls' helpers naming consistent coding-style.rst: Avoid comma statements Documentation: /proc/loadavg: add 3 more field descriptions Documentation/submitting-patches: Add blurb about backtraces in commit messages Docs: drop Python 2 support Move our minimum Sphinx version to 1.7 Documentation: input: define ABS_PRESSURE/ABS_MT_PRESSURE resolution as grams scripts/kernel-doc: add internal hyperlink to DOC: sections Update Documentation/admin-guide/sysctl/fs.rst docs: Update DTB format references docs: zh_CN: add iio index.rst translation docs/zh_CN: add iio ep93xx_adc.rst translation ...
2021-01-24block: remove the NULL bdev check in bdev_read_onlyChristoph Hellwig1-1/+2
Only a single caller can end up in bdev_read_only, so move the check there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-01-21fs: fix kernel-doc markupsMauro Carvalho Chehab1-6/+6
Two markups are at the wrong place. Kernel-doc only support having the comment just before the identifier. Also, some identifiers have different names between their prototypes and the kernel-doc markup. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Link: https://lore.kernel.org/r/96b1e1b388600ab092331f6c4e88ff8e8779ce6c.1610610937.git.mchehab+huawei@kernel.org Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-12-01block: remove i_bdevChristoph Hellwig1-25/+19
Switch the block device lookup interfaces to directly work with a dev_t so that struct block_device references are only acquired by the blkdev_get variants (and the blk-cgroup special case). This means that we now don't need an extra reference in the inode and can generally simplify handling of struct block_device to keep the lookups contained in the core block layer code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Coly Li <colyli@suse.de> [bcache] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01fs: remove get_super_thawed and get_super_exclusive_thawedChristoph Hellwig1-49/+2
Just open code the wait in the only caller of both functions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-10vfs: move __sb_{start,end}_write* to fs.hDarrick J. Wong1-30/+0
Now that we've straightened out the callers, move these three functions to fs.h since they're fairly trivial. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz>
2020-11-10vfs: separate __sb_start_write into blocking and non-blocking helpersDarrick J. Wong1-6/+12
Break this function into two helpers so that it's obvious that the trylock versions return a value that must be checked, and the blocking versions don't require that. While we're at it, clean up the return type mismatch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-11-10vfs: remove lockdep bogosity in __sb_start_writeDarrick J. Wong1-29/+4
__sb_start_write has some weird looking lockdep code that claims to exist to handle nested freeze locking requests from xfs. The code as written seems broken -- if we think we hold a read lock on any of the higher freeze levels (e.g. we hold SB_FREEZE_WRITE and are trying to lock SB_FREEZE_PAGEFAULT), it converts a blocking lock attempt into a trylock. However, it's not correct to downgrade a blocking lock attempt to a trylock unless the downgrading code or the callers are prepared to deal with that situation. Neither __sb_start_write nor its callers handle this at all. For example: sb_start_pagefault ignores the return value completely, with the result that if xfs_filemap_fault loses a race with a different thread trying to fsfreeze, it will proceed without pagefault freeze protection (thereby breaking locking rules) and then unlocks the pagefault freeze lock that it doesn't own on its way out (thereby corrupting the lock state), which leads to a system hang shortly afterwards. Normally, this won't happen because our ownership of a read lock on a higher freeze protection level blocks fsfreeze from grabbing a write lock on that higher level. *However*, if lockdep is offline, lock_is_held_type unconditionally returns 1, which means that percpu_rwsem_is_held returns 1, which means that __sb_start_write unconditionally converts blocking freeze lock attempts into trylocks, even when we *don't* hold anything that would block a fsfreeze. Apparently this all held together until 5.10-rc1, when bugs in lockdep caused lockdep to shut itself off early in an fstests run, and once fstests gets to the "race writes with freezer" tests, kaboom. This might explain the long trail of vanishingly infrequent livelocks in fstests after lockdep goes offline that I've never been able to diagnose. We could fix it by spinning on the trylock if wait==true, but AFAICT the locking works fine if lockdep is not built at all (and I didn't see any complaints running fstests overnight), so remove this snippet entirely. NOTE: Commit f4b554af9931 in 2015 created the current weird logic (which used to exist in a different form in commit 5accdf82ba25c from 2012) in __sb_start_write. XFS solved this whole problem in the late 2.6 era by creating a variant of transactions (XFS_TRANS_NO_WRITECOUNT) that don't grab intwrite freeze protection, thus making lockdep's solution unnecessary. The commit claims that Dave Chinner explained that the trylock hack + comment could be removed, but nobody ever did. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz>
2020-09-24bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flagChristoph Hellwig1-0/+2
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the backing_dev_info shared between the block drivers and the writeback code. To help untangling the dependency replace it with a queue flag and a superblock flag derived from it. This also helps with the case of e.g. a file system requiring stable writes due to its own checksumming, but not forcing it on other users of the block device like the swap code. One downside is that we an't support the stable_pages_required bdi attribute in sysfs anymore. It is replaced with a queue attribute which also is writable for easier testing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-10Merge branch 'work.misc' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "A couple of trivial patches that fell through the cracks last cycle" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: fix indentation in deactivate_super() vfs: Remove duplicated d_mountpoint check in __is_local_mountpoint
2020-06-02Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-blockLinus Torvalds1-3/+1
Pull block updates from Jens Axboe: "Core block changes that have been queued up for this release: - Remove dead blk-throttle and blk-wbt code (Guoqing) - Include pid in blktrace note traces (Jan) - Don't spew I/O errors on wouldblock termination (me) - Zone append addition (Johannes, Keith, Damien) - IO accounting improvements (Konstantin, Christoph) - blk-mq hardware map update improvements (Ming) - Scheduler dispatch improvement (Salman) - Inline block encryption support (Satya) - Request map fixes and improvements (Weiping) - blk-iocost tweaks (Tejun) - Fix for timeout failing with error injection (Keith) - Queue re-run fixes (Douglas) - CPU hotplug improvements (Christoph) - Queue entry/exit improvements (Christoph) - Move DMA drain handling to the few drivers that use it (Christoph) - Partition handling cleanups (Christoph)" * tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits) block: mark bio_wouldblock_error() bio with BIO_QUIET blk-wbt: rename __wbt_update_limits to wbt_update_limits blk-wbt: remove wbt_update_limits blk-throttle: remove tg_drain_bios blk-throttle: remove blk_throtl_drain null_blk: force complete for timeout request blk-mq: drain I/O when all CPUs in a hctx are offline blk-mq: add blk_mq_all_tag_iter blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx blk-mq: use BLK_MQ_NO_TAG in more places blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG blk-mq: move more request initialization to blk_mq_rq_ctx_init blk-mq: simplify the blk_mq_get_request calling convention blk-mq: remove the bio argument to ->prepare_request nvme: force complete cancelled requests blk-mq: blk-mq: provide forced completion method block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds block: blk-crypto-fallback: remove redundant initialization of variable err block: reduce part_stat_lock() scope block: use __this_cpu_add() instead of access by smp_processor_id() ...
2020-05-29fs: fix indentation in deactivate_super()Yufen Yu1-1/+1
Fix the breaked indent in deactive_super(). Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-09bdi: remove the name field in struct backing_dev_infoChristoph Hellwig1-2/+0
The name is only printed for a not registered bdi in writeback. Use the device name there as is more useful anyway for the unlike case that the warning triggers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09bdi: simplify bdi_allocChristoph Hellwig1-1/+1
Merge the _node vs normal version and drop the superflous gfp_t argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-28Fix use after free in get_tree_bdev()David Howells1-1/+1
Commit 6fcf0c72e4b9, a fix to get_tree_bdev() put a missing blkdev_put() in the wrong place, before a warnf() that displays the bdev under consideration rather after it. This results in a silent lockup in printk("%pg") called via warnf() from get_tree_bdev() under some circumstances when there's a race with the blockdev being frozen. This can be caused by xfstests/tests/generic/085 in combination with Lukas Czerner's ext4 mount API conversion patchset. It looks like it ought to occur with other users of get_tree_bdev() such as XFS, but apparently doesn't. Fix this by switching the order of the lines. Fixes: 6fcf0c72e4b9 ("vfs: add missing blkdev_put() in get_tree_bdev()") Reported-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Ian Kent <raven@themaw.net> cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-18fs: call fsnotify_sb_delete after evict_inodesEric Sandeen1-1/+3
When a filesystem is unmounted, we currently call fsnotify_sb_delete() before evict_inodes(), which means that fsnotify_unmount_inodes() must iterate over all inodes on the superblock looking for any inodes with watches. This is inefficient and can lead to livelocks as it iterates over many unwatched inodes. At this point, SB_ACTIVE is gone and dropping refcount to zero kicks the inode out out immediately, so anything processed by fsnotify_sb_delete / fsnotify_unmount_inodes gets evicted in that loop. After that, the call to evict_inodes will evict everything else with a zero refcount. This should speed things up overall, and avoid livelocks in fsnotify_unmount_inodes(). Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-10-10Merge branch 'work.mount3' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mount fixes from Al Viro: "A couple of regressions from the mount series" * 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: add missing blkdev_put() in get_tree_bdev() shmem: fix LSM options parsing
2019-10-09vfs: add missing blkdev_put() in get_tree_bdev()Ian Kent1-1/+4
Is there are a couple of missing blkdev_put() in get_tree_bdev()? Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-25Merge tag 'fuse-update-5.4' of ↵Linus Torvalds1-5/+0
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Continue separating the transport (user/kernel communication) and the filesystem layers of fuse. Getting rid of most layering violations will allow for easier cleanup and optimization later on. - Prepare for the addition of the virtio-fs filesystem. The actual filesystem will be introduced by a separate pull request. - Convert to new mount API. - Various fixes, optimizations and cleanups. * tag 'fuse-update-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (55 commits) fuse: Make fuse_args_to_req static fuse: fix memleak in cuse_channel_open fuse: fix beyond-end-of-page access in fuse_parse_cache() fuse: unexport fuse_put_request fuse: kmemcg account fs data fuse: on 64-bit store time in d_fsdata directly fuse: fix missing unlock_page in fuse_writepage() fuse: reserve byteswapped init opcodes fuse: allow skipping control interface and forced unmount fuse: dissociate DESTROY from fuseblk fuse: delete dentry if timeout is zero fuse: separate fuse device allocation and installation in fuse_conn fuse: add fuse_iqueue_ops callbacks fuse: extract fuse_fill_super_common() fuse: export fuse_dequeue_forget() function fuse: export fuse_get_unique() fuse: export fuse_send_init_request() fuse: export fuse_len_args() fuse: export fuse_end_request() fuse: fix request limit ...
2019-09-19Merge branch 'work.mount2' of ↵Linus Torvalds1-7/+28
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc mount API conversions from Al Viro: "Conversions to new API for shmem and friends and for mount_mtd()-using filesystems. As for the rest of the mount API conversions in -next, some of them belong in the individual trees (e.g. binderfs one should definitely go through android folks, after getting redone on top of their changes). I'm going to drop those and send the rest (trivial ones + stuff ACKed by maintainers) in a separate series - by that point they are independent from each other. Some stuff has already migrated into individual trees (NFS conversion, for example, or FUSE stuff, etc.); those presumably will go through the regular merges from corresponding trees." * 'work.mount2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: Make fs_parse() handle fs_param_is_fd-type params better vfs: Convert ramfs, shmem, tmpfs, devtmpfs, rootfs to use the new mount API shmem_parse_one(): switch to use of fs_parse() shmem_parse_options(): take handling a single option into a helper shmem_parse_options(): don't bother with mpol in separate variable shmem_parse_options(): use a separate structure to keep the results make shmem_fill_super() static make ramfs_fill_super() static devtmpfs: don't mix {ramfs,shmem}_fill_super() with mount_single() vfs: Convert squashfs to use the new mount API mtd: Kill mount_mtd() vfs: Convert jffs2 to use the new mount API vfs: Convert cramfs to use the new mount API vfs: Convert romfs to use the new mount API vfs: Add a single-or-reconfig keying to vfs_get_super()
2019-09-19Merge tag 'y2038-vfs' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground Pull y2038 vfs updates from Arnd Bergmann: "Add inode timestamp clamping. This series from Deepa Dinamani adds a per-superblock minimum/maximum timestamp limit for a file system, and clamps timestamps as they are written, to avoid random behavior from integer overflow as well as having different time stamps on disk vs in memory. At mount time, a warning is now printed for any file system that can represent current timestamps but not future timestamps more than 30 years into the future, similar to the arbitrary 30 year limit that was added to settimeofday(). This was picked as a compromise to warn users to migrate to other file systems (e.g. ext4 instead of ext3) when they need the file system to survive beyond 2038 (or similar limits in other file systems), but not get in the way of normal usage" * tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground: ext4: Reduce ext4 timestamp warnings isofs: Initialize filesystem timestamp ranges pstore: fs superblock limits fs: omfs: Initialize filesystem timestamp ranges fs: hpfs: Initialize filesystem timestamp ranges fs: ceph: Initialize filesystem timestamp ranges fs: sysv: Initialize filesystem timestamp ranges fs: affs: Initialize filesystem timestamp ranges fs: fat: Initialize filesystem timestamp ranges fs: cifs: Initialize filesystem timestamp ranges fs: nfs: Initialize filesystem timestamp ranges ext4: Initialize timestamps limits 9p: Fill min and max timestamps in sb fs: Fill in max and min timestamps in superblock utimes: Clamp the timestamps before update mount: Add mount warning for impending timestamp expiry timestamp_truncate: Replace users of timespec64_trunc vfs: Add timestamp_truncate() api vfs: Add file timestamp range support
2019-09-18Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds1-0/+2
Pull fscrypt updates from Eric Biggers: "This is a large update to fs/crypto/ which includes: - Add ioctls that add/remove encryption keys to/from a filesystem-level keyring. These fix user-reported issues where e.g. an encrypted home directory can break NetworkManager, sshd, Docker, etc. because they don't get access to the needed keyring. These ioctls also provide a way to lock encrypted directories that doesn't use the vm.drop_caches sysctl, so is faster, more reliable, and doesn't always need root. - Add a new encryption policy version ("v2") which switches to a more standard, secure, and flexible key derivation function, and starts verifying that the correct key was supplied before using it. The key derivation improvement is needed for its own sake as well as for ongoing feature work for which the current way is too inflexible. Work is in progress to update both Android and the 'fscrypt' userspace tool to use both these features. (Working patches are available and just need to be reviewed+merged.) Chrome OS will likely use them too. This has also been tested on ext4, f2fs, and ubifs with xfstests -- both the existing encryption tests, and the new tests for this. This has also been in linux-next since Aug 16 with no reported issues. I'm also using an fscrypt v2-encrypted home directory on my personal desktop" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: (27 commits) ext4 crypto: fix to check feature status before get policy fscrypt: document the new ioctls and policy version ubifs: wire up new fscrypt ioctls f2fs: wire up new fscrypt ioctls ext4: wire up new fscrypt ioctls fscrypt: require that key be added when setting a v2 encryption policy fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS ioctl fscrypt: allow unprivileged users to add/remove keys for v2 policies fscrypt: v2 encryption policy support fscrypt: add an HKDF-SHA512 implementation fscrypt: add FS_IOC_GET_ENCRYPTION_KEY_STATUS ioctl fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl fscrypt: rename keyinfo.c to keysetup.c fscrypt: move v1 policy key setup to keysetup_v1.c fscrypt: refactor key setup code in preparation for v2 policies fscrypt: rename fscrypt_master_key to fscrypt_direct_key fscrypt: add ->ci_inode to fscrypt_info fscrypt: use FSCRYPT_* definitions, not FS_* fscrypt: use FSCRYPT_ prefix for uapi constants ...
2019-09-06vfs: subtype handling moved to fuseDavid Howells1-5/+0
The unused vfs code can be removed. Don't pass empty subtype (same as if ->parse callback isn't called). The bits that are left involve determining whether it's permitted to split the filesystem type string passed in to mount(2). Consequently, this means that we cannot get rid of the FS_HAS_SUBTYPE flag unless we define that a type string with a dot in it always indicates a subtype specification. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-05vfs: Add a single-or-reconfig keying to vfs_get_super()David Howells1-7/+28
Add an additional keying mode to vfs_get_super() to indicate that only a single superblock should exist in the system, and that, if it does, further mounts should invoke reconfiguration upon it. This allows mount_single() to be replaced. [Fix by Eric Biggers folded in] Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-05vfs: Create fs_context-aware mount_bdev() replacementDavid Howells1-0/+94
Create a function, get_tree_bdev(), that is fs_context-aware and a ->get_tree() counterpart of mount_bdev(). It caches the block device pointer in the fs_context struct so that this information can be passed into sget_fc()'s test and set functions. Signed-off-by: David Howells <dhowells@redhat.com> cc: Jens Axboe <axboe@kernel.dk> cc: linux-block@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-05new helper: get_tree_keyed()Al Viro1-0/+10
For vfs_get_keyed_super users. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-08-30vfs: Add file timestamp range supportDeepa Dinamani1-0/+2
Add fields to the superblock to track the min and max timestamps supported by filesystems. Initially, when a superblock is allocated, initialize it to the max and min values the fields can hold. Individual filesystems override these to match their actual limits. Pseudo filesystems are assumed to always support the min and max allowable values for the fields. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Jeff Layton <jlayton@kernel.org>
2019-08-12fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctlEric Biggers1-0/+2
Add a new fscrypt ioctl, FS_IOC_ADD_ENCRYPTION_KEY. This ioctl adds an encryption key to the filesystem's fscrypt keyring ->s_master_keys, making any files encrypted with that key appear "unlocked". Why we need this ~~~~~~~~~~~~~~~~ The main problem is that the "locked/unlocked" (ciphertext/plaintext) status of encrypted files is global, but the fscrypt keys are not. fscrypt only looks for keys in the keyring(s) the process accessing the filesystem is subscribed to: the thread keyring, process keyring, and session keyring, where the session keyring may contain the user keyring. Therefore, userspace has to put fscrypt keys in the keyrings for individual users or sessions. But this means that when a process with a different keyring tries to access encrypted files, whether they appear "unlocked" or not is nondeterministic. This is because it depends on whether the files are currently present in the inode cache. Fixing this by consistently providing each process its own view of the filesystem depending on whether it has the key or not isn't feasible due to how the VFS caches work. Furthermore, while sometimes users expect this behavior, it is misguided for two reasons. First, it would be an OS-level access control mechanism largely redundant with existing access control mechanisms such as UNIX file permissions, ACLs, LSMs, etc. Encryption is actually for protecting the data at rest. Second, almost all users of fscrypt actually do need the keys to be global. The largest users of fscrypt, Android and Chromium OS, achieve this by having PID 1 create a "session keyring" that is inherited by every process. This works, but it isn't scalable because it prevents session keyrings from being used for any other purpose. On general-purpose Linux distros, the 'fscrypt' userspace tool [1] can't similarly abuse the session keyring, so to make 'sudo' work on all systems it has to link all the user keyrings into root's user keyring [2]. This is ugly and raises security concerns. Moreover it can't make the keys available to system services, such as sshd trying to access the user's '~/.ssh' directory (see [3], [4]) or NetworkManager trying to read certificates from the user's home directory (see [5]); or to Docker containers (see [6], [7]). By having an API to add a key to the *filesystem* we'll be able to fix the above bugs, remove userspace workarounds, and clearly express the intended semantics: the locked/unlocked status of an encrypted directory is global, and encryption is orthogonal to OS-level access control. Why not use the add_key() syscall ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We use an ioctl for this API rather than the existing add_key() system call because the ioctl gives us the flexibility needed to implement fscrypt-specific semantics that will be introduced in later patches: - Supporting key removal with the semantics such that the secret is removed immediately and any unused inodes using the key are evicted; also, the eviction of any in-use inodes can be retried. - Calculating a key-dependent cryptographic identifier and returning it to userspace. - Allowing keys to be added and removed by non-root users, but only keys for v2 encryption policies; and to prevent denial-of-service attacks, users can only remove keys they themselves have added, and a key is only really removed after all users who added it have removed it. Trying to shoehorn these semantics into the keyrings syscalls would be very difficult, whereas the ioctls make things much easier. However, to reuse code the implementation still uses the keyrings service internally. Thus we get lockless RCU-mode key lookups without having to re-implement it, and the keys automatically show up in /proc/keys for debugging purposes. References: [1] https://github.com/google/fscrypt [2] https://goo.gl/55cCrI#heading=h.vf09isp98isb [3] https://github.com/google/fscrypt/issues/111#issuecomment-444347939 [4] https://github.com/google/fscrypt/issues/116 [5] https://bugs.launchpad.net/ubuntu/+source/fscrypt/+bug/1770715 [6] https://github.com/google/fscrypt/issues/128 [7] https://askubuntu.com/questions/1130306/cannot-run-docker-on-an-encrypted-filesystem Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Eric Biggers <ebiggers@google.com>
2019-07-31Unbreak mount_capable()Al Viro1-4/+1
In "consolidate the capability checks in sget_{fc,userns}())" the wrong argument had been passed to mount_capable() by sget_fc(). That mistake had been further obscured later, when switching mount_capable() to fs_context has moved the calculation of bogus argument from sget_fc() to mount_capable() itself. It should've been fc->user_ns all along. Screwed-up-by: Al Viro <viro@zeniv.linux.org.uk> Reported-by: Christian Brauner <christian@brauner.io> Tested-by: Christian Brauner <christian@brauner.io> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04convenience helper: get_tree_single()Al Viro1-0/+8
counterpart of mount_single(); switch fusectl to it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-07-04convenience helper get_tree_nodev()Al Viro1-0/+8
counterpart of mount_nodev(). Switch hugetlb and pseudo to it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25vfs: Kill sget_userns()David Howells1-38/+16
Kill sget_userns(), folding it into sget() as that's the only remaining user. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-fsdevel@vger.kernel.org
2019-05-25vfs: Provide sb->s_iflags settings in fs_context structDavid Howells1-0/+1
Provide a field in the fs_context struct through which bits in the sb->s_iflags superblock field can be set. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-fsdevel@vger.kernel.org
2019-05-25move mount_capable() further outAl Viro1-6/+0
Call graph of vfs_get_tree(): vfs_fsconfig_locked() # neither kernmount, nor submount do_new_mount() # neither kernmount, nor submount fc_mount() afs_mntpt_do_automount() # submount mount_one_hugetlbfs() # kernmount pid_ns_prepare_proc() # kernmount mq_create_mount() # kernmount vfs_kern_mount() simple_pin_fs() # kernmount vfs_submount() # submount kern_mount() # kernmount init_mount_tree() btrfs_mount() nfs_do_root_mount() The first two need the check (unconditionally). init_mount_tree() is setting rootfs up; any capability checks make zero sense for that one. And btrfs_mount()/ nfs_do_root_mount() have the checks already done in their callers. IOW, we can shift mount_capable() handling into the two callers - one in the normal case of mount(2), another - in fsconfig(2) handling of FSCONFIG_CMD_CREATE. I.e. the syscalls that set a new filesystem up. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25move mount_capable() calls to vfs_get_tree()Al Viro1-6/+6
sget_fc() is called only from ->get_tree() instances and the only instance not calling it is legacy_get_tree(), which calls mount_capable() directly. In all sget_fc() callers the checks could be moved to the very beginning of ->get_tree() - ->user_ns is not changed in between. So lifting the checks to the only caller of ->get_tree() is OK. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25switch mount_capable() to fs_contextAl Viro1-4/+7
now both callers of mount_capable() have access to fs_context; the only difference is that for sget_fc() we have the possibility of fc->global being true, while for legacy_get_tree() it's guaranteed to be impossible. Unify to more generic variant... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25move the capability checks from sget_userns() to legacy_get_tree()Al Viro1-4/+0
1) all call chains leading to sget_userns() pass through ->mount() instances. 2) none of ->mount() instances is ever called directly - the only call site is legacy_get_tree() 3) all remaining ->mount() instances end up calling sget_userns() IOW, we might as well do the capability checks just before calling ->mount(). As for the arguments passed to mount_capable(), in case of call chains to sget_userns() going through sget(), we either don't call mount_capable() at all, or pass current_user_ns() to it. The call chains going through mount_pseudo_xattr() don't call mount_capable() at all (SB_KERNMOUNT in flags on those). That could've been split into smaller steps (lifting the checks into sget(), then callers of sget(), then all the way to the entries of every ->mount() out there, then to the sole caller), but that would be too much churn for little benefit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25vfs: Kill mount_ns()David Howells1-38/+0
Kill mount_ns() as it has been replaced by vfs_get_super() in the new mount API. Signed-off-by: David Howells <dhowells@redhat.com> cc: linux-fsdevel@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25consolidate the capability checks in sget_{fc,userns}()Al Viro1-18/+14
... into a common helper - mount_capable(type, userns) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-25start massaging the checks in sget_...(): move to sget_userns()Al Viro1-10/+4
there are 3 remaining callers of sget_userns() - sget(), mount_ns() and mount_pseudo_xattr(). Extra check in sget() is conditional upon mount being neither KERNMOUNT nor SUBMOUNT, the identical one in mount_ns() - upon being not KERNMOUNT; mount_pseudo_xattr() has no such checks at all. However, mount_ns() is never used with SUBMOUNT and mount_pseudo_xattr() is used only for KERNMOUNT, so both would be fine with the same logics as currently done in sget(), allowing to consolidate the entire thing in sget_userns() itself. That's not where these checks will end up in the long run, though - the whole reason why they'd been done so deep in the bowels of mount(2) was that there had been no way for a filesystem to specify which userns to look at until it has entered ->mount(). Now there is a place where filesystem could override the defaults - ->init_fs_context(). Which allows to pull the checks out into the callers of vfs_get_tree(). That'll take quite a bit of massage, but that mess is possible to tease apart. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-04-28[fix] get rid of checking for absent device name in vfs_get_tree()Al Viro1-5/+0
It has no business being there, it's checked by relevant ->get_tree() as it is *and* it returns the wrong error for no reason whatsoever. Fixes: f3a09c92018a "introduce fs_context methods" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28vfs: Add some logging to the core users of the fs_context logDavid Howells1-1/+3
Add some logging to the core users of the fs_context log so that information can be extracted from them as to the reason for failure. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-02-28convenience helpers: vfs_get_super() and sget_fc()Al Viro1-0/+171
the former is an analogue of mount_{single,nodev} for use in ->get_tree() instances, the latter - analogue of sget() for the same. These are fairly similar to the originals, but the callback signature for sget_fc() is different from sget() ones, so getting bits and pieces shared would be too convoluted; we might get around to that later, but for now let's just remember to keep them in sync. They do live next to each other, and changes in either won't be hard to spot. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30introduce fs_context methodsAl Viro1-8/+28
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30convert do_remount_sb() to fs_contextDavid Howells1-34/+73
Replace do_remount_sb() with a function, reconfigure_super(), that's fs_context aware. The fs_context is expected to be parameterised already and have ->root pointing to the superblock to be reconfigured. A legacy wrapper is provided that is intended to be called from the fs_context ops when those appear, but for now is called directly from reconfigure_super(). This wrapper invokes the ->remount_fs() superblock op for the moment. It is intended that the remount_fs() op will be phased out. The fs_context->purpose is set to FS_CONTEXT_FOR_RECONFIGURE to indicate that the context is being used for reconfiguration. do_umount_root() is provided to consolidate remount-to-R/O for umount and emergency remount by creating a context and invoking reconfiguration. do_remount(), do_umount() and do_emergency_remount_callback() are switched to use the new process. [AV -- fold UMOUNT and EMERGENCY_REMOUNT in; fixes the umount / bug, gets rid of pointless complexity] [AV -- set ->net_ns in all cases; nfs remount will need that] [AV -- shift security_sb_remount() call into reconfigure_super(); the callers that didn't do security_sb_remount() have NULL fc->security anyway, so it's a no-op for them] Signed-off-by: David Howells <dhowells@redhat.com> Co-developed-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30vfs_get_tree(): evict the call of security_sb_kern_mount()Al Viro1-12/+3
Right now vfs_get_tree() calls security_sb_kern_mount() (i.e. mount MAC) unless it gets MS_KERNMOUNT or MS_SUBMOUNT in flags. Doing it that way is both clumsy and imprecise. Consider the callers' tree of vfs_get_tree(): vfs_get_tree() <- do_new_mount() <- vfs_kern_mount() <- simple_pin_fs() <- vfs_submount() <- kern_mount_data() <- init_mount_tree() <- btrfs_mount() <- vfs_get_tree() <- nfs_do_root_mount() <- nfs4_try_mount() <- nfs_fs_mount() <- vfs_get_tree() <- nfs4_referral_mount() do_new_mount() always does need MAC (we are guaranteed that neither MS_KERNMOUNT nor MS_SUBMOUNT will be passed there). simple_pin_fs(), vfs_submount() and kern_mount_data() pass explicit flags inhibiting that check. So does nfs4_referral_mount() (the flags there are ulimately coming from vfs_submount()). init_mount_tree() is called too early for anything LSM-related; it doesn't matter whether we attempt those checks, they'll do nothing. Finally, in case of btrfs_mount() and nfs_fs_mount(), doing MAC is pointless - either the caller will do it, or the flags are such that we wouldn't have done it either. In other words, the one and only case when we want that check done is when we are called from do_new_mount(), and there we want it unconditionally. So let's simply move it there. The superblock is still locked, so nobody is going to get access to it (via ustat(2), etc.) until we get a chance to apply the checks - we are free to move them to any point up to where we drop ->s_umount (in do_new_mount_fc()). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30teach vfs_get_tree() to handle subtype, switch do_new_mount() to itAl Viro1-0/+5
Roll the handling of subtypes into do_new_mount() and vfs_get_tree(). The former determines any subtype string and hangs it off the fs_context; the latter applies it. Make do_new_mount() create, parameterise and commit an fs_context and create a mount for itself rather than calling vfs_kern_mount(). [AV -- missing kstrdup()] [AV -- ... and no kstrdup() if we get to setting ->s_submount - we simply transfer it from fc, leaving NULL behind] [AV -- constify ->s_submount, while we are at it] Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-01-30vfs: Introduce fs_context, switch vfs_kern_mount() to it.David Howells1-27/+23
Introduce a filesystem context concept to be used during superblock creation for mount and superblock reconfiguration for remount. This is allocated at the beginning of the mount procedure and into it is placed: (1) Filesystem type. (2) Namespaces. (3) Source/Device names (there may be multiple). (4) Superblock flags (SB_*). (5) Security details. (6) Filesystem-specific data, as set by the mount options. Accessor functions are then provided to set up a context, parameterise it from monolithic mount data (the data page passed to mount(2)) and tear it down again. A legacy wrapper is provided that implements what will be the basic operations, wrapping access to filesystems that aren't yet aware of the fs_context. Finally, vfs_kern_mount() is changed to make use of the fs_context and mount_fs() is replaced by vfs_get_tree(), called from vfs_kern_mount(). [AV -- add missing kstrdup()] [AV -- put_cred() can be unconditional - fc->cred can't be NULL] [AV -- take legacy_validate() contents into legacy_parse_monolithic()] [AV -- merge KERNEL_MOUNT and USER_MOUNT] [AV -- don't unlock superblock on success return from vfs_get_tree()] [AV -- kill 'reference' argument of init_fs_context()] Signed-off-by: David Howells <dhowells@redhat.com> Co-developed-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21mount_fs: suppress MAC on MS_SUBMOUNT as well as MS_KERNMOUNTAl Viro1-1/+1
Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21LSM: hide struct security_mnt_opts from any generic codeAl Viro1-7/+5
Keep void * instead, allocate on demand (in parse_str_opts, at the moment). Eventually both selinux and smack will be better off with private structures with several strings in those, rather than this "counter and two pointers to dynamically allocated arrays" ugliness. This commit allows to do that at leisure, without disrupting anything outside of given module. Changes: * instead of struct security_mnt_opt use an opaque pointer initialized to NULL. * security_sb_eat_lsm_opts(), security_sb_parse_opts_str() and security_free_mnt_opts() take it as var argument (i.e. as void **); call sites are unchanged. * security_sb_set_mnt_opts() and security_sb_remount() take it by value (i.e. as void *). * new method: ->sb_free_mnt_opts(). Takes void *, does whatever freeing that needs to be done. * ->sb_set_mnt_opts() and ->sb_remount() might get NULL as mnt_opts argument, meaning "empty". Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21LSM: split ->sb_set_mnt_opts() out of ->sb_kern_mount()Al Viro1-1/+7
... leaving the "is it kernel-internal" logics in the caller. Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21new helper: security_sb_eat_lsm_opts()Al Viro1-12/+1
combination of alloc_secdata(), security_sb_copy_data(), security_sb_parse_opt_str() and free_secdata(). Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-21LSM: lift parsing LSM options into the caller of ->sb_kern_mount()Al Viro1-8/+16
This paves the way for retaining the LSM options from a common filesystem mount context during a mount parameter parsing phase to be instituted prior to actual mount/reconfiguration actions. Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-12-20vfs: Suppress MS_* flag defs within the kernel unless explicitly enabledDavid Howells1-0/+1
Only the mount namespace code that implements mount(2) should be using the MS_* flags. Suppress them inside the kernel unless uapi/linux/mount.h is included. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Howells <dhowells@redhat.com>
2018-09-03fsnotify: add super block object typeAmir Goldstein1-1/+1
Add the infrastructure to attach a mark to a super_block struct and detach all attached marks when super block is destroyed. This is going to be used by fanotify backend to setup super block marks. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-26Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-daxLinus Torvalds1-42/+23
Pull IDA updates from Matthew Wilcox: "A better IDA API: id = ida_alloc(ida, GFP_xxx); ida_free(ida, id); rather than the cumbersome ida_simple_get(), ida_simple_remove(). The new IDA API is similar to ida_simple_get() but better named. The internal restructuring of the IDA code removes the bitmap preallocation nonsense. I hope the net -200 lines of code is convincing" * 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits) ida: Change ida_get_new_above to return the id ida: Remove old API test_ida: check_ida_destroy and check_ida_alloc test_ida: Convert check_ida_conv to new API test_ida: Move ida_check_max test_ida: Move ida_check_leaf idr-test: Convert ida_check_nomem to new API ida: Start new test_ida module target/iscsi: Allocate session IDs from an IDA iscsi target: fix session creation failure handling drm/vmwgfx: Convert to new IDA API dmaengine: Convert to new IDA API ppc: Convert vas ID allocation to new IDA API media: Convert entity ID allocation to new IDA API ppc: Convert mmu context allocation to new IDA API Convert net_namespace to new IDA API cb710: Convert to new IDA API rsxx: Convert to new IDA API osd: Convert to new IDA API sd: Convert to new IDA API ...
2018-08-21fs: Convert unnamed_dev_ida to new APIMatthew Wilcox1-42/+23
The new API is much easier for this user. Also add kerneldoc for get_anon_bdev(). Signed-off-by: Matthew Wilcox <willy@infradead.org>