aboutsummaryrefslogtreecommitdiffstats
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2 daysMerge branch 'for-next' of ↵Mark Brown1-22/+0
https://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git
2 daysMerge branch 'driver-core-next' of ↵Mark Brown2-7/+13
https://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core.git
2 daysMerge branch 'master' of ↵Mark Brown6-76/+121
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git # Conflicts: # drivers/cpufreq/Kconfig.x86 # drivers/cpufreq/Makefile
2 daysMerge branch 'next' of ↵Mark Brown2-9/+9
https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm.git
2 daysMerge branch 'for-next' of ↵Mark Brown1-11/+4
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git
2 daysMerge branch 'for-next' of ↵Mark Brown2-3/+3
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
2 daysMerge branch 'docs-next' of git://git.lwn.net/linux.gitMark Brown1-91/+1
2 daysMerge branch 'fs-next' of linux-nextMark Brown273-6141/+10431
# Conflicts: # fs/btrfs/defrag.c
2 daysMerge branch 'mm-nonmm-unstable' of ↵Mark Brown2-0/+39
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
2 daysMerge branch 'mm-unstable' of ↵Mark Brown6-2304/+156
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
2 daysMerge branch 'mm-nonmm-stable' of ↵Mark Brown13-110/+323
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
3 daysnext-20260522/vfs-braunerMark Brown48-744/+1388
# Conflicts: # fs/fuse/dev.c
3 daysMerge branch 'for-next' of https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.gitMark Brown6-43/+50
3 daysMerge branch '9p-next' of https://github.com/martinetd/linuxMark Brown10-60/+272
3 daysMerge branch 'master' of ↵Mark Brown13-297/+482
https://github.com/Paragon-Software-Group/linux-ntfs3.git
3 daysMerge branch 'ntfs-next' of ↵Mark Brown15-532/+196
https://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/ntfs.git
3 daysMerge branch 'nfsd-next' of ↵Mark Brown67-270/+2141
https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux # Conflicts: # fs/exfat/file.c
3 daysMerge branch 'ksmbd-for-next' of https://github.com/smfrench/smb3-kernel.gitMark Brown3-7/+18
3 daysMerge branch 'for-next' of ↵Mark Brown5-15/+57
https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git
3 daysMerge branch 'for-next' of ↵Mark Brown22-1897/+2309
https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git # Conflicts: # fs/fuse/dev.c
3 daysMerge branch 'dev' of ↵Mark Brown17-48/+309
https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git
3 daysMerge branch 'for_next' of ↵Mark Brown4-14/+12
https://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git
3 daysMerge branch 'dev' of ↵Mark Brown16-509/+757
https://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat.git
3 daysMerge branch 'dev' of ↵Mark Brown1-3/+3
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git
3 daysMerge branch 'next' of ↵Mark Brown3-13/+24
https://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm.git
3 daysMerge branch 'next' of ↵Mark Brown1-11/+5
https://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs.git
3 daysMerge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6.gitMark Brown8-178/+510
3 daysMerge branch 'for-next' of ↵Mark Brown48-1531/+1926
https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
3 daysMerge branch 'fixes' of ↵Mark Brown1-1/+1
https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git
3 daysMerge branch 'next-fixes' of ↵Mark Brown8-2/+56
https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
3 daysMerge branch 'vfs.fixes' of ↵Mark Brown2-5/+15
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
3 daysgfs2: page poisoning fixAndreas Gruenbacher3-0/+30
Processes can write to the last page of a file using mmap, and when the file size is not a multiple of the page size, this can be used to write beyond the end of the file. This is sometimes referred to as page poisoning, and it is not a problem in itself because the data beyond eof will be ignored. However, we currently fail to clear out any space beyond the end of the file that we skip over when the file size is increased, so that "poison" can end up getting exposed. Fix that. Fixes xfstest generic/363. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
3 daysgfs2: Remove unused fallocate_chunk argumentAndreas Gruenbacher1-3/+2
The mode argument of fallocate_chunk() became unused in commit 1885867b84d5 ("GFS2: Update i_size properly on fallocate"), so remove it. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
3 daysMerge branch into tip/master: 'x86/cache'Ingo Molnar3-22/+50
# New commits in x86/cache: 1cfa74c683ea ("fs/resctrl: Document tasks file behaviour for task id 0 and idle tasks") 9a1646211f8c ("fs/resctrl: Document that automatic counter assignment is best effort") 3aec86e4ea01 ("fs/resctrl: Continue counter allocation after failure") ee3d4c81d89c ("fs/resctrl: Add monitor property 'mbm_cntr_assign_fixed'") f52abe650241 ("fs/resctrl: Disallow the software controller when MBM counters are assignable") 94a1206522d1 ("x86,fs/resctrl: Create 'event_filter' files read only if they're not configurable") 7625632fed43 ("fs/resctrl: Tidy up the error path in resctrl_mkdir_event_configs()") Signed-off-by: Ingo Molnar <mingo@kernel.org>
3 daysMerge branch into tip/master: 'irq/core'Ingo Molnar2-5/+3
# New commits in irq/core: 171cc0d9eed1 ("genirq/proc: Speed up /proc/interrupts iteration") 61b51a167c52 ("genirq/proc: Runtime size the chip name") 7603e0575d8a ("genirq: Expose irq_find_desc_at_or_after() in core code") 1d9c4745bfb6 ("genirq: Add rcuref count to struct irq_desc") 34594da7650d ("genirq/proc: Increase default interrupt number precision to four") 2d62735f1d4a ("genirq: Calculate precision only when required") 4892e5e71ec9 ("genirq: Cache the condition for /proc/interrupts exposure") 3ba92f6a2820 ("genirq/manage: Make NMI cleanup RT safe") b99dc723b12e ("genirq: Expose nr_irqs in core code") cca5e6fa791b ("scripts/gdb: Update x86 interrupts to the array based storage") d6b70b16b4e7 ("x86/irq: Move IOAPIC misrouted and PIC/APIC error counts into irq_stats") 8713f2e596a1 ("x86/irq: Suppress unlikely interrupt stats by default") 2b57c69917ee ("x86/irq: Make irqstats array based") 0179464391af ("genirq/proc: Utilize irq_desc::tot_count to avoid evaluation") 95c33a64f203 ("genirq/proc: Avoid formatting zero counts in /proc/interrupts") 115bbf0c1b60 ("x86/irq: Optimize interrupts decimals printing") c2c7983c93f5 ("genirq/proc: Size interrupt directory names for 10-digit interrupt numbers") Signed-off-by: Ingo Molnar <mingo@kernel.org>
3 daysMerge branch into tip/master: 'timers/merge'Ingo Molnar1-49/+68
# New commits in timers/merge: 3eb4923e6851 ("clocksource: Add devm_clocksource_register_*() helpers") c8d32a0389fb ("timers: Fix flseep() typo in kernel-doc comment") 5d330d652d7a ("hrtimer: Fix the bogus return type of __hrtimer_start_range_ns()") 3af1f49f415d ("hrtimer: Return ktime_t from hrtimer_get_next_event()/hrtimer_next_event_without()") 33d4bfc49613 ("clocksource: Clean up clocksource_update_freq() functions") ed3b3c497668 ("alarmtimer: Remove stale return description from alarm_handle_timer()") b00385b8d081 ("selftests/posix_timers: Use CLOCK_THREAD_CPUTIME_ID for ITIMER_PROF measurements") cab0cd0130eb ("scripts/timers: Add timer_migration_tree.py") 5a7dfbcbbdb6 ("timers/migration: Handle capacity in connect tracepoints") 098cbaad8e57 ("timers/migration: Split per-capacity hierarchies") 3ba25488380f ("timers/migration: Track CPUs in a hierarchy") ff65875f80d1 ("timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness") ed78a7019419 ("alarmtimer: Remove unused interfaces") 12e4311aa5b2 ("netfilter: xt_IDLETIMER: Switch to alarm_start_timer()") 9fa2e38ab749 ("power: supply: charger-manager: Switch to alarm_start_timer()") 7dda99952ced ("fs/timerfd: Use the new alarm/hrtimer functions") f4b58f61da79 ("alarmtimer: Convert posix timer functions to alarm_start_timer()") 183d00b72713 ("alarmtimer: Provide alarm_start_timer()") acc071343d29 ("posix-timers: Switch to hrtimer_start_expires_user()") cfb7fe3fdd4c ("posix-timers: Handle the timer_[re]arm() return value") 6fdb2677a594 ("posix-timers: Expand timer_[re]arm() callbacks with a boolean return value") b40c927345a9 ("hrtimer: Use hrtimer_start_expires_user() for hrtimer sleepers") bd5956166d20 ("hrtimer: Provide hrtimer_start_range_ns_user()") 68ed094971b0 ("clocksource/drivers/timer-of: Make the code compatible with modules") 2423405880c2 ("clocksource/drivers/mmio: Make the code compatible with modules") fed9f727cc3f ("clocksource/drivers/sun5i: Handle error returns from devm_reset_control_get_optional_exclusive()") 045a9dac7eb7 ("clocksource/drivers/timer-rtl-otto: Make rttm_cs variable static") b385caf91868 ("dt-bindings: timer: fsl,imxgpt: add compatible string fsl,imx25-epit") Signed-off-by: Ingo Molnar <mingo@kernel.org>
3 daysocfs2: reject oversized group bitmap descriptorsZhang Cen1-0/+22
ocfs2_validate_gd_parent() only bounds bg_bits against the parent allocator's chain geometry. A malicious descriptor can still claim a bg_size/bg_bits pair that exceeds the bitmap bytes that physically fit in the group descriptor block, so later bitmap scans and bit updates can run past bg_bitmap. Add a physical-cap check based on ocfs2_group_bitmap_size() for the parent allocator type and reject descriptors whose bg_size or bg_bits exceed that capacity. Keep the existing chain geometry check so both the on-disk bitmap layout and the allocator metadata must agree before the descriptor is used. Validation reproduced this kernel report: KASAN use-after-free in _find_next_bit+0x7f/0xc0 Read of size 8 Call trace: dump_stack_lvl+0x66/0xa0 (?:?) print_report+0xd0/0x630 (?:?) _find_next_bit+0x7f/0xc0 (?:?) srso_alias_return_thunk+0x5/0xfbef5 (?:?) __virt_addr_valid+0x188/0x2f0 (?:?) kasan_report+0xe4/0x120 (?:?) ocfs2_find_max_contig_free_bits+0x35/0x70 (fs/ocfs2/suballoc.c:1375) ocfs2_block_group_set_bits+0x472/0x4b0 (fs/ocfs2/suballoc.c:1457) ocfs2_cluster_group_search+0x16b/0x440 (fs/ocfs2/suballoc.c:86) ocfs2_bg_discontig_fix_result+0x1ef/0x230 (fs/ocfs2/suballoc.c:1786) ocfs2_search_chain+0x8f8/0x10a0 (fs/ocfs2/suballoc.c:1886) get_page_from_freelist+0x70e/0x2370 (?:?) lock_release+0xc6/0x290 (?:?) do_raw_spin_unlock+0x9a/0x100 (?:?) kasan_unpoison+0x27/0x60 (?:?) __bfs+0x147/0x240 (?:?) get_page_from_freelist+0x83d/0x2370 (?:?) ocfs2_claim_suballoc_bits+0x38c/0xe70 (fs/ocfs2/suballoc.c:96) sched_domains_numa_masks_clear+0x70/0xd0 (?:?) check_irq_usage+0xe8/0xb70 (?:?) __ocfs2_claim_clusters+0x18d/0x4c0 (fs/ocfs2/suballoc.c:2497) check_path+0x24/0x50 (?:?) rcu_is_watching+0x20/0x50 (?:?) check_prev_add+0xfd/0xd00 (?:?) ocfs2_add_clusters_in_btree+0x17d/0x810 (fs/ocfs2/suballoc.c:?) __folio_batch_add_and_move+0x1f5/0x3d0 (?:?) ocfs2_add_inode_data+0xd9/0x120 (fs/ocfs2/suballoc.c:?) filemap_add_folio+0x105/0x1f0 (?:?) ocfs2_write_begin_nolock+0x29f7/0x2f80 (fs/ocfs2/suballoc.c:3043) ocfs2_read_inode_block+0xb5/0x110 (fs/ocfs2/suballoc.c:?) down_write+0xf5/0x180 (?:?) ocfs2_write_begin+0x180/0x240 (fs/ocfs2/suballoc.c:?) __mark_inode_dirty+0x758/0x9a0 (?:?) inode_to_bdi+0x41/0x90 (?:?) balance_dirty_pages_ratelimited_flags+0xf8/0x1d0 (?:?) generic_perform_write+0x252/0x440 (?:?) mnt_put_write_access_file+0x16/0x70 (?:?) file_update_time_flags+0xe4/0x200 (?:?) ocfs2_file_write_iter+0x80a/0x1320 (fs/ocfs2/suballoc.c:?) lock_acquire+0x184/0x2f0 (?:?) ksys_write+0xd2/0x170 (?:?) apparmor_file_permission+0xf5/0x310 (?:?) read_zero+0x8d/0x140 (?:?) lock_is_held_type+0x8f/0x100 (?:?) Link: https://lore.kernel.org/20260524111248.1429884-1-rollkingzzc@gmail.com Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem") Assisted-by: Codex:gpt-5.5 Signed-off-by: Zhang Cen <rollkingzzc@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysocfs2: rebase copied fsdlm LVB pointers in locking_stateZhang Cen1-0/+17
The locking_state debugfs iterator snapshots struct ocfs2_lock_res by value under ocfs2_dlm_tracking_lock and later formats that copy in ocfs2_dlm_seq_show(). That is fine for the inline fields, but the userspace fsdlm stack stores the LVB through lksb_fsdlm.sb_lvbptr. Once the iterator drops the tracking lock, a copied non-NULL sb_lvbptr still points into the original lockres owner, so teardown can free that container before the debugfs dump walks the raw LVB bytes. Rebase the copied sb_lvbptr to the copied l_lksb before dumping the raw LVB. The seq snapshot already carries the inline LVB storage reserved in struct ocfs2_dlm_lksb, so the debugfs reader can dump the copied bytes without borrowing the original lockres lifetime. The buggy scenario involves two paths, with each column showing the order within that path: locking_state reader: lockres teardown: 1. ocfs2_dlm_seq_start()/next() 1. file release or another owner copies struct ocfs2_lock_res teardown reaches 2. ocfs2_dlm_seq_show() formats ocfs2_lock_res_free() the copied row 2. the lockres is removed from the 3. ocfs2_dlm_lvb() follows the tracking list copied sb_lvbptr 3. the owner frees the original lockres container Validation reproduced this kernel report: KASAN slab-use-after-free in ocfs2_dlm_seq_show+0x1bd/0x430 RIP: 0033:0x7f8ec4b1e29d The buggy address belongs to the object at ffff88810a1e0800 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 368 bytes inside of freed 1024-byte region [ffff88810a1e0800, ffff88810a1e0c00) Read of size 1 Call trace: dump_stack_lvl+0x66/0xa0 print_report+0xce/0x630 ocfs2_dlm_seq_show+0x1bd/0x430 (fs/ocfs2/dlmglue.c:3137) srso_alias_return_thunk+0x5/0xfbef5 __virt_addr_valid+0x19f/0x330 kasan_report+0xe0/0x110 seq_read_iter+0x29d/0x790 seq_read+0x20a/0x280 find_held_lock+0x2b/0x80 rcu_read_unlock+0x18/0x70 full_proxy_read+0x9e/0xd0 vfs_read+0x12c/0x590 ksys_read+0xd2/0x170 do_user_addr_fault+0x65a/0x890 do_syscall_64+0x115/0x6a0 (arch/x86/entry/syscall_64.c:87) entry_SYSCALL_64_after_hwframe+0x77/0x7f Allocated by task stack: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 __kasan_kmalloc+0xaa/0xb0 ocfs2_file_open+0x13e/0x300 do_dentry_open+0x233/0x7f0 vfs_open+0x5a/0x1b0 path_openat+0x66d/0x1540 do_file_open+0x186/0x2b0 do_sys_openat2+0xce/0x150 __x64_sys_openat+0xd0/0x140 do_syscall_64+0x115/0x6a0 (arch/x86/entry/syscall_64.c:87) entry_SYSCALL_64_after_hwframe+0x77/0x7f Freed by task stack: kasan_save_stack+0x33/0x60 kasan_save_track+0x14/0x30 kasan_save_free_info+0x3b/0x60 __kasan_slab_free+0x5f/0x80 kfree+0x313/0x590 ocfs2_file_release+0x138/0x260 __fput+0x1df/0x4b0 fput_close_sync+0xd2/0x170 __x64_sys_close+0x55/0x90 do_syscall_64+0x115/0x6a0 (arch/x86/entry/syscall_64.c:87) entry_SYSCALL_64_after_hwframe+0x77/0x7f Link: https://lore.kernel.org/20260525041726.4112882-1-rollkingzzc@gmail.com Fixes: cf4d8d75d8ab ("ocfs2: add fsdlm to stackglue") Assisted-by: Codex:gpt-5.5 Signed-off-by: Zhang Cen <rollkingzzc@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysfs/btrfs: remove a comment referring to READ_ONLY_THP_FOR_FSZi Yan1-3/+0
READ_ONLY_THP_FOR_FS is no longer present, remove related comment. Link: https://lore.kernel.org/20260517135416.1434539-11-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam@infradead.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysfs: remove nr_thps from struct address_spaceZi Yan1-3/+0
filemap_nr_thps*() are removed, the related field, address_space->nr_thps, is no longer needed. Remove it. This shrinks struct address_space by 8 bytes on 64-bit systems which may increase the number of inodes we can cache. Link: https://lore.kernel.org/20260517135416.1434539-8-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: fs: remove filemap_nr_thps*() functions and their usersZi Yan1-27/+0
They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without large folio support, so that read-only THPs created in these FSes are not seen by the FSes when the underlying fd becomes writable. Now read-only PMD THPs only appear in a FS with large folio support and the supported orders include PMD_ORDER. READ_ONLY_THP_FOR_FS was using mapping->nr_thps, inode->i_writecount, and smp_mb() to prevent writes to a read-only THP and collapsing writable folios into a THP. In collapse_file(), mapping->nr_thps is increased, then smp_mb(), and if inode->i_writecount > 0, collapse is stopped, while do_dentry_open() first increases inode->i_writecount, then a full memory fence, and if mapping->nr_thps > 0, all read-only THPs are truncated. Now this mechanism can be removed along with READ_ONLY_THP_FOR_FS code, since a dirty folio check has been added after try_to_unmap() in collapse_file() to prevent dirty folios from being collapsed as clean. Link: https://lore.kernel.org/20260517135416.1434539-7-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysfs/proc/task_mmu: read proc/pid/{smaps|numa_maps} under per-vma lockSuren Baghdasaryan1-39/+156
Patch series "use vma locks for proc/pid/{smaps|numa_maps} reads", v2. Use per-vma locks when reading /proc/pid/smaps and /proc/pid/numa_maps similar to /proc/pid/maps to reduce contention on central mmap_lock. One major difference between maps and smaps/numa_maps reading is that the latter executes page table walk which can't be done under RCU due to a possibility of sleeping. Therefore we drop RCU read lock before this walk while keeping the VMA locked. After the walk we retake RCU read lock, reset VMA iterator and proceed with the next VMA. The last two patches extend /proc/pid/maps test to cover /proc/pid/smaps reading during concurrent address space modification. This patch (of 3): proc/pid/{smaps|numa_maps} can be read using the combination of RCU and VMA read locks, similar to proc/pid/maps. RCU is required to safely traverse the VMA tree and VMA lock stabilizes the VMA being processed and the pagetable walk. Link: https://lore.kernel.org/20260426062718.1238437-1-surenb@google.com Link: https://lore.kernel.org/20260426062718.1238437-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <liam@infradead.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysuserfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.cMike Rapoport (Microsoft)2-2234/+0
Patch series "userfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.c", v3. These patches merge fs/userfaultfd.c into mm/userfaultfd.c and make functions used only inside mm/userfaultfd.c static. This patch (of 2): Historically userfaultfd implementation has been split between fs/userfaultfd.c and mm/userfaultfd.c. The mm/ part implemented memory management operations, while the fs/ part implemented file descriptor handling and called into the mm/ part for the actual memory management work. This separation is quite artificial and fs/userfaultfd.c does not seem to belong to fs/ because it's only a user if vfs APIs and like for other users, for example, memfd and secretmem, the file descriptor handling could live in mm/ as well. "Append" fs/userfaultfd.c to mm/userfaultfd and update fs/Makefile and MAINTAINERS accordingly. No intended functional changes. Link: https://lore.kernel.org/20260523173759.3964908-1-rppt@kernel.org Link: https://lore.kernel.org/20260523173759.3964908-2-rppt@kernel.org Assisted-by: Copilot:claude-opus-4-6 Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christian Brauner (Amutable) <brauner@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Hildenbrand <david@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysuserfaultfd: ensure mremap_userfaultfd_fail() releases mmap_changingMike Rapoport (Microsoft)1-0/+2
Sashiko says: mremap_userfaultfd_prep() increments ctx->mmap_changing to stall concurrent operations, but mremap_userfaultfd_fail() does not decrement it before dropping the context reference. If an mremap operation fails, ctx->mmap_changing remains elevated. This will causes subsequent userfaultfd operations like a UFFDIO_COPY to fail with -EAGAIN. Decrement ctx->mmap_changing in mremap_userfaultfd_fail(). Link: https://sashiko.dev/#/patchset/20260430113512.115938-1-rppt@kernel.org Link: https://lore.kernel.org/20260513081416.495963-1-rppt@kernel.org Fixes: df2cc96e7701 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysocfs2: reject non-inline dinodes with i_size and zero i_clustersMichael Bommarito1-0/+60
On a volume mounted without OCFS2_FEATURE_INCOMPAT_SPARSE_ALLOC, a non-inline regular file with non-zero i_size and zero i_clusters is structurally malformed: the extent map declares no allocated clusters yet the size header claims content exists. Keep rejecting that shape, but express it through a shared predicate so the same invariant is available to normal inode reads and online filecheck. The same zero-cluster shape is also malformed for non-inline directories. ocfs2 directory growth allocates backing storage before advancing i_size, and ocfs2_dir_foreach_blk_el() later walks until ctx->pos reaches i_size_read(inode). A forged directory dinode with a huge i_size and no clusters would repeatedly fail on holes while advancing through the claimed size. Sparse regular files remain exempt: on sparse-alloc volumes, truncate can legitimately grow i_size without allocating clusters. System inodes and inline-data dinodes also retain their separate storage rules. Mirror the check in ocfs2_filecheck_validate_inode_block() as well. filecheck reports through its own error namespace, so malformed size/cluster state is logged as a filecheck invalid-inode result rather than via ocfs2_error(), but it must not proceed into ocfs2_populate_inode(). Link: https://lore.kernel.org/20260519110404.1803902-4-michael.bommarito@gmail.com Fixes: b657c95c1108 ("ocfs2: Wrap inode block reads in a dedicated function.") Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Link: https://sashiko.dev/#/patchset/20260517111015.3187935-1-michael.bommarito%40gmail.com Assisted-by: Claude:claude-opus-4-7 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysocfs2: reject dinodes whose i_rdev disagrees with the file typeMichael Bommarito1-0/+55
id1.dev1.i_rdev is the device-number arm of the ocfs2_dinode id1 union. It is only meaningful for character and block device inodes. For any other user-visible file type the on-disk value must be zero. ocfs2_populate_inode() currently copies id1.dev1.i_rdev into inode->i_rdev before the S_IFMT switch decides whether the inode is a special file. A non-device inode with a non-zero i_rdev can therefore publish stale or attacker-controlled device state into the in-core inode. System inodes legitimately use other arms of the same union, so keep the cross-check restricted to non-system inodes. Factor that predicate into a helper and use it in both the normal validator and online filecheck path; filecheck reports the malformed dinode through OCFS2_FILECHECK_ERR_INVALIDINO instead of ocfs2_error(). Link: https://lore.kernel.org/20260519110404.1803902-3-michael.bommarito@gmail.com Fixes: b657c95c1108 ("ocfs2: Wrap inode block reads in a dedicated function.") Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Assisted-by: Claude:claude-opus-4-7 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Heming Zhao <heming.zhao@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysocfs2: reject dinodes with non-canonical i_mode typeMichael Bommarito1-2/+34
Patch series "ocfs2: harden inode validators against forged metadata", v2. This series adds three structural checks to OCFS2 dinode validation so malformed on-disk fields are rejected before ocfs2_populate_inode() copies them into the in-core inode. The checks cover: - i_mode values whose type bits do not name a canonical POSIX file type; - non-device dinodes whose id1.dev1.i_rdev field is non-zero; and - non-inline dinodes that claim non-zero i_size while i_clusters is zero, covering directories unconditionally and regular files on non-sparse volumes. The normal read path reports these through ocfs2_error(), matching the existing suballoc-slot, inline-data, chain-list, and refcount checks. The online filecheck path uses the same structural predicates but keeps its own reporting contract, returning OCFS2_FILECHECK_ERR_INVALIDINO instead of calling ocfs2_error(). This patch (of 3): ocfs2_validate_inode_block() currently accepts any non-zero i_mode value. ocfs2_populate_inode() then copies that mode verbatim into inode->i_mode and dispatches on i_mode & S_IFMT to the file/dir/symlink/special_file iops; an unrecognised type falls through to ocfs2_special_file_iops and init_special_inode(). Reject dinodes whose type bits do not name one of the seven canonical POSIX file types. Use fs_umode_to_ftype(), the same generic file-type conversion helper OCFS2 already uses for directory entries, so the accepted inode type set matches the kernel file-type vocabulary instead of open-coding a local switch. Apply the same structural check to the online filecheck read path. filecheck keeps its own error namespace, so it reports malformed i_mode through the filecheck logger and OCFS2_FILECHECK_ERR_INVALIDINO instead of calling ocfs2_error(), but it must not allow a malformed dinode to proceed into ocfs2_populate_inode(). Link: https://lore.kernel.org/20260519110404.1803902-1-michael.bommarito@gmail.com Link: https://lore.kernel.org/20260519110404.1803902-2-michael.bommarito@gmail.com Fixes: b657c95c1108 ("ocfs2: Wrap inode block reads in a dedicated function.") Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Link: https://sashiko.dev/#/patchset/20260517111015.3187935-1-michael.bommarito%40gmail.com Assisted-by: Claude:claude-opus-4-7 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Heming Zhao <heming.zhao@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysocfs2: kill osb->system_file_mutex lockTetsuo Handa3-10/+3
Commit 43b10a20372d ("ocfs2: avoid system inode ref confusion by adding mutex lock") tried to avoid a refcount leak caused by allowing multiple threads to call igrab(inode). But addition of osb->system_file_mutex made locking dependency complicated and is causing lockdep to warn about possibility of AB-BA deadlock. Since _ocfs2_get_system_file_inode() returns the same inode for the same input arguments, we don't need to serialize _ocfs2_get_system_file_inode(). What we need to make sure is that igrab(inode) is called for only once(). Therefore, replace osb->system_file_mutex with cmpxchg()-based locking. Link: https://lore.kernel.org/fea8d1fd-afb0-4302-a560-c202e2ef7afd@I-love.SAKURA.ne.jp Fixes: 43b10a20372d ("ocfs2: avoid system inode ref confusion by adding mutex lock") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysraid6: improve the public interfaceChristoph Hellwig1-4/+4
Stop directly calling into function pointers from users of the RAID6 PQ API, and provide exported functions with proper documentation and API guarantees asserts where applicable instead. Link: https://lore.kernel.org/20260518051804.462141-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64 Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Mason <clm@fb.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Sterba <dsterba@suse.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Li Nan <linan122@huawei.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Song Liu <song@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: WANG Xuerui <kernel@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysocfs2: validate inline xattr header before reflinking inline xattrsZhengYuan Huang1-6/+13
[BUG] A corrupt inline xattr header can make ocfs2_reflink_xattr_inline() lock, copy, and reflink xattr state from an unchecked ibody xattr header. [CAUSE] The inline reflink path still trusted di->i_xattr_inline_size to compute header_off, xh, and new_xh before handing the source header to the reflink allocator and copy logic. [FIX] Validate the source inode's inline xattr header with the shared helper first, then derive the reflink copy offsets from the validated inline size/header. This keeps the reflink path from traversing corrupt ibody xattr geometry. Link: https://lore.kernel.org/20260508085914.61647-6-gality369@gmail.com Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Heming Zhao <heming.zhao@suse.com> Cc: Jia-Ju Bai <baijiaju1990@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Zixuan Fu <r33s3n6@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysocfs2: validate inline xattr header before inline refcount attachZhengYuan Huang1-3/+6
[BUG] A corrupt inline xattr header can make ocfs2_xattr_inline_attach_refcount() feed an unchecked header into the refcount-attachment walk for inline xattr values. [CAUSE] The inline refcount-attach path still derived the header directly from di->i_xattr_inline_size and then passed it to code that iterates xh_count and xattr entries. [FIX] Use the shared ibody header helper before attaching refcounts to inline xattr values so corrupt header geometry is rejected with -EFSCORRUPTED instead of being traversed. Link: https://lore.kernel.org/20260508085914.61647-5-gality369@gmail.com Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Heming Zhao <heming.zhao@suse.com> Cc: Jia-Ju Bai <baijiaju1990@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Zixuan Fu <r33s3n6@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysocfs2: validate inline xattr header before ibody removeZhengYuan Huang1-3/+3
[BUG] A corrupt inline xattr header can make ocfs2_xattr_ibody_remove() pass an unchecked header into ocfs2_remove_value_outside() during inode xattr teardown. [CAUSE] ocfs2_xattr_ibody_remove() still rebuilt the ibody xattr header directly from di->i_xattr_inline_size and then handed it to code that iterates xh_count and entry geometry. [FIX] Validate the inline xattr header with the shared helper before handing it to the outside-value removal path, and propagate -EFSCORRUPTED on bad metadata instead of traversing the unchecked header. Link: https://lore.kernel.org/20260508085914.61647-4-gality369@gmail.com Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Heming Zhao <heming.zhao@suse.com> Cc: Jia-Ju Bai <baijiaju1990@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Zixuan Fu <r33s3n6@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysocfs2: validate inline xattr header before checking outside valuesZhengYuan Huang1-3/+4
[BUG] A corrupt inline xattr header can make ocfs2_has_inline_xattr_value_outside() walk xh_count from an unchecked header while refcount-tree teardown decides whether inline xattrs still point outside the inode body. [CAUSE] ocfs2_has_inline_xattr_value_outside() still computed the inline header directly from di->i_xattr_inline_size and immediately iterated xh_count. That is the same unchecked metadata boundary as the ibody lookup bug. [FIX] Reuse the shared inline-header helper before iterating xh_count. Because this helper returns a boolean-style answer to its caller, treat a corrupt header conservatively as "has outside values" instead of walking it. Link: https://lore.kernel.org/20260508085914.61647-3-gality369@gmail.com Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Heming Zhao <heming.zhao@suse.com> Cc: Jia-Ju Bai <baijiaju1990@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Zixuan Fu <r33s3n6@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysocfs2: validate inline xattr header before ibody lookupsZhengYuan Huang1-35/+47
Patch series "ocfs2: validate inline xattr header consumers". Corrupt i_xattr_inline_size can move the computed inode-body xattr header outside the dinode block. Several OCFS2 paths then trust xh_count or xattr entry geometry from that unchecked header. The reported KASAN splat hits the ibody lookup path: BUG: KASAN: use-after-free in ocfs2_xattr_find_entry+0x37b/0x3a0 ocfs2_xattr_ibody_get() ocfs2_xattr_get_nolock() ocfs2_calc_xattr_init() The same unchecked header derivation also exists in the outside-value probe, ibody remove, inline refcount attach, and inline reflink paths. This series factors the existing ibody list validation into a shared helper and then converts the remaining inline-header consumers one at a time. Patch layout: 1. validate ibody get/find and reuse the helper in ibody list 2. validate the outside-value probe 3. validate ibody remove 4. validate inline refcount attach 5. validate inline reflink This patch (of 5): [BUG] mknodat() can read past the end of a dinode block when ACL inheritance walks a corrupted inode-body xattr header. Another report shows the same unchecked lookup later faulting in the VFS open path after create returns a garbage status. KASAN: use-after-free in ocfs2_xattr_find_entry+0x37b/0x3a0 fs/ocfs2/xattr.c:1078 Read of size 2 at addr ffff88801c520300 by task syz.0.10/360 Trace: ... ocfs2_xattr_find_entry+0x37b/0x3a0 fs/ocfs2/xattr.c:1078 ocfs2_xattr_ibody_get fs/ocfs2/xattr.c:1178 [inline] ocfs2_xattr_get_nolock+0x2ee/0x1110 fs/ocfs2/xattr.c:1309 ocfs2_calc_xattr_init+0x716/0xac0 fs/ocfs2/xattr.c:628 ocfs2_mknod+0x935/0x2400 fs/ocfs2/namei.c:333 ocfs2_create+0x158/0x390 fs/ocfs2/namei.c:676 vfs_create fs/namei.c:3493 [inline] vfs_create+0x445/0x6f0 fs/namei.c:3477 do_mknodat+0x2d8/0x5e0 fs/namei.c:4372 __do_sys_mknodat fs/namei.c:4400 [inline] __se_sys_mknodat fs/namei.c:4397 [inline] __x64_sys_mknodat+0xb6/0xf0 fs/namei.c:4397 ... Another report: BUG: unable to handle page fault for address: fffffbfff3e40ec0 RIP: 0010:__d_entry_type include/linux/dcache.h:414 [inline] RIP: 0010:d_can_lookup include/linux/dcache.h:429 [inline] RIP: 0010:d_is_dir include/linux/dcache.h:439 [inline] RIP: 0010:path_openat+0xe2f/0x2ce0 fs/namei.c:4134 Trace: ... do_filp_open+0x1f6/0x430 fs/namei.c:4161 do_sys_openat2+0x117/0x1c0 fs/open.c:1437 __x64_sys_openat+0x15b/0x220 fs/open.c:1463 ... [CAUSE] ocfs2_xattr_ibody_list() already validates the inline xattr size and entry count, but ocfs2_xattr_ibody_get() and ocfs2_xattr_ibody_find() still derive the inline header directly from di->i_xattr_inline_size and then trust xh_count. A corrupted inline size or entry count can therefore move the computed header outside the dinode block before get/find start walking it. That can either make ocfs2_xattr_find_entry() dereference xs->header->xh_count outside the block or make ocfs2_xattr_get_nolock() bubble a garbage status back through ocfs2_calc_xattr_init() into the create/open path. [FIX] Factor the existing ibody header geometry checks into a shared helper. Use it in ocfs2_xattr_ibody_get() and ocfs2_xattr_ibody_find(), and have ocfs2_xattr_ibody_list() reuse the same helper instead of open-coding the validation. Reject corrupt ibody metadata with -EFSCORRUPTED before the lookup path can walk bogus xattr geometry or return a garbage status. Link: https://lore.kernel.org/20260508085914.61647-1-gality369@gmail.com Link: https://lore.kernel.org/20260508085914.61647-2-gality369@gmail.com Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Jia-Ju Bai <baijiaju1990@gmail.com> Cc: Zixuan Fu <r33s3n6@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysocfs2: don't BUG_ON an invalid journal dinodeZhengYuan Huang1-5/+2
[BUG] A fuzzed OCFS2 image can corrupt the current slot journal dinode while mount is still in progress. The mount path first reports the invalid journal block and then crashes in shutdown: kernel BUG at fs/ocfs2/journal.c:1034! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI RIP: 0010:ocfs2_journal_toggle_dirty+0x2d6/0x340 fs/ocfs2/journal.c:1034 Call Trace: ocfs2_journal_shutdown+0x414/0xc30 fs/ocfs2/journal.c:1116 ocfs2_mount_volume fs/ocfs2/super.c:1785 [inline] ocfs2_fill_super+0x30a9/0x3cd0 fs/ocfs2/super.c:1083 get_tree_bdev_flags+0x38b/0x640 fs/super.c:1698 get_tree_bdev+0x24/0x40 fs/super.c:1721 ocfs2_get_tree+0x21/0x30 fs/ocfs2/super.c:1184 vfs_get_tree+0x9a/0x370 fs/super.c:1758 fc_mount fs/namespace.c:1199 [inline] do_new_mount_fc fs/namespace.c:3642 [inline] do_new_mount fs/namespace.c:3718 [inline] path_mount+0x5b8/0x1ea0 fs/namespace.c:4028 do_mount fs/namespace.c:4041 [inline] __do_sys_mount fs/namespace.c:4229 [inline] __se_sys_mount fs/namespace.c:4206 [inline] __x64_sys_mount+0x282/0x320 fs/namespace.c:4206 ... [CAUSE] ocfs2_journal_toggle_dirty() used to return -EIO when journal->j_bh no longer contained a valid dinode, because the startup and shutdown paths already handled that failure. Commit 10995aa2451a ("ocfs2: Morph the haphazard OCFS2_IS_VALID_DINODE() checks.") changed the check to a BUG_ON() under the assumption that the journal dinode had already been validated. That turns an unexpected invalid journal dinode during mount teardown into a kernel crash instead of a normal mount failure. [FIX] Replace the BUG_ON() with WARN_ON() and return -EIO. This keeps the invariant warning for debugging, but restores the original behavior of failing startup or shutdown cleanly instead of panicking the kernel. Link: https://lore.kernel.org/20260512024115.4036371-1-gality369@gmail.com Fixes: 10995aa2451a ("ocfs2: Morph the haphazard OCFS2_IS_VALID_DINODE() checks.") Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysocfs2: reject inconsistent inode size before truncateZhengYuan Huang1-9/+14
[BUG] openat(..., O_WRONLY|O_CREAT|O_TRUNC) can hit: kernel BUG at fs/ocfs2/file.c:454! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI RIP: 0010:ocfs2_truncate_file+0x1204/0x13c0 fs/ocfs2/file.c:454 Call Trace: ocfs2_setattr+0xa6d/0x1fd0 fs/ocfs2/file.c:1212 notify_change+0x4b5/0x1030 fs/attr.c:546 do_truncate+0x1d2/0x230 fs/open.c:68 handle_truncate fs/namei.c:3596 [inline] do_open fs/namei.c:3979 [inline] path_openat+0x260f/0x2ce0 fs/namei.c:4134 do_filp_open+0x1f6/0x430 fs/namei.c:4161 do_sys_openat2+0x117/0x1c0 fs/open.c:1437 do_sys_open fs/open.c:1452 [inline] __do_sys_openat fs/open.c:1468 [inline] __se_sys_openat fs/open.c:1463 [inline] __x64_sys_openat+0x15b/0x220 fs/open.c:1463 ... [CAUSE] ocfs2_truncate_file() treats di_bh->i_size matching inode->i_size as an internal code invariant and BUGs if it is broken. That assumption is too strong for corrupted metadata. The dinode block can still be structurally valid enough to pass ocfs2_read_inode_block() while no longer matching an already-instantiated VFS inode. On local mounts, ocfs2_inode_lock_update() skips refresh entirely, so truncate can observe the mismatch directly and crash instead of rejecting the corruption. [FIX] Turn the BUG_ON into normal OCFS2 corruption handling. If truncate sees di_bh->i_size disagree with inode->i_size, report it with ocfs2_error() and abort before touching truncate state. This keeps the fix at the first boundary that actually requires the sizes to match and avoids widening checks into hotter generic inode-lock paths Link: https://lore.kernel.org/20260512021601.3936417-1-gality369@gmail.com Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daystreewide: fix indentation and whitespace in Kconfig filesAnand Moon1-3/+3
Clean up inconsistent indentation (mixing tabs and spaces) and remove extraneous whitespace in several Kconfig files across the tree. This is a purely cosmetic change to improve readability. Adjust indentation from spaces to tab (+optional two spaces) as in coding style with command like: $ sed -e 's/^ /\t/' -i */Kconfig Link: https://lore.kernel.org/20260407053945.14116-1-linux.amoon@gmail.com Signed-off-by: Anand Moon <linux.amoon@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> [fs] Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> [mm] Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> [mm] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 dayskunit: fat: test cluster and directory i_pos layout helpersAdi Nata1-0/+33
Add KUnit coverage for fat_clus_to_blknr() and fat_get_blknr_offset() using stub msdos_sb_info values so cluster-to-sector and i_pos split math stays correct. Link: https://lore.kernel.org/20260405011920.28622-1-adinata.softwareengineer@gmail.com Signed-off-by: Adi Nata <adinata.softwareengineer@gmail.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysocfs2: use kzalloc for quota recovery bitmap allocationTristan Madani1-1/+1
ocfs2 quota recovery allocates a bitmap buffer with kmalloc and does not fully initialize it. This can lead to use of uninitialized bits during quota recovery from a corrupted filesystem image. Use kzalloc instead to ensure the bitmap is zero-initialized. Link: https://lore.kernel.org/20260418131048.1052507-1-tristmd@gmail.com Reported-by: syzbot+7ea0b96c4ddb49fd1a70@syzkaller.appspotmail.com Signed-off-by: Tristan Madani <tristan@talencesecurity.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysproc: use strnlen() for name validation in __proc_createThorsten Blum1-3/+7
Replace strlen(fn) with strnlen(fn, NAME_MAX + 1) when validating the final path component in __proc_create(). This preserves the existing name limit while bounding the length scan to one byte past the maximum name length. Handle empty names separately, and treat names longer than NAME_MAX as too long. Link: https://lore.kernel.org/20260421122648.56723-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: wangzijie <wangzijie1@honor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysproc: rewrite next_tgid()Alexey Dobriyan1-24/+32
* deduplicate "iter.tgid += 1" line, Right now it is done once inside next_tgid() itself and second time inside "for" loop. * deduplicate next_tgid() call itself with different loop style: auto it = make_tgid_iter(); while (next_tgid(&it)) { ... } gcc seems to inline it twice: $ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux add/remove: 0/1 grow/shrink: 1/0 up/down: 100/-245 (-145) Function old new delta proc_pid_readdir 531 631 +100 next_tgid 245 - -245 * make tgid_iter.pid_ns const it never changes during readdir anyway [akpm@linux-foundation.org: remove newline] Link: https://lore.kernel.org/20260422191745.435556-2-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysproc: add tgid_iter.pid_ns memberAlexey Dobriyan1-6/+9
next_tgid() accepts pid namespace as an argument, but it never changes during readdir (which would be unthinkable thing to do anyway). Move it inside iterator type and hide from direct usage. Link: https://lore.kernel.org/20260422191745.435556-1-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 days9p: Enable symlink caching in page cacheRemi Pommarel3-9/+87
Currently, when cache=loose is enabled, file reads are cached in the page cache, but symlink reads are not. This patch allows the results of p9_client_readlink() to be stored in the page cache, eliminating the need for repeated 9P transactions on subsequent symlink accesses. This change improves performance for workloads that involve frequent symlink resolution. Signed-off-by: Remi Pommarel <repk@triplefau.lt> Message-ID: <982462d17c0c0d2856763266a25eb04d080c1dbb.1779355927.git.repk@triplefau.lt> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
3 days9p: Set default negative dentry retention time for cache=looseRemi Pommarel1-0/+10
For cache=loose mounts, set the default negative dentry cache retention time to 24 hours. Signed-off-by: Remi Pommarel <repk@triplefau.lt> Message-ID: <b5beca3e70890ab8a4f0b9e99bd69cb97f5cb9eb.1779355927.git.repk@triplefau.lt> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
3 days9p: Add mount option for negative dentry cache retentionRemi Pommarel2-11/+28
Introduce a new mount option, negtimeout, for v9fs that allows users to specify how long negative dentries are retained in the cache. The retention time can be set in milliseconds (e.g. negtimeout=10000 for a 10secs retention time) or a negative value (e.g. negtimeout=-1) to keep negative entries until the buffer cache management removes them. For consistency reasons, this option should only be used in exclusive or read-only mount scenarios, aligning with the cache=loose usage. Signed-off-by: Remi Pommarel <repk@triplefau.lt> Message-ID: <b2d66500aa5a2f6540347c4aa46a4be10dd01bc6.1779355927.git.repk@triplefau.lt> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
3 days9p: Cache negative dentries for lookup performanceRemi Pommarel7-24/+126
Not caching negative dentries can result in poor performance for workloads that repeatedly look up non-existent paths. Each such lookup triggers a full 9P transaction with the server, adding unnecessary overhead. A typical example is source compilation, where multiple cc1 processes are spawned and repeatedly search for the same missing header files over and over again. This change enables caching of negative dentries, so that lookups for known non-existent paths do not require a full 9P transaction. The cached negative dentries are retained for a configurable duration (expressed in milliseconds), as specified by the ndentry_timeout field in struct v9fs_session_info. If set to -1, negative dentries are cached indefinitely. This optimization reduces lookup overhead and improves performance for workloads involving frequent access to non-existent paths. Signed-off-by: Remi Pommarel <repk@triplefau.lt> Message-ID: <e542317dd03bbadb5249abd3ea6aecfdca692c19.1779355927.git.repk@triplefau.lt> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
3 days9p: avoid returning ERR_PTR(0) from mkdir operationsHongling Zeng2-15/+8
When mkdir succeeds, v9fs_vfs_mkdir_dotl() and v9fs_vfs_mkdir() return ERR_PTR(0) which is incorrect. They should return NULL instead for success and ERR_PTR() only with negative error codes for failure. Return NULL instead of passing to ERR_PTR while err is zero Fixes smatch warnings: fs/9p/vfs_inode_dotl.c:420 v9fs_vfs_mkdir_dotl() warn: passing zero to 'ERR_PTR' fs/9p/vfs_inode.c:695 v9fs_vfs_mkdir() warn: passing zero to 'ERR_PTR' The v9fs_vfs_mkdir() code was further simplified because v9fs_create() can never return NULL, so we do not need to check for fid being set separately, and the error path can be a simple return immediately after v9fs_create() failure. There is no intended functional change. Fixes: 88d5baf69082 ("Change inode_operations.mkdir to return struct dentry *") Suggested-by: David Laight <david.laight.linux@gmail.com> Acked-by: Christian Schoenebeck <linux_oss@crudebyte.com> Signed-off-by: Hongling Zeng <zenghongling@kylinos.cn> Message-ID: <20260520022650.14217-1-zenghongling@kylinos.cn> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
3 daysAutomated merge of 'dev' into 'next'Paul Moore2-9/+9
* dev: crypto: pkcs7: export verify_pkcs7_message_sig() as EXPORT_SYMBOL_GPL ipe: restore the kdoc comments for evaluate_property() hornet: depend on CONFIG_SECURITY and CONFIG_BPF_SYSCALL ipe: Add BPF program load policy enforcement via Hornet integration selftests/hornet: Add a selftest for the Hornet LSM hornet: Add a light skeleton data extractor scripts hornet: Introduce gen_sig lsm: introduce the Hornet LSM lsm: add additional enum values for bpf integrity checks lsm: framework for BPF integrity verification crypto: pkcs7: add tests for pkcs7_get_authattr crypto: pkcs7: add ability to extract signed attributes by OID crypto: pkcs7: add flag for validated trust on a signed info block security,fs,nfs,net: update security_inode_listsecurity() interface
4 daysNFSD: Increase the default max_block_size to 4MBChuck Lever1-3/+2
Commit 8a81f16de64f ("NFSD: Add a "default" block size") introduced NFSSVC_DEFBLKSIZE at 1MB, well below the 4MB NFSSVC_MAXBLKSIZE ceiling, with the stated intent that a later change would raise the default. Raising the default reduces per-RPC overhead on fast networks by amortizing header processing and scheduling costs across larger payloads. The halving loop in nfsd_get_default_max_blksize() constrains the returned value to 1/4096 of available RAM, so the new 4MB default takes effect only on systems with at least 16GB of RAM. Smaller machines continue to receive the same computed value as before. Administrators can still override the computed value through /proc/fs/nfsd/max_block_size. On systems where the new default takes effect, svc_sock_setbufsize() sizes each service socket's send and receive buffers as nreqs * max_mesg * 2. Quadrupling max_mesg therefore quadruples the per-socket buffer reservation at a fixed thread count, which operators tuning large thread pools should account for. Note well: Your NFS client implementation must support large read and write size settings to benefit from this change. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysNFSD: Close cached file handles when revoking export stateChuck Lever3-2/+50
When NFSD_CMD_UNLOCK_EXPORT revokes NFSv4 state for an export path, GC-managed nfsd_file entries for files under that path may remain in the file cache. These cached handles hold the underlying filesystem busy, preventing a subsequent unmount. Add nfsd_file_close_export(), which walks the nfsd_file hash table and closes GC-eligible entries whose underlying file resides on the same filesystem and is a descendant of the export path. Because nfsd_file entries do not carry an export reference, the ancestry check uses is_subdir() on the file's dentry. False positives -- closing a cached handle that did not originate from the target export -- are harmless; the handle is simply reopened on the next access. The handler calls nfsd_file_close_export() before revoking NFSv4 state, mirroring the order used by NFSD_CMD_UNLOCK_FILESYSTEM (which cancels copies and releases NLM locks before revoking state). Both calls run under nfsd_mutex. Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysNFSD: Add NFSD_CMD_UNLOCK_EXPORT netlink commandChuck Lever6-0/+149
When a filesystem is exported to NFS clients, NFSv4 state (opens, locks, delegations, layouts) holds references that prevent the underlying filesystem from being unmounted. NFSD_CMD_UNLOCK_FILESYSTEM addresses this at superblock granularity, but administrators unexporting a single path on a shared filesystem (e.g., one of several exports on the same device) need finer control. Add NFSD_CMD_UNLOCK_EXPORT, which revokes NFSv4 state acquired through exports of a specific path. Matching is by path identity (dentry + vfsmount) via the sc_export field on each nfs4_stid, so multiple svc_export objects for the same path -- one per auth_domain -- are handled correctly without requiring the caller to name a specific client. The command takes a single "path" attribute. Userspace (exportfs -u) sends this after removing the last client for a given path, enabling the underlying filesystem to be unmounted. When multiple clients share an export path, individual unexports do not trigger state revocation; only the final one does. Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysNFSD: Track svc_export in nfs4_stidChuck Lever3-3/+42
Add an sc_export field to struct nfs4_stid so that each stateid records the export under which it was acquired. The export reference is taken via exp_get() at stateid creation and released via exp_put() in nfs4_put_stid(). Open stateids record the export from current_fh->fh_export. Lock stateids and delegations inherit the export from their parent open stateid. Layout stateids inherit from their parent stateid. Directory delegations record the export from cstate->current_fh. A subsequent commit uses sc_export to scope state revocation to a specific export, avoiding the need to walk inode dentry aliases at revocation time. Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysNFSD: Replace idr_for_each_entry_ul in find_one_sb_stid()Chuck Lever1-2/+4
Replace idr_for_each_entry_ul() with a while loop over idr_get_next_ul() for consistency with find_one_export_stid(), added in a subsequent commit. No change in behavior. Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysNFSD: Add NFSD_CMD_UNLOCK_FILESYSTEM netlink commandChuck Lever3-0/+53
Add NFSD_CMD_UNLOCK_FILESYSTEM as a dedicated netlink command for revoking NFS state under a filesystem path, providing a netlink equivalent of /proc/fs/nfsd/unlock_fs. The command requires a "path" string attribute containing the filesystem path whose state should be released. The handler resolves the path to its superblock, then cancels async copies, releases NLM locks, and revokes NFSv4 state on that superblock. Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysNFSD: Add NFSD_CMD_UNLOCK_IP netlink commandChuck Lever4-7/+59
The existing write_unlock_ip procfs interface releases NLM file locks held by a specific client IP address, but procfs provides no structured way to extend that operation to other scopes such as revoking NFSv4 state. Add NFSD_CMD_UNLOCK_IP as a dedicated netlink command for releasing NLM locks by client address. The command accepts a binary sockaddr_in or sockaddr_in6 in its address attribute. The handler validates the address family and length, then calls nlmsvc_unlock_all_by_ip() to release matching NLM locks. Because lockd is a single global instance, that call operates across all network namespaces regardless of which namespace the caller inhabits. A separate netlink command for filesystem-scoped unlock is added in a subsequent commit. The nfsd_ctl_unlock_ip tracepoint is updated from string-based address logging to __sockaddr, which stores the binary sockaddr and formats it with %pISpc. This affects both the new netlink path and the existing procfs write_unlock_ip path, giving consistent structured output in both cases. Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysNFSD: Extract revoke_one_stid() utility functionChuck Lever1-76/+75
The per-stateid revocation logic in nfsd4_revoke_states() handles four stateid types in a deeply nested switch. Extract two helpers: revoke_ol_stid() performs admin-revocation of an open or lock stateid with st_mutex already held: marks the stateid as SC_STATUS_ADMIN_REVOKED, closes POSIX locks for lock stateids, and releases file access. revoke_one_stid() dispatches by sc_type, acquires st_mutex with the appropriate lockdep class for open and lock stateids, and handles delegation unhash and layout close inline. No functional change. Preparation for adding export-scoped state revocation which reuses revoke_one_stid(). Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysNFSD: Handle layout stid in nfsd4_drop_revoked_stid()Chuck Lever1-0/+7
nfsd4_drop_revoked_stid() has no SC_TYPE_LAYOUT case, so when a client sends FREE_STATEID for an admin-revoked layout stid, the default branch releases cl_lock and returns without unhashing or releasing the stid. The stid remains in the IDR and on the per-client list until the client is destroyed. Remove the layout stid from the per-client list and call nfs4_put_stid() to drop the creation reference. When the refcount reaches zero, nfsd4_free_layout_stateid() handles the remaining cleanup: cancelling the fence worker, removing from the per-file list, and freeing the slab object. Fixes: 1e33e1414bec ("nfsd: allow layout state to be admin-revoked.") Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysNFSD: Put cache get-reqs dump attrs under replyChuck Lever1-20/+6
The new get-reqs dump operations added to sunrpc_cache.yaml and nfsd.yaml place the "requests" nested attribute under dump.request. A netlink dump carries an empty request; its payload travels back in the reply. Because the spec names no reply attributes, the YNL C code generator synthesizes a forward reference to a <op>_rsp struct that is never defined, breaking any consumer of these specs. This first surfaced when Thorsten Leemhuis built tools/net/ynl against -next: nfsd-user.h:746: error: field 'obj' has incomplete type struct nfsd_svc_export_get_reqs_rsp obj ... nfsd-user.h:826: error: field 'obj' has incomplete type struct nfsd_expkey_get_reqs_rsp obj ... nfsd-user.c:1211: error: 'nfsd_svc_export_get_reqs_rsp_parse' undeclared sunrpc_cache.yaml has the same defect in ip-map-get-reqs and unix-gid-get-reqs, but nfsd.yaml errors out first in the Makefile's alphabetical build order and hides the sunrpc failures. These bugs were introduced by incorrect merge conflict resolution. Reported-by: Thorsten Leemhuis <linux@leemhuis.info> Closes: https://lore.kernel.org/linux-nfs/f6a3ca6d-e5cb-4a5c-9af2-8d2b1ce33ef0@leemhuis.info/ Fixes: 1045ccf519ce30 ("sunrpc: add netlink upcall for the auth.unix.ip cache") Tested-by: Thorsten Leemhuis <linux@leemhuis.info> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysnfsd: add NFSD_CMD_CACHE_FLUSH netlink commandJeff Layton3-0/+49
Add a new NFSD_CMD_CACHE_FLUSH generic netlink command that allows userspace to flush the nfsd export caches (svc_export and expkey) without writing to /proc/net/rpc/*/flush. An optional NFSD_A_CACHE_FLUSH_MASK u32 attribute selects which caches to flush (bit 1 = svc_export, bit 2 = expkey). If the attribute is omitted, all nfsd caches are flushed. This is used by exportfs to replace its /proc-based cache_flush() with a netlink equivalent, with /proc fallback for older kernels. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysnfsd: add netlink upcall for the nfsd.fh cacheJeff Layton3-0/+304
Add netlink-based cache upcall support for the expkey (nfsd.fh) cache, following the same pattern as the existing svc_export netlink support. Add expkey to the cache-type enum, a new expkey attribute-set with client, fsidtype, fsid, negative, expiry, and path fields, and the expkey-get-reqs / expkey-set-reqs operations to the nfsd YAML spec and generated headers. Implement nfsd_nl_expkey_get_reqs_dumpit() which snapshots pending expkey cache requests and sends each entry's seqno, client name, fsidtype, and fsid over netlink. Implement nfsd_nl_expkey_set_reqs_doit() which parses expkey cache responses from userspace (client, fsidtype, fsid, expiry, and path or negative flag) and updates the cache via svc_expkey_lookup() / svc_expkey_update(). Wire up the expkey_notify() callback in svc_expkey_cache_template so cache misses trigger NFSD_CMD_CACHE_NOTIFY multicast events with NFSD_CACHE_TYPE_EXPKEY. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysnfsd: add netlink upcall for the svc_export cacheJeff Layton5-5/+543
Add netlink-based cache upcall support for the svc_export (nfsd.export) cache to Documentation/netlink/specs/nfsd.yaml and regenerate the resulting files. Implement nfsd_cache_notify() which sends a NFSD_CMD_CACHE_NOTIFY multicast event to the "exportd" group, carrying the cache type so userspace knows which cache has pending requests. Implement nfsd_nl_svc_export_get_reqs_dumpit() which snapshots pending svc_export cache requests and sends each entry's seqno, client name, and path over netlink. Implement nfsd_nl_svc_export_set_reqs_doit() which parses svc_export cache responses from userspace (client, path, expiry, flags, anon uid/gid, fslocations, uuid, secinfo, xprtsec, fsid, or negative flag) and updates the cache via svc_export_lookup() / svc_export_update(). Wire up the svc_export_notify() callback in svc_export_cache_template so cache misses trigger NFSD_CMD_CACHE_NOTIFY multicast events with NFSD_CACHE_TYPE_SVC_EXPORT. Note that the export-flags and xprtsec-mode enums are organized to match their counterparts in include/uapi/linux/nfsd/export.h. The intent is that future export options will only be added to the netlink headers, which should eliminate the need to keep so much in sync. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 dayssunrpc: rename sunrpc_cache_pipe_upcall_timeout()Jeff Layton2-3/+3
This function doesn't have anything to do with a timeout. The only difference is that it warns if there are no listeners. Rename it to sunrpc_cache_upcall_warn(). Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 dayssunrpc: rename sunrpc_cache_pipe_upcall() to sunrpc_cache_upcall()Jeff Layton1-2/+2
Since it will soon also send an upcall via netlink, if configured. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysnfsd: move struct nfsd_genl_rqstp to nfsctl.cJeff Layton2-15/+15
It's not used outside of that file. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysNFSD: Fix delegation reference leak in nfsd4_revoke_statesChuck Lever1-1/+8
When revoking delegation state, nfsd4_revoke_states() takes an extra reference on the stid before calling unhash_delegation_locked(). If unhash_delegation_locked() returns false (the delegation was already unhashed by a concurrent path), dp is set to NULL and revoke_delegation() is skipped, but the extra reference is never released. Each occurrence permanently pins the stid in memory. The leaked reference also prevents nfs4_put_stid() from decrementing cl_admin_revoked, leaving the counter permanently inflated. Drop the extra reference in the failure path. Fixes: 8dd91e8d31fe ("nfsd: fix race between laundromat and free_stateid") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
4 daysMerge remote-tracking branches 'vfs/vfs-7.2.casefold', ↵Chuck Lever56-161/+800
'vfs/vfs-7.2.directory.delegations' and 'vfs/vfs-7.2.exportfs' into vfs-7.2-merge
4 daysvirtiofs: fix UAF on submount umountMiklos Szeredi1-1/+7
iput() called from fuse_release_end() can Oops if the super block has already been destroyed. Normally this is prevented by waiting for num_waiting to go down to zero before commencing with super block shutdown. This only works, however, for the last submount instance, as the wait counter is per connection, not per superblock. Revert to using synchronous release requests for the auto_submounts case, which is virtiofs only at this time. Reported-by: Aurélien Bombo <abombo@microsoft.com> Cc: Greg Kurz <gkurz@redhat.com> Closes: https://github.com/kata-containers/kata-containers/issues/12589 Fixes: 26e5c67deb2e ("fuse: fix livelock in synchronous file put from fuseblk workers") Cc: stable@vger.kernel.org Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
4 daysntfs3: fix out-of-bounds read in ntfs_dir_emit() and hdr_find_e()Alessandro Schino2-1/+7
The bounds check in ntfs_dir_emit() compares fname->name_len (a character count) against e->size (a byte count) without accounting for the 2-byte-per-character UTF-16LE encoding or the ATTR_FILE_NAME header size: if (fname->name_len + sizeof(struct NTFS_DE) > le16_to_cpu(e->size)) This computes: name_len + 16 > e_size The correct check must account for the ATTR_FILE_NAME header (66 bytes before the name) and the UTF-16LE character size (2 bytes each): sizeof(NTFS_DE) + offsetof(ATTR_FILE_NAME, name) + name_len * sizeof(short) > e_size Which computes: 16 + 66 + name_len * 2 > e_size The correct calculation already exists as fname_full_size() in ntfs.h and is used in cmp_fnames(), namei.c, and fslog.c, but was not used in the readdir path. A crafted NTFS image with an index entry containing a small e->size but large fname->name_len bypasses the current check, causing ntfs_utf16_to_nls() to read past the entry boundary. Additionally, add a key_size validation in hdr_find_e() to ensure the declared key_size does not exceed the available entry data, preventing comparison functions from reading past entry boundaries on the lookup path. Signed-off-by: Alessandro Schino <7991aleschino@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
4 daysfs/ntfs3: fix mount failure on 64K page-size kernelsJamie Nguyen1-5/+1
On 64K page-size kernels, mounting NTFS volumes smaller than ~650 MB fails with EINVAL. The issue is in log_replay(): the initial log page size probe uses PAGE_SIZE (65536) instead of DefaultLogPageSize (4096) when PAGE_SIZE exceeds DefaultLogPageSize * 2. This makes norm_file_page() require the $LogFile to be at least 50 * 65536 = 3.2 MB, but mkfs.ntfs creates a $LogFile of only ~1.5 MB for a typical 300 MB volume. norm_file_page() returns 0 and the mount is rejected with EINVAL. On 4K kernels the #if guard evaluates to true, so use_default=true is passed and DefaultLogPageSize (4096) is used, requiring only ~200 KB. This path works fine. Fix this by always passing use_default=true, which forces the initial probe to use DefaultLogPageSize regardless of the kernel's PAGE_SIZE. This is safe because, after reading the on-disk restart area, log_replay() already re-adjusts log->page_size to match the volume's actual sys_page_size. Also fix read_log_page() to pass log->page_size instead of PAGE_SIZE to ntfs_fix_post_read(), matching the actual buffer size. Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal") Tested-by: Matthew R. Ochs <mochs@nvidia.com> Signed-off-by: Jamie Nguyen <jamien@nvidia.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
4 daysntfs3: avoid another -Wmaybe-uninitialized warningArnd Bergmann1-2/+2
The ntfs3 specific -Wmaybe-uninitialized flag found one more false-postive, this time with gcc-10 on s390: fs/ntfs3/frecord.c: In function 'ni_expand_list': fs/ntfs3/frecord.c:1370:16: error: 'ins_attr' may be used uninitialized in this function [-Werror=maybe-uninitialized] Add an explicit NULL pointer check before using the pointer, and initialize it to NULL. Fixes: 48d9b57b169f ("fs/ntfs3: add a subset of W=1 warnings for stricter checks") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
4 daysntfs3: Allocate iomap inline_data using alloc_pageMihai Brodschi2-5/+9
This fixes a BUG reported in iomap_write_end_inline: iomap_inline_data_valid checks that the inline_data fits within a page. If the inline_data is allocated with kmemdup there's no guarantee that it's page-aligned, so the check sometimes fails. Allocate it with alloc_page to ensure it's page-aligned. Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221446 Fixes: 099ef9a ("fs/ntfs3: implement iomap-based file operations") Signed-off-by: Mihai Brodschi <m.brodschi@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
4 daysfs/ntfs3: format code, deal with commentsKonstantin Komarov5-25/+23
format code according to .clang-format, add useful comments and remove non-useful comments. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
4 daysfs/ntfs3: reject SEEK_DATA and SEEK_HOLE past EOF earlyKonstantin Komarov2-10/+21
Handle non-data/hole seeks through generic_file_llseek_size() and return -ENXIO immediately when SEEK_DATA or SEEK_HOLE is requested at or past EOF. Handle compressed files in such cases properly as well. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
4 daysfs/ntfs3: fold file size handling into ntfs_set_size()Konstantin Komarov2-148/+51
Remove the separate ntfs_extend() and ntfs_truncate() helpers and route file size changes through ntfs_set_size(). This consolidates ntfs3 size updates in one place and lets the write, fallocate, and setattr paths share the same logic for updating i_size, valid data length, and preallocated extents. This patch fixes a few issues found during internal tests. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
4 daysfs/ntfs3: force waiting for direct I/O completionKonstantin Komarov1-1/+2
It makes ntfs3 wait for direct I/O completion before returning to the caller, instead of allowing the write path to complete asynchronously. The issue was discovered during internal tests. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
4 daysfs/ntfs3: fold resident writeback into writepages loopKonstantin Komarov1-27/+15
Remove the separate ntfs_resident_writepage() helper and handle resident writeback directly from ntfs_writepages(). This simplifies the resident writeback path and keeps the folio handling local to ntfs_writepages(). Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
4 daysfs/ntfs3: handle delayed allocation overlap in run lookupKonstantin Komarov3-12/+69
Introduce run_lookup_entry_da() to look up data runs while taking delayed allocation into account. ntfs3 may have both committed extents and delayed allocation extents for the same VCN range. The new helper checks delayed allocation first and falls back to the real run, then corrects the returned range when a real run overlaps with a delayed allocation run. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
4 daysfs/ntfs3: zero stale pagecache beyond valid data lengthKonstantin Komarov1-2/+28
Zero cached folios beyond the valid data length when closing a writable mapping. This keeps cached data beyond initialized file contents zeroed and prevents stale pagecache exposure after mmap-based writes. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
4 daysfs/ntfs3: add fileattr supportKonstantin Komarov4-0/+91
Implement fileattr_get() and fileattr_set() to fix a problem found during the internal testing. This allows ntfs3 to expose and modify inode flags through the generic file attribute interface used by FS_IOC_GETFLAGS and FS_IOC_SETFLAGS. Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
4 daysPull fanotify_error_event_equal() cleanupJan Kara1-4/+1
4 daysfanotify: simplify fanotify_error_event_equalThorsten Blum1-4/+1
Return the result of calling fanotify_fsid_equal() directly to simplify the code. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20260527142233.1256340-3-thorsten.blum@linux.dev Signed-off-by: Jan Kara <jack@suse.cz>
4 daysxfs: Remove mention of PageWritebackMatthew Wilcox (Oracle)1-7/+7
Update a comment to refer to folios instead of pages. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 daysxfs: abort mount if xfs_fs_reserve_ag_blocks failsChristoph Hellwig1-2/+5
xfs_mountfs currently ignores all errors from xfs_fs_reserve_ag_blocks, which can lead to the mount path continuing on corruption errors. Fix the check to only ignore -ENOSPC as in other callers, and unwind for all other errors. Fixes: 81ed94751b15 ("xfs: fix log intent recovery ENOSPC shutdowns when inactivating inodes") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 daysxfs: factor rtgroup geom write pointer reporting into a helperChristoph Hellwig1-16/+22
Sticks out a bit better if we add a separate helper for it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 daysxfs: drop the RTG reference later in xfs_ioc_rtgroup_geometryChristoph Hellwig1-4/+5
Keep the rtgroup reference until after reporting the write pointer, as that uses it. Right now this is not a major issue as we don't support shrinking file systems in a way that makes RTGs go away, but let's stick to the proper reference counting to prepare for that. Fixes: c6ce65cb17aa ("xfs: add write pointer to xfs_rtgroup_geometry") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 daysxfs: fix rtgroup cleanup in CoW fork repairYingjie Gao1-4/+1
xrep_cow_find_bad_rt() initializes scrub rtgroup state before the force-rebuild path calls xrep_cow_mark_file_range(). If that call fails, the code jumps directly to out_rtg, which skips the scrub rtgroup cleanup and only drops the local rtgroup reference. Remove the unnecessary jump so the function falls through to out_sr, ensuring the realtime cursors, lock state, and sr->rtg reference are released before returning. Fixes: fd97fe111208 ("xfs: fix CoW forks for realtime files") Cc: <stable@vger.kernel.org> # v6.14 Signed-off-by: Yingjie Gao <gaoyingjie@uniontech.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 daysxfs: fix error returns in CoW fork repairYingjie Gao1-5/+2
xrep_cow_find_bad() returns success after the cleanup labels even if AG setup, btree queries, or bitmap updates failed. This can make repair continue with an incomplete bad-file-offset bitmap instead of stopping at the original error. The force-rebuild path has a related cleanup problem. If xrep_cow_mark_file_range() fails, the function returns directly and skips the scrub AG context and perag cleanup. Let the force-rebuild path fall through to the existing cleanup code and return the saved error after cleanup. Fixes: dbbdbd008632 ("xfs: repair problems in CoW forks") Cc: <stable@vger.kernel.org> # v6.8 Signed-off-by: Yingjie Gao <gaoyingjie@uniontech.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
4 dayssmb: client: fix chmod and chgrp with SMB3.1.1 POSIX ExtensionsSteve French2-3/+18
chmod and chgrp were being ignored when mouting with the SMB3.1.1 POSIX Extensions. Add support for chmod and chgrp when mounting with the SMB3.1.1 POSIX Extensions. Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
4 dayscifs: validate full SID length in security descriptorsQihang1-67/+129
parse_sid() only verified that the fixed SID header fit in the returned security descriptor, but did not verify that the full SID body described by num_subauth was present. A malicious server can return a truncated owner or group SID whose header lies within the descriptor buffer while sub_auth[] extends past the end of the allocation, leading to an out-of-bounds read when the client later parses or copies that SID. Validate the full SID body in parse_sid(), centralize owner/group SID lookup and bounds checking in sid_from_sd(), and use that validation in parse_sec_desc(), build_sec_desc(), and copy_sec_desc() before sub_auth[] is accessed. Signed-off-by: Qihang <q.h.hack.winter@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>
4 dayssmb: client: resolve SWN tcon from live registrationsMichael Bommarito2-54/+262
cifs_swn_notify() looks up a witness registration by id under cifs_swnreg_idr_mutex, drops the mutex, and then uses the registration's cached tcon pointer. That pointer is not a lifetime reference, and it is not a stable representative once cifs_get_swn_reg() lets multiple tcons for the same net/share name share one registration id. A same-share second mount can keep the cifs_swn_reg alive after the first tcon unregisters and is freed. The registration then still points at the freed first tcon, so taking tc_lock or incrementing tc_count through swnreg->tcon only moves the use-after-free earlier. Taking tc_lock while holding cifs_swnreg_idr_mutex also violates the documented CIFS lock order. Fix this by making the registration store only the stable witness identity: id, net name, share name, and notify flags. When a notify arrives, copy that identity under cifs_swnreg_idr_mutex, drop the mutex, then find and pin a live witness tcon that currently matches the net/share pair under the normal cifs_tcp_ses_lock -> tc_lock order. The notification path uses that pinned tcon directly and drops the reference when done. Registration and unregister messages now use the live tcon passed by the caller instead of a cached tcon in the registration. The final unregister send is folded into cifs_swn_unregister() while the registration is still protected by cifs_swnreg_idr_mutex. This removes the previous find/drop/reacquire raw-pointer window. The release path only removes the idr entry and frees the stable identity strings. This preserves the intended one-registration/many-tcon behavior: a registration id represents a net/share pair, and notify handling acts on a live representative selected at use time. It also preserves CLIENT_MOVE ordering for the representative tcon because the old-IP unregister is sent before cifs_swn_register() sends the new-IP register. Fixes: fed979a7e082 ("cifs: Set witness notification handler for messages from userspace daemon") Cc: stable@vger.kernel.org Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Steve French <stfrench@microsoft.com>
4 dayscifs: remove all cifs files before kill superJian Zhang1-0/+3
Cifs files may be put into fileinfo_put_wq during umounting cifs. After umount done, cifsFileInfo_put_final is called, which cause following BUG: BUG: kernel NULL pointer dereference, address: 0000000000000000 ... [ 134.222152] list_lru_add+0x64/0x1a0 [ 134.222399] ? cifs_put_tcon+0x171/0x340 [cifs] [ 134.222772] d_lru_add+0x44/0x60 [ 134.222997] dput+0x1fc/0x210 [ 134.223213] cifsFileInfo_put_final+0x11a/0x140 [cifs] [ 134.223576] process_one_work+0x17c/0x320 [ 134.223843] worker_thread+0x188/0x280 [ 134.224084] ? __pfx_worker_thread+0x10/0x10 [ 134.224366] kthread+0xcc/0x100 [ 134.224576] ? __pfx_kthread+0x10/0x10 [ 134.224827] ret_from_fork+0x30/0x50 [ 134.225063] ? __pfx_kthread+0x10/0x10 [ 134.225328] ret_from_fork_asm+0x1b/0x30 This can be reproduce by following: unshare -n bash -c " mkdir -p ${CIFS_MNT} ip netns attach root 1 ip link add eth0 type veth peer veth0 netns root ip link set eth0 up ip -n root link set veth0 up ip addr add 192.168.0.2/24 dev eth0 ip -n root addr add 192.168.0.1/24 dev veth0 ip route add default via 192.168.0.1 dev eth0 ip netns exec root sysctl net.ipv4.ip_forward=1 ip netns exec root iptables -t nat -A POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE mount -t cifs ${CIFS_PATH} ${CIFS_MNT} -o vers=3.0,sec=ntlmssp,credentials=${CIFS_CRED},rsize=65536,wsize=65536,cache=none,echo_interval=1 touch ${CIFS_MNT}/a.txt ip netns exec root iptables -t nat -D POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE " umount ${CIFS_MNT} Fixes: 340cea84f691 ("cifs: open files should not hold ref on superblock") Signed-off-by: Jian Zhang <zhangjian496@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com>
4 dayssmb: client: fix conflicting option validation for new mount APIHenrique Carvalho1-49/+53
Apply conflicting option validation consistently across all the new mount API paths, for both mount and remount. Some checks were only applied during initial mount validation, while others were handled during option parsing, causing mount and remount/reconfigure to behave differently. Move the conflicting option checks into smb3_handle_conflicting_options() and call it from the common validation paths, including for multichannel/max_channels handling. Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api") Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
4 dayscifs: invalidate cfid on unlink/rename/rmdirShyam Prasad N1-1/+29
Today we do not invalidate the cached_dirent or the entire parent cfid when a dentry in a dir has been removed/moved. This change invalidates the parent cfid so that we don't serve directory contents from the cache. Cc: <stable@vger.kernel.org> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
4 dayssmb: client: fix uninitialized variable in smb2_writev_callbackSteve French1-1/+1
compiling with W=2 pointed out that "written may be used uninitialized" Fixes: 20d72b00ca81 ("netfs: Fix the request's work item to not require a ref") Cc: stable@vger.kernel.org Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
4 dayssmb: client: detect short folioq copy in cifs_copy_folioq_to_iter()Jeremy Erazo1-3/+15
cifs_copy_folioq_to_iter() copies a requested number of bytes from a folio queue into the destination iterator. Since the encrypted SMB2 READ path was changed to pass the server-declared payload length (data_len) instead of the larger folioq buffer length, the caller can ask for fewer bytes than the folio queue holds. In that case the helper continues walking the remaining folios after data_size has reached zero and calls copy_folio_to_iter() with len = 0, which is unnecessary work. The helper also returns 0 (success) when the folio queue is exhausted before data_size bytes have been copied. The caller has no way to distinguish that from a full copy and the reported transfer count ends up larger than the amount of data placed in the iterator. Add an early exit when data_size reaches zero, and return an error when the folio queue is exhausted before all requested bytes have been copied. Signed-off-by: Jeremy Erazo <mendozayt13@gmail.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
5 daysntfs: Fix spelling mistake "etnry" -> "entry"Colin Ian King1-1/+1
There is a spelling mistake in a ntfs_error message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
5 daysksmbd: fix FSCTL permission bypass by adding a permission check for ↵Sean Shen1-0/+11
FSCTL_SET_SPARSE FSCTL_SET_SPARSE in fsctl_set_sparse() modifies the file's sparse attribute and saves it through xattr without any permission checks. This exposes two issues: 1) A client on a read-only share can change the sparse attribute on files it opened, even though the share is read-only. Other FSCTL write operations already check test_tree_conn_flag(work->tcon, KSMBD_TREE_CONN_FLAG_WRITABLE), but FSCTL_SET_SPARSE does not. 2) Even on writable shares, clients without FILE_WRITE_DATA or FILE_WRITE_ATTRIBUTES access should not modify the sparse attribute. Similar handle-level checks exist in other functions but are missing here. Add both share-level writable check and per-handle access check. Use goto out on error to avoid leaking file references. Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3") Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steve French <smfrench@gmail.com> Signed-off-by: Sean Shen <grayhat@foxmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
5 daysksmbd: release ksmbd_inode ref via ksmbd_inode_put on lookup pathsAleksandr Golovnya1-3/+3
ksmbd_query_inode_status() and ksmbd_lookup_fd_inode() both take a reference on a ksmbd_inode via __ksmbd_inode_lookup() (which performs atomic_inc_not_zero()) and later release it using a bare atomic_dec(&ci->m_count). Unlike ksmbd_inode_put(), a bare atomic_dec() does not check whether the reference count has reached zero, so if the caller happens to drop the last reference, the ksmbd_inode is leaked: it stays in the global inode hash table with m_count == 0, future __ksmbd_inode_lookup() calls reject it via atomic_inc_not_zero(), and ksmbd_inode_free() is never invoked. The race is: T1: __ksmbd_inode_lookup() -> atomic_inc_not_zero(): m_count = 2 T2: ksmbd_inode_put() -> atomic_dec_and_test(): m_count = 1 (not freed) T1: atomic_dec(&ci->m_count) -> m_count = 0 return (LEAK) In ksmbd_lookup_fd_inode() the matched-fp path (which now also uses ksmbd_inode_put()) cannot currently reach m_count == 0 because the matched ksmbd_file holds its own reference on ci, but converting it to the proper API keeps the three call sites consistent and avoids future regressions if the locking changes. Because ksmbd_inode_put() may free the ksmbd_inode if this drops the last reference, the call must happen after up_read(&ci->m_lock) on the two affected paths in ksmbd_lookup_fd_inode(). On the no-match path this is a pure reordering; on the matched path ksmbd_fp_get() is moved above the unlock so that the returned ksmbd_file is pinned before the inode reference is released. Signed-off-by: Aleksandr Golovnya <cofedish@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
5 daysksmbd: OOB read regression in smb_check_perm_dacl() ACE-walk loopsAli Ganiyev1-4/+4
Commit d07b26f39246 ("ksmbd: require minimum ACE size in smb_check_perm_dacl()") introduced a transposed bounds check: if (offsetof(struct smb_ace, sid) + aces_size < CIFS_SID_BASE_SIZE) Since offsetof(..sid) is 8 and CIFS_SID_BASE_SIZE is 8, this evaluates to `aces_size < 0`. Because `aces_size` is always non-negative, this check becomes dead code and never breaks the loop. Worse, that commit removed the old 4-byte guard, meaning the loop now reads `ace->size` (offset 2) even when `aces_size` is 0-3 bytes. This re-opens a 2-byte heap out-of-bounds (OOB) read past the pntsd allocation during subsequent SMB2_CREATE operations. Fix this by properly transposing the comparison to require at least 16 bytes (8-byte offset + 8-byte SID base), matching the correct form used in smb_inherit_dacl(). Fixes: d07b26f39246 ("ksmbd: require minimum ACE size in smb_check_perm_dacl()") Cc: stable@vger.kernel.org Signed-off-by: Ali Ganiyev <ali.qaniyev@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
5 daysMerge tag 'nfsd-7.1-2' of ↵Linus Torvalds7-32/+63
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Regressions: - Tighten bounds checking for sunrpc cache hash tables - Don't report key material in the ftrace log Stable fix: - Fix lockd's implementation of the NLM TEST procedure" * tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: lockd: fix TEST handling when not all permissions are available. NFSD: Report whether fh_key was actually updated sunrpc: prevent out-of-bounds read in __cache_seq_start()
5 daysMerge branch 'for-7.2/block' into for-nextJens Axboe1-11/+4
* for-7.2/block: block: remove blkdev_write_begin() and blkdev_write_end() mtip32xx: fix use-after-free on service thread failure block: don't set BIO_QUIET for BLK_STS_AGAIN direct-io: remove IOCB_NOWAIT support block: Avoid mounting the bdev pseudo-filesystem in userspace block: switch numa_node to int in blk_mq_hw_ctx and init_request block: skip sync_blockdev() on surprise removal in bdev_mark_dead() blk-mq: add tracepoint block_rq_tag_wait block: partitions: fix of_node refcount leak in of_partition()
5 daysdirect-io: remove IOCB_NOWAIT supportChristoph Hellwig1-11/+4
None of the file systems using the legacy direct I/O code actually sets FMODE_NOWAIT, and if they did this would not work, as the write locking could not handle the retry. Remove this dead code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://patch.msgid.link/20260518063336.507369-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 daysbtrfs: derive f_fsid from on-disk fsid and dev_tAnand Jain1-8/+33
The f_fsid was originally derived from fs_devices->fsid and the subvolume root ID. However, when temp_fsid is active, fs_devices->fsid is randomized, making the standard derivation inconsistent. Since metadata_uuid is optional, it is not a reliable alternative. This patch instead retrieves the on-disk UUID from fs_info->super_copy->fsid. To prevent f_fsid collisions between original and cloned filesystems, this implementation hashes the dev_t for single-device btrfs filesystems to ensure uniqueness. This is limited to single-device filesystems as cloned mounts are currently only supported for that configuration. Note that f_fsid will change if the device is replaced. Additionally, since the kernel cannot distinguish between the original and the cloned filesystem, this new f_fsid derivation is applied to both. Link: https://lore.kernel.org/linux-btrfs/cover.1772095546.git.asj@kernel.org/ Link: https://lore.kernel.org/linux-btrfs/cover.1774092915.git.asj@kernel.org/ Signed-off-by: Anand Jain <asj@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>
5 daysbtrfs: use on-disk uuid for s_uuid in temp_fsid mountsAnand Jain1-1/+10
When mounting a cloned filesystem with a temporary fsuuid (temp_fsid), layered modules like overlayfs require a persistent identifier. While internal in-memory fs_devices->fsid must remain unique to the kernel module, let s_uuid carry the original on-disk UUID. Signed-off-by: Anand Jain <asj@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>
5 daysbtrfs: avoid unnecessary dev stats updatesQu Wenruo1-2/+2
[MINOR PROBLEM] When mounting a filesystem with a valid DEV_STATS item, we will always update the DEV_STATS again in the next transaction commit, even if there is no change the values. [CAUSE] During the mount, btrfs_device_init_dev_stats() will read out the on-disk DEV_STATS item for each device. Then it calls btrfs_dev_stat_set() to update the in-memory structure. However btrfs_dev_stat_set() does not only set the dev stats value, but also increase device->dev_stats_ccnt. That member determines if we should update the device item at the next transaction commit. Since we have called btrfs_dev_stat_set() for each dev status member, dev_stats_ccnt will be non-zero and we will update the dev stats item even it doesn't change at all. [FIX] Instead of using btrfs_dev_stat_set() for valid on-disk DEV_STATUS values, directly call atomic_set() to set the in-memory values. For other call sites, we still want to use btrfs_dev_stat_set() so that we will force updating/creating the dev stats item. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
5 daysbtrfs: always update/create the dev stats item when adding a new deviceQu Wenruo2-0/+8
[MINOR PROBLEM] When adding a new btrfs device, the corresponding DEV_STATS item creation can only triggered by a mount cycle if there is no other error triggered: # mkfs.btrfs -f $dev1 $mnt # mount $dev1 $mnt # btrfs dev add $dev2 $mnt # sync # btrfs ins dump-tree -t dev $dev1 device tree key (DEV_TREE ROOT_ITEM 0) leaf 30588928 items 6 free space 15853 generation 9 owner DEV_TREE item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40 <<< persistent item objectid DEV_STATS offset 1 device stats write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0 item 1 key (1 DEV_EXTENT 13631488) itemoff 16195 itemsize 48 Only after a mount cycle and a new transaction, the DEV_STATS for devid 2 can show up: # umount $mnt # mount $dev1 $mnt # touch $mnt # sync # btrfs ins dump-tree -t dev $dev1 device tree key (DEV_TREE ROOT_ITEM 0) leaf 30605312 items 7 free space 15788 generation 10 owner DEV_TREE item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40 persistent item objectid DEV_STATS offset 1 device stats write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0 item 1 key (DEV_STATS PERSISTENT_ITEM 2) itemoff 16203 itemsize 40 persistent item objectid DEV_STATS offset 2 device stats write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0 [CAUSE] Btrfs only updates the DEV_STATS item when the device->dev_stats_ccnt counter is not 0. This is to reduce COW for the device tree. However that dev_stats_ccnt is only increased at the following call sites: - btrfs_dev_stat_inc() This happens when some IO error happened. - btrfs_dev_stat_read_and_reset() This happens for GET_DEV_STATS ioctl with BTRFS_DEV_STATS_RESET flag. - btrfs_dev_stat_set() This happens inside btrfs_device_init_dev_stats(). So when a new device is added, its dev_stats_ccnt is just initialized to 0, and btrfs won't create nor update the corresponding DEV_STATS item at all. [ENHANCEMENT] When a new device is added, also increase the dev_stats_ccnt by one. This includes both device add ioctl and dev-replace. This will force btrfs to create a new DEV_STATS item or update the existing one with the correct values. This not only makes the DEV_STATS creation early, but also prevents old DEV_STATS left from older kernels to cause false alerts for the newly added device. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
5 daysbtrfs: remove the dev stats item when removing a deviceQu Wenruo1-0/+6
[MINOR BUG] The following script will cause DEV_STATS item to be left after the corresponding device is removed: # mkfs.btrfs -f $dev1 # mount $dev1 $mnt # btrfs dev add $dev2 $mnt # umount $mnt ## Without real errors, only at mount time btrfs will update ## dev->dev_stats_ccnt, thus we need a mount cycle to create the ## DEV_STATS item for the new device. # mount $dev1 $mnt # touch $mnt/foobar # sync # btrfs dev remove $dev2 $mnt # umount $mnt This will result the DEV_STATS item for devid 2 still left in device tree: device tree key (DEV_TREE ROOT_ITEM 0) leaf 31064064 items 7 free space 15788 generation 18 owner DEV_TREE leaf 31064064 flags 0x1(WRITTEN) backref revision 1 fs uuid 4bd853ed-f6ef-45fd-bbf1-1c3a2d9987cb chunk uuid b496eab1-ec23-46b5-81c1-2f1b3503ca07 item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40 persistent item objectid DEV_STATS offset 1 device stats write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0 item 1 key (DEV_STATS PERSISTENT_ITEM 2) itemoff 16203 itemsize 40 persistent item objectid DEV_STATS offset 2 device stats write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0 This is not a huge problem, but if the existing DEV_STATS contains errors, and a new device is added into the fs taking the old devid, then after a mount cycle, the new device will suddenly inherit old errors which can give false alerts. [CAUSE] Btrfs never has the ability to delete DEV_STATS items. It either create a new one through update_dev_stat_item(), or read an existing one through btrfs_device_init_dev_stats(). However update_dev_stat_item() is only called lazily, if a new device is created and no new update to dev stats, then it will skip the update of the on-disk item. So if the old DEV_STATS item exists and a new device is added, and no errors during the remaining operations, the old DEV_STATS will not be updated. Then at the next mount cycle, btrfs_device_init_dev_stats() is called at mount time, which will read out the old records, causing false alerts to the newly added device. [FIX] Manually remove the DEV_STATS item during btrfs_rm_device(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
5 daysbtrfs: remove the dev stats item for replace target deviceQu Wenruo3-1/+41
[MINOR PROBLEM] When a running dev-replace hits some error for the target device (devid 0), there will be a DEV_STATS with error records created at the next transaction commit. Unfortunately that item will never to be deleted. This means at the next dev-replace, if the replace is interrupted, then at the next mount, the target device will suddenly inherit the old error records from that DEV_STATS item, which can give some false alerts on that device. This shouldn't affect end users that much, as it requires all the following conditions to be met, which is pretty rare: - The initial dev-replace hits some error on the target device E.g. write errors, but those errors itself is already a big problem for a running replace. This is required to create the DEV_STATS item in the first place. - The next replace is interrupted This is required to allow btrfs to read from the old records. [CAUSE] Btrfs just never deletes the DEV_STATS after a replace is finished. [FIX] Remove the DEV_STATS item for devid 0 after the replace is finished. This is not going to completely fix the error, as we still have other error paths, e.g. by somehow the fs flips RO and can not start a new transaction for the DEV_STATS item removal. But those corner cases will be addressed by later patches which provide a more generic fix to DEV_STATS related problems. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
6 daysbtrfs: validate data reloc tree file extent item membersTeng Liu2-4/+45
get_new_location() uses BUG_ON() to crash the kernel if the file extent item it looks up has any of offset, compression, encryption, or other_encoding set non-zero. The data reloc inode is only written by relocation's own paths and the four fields are always 0 in what the kernel writes: - insert_prealloc_file_extent() memsets the stack item to zero and only fills in type, disk_bytenr, disk_num_bytes and num_bytes, so offset/compression/encryption/other_encoding stay 0. - insert_ordered_extent_file_extent() copies oe->compress_type into the file extent's compression field, but the data reloc inode is created with BTRFS_INODE_NOCOMPRESS so compress_type is always 0; encryption and other_encoding are reserved-and-zero in btrfs. A non-zero value here means the leaf decoded from disk does not match what the kernel wrote, i.e. on-disk corruption. A malformed image reaches this code via balance and panics the kernel. A previous attempt to enforce all four constraints in tree-checker's check_extent_data_item() was merged as commit 7d0ee95979e9 ("btrfs: validate data reloc tree file extent item members in tree-checker") and then reverted by commit 1c034697fcaa after btrfs/061 produced false positives on arm64 with 64K pages. The reason: relocation writeback legitimately produces REG file_extent_items with offset != 0 in the data reloc tree. When an ordered extent covers only the back portion of an underlying PREALLOC (num_bytes < ram_bytes on the input file_extent), insert_ordered_extent_file_extent() inserts a REG with offset = oe->offset num_bytes = oe->num_bytes ram_bytes preserved from the original PREALLOC, and this item can reach disk if a transaction commit fires while it is present in the leaf. The four fields belong in different layers: - compression, encryption and other_encoding are universal invariants for every item in the data reloc tree, regardless of cluster geometry. Enforce them in tree-checker's check_extent_data_item() so a corrupt leaf is rejected at read time. - offset is only an invariant at the cluster-boundary keys that get_new_location() searches (the key is computed as src_disk_bytenr - reloc_block_group_start). Partial-PREALLOC writebacks legitimately place REG items at non-boundary keys with offset != 0; tree-checker cannot reject these. The cluster- boundary item is always written by either insert_prealloc_file_extent() (offset=0 by memset) or by the front portion of a partial writeback (offset=0 by construction), so a non-zero offset there is corruption. Enforce the universal invariants in check_extent_data_item() with a file_extent_err() rejection. Convert the BUG_ON() in get_new_location() to a -EUCLEAN return paired with btrfs_print_leaf() and btrfs_err() so the offending leaf is logged. The caller in replace_file_extents() already handles non-zero returns from get_new_location() by breaking out of the loop without aborting the transaction. Suggested-by: Qu Wenruo <wqu@suse.com> Suggested-by: David Sterba <dsterba@suse.com> Reported-by: syzbot+3e20d8f3d41bac5dc9a2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3e20d8f3d41bac5dc9a2 Signed-off-by: Teng Liu <27rabbitlt@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
6 daysbtrfs: annotate lockless read of defrag_bytes in should_nocow()Cen Zhang1-1/+1
should_nocow() reads inode->defrag_bytes without holding inode->lock, while btrfs_set_delalloc_extent() and btrfs_clear_delalloc_extent() update it under that spinlock. This is a data race. The read is a quick check used to decide whether to fall back to COW for a NOCOW inode: if defrag_bytes is non-zero and the range is tagged EXTENT_DEFRAG, we force COW so that defragmentation can rewrite the extent. Reading a stale value is harmless because: - A missed increment may skip COW once, but the defrag pass will redo the extent later. - A stale non-zero may force an unnecessary COW, which is a minor efficiency loss, not a correctness issue. On 64-bit platforms an aligned u64 load is naturally atomic so tearing cannot happen. On 32-bit platforms u64 may tear, but we only test for zero vs non-zero, so the heuristic stays correct regardless. Use data_race() annotation. Fixes: 47059d930f0e ("Btrfs: make defragment work with nodatacow option") Signed-off-by: Cen Zhang <zzzccc427@gmail.com> [ Use data_race() instead of READ_ONCXE() ] Signed-off-by: David Sterba <dsterba@suse.com>
6 daysMerge tag 'mm-hotfixes-stable-2026-05-25-16-22' of ↵Linus Torvalds1-33/+13
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address post-7.1 issues or aren't considered suitable for backporting. All patches are singletons - please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: Revert "mm: introduce a new page type for page pool in page type" mm/vmalloc: do not trigger BUG() on BH disabled context MAINTAINERS, mailmap: change email for Eugen Hristev mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page kernel/fork: validate exit_signal in kernel_clone() mm: memcontrol: propagate NMI slab stats to memcg vmstats mm/damon/sysfs-schemes: delete tried region in regions_rmdirs() mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one zram: fix use-after-free in zram_writeback_endio memfd: deny writeable mappings when implying SEAL_WRITE ipc: limit next_id allocation to the valid ID range Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare" MAINTAINERS: .mailmap: update after GEHC spin-off
6 daysbtrfs: send: switch struct fs_path to auto freeingDavid Sterba1-70/+43
The fs_path can use the auto freeing pattern and it's completely contained in send. Define the freeing wrapper and add the cleanup attributes. Almost all conversions are straightforward, replacing goto with direct return. Signed-off-by: David Sterba <dsterba@suse.com>
6 daysbtrfs: add message format for qgroupidDavid Sterba2-15/+14
The qgroupid has a specific format, add common format specifier, similar to what we have for checksums and keys. Signed-off-by: David Sterba <dsterba@suse.com>
6 daysbtrfs: zoned: always set max_active_zones for zoned devicesJohannes Thumshirn1-25/+39
When a block device does not report a maximum number of open or active zones, currently assign BTRFS_DEFAULT_MAX_ACTIVE_ZONES (128) to the internal limit, if the device has more than BTRFS_DEFAULT_MAX_ACTIVE_ZONES zones. But if the device has less than BTRFS_DEFAULT_MAX_ACTIVE_ZONES the internal max_active_zones limit will stay at 0, even if the device has zone resource limits. Furthermore, if the device has a total number of zones that is less than BTRFS_DEFAULT_MAX_ACTIVE_ZONE, max_active_zones should be set to at most the number of zones. Also move the max_active_zone calculation and setting into a dedicated helper, to shrink btrfs_get_dev_zone_info(). Fixes: 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
6 daysbtrfs: use bvec_phys() in compressed_bio_last_folio()Matthew Wilcox (Oracle)1-1/+1
This is open-coded bvec_phys(), also remove direct use of bv_page. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Tested-by: Boris Burkov <boris@bur.io> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Sterba <dsterba@suse.com>
6 daysbtrfs: replace __free_page with folio_put() in attach_eb_folio_to_filemap()Matthew Wilcox (Oracle)1-3/+3
Calling __free_page() on folio_page() happens to work today, but won't always. Besides, it's far simpler to call folio_put(). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-by: Boris Burkov <boris@bur.io> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Sterba <dsterba@suse.com>
6 daysRevert "btrfs: fix the file offset calculation inside ↵Matthew Wilcox (Oracle)1-17/+1
btrfs_decompress_buf2page()" It seems that af566bdaff54 was tested against a tree which did not contain commit 12851bd921d4 ("fs: Turn page_offset() into a wrapper around folio_pos()). Unfortunately it has a bug of its own; on 32-bit systems, shifting by PAGE_SHIFT will overflow on files larger than 4GiB. Since page_offset() is now fixed, just revert af566bdaff54. Fixes: af566bdaff54 (btrfs: fix the file offset calculation inside btrfs_decompress_buf2page()) Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Tested-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
6 daysbtrfs: zoned: fix deadlock waiting for ticket during data relocationJohannes Thumshirn3-0/+15
When performing data relocation on a zoned filesystem, BTRFS can deadlock in handle_reserve_tickets(). The relocation process is waiting on a space reservation ticket that can never be fulfilled, because the relocation itself is the operation responsible for freeing up that space. Fix this by introducing a new flush state, BTRFS_RESERVE_FLUSH_ZONED_RELOCATION, specifically for data chunk allocation during zoned relocation. Like BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE, this state uses priority_reclaim_data_space() instead of the normal flushing path, which avoids re-entering the relocation code and breaking the deadlock cycle. In btrfs_alloc_data_chunk_ondemand(), select this new flush state when the inode belongs to a data relocation root on a zoned filesystem. Fixes: e2a7fd22378f ("btrfs: zoned: add zone reclaim flush state for DATA space_info") Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
6 daysbtrfs: zoned: don't account data relocation space-info in statfs free spaceJohannes Thumshirn1-1/+2
Don't account the free space in a data relocation space-info sub-group as usable free space in statfs. This is misleading as no user allocations can be made in this space-info sub-group. It is only a target for relocation. Fixes: f92ee31e031c ("btrfs: introduce btrfs_space_info sub-group") Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
6 daysbtrfs: zoned: always set data_relocation_bgJohannes Thumshirn1-8/+1
When searching for a data relocation block-group on mount, btrfs_zoned_reserve_data_reloc_bg() is looking for the first empty DATA block-group. But it first checks if the block-group is empty and if yes continues the search, and then checks if it is the first DATA block-group. There is actually no point in looking for the second empty DATA block group as new DATA allocations will just allocate a new chunk for it. Pick the first DATA block-group without any allocations done and set it as relocation block-group. At first, the commit 694ce5e143d6 ("btrfs: zoned: reserve data_reloc block group on mount") introduced the functionality. At that time, we took second unused (used == 0) block group, as the first one might be a block group used for normal data. Later, commit daa0fde32235 ("btrfs: zoned: fix data relocation block group reservation") switched to look for an empty block group (alloc_offset == 0). At this point, there is no reason taking the second one anymore. So, this commit is fixing an issue in commit daa0fde32235. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
6 daysbtrfs: zoned: document RECLAIM_ZONES flush stateJohannes Thumshirn1-0/+7
Document the purpose of the RECLAIM_ZONES flush state. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
6 daysbtrfs: introduce support for huge foliosQu Wenruo5-17/+42
With all the previous preparations, it's finally time to enable the huge folio support. - The max folio size Here we define BTRFS_MAX_FOLIO_SIZE, which is fixed at 2MiB. This will ensure we have a large enough but not too large folio for btrfs. This limit applies to all systems regardless of page size. Then we also define BTRFS_MAX_BLOCKS_PER_FOLIO, which depends on CONFIG_BTRFS_EXPERIMENTAL. If it's an experimental build, BTRFS_MAX_BLOCKS_PER_FOLIO is 512, otherwise it's BITS_PER_LONG. The filemap max order will be calculated using both BTRFS_MAX_FOLIO_SIZE and BTRFS_MAX_BLOCKS_PER_FOLIO. E.g. for 64K page size with 64K fs block size, the limit will be BTRFS_MAX_FOLIO_SIZE (2M), which limits the filemap max order to 5. This will be lower than the old order (6), but folios larger than 2M are rarely any better for IO performance. Meanwhile excessively large folios can cause other problems like stalling the IO pipeline for too long. For 4K page size and 4K fs block size, the limit will be increased to 2M from the old 256K. This new size is constrained by both BTRFS_MAX_FOLIO_SIZE (2M) and BTRFS_MAX_BLOCKS_PER_FOLIO (512 * 4K), allowing x86_64 to achieve huge folio support, and the filemap max order will be 9. - btrfs_bio_ctrl::submit_bitmap This will be enlarged to contain BTRFS_MAX_BLOCKS_PER_FOLIO bits, and this will be on-stack memory. This will increase on-stack memory usage by 56 bytes compared to the baseline (before the first patch in the series). - Local @delalloc_bitmap inside writepage_delalloc() Unfortunately we cannot afford to handle an allocation error here, thus again we use on-stack memory. Thus this will increase on-stack memory usage by 56 bytes again. So unfortunately this means during the delalloc window, the writeback path will have +112 bytes on-stack memory usage, and for other cases the writeback path will have +56 bytes on-stack memory usage. The +56 bytes (btrfs_bio_ctrl::submit_bitmap) can be removed after we have reworked the compression submission, so the current on-stack submit_bitmap is mostly a workaround until then. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
6 daysfuse: re-lock request before returning from fuse_ref_folio()Joanne Koong1-1/+1
fuse_ref_folio() unlocks the request but does not re-lock it before returning. fuse_chan_abort() can end the request and the async end callback (eg fuse_writepage_free()) can free the args while the subsequent copy chain logic after fuse_ref_folio() accesses them, leading to use-after-free issues. Fix this by locking the request in fuse_ref_folio() before returning. Fixes: c3021629a0d8 ("fuse: support splice() reading from fuse device") Cc: stable@vger.kernel.org Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
6 daysfuse: re-lock request before replacing page cache folioJoanne Koong1-14/+5
fuse_try_move_folio() unlocks the request on entry but does not re-lock it on the success path. This means fuse_chan_abort() can end the request and free the fuse_io_args (eg fuse_readpages_end()) while the subsequent copy chain logic after fuse_try_move_folio() accesses the fuse_io_args, leading to use-after-free issues. Fix this by calling lock_request() before replace_page_cache_folio(). This ensures the request is locked on the success path which will prevent the fuse_io_args from being freed while the later copying logic runs, and also ensures that the ap->folios[i]->mapping is never null since ap->folios[i] will always point to the newfolio after replace_page_cache_folio(). Fixes: ce534fb05292 ("fuse: allow splice to move pages") Cc: stable@vger.kernel.org Reported-by: Lei Lu <llfamsec@gmail.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
6 daysgenirq/proc: Speed up /proc/interrupts iterationThomas Gleixner1-1/+3
Reading /proc/interrupts iterates over the interrupt number space one by one and looks up the descriptors one by one. That's just a waste of time. When CONFIG_GENERIC_IRQ_SHOW is enabled this can utilize the maple tree and cache the descriptor pointer efficiently for the sequence file operations. Implement a CONFIG_GENERIC_IRQ_SHOW specific version in the core code and leave the fs/proc/ variant for the legacy architectures which ignore generic code. This reduces the time wasted for looking up the next record significantly. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com> Link: https://patch.msgid.link/20260517194932.165280601@kernel.org
6 daysx86/irq: Move IOAPIC misrouted and PIC/APIC error counts into irq_statsThomas Gleixner1-4/+0
The special treatment of these counts is just adding extra code for no real value. The irq_stats mechanism allows to suppress output of counters, which should never happen by default and provides a mechanism to enable them for the rare case that they occur. Move the IOAPIC misrouted and the PIC/APIC error counts into irq_stats, mark them suppressed by default and update the sites which increment them. This changes the output format of 'ERR' and 'MIS' in case there are events to the regular per CPU display format and otherwise suppresses them completely. As a side effect this removes the arch_cpu_stat() mechanism from proc/stat which was only there to account for the error interrupts on x86 and missed to take the misrouted ones into account. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Radu Rendec <radu@rendec.net> Link: https://patch.msgid.link/20260517194931.361942103@kernel.org
6 daysxfs: fix overlapping extents returned for pNFS LAYOUTGETDai Ngo1-2/+2
xfs_fs_map_blocks() currently passes XFS_BMAPI_ENTIRE to xfs_bmapi_read(), which causes the bmap code to expand the mapping to cover the entire extent rather than the requested range. A single LAYOUTGET request from the client can cause the server to issue multiple calls to xfs_fs_map_blocks() for different offsets within the same extent. Because the use of XFS_BMAPI_ENTIRE flag, these calls can produce overlapping mappings. As a result, the LAYOUTGET reply sent to the NFS client may contain overlapping extents. This creates ambiguity in extent selection for a given file range, which can lead to incorrect device selection, inconsistent handling of datastate, and ultimately data corruption or protocol violations on the client side. Problem discovered with xfstest generic/075 test using NFSv4.2 mount with SCSI layout. Fix this by replacing the XFS_BMAPI_ENTIRE flag with '0' so that xfs_bmapi_read() returns only the mapping for the requested range. Fixes: cc6c40e09d7b1 ("NFSD/blocklayout: Support multiple extents per LAYOUTGET"). Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
6 daysxfs: fix use of uninitialized imap in xfs_fs_map_blocks error pathDai Ngo1-2/+5
xfs_fs_map_blocks() acquires the data map lock and then calls xfs_bmapi_read(). If xfs_bmapi_read() fails, the function currently still falls through to xfs_bmbt_to_iomap(), which consumes an uninitialized imap record and may return invalid data to the caller. Fix this by releasing the data map lock and returning immediately when xfs_bmapi_read() reports an error. This prevents xfs_bmbt_to_iomap() from being called with an uninitialized xfs_bmbt_irec. Fixes: 527851124d10f ("xfs: implement pNFS export operations") Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
6 daysfs/ntfs3: call _ntfs_bad_inode() when failing to renameHelen Koike1-2/+2
It is safe to call _ntfs_bad_inode on live inodes since: commit 519b078998ce ("fs/ntfs3: Exclude call make_bad_inode for live nodes.") The WARN_ON was added when it wasn't safe by: commit d99208b91933 ("fs/ntfs3: cancle set bad inode after removing name fails") Replace the WARN_ON with a call to _ntfs_bad_inode() to prevent further operations on the inconsistent inode. Reported-by: syzbot+4d8e30dbafb5c1260479@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=4d8e30dbafb5c1260479 Fixes: 519b078998ce ("fs/ntfs3: Exclude call make_bad_inode for live nodes.") Signed-off-by: Helen Koike <koike@igalia.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
6 daysfs/ntfs3: fix wrong LCN in run_remove_range() when splitting a runZhan Xusheng1-1/+4
When run_remove_range() removes a middle portion of a non-sparse run, it splits the run into head and tail parts. The tail is inserted via run_add_entry() but uses the original r->lcn as its starting LCN instead of advancing it by the split offset. For example, removing VCN range [10, 20) from a run {vcn=0, lcn=100, len=30} should produce: {vcn=0, lcn=100, len=10} (head) {vcn=20, lcn=120, len=10} (tail, lcn advanced by 20) But the current code produces: {vcn=0, lcn=100, len=10} {vcn=20, lcn=100, len=10} (wrong: points to same physical clusters) This creates overlapping physical mappings in the in-memory run tree, which can corrupt cluster allocation decisions and lead to data corruption. The correct pattern is already used in run_insert_range(): CLST lcn2 = r->lcn == SPARSE_LCN ? SPARSE_LCN : (r->lcn + len1); Apply the same logic in run_remove_range(). Fixes: 10d7c95af043 ("fs/ntfs3: add delayed-allocation (delalloc) support") Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
6 daysfs/ntfs3: validate Dirty Page Table capacity in log_replay copy_lcnsYunpeng Tian1-5/+15
In the analysis pass of $LogFile journal replay, log_replay() copies LCNs from each action log record into an existing Dirty Page Table (DPT) entry without bounding the destination index. A crafted NTFS image with DPT entry lcns_follow=1 and an action log record with lcns_follow=2 produces a kernel slab out-of-bounds write at mount time: BUG: KASAN: slab-out-of-bounds in log_replay+0x654c/0xdb60 Write of size 8 at addr ffff8880095e1040 by task mount Two attacker-controlled fields can drive j+i past the allocated page_lcns[] array: 1. dp->lcns_follow (capacity) can be smaller than lrh->lcns_follow. 2. lrh->target_vcn may be smaller than dp->vcn, making the u64 subtraction wrap to a huge size_t. Validate target VCN delta and per-record LCN count against the DPT entry capacity, bail via the existing out: cleanup label with -EINVAL. This mirrors the bounds-check pattern added in commit b2bc7c44ed17 ("fs/ntfs3: Fix slab-out-of-bounds read in DeleteIndexEntryRoot") and commit 0ca0485e4b2e ("fs/ntfs3: validate rec->used in journal-replay file record check"). Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal") Reported-by: Yunpeng Tian <shionthanatos@gmail.com> Reported-by: Mingda Zhang <npczmd@qq.com> Reported-by: Gongming Wang <gmwgg05@gmail.com> Reported-by: Peiyuan Xu <paulbucket12@gmail.com> Reported-by: Qinrun Dai <jupmouse@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Yunpeng Tian <shionthanatos@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
6 daysxfs: handle racing deletions in xfs_zone_gc_iter_irecHans Holmberg1-1/+1
Under heavy garbage collection pressure from RocksDB workloads, filesystem shutdowns can occur in xfs_zone_gc_iter_irec when xfs_iget() returns -EINVAL for deleted files. Fix this by handling -EINVAL just like we handle -ENOENT, allowing zone GC to safely ignore stale mappings. Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection") Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
6 daysisofs: replace __get_free_page() with kmalloc()Mike Rapoport (Microsoft)1-2/+3
isofs_readdir() allocates a temporary buffer with __get_free_page(). kmalloc() is a better API for such use and it also provides better scalability and more debugging possibilities. Replace use of __get_free_page() with kmalloc(). Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Link: https://patch.msgid.link/20260523-b4-fs-v1-11-275e36a83f0e@kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
6 daysquota: allocate dquot_hash with kmalloc()Mike Rapoport (Microsoft)1-6/+5
dquot_init() allocates a single page for dquot_hash with __get_free_pages(). kmalloc() is a better API for such use and it also provides better scalability and more debugging possibilities. Replace use of __get_free_pages() with kmalloc() and get rid of the order variable that remained 0 for more than 20 years. Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Link: https://patch.msgid.link/20260523-b4-fs-v1-1-275e36a83f0e@kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
7 daysMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 7.1-rc5Alexei Starovoitov98-752/+1575
Cross-merge BPF and other fixes after downstream PR. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
7 daysMerge branch 'misc-7.1-perf' into next-fixesDavid Sterba6-0/+51
7 daysMerge branch 'misc-7.1' into next-fixesDavid Sterba2-2/+5
7 daysbtrfs: fix invalid pointer dereference in __btrfs_run_delayed_refs()Filipe Manana1-2/+3
In the beginning of the loop, we try to obtain a locked delayed ref head, if 'locked_ref' is currently NULL, by calling btrfs_select_ref_head(), which can return an error pointer. If the error pointer is -EAGAIN we do a continue and go back to the beginning of the loop, which will not try again to call btrfs_select_ref_head() since 'locked_ref' is no longer NULL but it's ERR_PTR(-EAGAIN), and then we do: spin_lock(&locked_ref->lock); against a ERR_PTR(-EAGAIN) value, generating an invalid pointer dereference. Fix this by ensuring that 'locked_ref' is set to NULL when btrfs_select_ref_head() returns ERR_PTR(-EAGAIN) and incrementing 'count' as well, to prevent infinite looping. We do this by doing a goto to the bottom of the loop that already sets 'locked_ref' to NULL and does a cond_resched(), with an increment to 'count' right before the goto. These measures were in place before the refactoring in commit 0110a4c43451 ("btrfs: refactor __btrfs_run_delayed_refs loop") but were unintentionally lost afterwards. Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/linux-btrfs/ag8ARRwykv8bpJ87@stanley.mountain/ Fixes: 0110a4c43451 ("btrfs: refactor __btrfs_run_delayed_refs loop") Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
7 daysbtrfs: protect sb_write_pointer() with invalidate lockKangNing Liao1-0/+2
sb_write_pointer() reads the super block from the block device page cache using read_cache_page_gfp(). This has the same race with BLKBSZSET as the one fixed by commit 3f29d661e568 ("btrfs: sync read disk super and set block size"). Take the mapping invalidate lock around read_cache_page_gfp() to serialize the read against block size changes. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: KangNing Liao <lkangn.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
7 daysntfs: free link name from ntfs_name_cacheDaeMyung Kang1-3/+2
ntfs_link() converts the new link name with ntfs_nlstoucs() using NTFS_MAX_NAME_LEN. In this case ntfs_nlstoucs() allocates the result from ntfs_name_cache, and its contract requires callers to release the buffer with kmem_cache_free(ntfs_name_cache, ...). All other ntfs_nlstoucs() callers in namei.c do that, but ntfs_link() uses kfree(), which mismatches the allocator for successfully converted names. The conversion failure path reaches the common out label with uname == NULL. That was harmless for kfree(), but kmem_cache_free() does not provide the same NULL contract. Return directly on conversion failure and free successful conversions with ntfs_name_cache. Fixes: af0db57d4293 ("ntfs: update inode operations") Signed-off-by: DaeMyung Kang <charsyam@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
7 dayshpfs: fix a crash if hpfs_map_dnode_bitmap failsMikulas Patocka1-1/+1
If hpfs_map_dnode_bitmap fails, the code would call hpfs_brelse4 on uninitialized quad buffer head, causing a crash. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu> Cc: stable@vger.kernel.org
7 daysbtrfs: migrate btrfs_bio_ctrl::submit_bitmap to support larger bitmapsQu Wenruo3-33/+66
[CURRENT LIMIT] Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger than BITS_PER_LONG. One call site that utilizes this limit is btrfs_bio_ctrl::submit_bitmap, which makes it very simple and straightforward to just grab an unsigned long value and assign it to submit_bitmap. Unfortunately that limit prevents us from supporting huge folios. For 4K page size and block size, a huge folio (order 9) means 512 blocks inside a 2M folio. [ENHANCEMENT] Instead of using a fixed unsigned long value, change btrfs_bio_ctrl::submit_bitmap to an unsigned long pointer. And for cases where an unsigned long can hold the whole bitmap, introduce @submit_bitmap_value, and just point that pointer to that unsigned long. Then update all direct users of bio_ctrl->submit_bitmap to use the pointer version. There are several call sites that get extra changes: - @range_bitmap inside extent_writepage_io() Which is only utilized to truncate the bitmap. Since we do not want to allocate new memory just for such temporary usage, change the original bitmap_set() and bitmap_and() into bitmap_clear() for the ranges outside of the target range. - Getting dirty subpage bitmap inside writepage_delalloc() Since we're passing an unsigned long pointer now, we need to go with different handling (bs == ps, blocks_per_folio <= BITS_PER_LONG, blocks_per_folio > BITS_PER_LONG). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
7 daysbtrfs: prepare subpage operations to support more than BITS_PER_LONG sub-bitmapsQu Wenruo3-43/+87
[CURRENT LIMIT] Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger than BITS_PER_LONG. That limit allows us to easily grab an unsigned long without the need to properly allocate memory for a larger bitmap. Unfortunately that limit prevents us from supporting huge folios. For 4K page size and block size, a huge folio (order 9) means 512 blocks inside a 2M folio. [ENHANCEMENT] To allow direct bitmap operations without allocating new memory, introduce two different ways to access the subpage bitmaps: - Return an unsigned long value This only happens if blocks_per_folio <= BITS_PER_LONG. We read out the sub-bitmap into an unsigned long, and return the value. This is the old existing method. This involves get_bitmap_value_##name() helper functions. And this time the helper functions are defined as inline functions instead of macros to provide better type checks. - Return a pointer where the sub-bitmap starts This only happens if blocks_per_folio >= BITS_PER_LONG. This is the new method for sub-bitmaps larger than BITS_PER_LONG. Since the sizes of sub-bitmaps are all aligned to BITS_PER_LONG, we can directly access the start word of the sub-bitmap. This involves get_bitmap_pointer_##name() helper functions. Then change the existing sub-bitmaps users to use the new helpers: - Bitmap dumping Switch between get_bitmap_value_##name() and get_bitmap_pointer_##name() depending on the sub-bitmap size. - btrfs_get_subpage_dirty_bitmap() Rename it to btrfs_get_subpage_dirty_bitmap_value() to follow the new value/pointer naming. Since we do not support huge folios yet, there is no pointer version for the dirty bitmap. Furthermore, add the support for block size == page size cases for btrfs_get_subpage_dirty_bitmap_value(), so that the caller no longer needs to check if the folio needs subpage handling. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
7 daysbtrfs: update the out-of-date comments on subpageQu Wenruo1-34/+5
The comments at the beginning of subpage.c are out-of-date, a lot of the limitations have been already resolved. Update them to reflect the latest status. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
7 daysntfs: remove unnecessary NULL checks before kfreeNamjae Jeon2-5/+3
NULL check before kfree() is unnecessary and triggers coccinelle warnings. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
7 daysntfs: remove unnecessary ternary boolean conversionNamjae Jeon1-6/+6
Coccinelle warned about unnecessary patterns when assigning to bool variables. Simply assign the condition directly. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
7 daysntfs: remove unsupported quota handlingDaeMyung Kang5-226/+2
The ntfs driver does not implement quota accounting. It creates new inodes with the NTFS 1.2 $STANDARD_INFORMATION layout and does not maintain the NTFS 3.x owner_id/quota_charged fields or the $Quota usage records that Windows would need for meaningful quota accounting. The only runtime quota path left in the driver is the remount-rw code that tries to mark $Quota/$Q out of date, plus the mount-time code that loads $Quota and its $Q index solely to support that marker. Since the driver does not maintain the per-file quota metadata, setting QUOTA_FLAG_OUT_OF_DATE does not make the quota state meaningful, and failures in this unsupported path can unnecessarily block remount-rw or force a mount read-only. Remove the quota marker, the $Quota/$Q loading state, and the unused quota volume flag. Keep the on-disk quota layout definitions in layout.h so the documented NTFS structures remain available. Suggested-by: Hyunchul Lee <hyc.lee@gmail.com> Link: https://lore.kernel.org/all/CANFS6bYTzioqZjYt=51Kb9RdR3MKXaez_fh_WCLoym093VxFmg@mail.gmail.com/ Signed-off-by: DaeMyung Kang <charsyam@gmail.com> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
7 daysntfs: use str_plural in ntfs_attr_make_non_residentThorsten Blum1-1/+2
Replace the manual ternary "s" pluralization with str_plural() to simplify the code. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
7 daysntfs: add bounds check before accessing EA entriesHyunchul Lee1-8/+8
in ntfs_ea_lookup and ntfs_listxattr, this verifies that there is enough space in the EA entry before accessing the next_entry_offset field of the EA entry. Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
7 daysntfs: validate index entries on readingHyunchul Lee4-68/+60
Validate index entries immediately after reading an index root or index block from disk. This eliminates repeated checks in lookup and readdir, and reduce the risk of missing checks in those paths. Tested-by: woot000 <woot000@woot000.com> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
7 daysntfs: centalize $INDEX_ROOT header validationHyunchul Lee3-9/+23
Add a dedicated helper to perform stricter validation of $INDEX_ROOT and use it for both directory inodes and named index inodes. This keeps the root size and header geometry checks consistent across both read paths. Tested-by: woot000 <woot000@woot000.com> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
7 daysntfs: validate index block header more strictlyHyunchul Lee3-61/+81
Modify ntfs_index_block_inconsisent() to perform stricter validation of INDEX_HEADER geometry in INDX blocks, and update ntfs_lookup_inode_by_name() to use that function to validate INDX blocks. Tested-by: woot000 <woot000@woot000.com> Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
7 daysntfs: avoid heap allocation for free-cluster readahead stateDaeMyung Kang1-16/+6
get_nr_free_clusters() allocates a temporary file_ra_state before it publishes the precomputed free cluster count, sets NVolFreeClusterKnown(), and wakes vol->free_waitq. If that allocation fails, the worker returns without setting the flag or waking waiters, so callers waiting for the free count can block indefinitely. The readahead state is only used synchronously while scanning the bitmap. Keep it on the stack and pass it by address to the readahead helper. This eliminates the early allocation failure path instead of adding a special case that publishes a conservative count and wakes the waitqueue. Zero-initialize the on-stack state because file_ra_state_init() only sets ra_pages and prev_pos. Apply the same treatment to __get_nr_free_mft_records(), which scans the MFT bitmap with the same short-lived readahead state. Signed-off-by: DaeMyung Kang <charsyam@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
7 daysntfs: skip extent mft records in writeback to prevent deadlockHyunchul Lee1-125/+4
This patch fixes the ABBA deadlock between extent_lock and extent mrec_lock triggered by xfstests generic/113, that occurs since the commit 6994acf33bae ("ntfs: use base mft_no when looking up base inode for extent record"). Path A (inode writeback): VFS writeback -> ntfs_write_inode() -> __ntfs_write_inode() -> mutex_lock(&ni->extent_lock) -> mutex_lock(&tni->mrec_lock) Path B (MFT folio writeback): VFS writeback of $MFT dirty folios -> ntfs_mft_writepages() -> ntfs_write_mft_block() -> ntfs_may_write_mft_record() -> holds one extent mrec_lock from a previous iteration -> tries to acquire another base inode extent_lock By removing all extent_lock and extent mrec_lock acquisition from the MFT folio writeback path, the ABBA lock ordering is eliminated: Path A: __ntfs_write_inode(): extent_lock -> mrec_lock Path B (removed): ntfs_write_mft_block(): mrec_lock -> extent_lock Path B is always redundant for extent records because: 1. mark_mft_record_dirty(ext_ni) does NOT dirty the MFT folio. It only sets NInoDirty(ext_ni) and marks the base VFS inode dirty via __mark_inode_dirty(I_DIRTY_DATASYNC), which triggers Path A. Therefore, normal extent modifications never create a situation where the MFT folio is dirty and Path B is not scheduled. 2. The MFT folio only gets dirtied via ntfs_mft_mark_dirty() inside ntfs_mft_record_alloc(). But all identified callers in attrib.c (ntfs_attr_add, ntfs_attr_record_move_away, ntfs_attr_make_non_resident, ntfs_attr_record_resize) follow through with mark_mft_record_dirty(), which triggers Path A to write the complete record. 3. ntfs_evict_big_inode() calls ntfs_commit_inode() before freeing extent inodes, ensuring all dirty extents are flushed via Path A before the base inode leaves the icache. Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
7 daysntfs: only alias volume $UpCase to default on exact matchDaeMyung Kang1-9/+3
load_and_init_upcase() currently aliases vol->upcase to the global default upcase whenever the shared prefix matches, and then truncates vol->upcase_len to that shorter prefix. The result is correct only by accident: upcase[] accesses in name collation are gated by upcase_len, so the prefix-equality alias produces the same fold output as keeping the volume's own shorter table. Still, prefix equality is not equality: the volume table is logically distinct from the default and should not be replaced by it unless they are byte-for-byte identical. Use memcmp() to compare the complete table in one expression and drop the now-redundant upcase_len rewrite. No user-visible change is expected for compliant volumes whose $UpCase has exactly default_upcase_len entries; shorter volume tables are no longer aliased to the default. Signed-off-by: DaeMyung Kang <charsyam@gmail.com> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
7 daysntfs: free volume-wide resources on fill_super failureDaeMyung Kang1-2/+3
ntfs_fill_super()'s err_out_now path frees only the volume struct via kfree(vol), leaving several vol-owned allocations behind on every mount failure: - vol->nls_map, loaded by ntfs_init_fs_context() via load_nls_default() (or replaced by an explicit nls= option in ntfs_parse_param()), is never unload_nls()'d. - vol->volume_label, allocated by load_system_files() through ntfs_ucstonls() once the $Volume name attribute has been parsed, is not released by load_system_files()'s own error labels nor by the fill_super() inline cleanup that only runs on d_make_root() failure. Any later failure inside load_system_files() leaks it. - vol->lcn_empty_bits_per_page was kvfree()'d in unl_upcase_iput_tmp_ino_err_out_now without clearing the pointer, so it could not be folded into a single common cleanup. Because the failure paths never call ntfs_volume_free() and never reach the d_make_root() inline cleanup block (it sits above the label and is jumped over by the load_system_files() / kvmalloc failure gotos), these resources accumulate per failed mount attempt with no chance of recovery short of unloading the module. This is a silent leak: the inodes loaded prior to failure remain hashed but generic_shutdown_super() skips evict_inodes() when sb->s_root is unset, so no CHECK_DATA_CORRUPTION warning is emitted either. Move the per-volume frees down to err_out_now and drop the lcn_empty_bits_per_page kvfree() from the upper label so the cleanup is performed exactly once on every failure path. Using unconditional kvfree() / kfree() / unload_nls() is safe because they all accept NULL and the upper labels that previously freed nls_map (the d_make_root() inline cleanup) already clear the pointer. Signed-off-by: DaeMyung Kang <charsyam@gmail.com> Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
7 daysMerge tag 'v7.1-rc5' into driver-core-nextDanilo Krummrich142-1243/+2482
We need the driver-core fixes in here as well to build on top of. Signed-off-by: Danilo Krummrich <dakr@kernel.org>
8 daysbtrfs: don't force DIO writes to be serializedMark Harmstone1-0/+1
Before btrfs switched to the new mount API in 2023, we were setting SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the filesystem may have files which don't have security xattrs, enabling it to do some optimizations. Unfortunately this was missed in the transition, meaning that IS_NOSEC will always return false for a btrfs inode. This means that btrfs_direct_write() calls will always get the inode lock exclusively, meaning that DIO writes to the same file will be serialized. On my machine, this one-line change results in a ~59% improvement in DIO throughput: Before patch: test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64 ... fio-3.39 Starting 32 processes test: Laying out IO file (1 file / 1024MiB) Jobs: 32 (f=32): [w(32)][100.0%][w=764MiB/s][w=195k IOPS][eta 00m:00s] test: (groupid=0, jobs=32): err= 0: pid=586: Wed Apr 22 13:03:04 2026 write: IOPS=202k, BW=787MiB/s (826MB/s)(46.1GiB/60012msec); 0 zone resets bw ( KiB/s): min=498714, max=1199892, per=100.00%, avg=806659.03, stdev=4229.94, samples=3808 iops : min=124677, max=299971, avg=201661.82, stdev=1057.49, samples=3808 cpu : usr=0.32%, sys=1.27%, ctx=8329204, majf=0, minf=1163 IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=0,12094328,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: bw=787MiB/s (826MB/s), 787MiB/s-787MiB/s (826MB/s-826MB/s), io=46.1GiB (49.5GB), run=60012-60012msec After patch: test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64 ... fio-3.39 Starting 32 processes test: Laying out IO file (1 file / 1024MiB) Jobs: 32 (f=32): [w(32)][100.0%][w=1255MiB/s][w=321k IOPS][eta 00m:00s] test: (groupid=0, jobs=32): err= 0: pid=572: Wed Apr 22 13:13:46 2026 write: IOPS=320k, BW=1250MiB/s (1311MB/s)(73.3GiB/60003msec); 0 zone resets bw ( MiB/s): min= 619, max= 2289, per=100.00%, avg=1251.28, stdev= 9.64, samples=3808 iops : min=158538, max=586025, avg=320320.80, stdev=2468.97, samples=3808 cpu : usr=0.35%, sys=11.50%, ctx=1584847, majf=0, minf=1160 IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=0,19203309,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: bw=1250MiB/s (1311MB/s), 1250MiB/s-1250MiB/s (1311MB/s-1311MB/s), io=73.3GiB (78.7GB), run=60003-60003msec The script to reproduce that: #!/bin/bash mkfs.btrfs -f /dev/nvme0n1 mount /dev/nvme0n1 /mnt/test mkdir /mnt/test/nocow chattr +C /mnt/test/nocow fio /root/test.fio # cat /root/test.fio [global] rw=randwrite ioengine=io_uring iodepth=64 size=1g direct=1 startdelay=20 force_async=4 ramp_time=5 runtime=60 group_reporting=1 numjobs=32 time_based disk_util=0 clat_percentiles=0 disable_lat=1 disable_clat=1 disable_slat=1 filename=/mnt/test/nocow/fiofile [test] name=test bs=4k stonewall This was on a VM with 8 cores and 8GB of RAM, with a real NVMe exposed through PCI passthrough. The figures for XFS and ext4 in comparison are both about ~3GB/s. Fixes: ad21f15b0f79 ("btrfs: switch to the new mount API") Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: limit size of bios submitted from writebackJan Kara5-0/+50
Currently btrfs_writepages() just accumulates as large bio as possible (within writeback_control constraints) and then submits it. This can however lead to significant latency in writeback IO submission (I have observed tens of milliseconds) because the submitted bio easily has over hundred of megabytes. Consequently this leads to IO pipeline stalls and reduced throughput. At the same time beyond certain size submitting so large bio provides diminishing returns because the bio is split by the block layer immediately anyway. So compute (estimate of) bio size beyond which we are unlikely to improve performance and just submit the bio for writeback once we accumulate that much to keep the IO pipeline busy. This improves writeback throughput for sequential writes by about 15% on the test machine I was using. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> [ Fix the handling of missing device to avoid NULL pointer dereference. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: simplify how first hit is passed to __btrfs_abort_transaction()David Sterba2-8/+22
Optimize the btrfs_abort_transaction() for size as it (by our convention) must be put right after the error condition is detected. The exact file:line is reported so there's a portion that must be inlined. As this is cold code it bloats functions. In previous patch "btrfs: move transaction abort message to __btrfs_abort_transaction()" the error message was moved to the common helper, saving like 20KiB of btrfs.ko and several instructions per call site and some stack space. There's little left to be optimized, we need to keep the atomic test_and_set_bit() and to convey that as 'first hit' to __btrfs_abort_transaction(). Right now it's a bool, which takes 8 bytes on stack for each call but it's 1 bit of information. We can encode that to some of the other parameters. For that let's use the 'error' parameter, by convention it's negative errno so we can reliably detect if it's the first hit or a later error. Also the negation is usually implemented by a single instruction (NEG on x86_64) so the resulting object code is kept short. This reduces btrfs.ko by 8K and stack in several functions by 8 bytes. Cumulative effect with the other commit is -30K of btrfs.ko. While the encoding is an implementation detail, it's contained within the API. Making the transaction abort calls very light is desired. Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: validate negative error number passed to btrfs_abort_transaction()David Sterba1-0/+24
In preparation to encode more information to the error value add a step that verifies if the value is valid (i.e. < 0). This works for compile-time and runtime (in debugging mode). The compile-time check recognizes direct constants and defines an array type. An invalid condition leads to negative array size which is caught by compiler. The runtime check constructs the array type from the condition and only verifies the correct size, as we don't need to tweak the size to be negative. The sizeof() expressions do not generate any code. In the debugging config the warning adds about 9KiB of btrfs.ko code size. The array size trick is needed as we can't use static_array(), not even with __builtin_constant_p(). Sample error message: In file included from inode.c:40: inode.c: In function ‘__cow_file_range_inline’: transaction.h:261:26: error: size of unnamed array is negative 261 | (void)sizeof(char[-!(__builtin_constant_p(error) ? (error) < 0 : 1)]); \ | ^ transaction.h:275:9: note: in expansion of macro ‘VERIFY_NEGATIVE_ERROR’ 275 | VERIFY_NEGATIVE_ERROR(error); \ | ^~~~~~~~~~~~~~~~~~~~~ inode.c:665:17: note: in expansion of macro ‘btrfs_abort_transaction’ 665 | btrfs_abort_transaction(trans, 17); | ^~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: fix invalid pointer dereference in __btrfs_run_delayed_refs()Filipe Manana1-2/+3
In the beginning of the loop, we try to obtain a locked delayed ref head, if 'locked_ref' is currently NULL, by calling btrfs_select_ref_head(), which can return an error pointer. If the error pointer is -EAGAIN we do a continue and go back to the beginning of the loop, which will not try again to call btrfs_select_ref_head() since 'locked_ref' is no longer NULL but it's ERR_PTR(-EAGAIN), and then we do: spin_lock(&locked_ref->lock); against a ERR_PTR(-EAGAIN) value, generating an invalid pointer dereference. Fix this by ensuring that 'locked_ref' is set to NULL when btrfs_select_ref_head() returns ERR_PTR(-EAGAIN) and incrementing 'count' as well, to prevent infinite looping. We do this by doing a goto to the bottom of the loop that already sets 'locked_ref' to NULL and does a cond_resched(), with an increment to 'count' right before the goto. These measures were in place before the refactoring in commit 0110a4c43451 ("btrfs: refactor __btrfs_run_delayed_refs loop") but were unintentionally lost afterwards. Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/linux-btrfs/ag8ARRwykv8bpJ87@stanley.mountain/ Fixes: 0110a4c43451 ("btrfs: refactor __btrfs_run_delayed_refs loop") Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: protect sb_write_pointer() with invalidate lockKangNing Liao1-0/+2
sb_write_pointer() reads the super block from the block device page cache using read_cache_page_gfp(). This has the same race with BLKBSZSET as the one fixed by commit 3f29d661e568 ("btrfs: sync read disk super and set block size"). Take the mapping invalidate lock around read_cache_page_gfp() to serialize the read against block size changes. Signed-off-by: KangNing Liao <lkangn.kernel@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for btrfs_sync_log()Filipe Manana1-0/+15
btrfs_sync_log() is one of the main functions called during a fsync. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for btrfs_log_new_name()Filipe Manana1-2/+5
btrfs_log_new_name() is an important function that affects inode logging and is called during link and rename operations. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for btrfs_record_new_subvolume()Filipe Manana1-0/+2
btrfs_record_new_subvolume() is an important operation that affects inode logging and is called during subvolume creation. Add a trace event for it to help debug issues. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for btrfs_record_snapshot_destroy()Filipe Manana1-0/+2
btrfs_record_snapshot_destroy() is an important operation that affects inode logging and is called during subvolume/snapshot deletion as well as during rmdir. Add a trace event for it to help debug issues. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for btrfs_record_unlink_dir()Filipe Manana1-0/+2
btrfs_record_unlink_dir() is an important operation that affects inode logging and is called during unlink and rename operations. Add a trace event for it to help debug issues. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for log_new_delayed_dentries()Filipe Manana1-0/+10
log_new_delayed_dentries() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: use simple assertions where enough during inode logging and replayFilipe Manana1-3/+2
In overwrite_item(): There's no point in printing the root's ID if the assertion fails, since it can only be BTRFS_TREE_LOG_OBJECTID if it fails. In log_new_delayed_dentries(): There's no point in using a verbose assertion to print the value of ctx->logging_new_delayed_dentries because it's a boolean, so if the assertion fails we know its value is true (1). So convert them to simpler assertion to make the code less verbose. It also slightly reduces the object size, at least on x86_64 using Debian's gcc 14.2.0-19 (if CONFIG_BTRFS_ASSERT is enabled in the kernel config, which is the case for SUSE distributions for example). Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2028244 197176 15624 2241044 223214 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2028228 197176 15624 2241028 223204 fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for log_conflicting_inodes()Filipe Manana1-0/+9
log_conflicting_inodes() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for add_conflicting_inode()Filipe Manana1-13/+26
add_conflicting_inode() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for log_new_dir_dentries()Filipe Manana1-2/+8
log_new_dir_dentries() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for log_all_new_ancestors()Filipe Manana1-11/+23
log_all_new_ancestors() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for btrfs_log_all_parents()Filipe Manana1-9/+20
btrfs_log_all_parents() is an important step called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for btrfs_log_inode()Filipe Manana2-5/+12
btrfs_log_inode() is one of the most important steps called during a fsync, as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: use a named enum for the log mode in inode log functionsFilipe Manana2-37/+32
We use this unnamed enum for the log mode and then pass it around log functions as an int type with the odd name "inode_only" which suggests a boolean. So add a name to the enum and change the type everywhere to that enum and rename the parameters to something more clear - "log_mode". Also move the enum into tree-log.h - it will be used later by new trace events. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for btrfs_log_inode_parent()Filipe Manana1-9/+22
btrfs_log_inode_parent() is one of the most important steps called during a fsync operation as well as during rename and link operations on inodes that were previously logged. Add trace events for when entering and exiting that function. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for when fsync finishesFilipe Manana1-1/+3
Currently we only have a trace event for when a fsync operation starts, but this alone is not very helpful. Add a trace event for when fsync finishes, which reports its return value, so that using tracing we can see which other trace events happened in between (several will be added soon for inode logging steps) and even measure execution time. So rename the existing trace event btrfs_sync_file to btrfs_sync_file_enter and add the trace event btrfs_sync_file_exit. The naming is similar to what ext4 does (ext4_sync_file_enter and ext4_sync_file_exit) and with similar information reported. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove redundant writeback error check during fsyncFilipe Manana1-8/+4
If we can skip logging the inode during fsync, we check for writeback errors in the inode's mapping by calling filemap_check_wb_err() and then jump to the 'out_release_extents' label, which in turn jumps to the 'out' label under which we check again for a writeback error by calling file_check_and_advance_wb_err(). So the filemap_check_wb_err() ends up being redundant. This happens since commit 333427a505be ("btrfs: minimal conversion to errseq_t writeback error reporting on fsync"). Remove the filemap_check_wb_err() call. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>