| Age | Commit message (Collapse) | Author | Files | Lines |
|
https://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
# Conflicts:
# drivers/cpufreq/Kconfig.x86
# drivers/cpufreq/Makefile
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git
|
|
|
|
# Conflicts:
# fs/btrfs/defrag.c
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
|
|
# Conflicts:
# fs/fuse/dev.c
|
|
|
|
|
|
https://github.com/Paragon-Software-Group/linux-ntfs3.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/ntfs.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
# Conflicts:
# fs/exfat/file.c
|
|
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git
# Conflicts:
# fs/fuse/dev.c
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs.git
|
|
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
|
|
Processes can write to the last page of a file using mmap, and when the file
size is not a multiple of the page size, this can be used to write beyond the
end of the file. This is sometimes referred to as page poisoning, and it is
not a problem in itself because the data beyond eof will be ignored. However,
we currently fail to clear out any space beyond the end of the file that we
skip over when the file size is increased, so that "poison" can end up getting
exposed. Fix that.
Fixes xfstest generic/363.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The mode argument of fallocate_chunk() became unused in commit
1885867b84d5 ("GFS2: Update i_size properly on fallocate"), so remove it.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
# New commits in x86/cache:
1cfa74c683ea ("fs/resctrl: Document tasks file behaviour for task id 0 and idle tasks")
9a1646211f8c ("fs/resctrl: Document that automatic counter assignment is best effort")
3aec86e4ea01 ("fs/resctrl: Continue counter allocation after failure")
ee3d4c81d89c ("fs/resctrl: Add monitor property 'mbm_cntr_assign_fixed'")
f52abe650241 ("fs/resctrl: Disallow the software controller when MBM counters are assignable")
94a1206522d1 ("x86,fs/resctrl: Create 'event_filter' files read only if they're not configurable")
7625632fed43 ("fs/resctrl: Tidy up the error path in resctrl_mkdir_event_configs()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
# New commits in irq/core:
171cc0d9eed1 ("genirq/proc: Speed up /proc/interrupts iteration")
61b51a167c52 ("genirq/proc: Runtime size the chip name")
7603e0575d8a ("genirq: Expose irq_find_desc_at_or_after() in core code")
1d9c4745bfb6 ("genirq: Add rcuref count to struct irq_desc")
34594da7650d ("genirq/proc: Increase default interrupt number precision to four")
2d62735f1d4a ("genirq: Calculate precision only when required")
4892e5e71ec9 ("genirq: Cache the condition for /proc/interrupts exposure")
3ba92f6a2820 ("genirq/manage: Make NMI cleanup RT safe")
b99dc723b12e ("genirq: Expose nr_irqs in core code")
cca5e6fa791b ("scripts/gdb: Update x86 interrupts to the array based storage")
d6b70b16b4e7 ("x86/irq: Move IOAPIC misrouted and PIC/APIC error counts into irq_stats")
8713f2e596a1 ("x86/irq: Suppress unlikely interrupt stats by default")
2b57c69917ee ("x86/irq: Make irqstats array based")
0179464391af ("genirq/proc: Utilize irq_desc::tot_count to avoid evaluation")
95c33a64f203 ("genirq/proc: Avoid formatting zero counts in /proc/interrupts")
115bbf0c1b60 ("x86/irq: Optimize interrupts decimals printing")
c2c7983c93f5 ("genirq/proc: Size interrupt directory names for 10-digit interrupt numbers")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
# New commits in timers/merge:
3eb4923e6851 ("clocksource: Add devm_clocksource_register_*() helpers")
c8d32a0389fb ("timers: Fix flseep() typo in kernel-doc comment")
5d330d652d7a ("hrtimer: Fix the bogus return type of __hrtimer_start_range_ns()")
3af1f49f415d ("hrtimer: Return ktime_t from hrtimer_get_next_event()/hrtimer_next_event_without()")
33d4bfc49613 ("clocksource: Clean up clocksource_update_freq() functions")
ed3b3c497668 ("alarmtimer: Remove stale return description from alarm_handle_timer()")
b00385b8d081 ("selftests/posix_timers: Use CLOCK_THREAD_CPUTIME_ID for ITIMER_PROF measurements")
cab0cd0130eb ("scripts/timers: Add timer_migration_tree.py")
5a7dfbcbbdb6 ("timers/migration: Handle capacity in connect tracepoints")
098cbaad8e57 ("timers/migration: Split per-capacity hierarchies")
3ba25488380f ("timers/migration: Track CPUs in a hierarchy")
ff65875f80d1 ("timers/migration: Abstract out hierarchy to prepare for CPU capacity awareness")
ed78a7019419 ("alarmtimer: Remove unused interfaces")
12e4311aa5b2 ("netfilter: xt_IDLETIMER: Switch to alarm_start_timer()")
9fa2e38ab749 ("power: supply: charger-manager: Switch to alarm_start_timer()")
7dda99952ced ("fs/timerfd: Use the new alarm/hrtimer functions")
f4b58f61da79 ("alarmtimer: Convert posix timer functions to alarm_start_timer()")
183d00b72713 ("alarmtimer: Provide alarm_start_timer()")
acc071343d29 ("posix-timers: Switch to hrtimer_start_expires_user()")
cfb7fe3fdd4c ("posix-timers: Handle the timer_[re]arm() return value")
6fdb2677a594 ("posix-timers: Expand timer_[re]arm() callbacks with a boolean return value")
b40c927345a9 ("hrtimer: Use hrtimer_start_expires_user() for hrtimer sleepers")
bd5956166d20 ("hrtimer: Provide hrtimer_start_range_ns_user()")
68ed094971b0 ("clocksource/drivers/timer-of: Make the code compatible with modules")
2423405880c2 ("clocksource/drivers/mmio: Make the code compatible with modules")
fed9f727cc3f ("clocksource/drivers/sun5i: Handle error returns from devm_reset_control_get_optional_exclusive()")
045a9dac7eb7 ("clocksource/drivers/timer-rtl-otto: Make rttm_cs variable static")
b385caf91868 ("dt-bindings: timer: fsl,imxgpt: add compatible string fsl,imx25-epit")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
ocfs2_validate_gd_parent() only bounds bg_bits against the parent
allocator's chain geometry. A malicious descriptor can still claim a
bg_size/bg_bits pair that exceeds the bitmap bytes that physically fit in
the group descriptor block, so later bitmap scans and bit updates can run
past bg_bitmap.
Add a physical-cap check based on ocfs2_group_bitmap_size() for the parent
allocator type and reject descriptors whose bg_size or bg_bits exceed that
capacity. Keep the existing chain geometry check so both the on-disk
bitmap layout and the allocator metadata must agree before the descriptor
is used.
Validation reproduced this kernel report:
KASAN use-after-free in _find_next_bit+0x7f/0xc0
Read of size 8
Call trace:
dump_stack_lvl+0x66/0xa0 (?:?)
print_report+0xd0/0x630 (?:?)
_find_next_bit+0x7f/0xc0 (?:?)
srso_alias_return_thunk+0x5/0xfbef5 (?:?)
__virt_addr_valid+0x188/0x2f0 (?:?)
kasan_report+0xe4/0x120 (?:?)
ocfs2_find_max_contig_free_bits+0x35/0x70 (fs/ocfs2/suballoc.c:1375)
ocfs2_block_group_set_bits+0x472/0x4b0 (fs/ocfs2/suballoc.c:1457)
ocfs2_cluster_group_search+0x16b/0x440 (fs/ocfs2/suballoc.c:86)
ocfs2_bg_discontig_fix_result+0x1ef/0x230 (fs/ocfs2/suballoc.c:1786)
ocfs2_search_chain+0x8f8/0x10a0 (fs/ocfs2/suballoc.c:1886)
get_page_from_freelist+0x70e/0x2370 (?:?)
lock_release+0xc6/0x290 (?:?)
do_raw_spin_unlock+0x9a/0x100 (?:?)
kasan_unpoison+0x27/0x60 (?:?)
__bfs+0x147/0x240 (?:?)
get_page_from_freelist+0x83d/0x2370 (?:?)
ocfs2_claim_suballoc_bits+0x38c/0xe70 (fs/ocfs2/suballoc.c:96)
sched_domains_numa_masks_clear+0x70/0xd0 (?:?)
check_irq_usage+0xe8/0xb70 (?:?)
__ocfs2_claim_clusters+0x18d/0x4c0 (fs/ocfs2/suballoc.c:2497)
check_path+0x24/0x50 (?:?)
rcu_is_watching+0x20/0x50 (?:?)
check_prev_add+0xfd/0xd00 (?:?)
ocfs2_add_clusters_in_btree+0x17d/0x810 (fs/ocfs2/suballoc.c:?)
__folio_batch_add_and_move+0x1f5/0x3d0 (?:?)
ocfs2_add_inode_data+0xd9/0x120 (fs/ocfs2/suballoc.c:?)
filemap_add_folio+0x105/0x1f0 (?:?)
ocfs2_write_begin_nolock+0x29f7/0x2f80 (fs/ocfs2/suballoc.c:3043)
ocfs2_read_inode_block+0xb5/0x110 (fs/ocfs2/suballoc.c:?)
down_write+0xf5/0x180 (?:?)
ocfs2_write_begin+0x180/0x240 (fs/ocfs2/suballoc.c:?)
__mark_inode_dirty+0x758/0x9a0 (?:?)
inode_to_bdi+0x41/0x90 (?:?)
balance_dirty_pages_ratelimited_flags+0xf8/0x1d0 (?:?)
generic_perform_write+0x252/0x440 (?:?)
mnt_put_write_access_file+0x16/0x70 (?:?)
file_update_time_flags+0xe4/0x200 (?:?)
ocfs2_file_write_iter+0x80a/0x1320 (fs/ocfs2/suballoc.c:?)
lock_acquire+0x184/0x2f0 (?:?)
ksys_write+0xd2/0x170 (?:?)
apparmor_file_permission+0xf5/0x310 (?:?)
read_zero+0x8d/0x140 (?:?)
lock_is_held_type+0x8f/0x100 (?:?)
Link: https://lore.kernel.org/20260524111248.1429884-1-rollkingzzc@gmail.com
Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The locking_state debugfs iterator snapshots struct ocfs2_lock_res by
value under ocfs2_dlm_tracking_lock and later formats that copy in
ocfs2_dlm_seq_show(). That is fine for the inline fields, but the
userspace fsdlm stack stores the LVB through lksb_fsdlm.sb_lvbptr. Once
the iterator drops the tracking lock, a copied non-NULL sb_lvbptr still
points into the original lockres owner, so teardown can free that
container before the debugfs dump walks the raw LVB bytes.
Rebase the copied sb_lvbptr to the copied l_lksb before dumping the raw
LVB. The seq snapshot already carries the inline LVB storage reserved in
struct ocfs2_dlm_lksb, so the debugfs reader can dump the copied bytes
without borrowing the original lockres lifetime.
The buggy scenario involves two paths, with each column showing the order
within that path:
locking_state reader: lockres teardown:
1. ocfs2_dlm_seq_start()/next() 1. file release or another owner
copies struct ocfs2_lock_res teardown reaches
2. ocfs2_dlm_seq_show() formats ocfs2_lock_res_free()
the copied row 2. the lockres is removed from the
3. ocfs2_dlm_lvb() follows the tracking list
copied sb_lvbptr 3. the owner frees the original
lockres container
Validation reproduced this kernel report:
KASAN slab-use-after-free in ocfs2_dlm_seq_show+0x1bd/0x430
RIP: 0033:0x7f8ec4b1e29d
The buggy address belongs to the object at ffff88810a1e0800 which belongs
to the cache kmalloc-1k of size 1024
The buggy address is located 368 bytes inside of freed 1024-byte region
[ffff88810a1e0800, ffff88810a1e0c00)
Read of size 1
Call trace:
dump_stack_lvl+0x66/0xa0
print_report+0xce/0x630
ocfs2_dlm_seq_show+0x1bd/0x430 (fs/ocfs2/dlmglue.c:3137)
srso_alias_return_thunk+0x5/0xfbef5
__virt_addr_valid+0x19f/0x330
kasan_report+0xe0/0x110
seq_read_iter+0x29d/0x790
seq_read+0x20a/0x280
find_held_lock+0x2b/0x80
rcu_read_unlock+0x18/0x70
full_proxy_read+0x9e/0xd0
vfs_read+0x12c/0x590
ksys_read+0xd2/0x170
do_user_addr_fault+0x65a/0x890
do_syscall_64+0x115/0x6a0 (arch/x86/entry/syscall_64.c:87)
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Allocated by task stack:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
__kasan_kmalloc+0xaa/0xb0
ocfs2_file_open+0x13e/0x300
do_dentry_open+0x233/0x7f0
vfs_open+0x5a/0x1b0
path_openat+0x66d/0x1540
do_file_open+0x186/0x2b0
do_sys_openat2+0xce/0x150
__x64_sys_openat+0xd0/0x140
do_syscall_64+0x115/0x6a0 (arch/x86/entry/syscall_64.c:87)
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task stack:
kasan_save_stack+0x33/0x60
kasan_save_track+0x14/0x30
kasan_save_free_info+0x3b/0x60
__kasan_slab_free+0x5f/0x80
kfree+0x313/0x590
ocfs2_file_release+0x138/0x260
__fput+0x1df/0x4b0
fput_close_sync+0xd2/0x170
__x64_sys_close+0x55/0x90
do_syscall_64+0x115/0x6a0 (arch/x86/entry/syscall_64.c:87)
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Link: https://lore.kernel.org/20260525041726.4112882-1-rollkingzzc@gmail.com
Fixes: cf4d8d75d8ab ("ocfs2: add fsdlm to stackglue")
Assisted-by: Codex:gpt-5.5
Signed-off-by: Zhang Cen <rollkingzzc@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
READ_ONLY_THP_FOR_FS is no longer present, remove related comment.
Link: https://lore.kernel.org/20260517135416.1434539-11-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam@infradead.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
filemap_nr_thps*() are removed, the related field, address_space->nr_thps,
is no longer needed. Remove it. This shrinks struct address_space by 8
bytes on 64-bit systems which may increase the number of inodes we can
cache.
Link: https://lore.kernel.org/20260517135416.1434539-8-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nico Pache <npache@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Sterba <dsterba@suse.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Liam Howlett <liam@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without
large folio support, so that read-only THPs created in these FSes are not
seen by the FSes when the underlying fd becomes writable. Now read-only
PMD THPs only appear in a FS with large folio support and the supported
orders include PMD_ORDER.
READ_ONLY_THP_FOR_FS was using mapping->nr_thps, inode->i_writecount, and
smp_mb() to prevent writes to a read-only THP and collapsing writable
folios into a THP. In collapse_file(), mapping->nr_thps is increased,
then smp_mb(), and if inode->i_writecount > 0, collapse is stopped, while
do_dentry_open() first increases inode->i_writecount, then a full memory
fence, and if mapping->nr_thps > 0, all read-only THPs are truncated.
Now this mechanism can be removed along with READ_ONLY_THP_FOR_FS code,
since a dirty folio check has been added after try_to_unmap() in
collapse_file() to prevent dirty folios from being collapsed as clean.
Link: https://lore.kernel.org/20260517135416.1434539-7-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Sterba <dsterba@suse.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "use vma locks for proc/pid/{smaps|numa_maps} reads", v2.
Use per-vma locks when reading /proc/pid/smaps and /proc/pid/numa_maps
similar to /proc/pid/maps to reduce contention on central mmap_lock. One
major difference between maps and smaps/numa_maps reading is that the
latter executes page table walk which can't be done under RCU due to a
possibility of sleeping. Therefore we drop RCU read lock before this walk
while keeping the VMA locked. After the walk we retake RCU read lock,
reset VMA iterator and proceed with the next VMA.
The last two patches extend /proc/pid/maps test to cover /proc/pid/smaps
reading during concurrent address space modification.
This patch (of 3):
proc/pid/{smaps|numa_maps} can be read using the combination of RCU and
VMA read locks, similar to proc/pid/maps. RCU is required to safely
traverse the VMA tree and VMA lock stabilizes the VMA being processed and
the pagetable walk.
Link: https://lore.kernel.org/20260426062718.1238437-1-surenb@google.com
Link: https://lore.kernel.org/20260426062718.1238437-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Liam R. Howlett <liam@infradead.org>
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "userfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.c",
v3.
These patches merge fs/userfaultfd.c into mm/userfaultfd.c and make
functions used only inside mm/userfaultfd.c static.
This patch (of 2):
Historically userfaultfd implementation has been split between
fs/userfaultfd.c and mm/userfaultfd.c.
The mm/ part implemented memory management operations, while the fs/ part
implemented file descriptor handling and called into the mm/ part for the
actual memory management work.
This separation is quite artificial and fs/userfaultfd.c does not seem to
belong to fs/ because it's only a user if vfs APIs and like for other
users, for example, memfd and secretmem, the file descriptor handling
could live in mm/ as well.
"Append" fs/userfaultfd.c to mm/userfaultfd and update fs/Makefile and
MAINTAINERS accordingly.
No intended functional changes.
Link: https://lore.kernel.org/20260523173759.3964908-1-rppt@kernel.org
Link: https://lore.kernel.org/20260523173759.3964908-2-rppt@kernel.org
Assisted-by: Copilot:claude-opus-4-6
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Christian Brauner (Amutable) <brauner@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Sashiko says:
mremap_userfaultfd_prep() increments ctx->mmap_changing to stall
concurrent operations, but mremap_userfaultfd_fail() does not
decrement it before dropping the context reference.
If an mremap operation fails, ctx->mmap_changing remains elevated. This
will causes subsequent userfaultfd operations like a UFFDIO_COPY to fail
with -EAGAIN.
Decrement ctx->mmap_changing in mremap_userfaultfd_fail().
Link: https://sashiko.dev/#/patchset/20260430113512.115938-1-rppt@kernel.org
Link: https://lore.kernel.org/20260513081416.495963-1-rppt@kernel.org
Fixes: df2cc96e7701 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races")
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
On a volume mounted without OCFS2_FEATURE_INCOMPAT_SPARSE_ALLOC, a
non-inline regular file with non-zero i_size and zero i_clusters is
structurally malformed: the extent map declares no allocated clusters yet
the size header claims content exists. Keep rejecting that shape, but
express it through a shared predicate so the same invariant is available
to normal inode reads and online filecheck.
The same zero-cluster shape is also malformed for non-inline directories.
ocfs2 directory growth allocates backing storage before advancing i_size,
and ocfs2_dir_foreach_blk_el() later walks until ctx->pos reaches
i_size_read(inode). A forged directory dinode with a huge i_size and no
clusters would repeatedly fail on holes while advancing through the
claimed size.
Sparse regular files remain exempt: on sparse-alloc volumes, truncate can
legitimately grow i_size without allocating clusters. System inodes and
inline-data dinodes also retain their separate storage rules.
Mirror the check in ocfs2_filecheck_validate_inode_block() as well.
filecheck reports through its own error namespace, so malformed
size/cluster state is logged as a filecheck invalid-inode result rather
than via ocfs2_error(), but it must not proceed into
ocfs2_populate_inode().
Link: https://lore.kernel.org/20260519110404.1803902-4-michael.bommarito@gmail.com
Fixes: b657c95c1108 ("ocfs2: Wrap inode block reads in a dedicated function.")
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://sashiko.dev/#/patchset/20260517111015.3187935-1-michael.bommarito%40gmail.com
Assisted-by: Claude:claude-opus-4-7
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
id1.dev1.i_rdev is the device-number arm of the ocfs2_dinode id1 union.
It is only meaningful for character and block device inodes. For any
other user-visible file type the on-disk value must be zero.
ocfs2_populate_inode() currently copies id1.dev1.i_rdev into inode->i_rdev
before the S_IFMT switch decides whether the inode is a special file. A
non-device inode with a non-zero i_rdev can therefore publish stale or
attacker-controlled device state into the in-core inode.
System inodes legitimately use other arms of the same union, so keep the
cross-check restricted to non-system inodes. Factor that predicate into a
helper and use it in both the normal validator and online filecheck path;
filecheck reports the malformed dinode through
OCFS2_FILECHECK_ERR_INVALIDINO instead of ocfs2_error().
Link: https://lore.kernel.org/20260519110404.1803902-3-michael.bommarito@gmail.com
Fixes: b657c95c1108 ("ocfs2: Wrap inode block reads in a dedicated function.")
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Assisted-by: Claude:claude-opus-4-7
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "ocfs2: harden inode validators against forged metadata", v2.
This series adds three structural checks to OCFS2 dinode validation so
malformed on-disk fields are rejected before ocfs2_populate_inode() copies
them into the in-core inode.
The checks cover:
- i_mode values whose type bits do not name a canonical POSIX file
type;
- non-device dinodes whose id1.dev1.i_rdev field is non-zero; and
- non-inline dinodes that claim non-zero i_size while i_clusters is
zero, covering directories unconditionally and regular files on
non-sparse volumes.
The normal read path reports these through ocfs2_error(), matching the
existing suballoc-slot, inline-data, chain-list, and refcount checks. The
online filecheck path uses the same structural predicates but keeps its
own reporting contract, returning OCFS2_FILECHECK_ERR_INVALIDINO instead
of calling ocfs2_error().
This patch (of 3):
ocfs2_validate_inode_block() currently accepts any non-zero i_mode value.
ocfs2_populate_inode() then copies that mode verbatim into inode->i_mode
and dispatches on i_mode & S_IFMT to the file/dir/symlink/special_file
iops; an unrecognised type falls through to ocfs2_special_file_iops and
init_special_inode().
Reject dinodes whose type bits do not name one of the seven canonical
POSIX file types. Use fs_umode_to_ftype(), the same generic file-type
conversion helper OCFS2 already uses for directory entries, so the
accepted inode type set matches the kernel file-type vocabulary instead of
open-coding a local switch.
Apply the same structural check to the online filecheck read path.
filecheck keeps its own error namespace, so it reports malformed i_mode
through the filecheck logger and OCFS2_FILECHECK_ERR_INVALIDINO instead of
calling ocfs2_error(), but it must not allow a malformed dinode to proceed
into ocfs2_populate_inode().
Link: https://lore.kernel.org/20260519110404.1803902-1-michael.bommarito@gmail.com
Link: https://lore.kernel.org/20260519110404.1803902-2-michael.bommarito@gmail.com
Fixes: b657c95c1108 ("ocfs2: Wrap inode block reads in a dedicated function.")
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Link: https://sashiko.dev/#/patchset/20260517111015.3187935-1-michael.bommarito%40gmail.com
Assisted-by: Claude:claude-opus-4-7
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 43b10a20372d ("ocfs2: avoid system inode ref confusion by adding
mutex lock") tried to avoid a refcount leak caused by allowing multiple
threads to call igrab(inode). But addition of osb->system_file_mutex made
locking dependency complicated and is causing lockdep to warn about
possibility of AB-BA deadlock.
Since _ocfs2_get_system_file_inode() returns the same inode for the same
input arguments, we don't need to serialize
_ocfs2_get_system_file_inode(). What we need to make sure is that
igrab(inode) is called for only once(). Therefore, replace
osb->system_file_mutex with cmpxchg()-based locking.
Link: https://lore.kernel.org/fea8d1fd-afb0-4302-a560-c202e2ef7afd@I-love.SAKURA.ne.jp
Fixes: 43b10a20372d ("ocfs2: avoid system inode ref confusion by adding mutex lock")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Stop directly calling into function pointers from users of the RAID6 PQ
API, and provide exported functions with proper documentation and API
guarantees asserts where applicable instead.
Link: https://lore.kernel.org/20260518051804.462141-8-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Ard Biesheuvel <ardb@kernel.org> # kunit only on arm64
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Borislav Petkov (AMD)" <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Li Nan <linan122@huawei.com>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Song Liu <song@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
[BUG]
A corrupt inline xattr header can make ocfs2_reflink_xattr_inline() lock,
copy, and reflink xattr state from an unchecked ibody xattr header.
[CAUSE]
The inline reflink path still trusted di->i_xattr_inline_size to compute
header_off, xh, and new_xh before handing the source header to the reflink
allocator and copy logic.
[FIX]
Validate the source inode's inline xattr header with the shared helper
first, then derive the reflink copy offsets from the validated inline
size/header. This keeps the reflink path from traversing corrupt ibody
xattr geometry.
Link: https://lore.kernel.org/20260508085914.61647-6-gality369@gmail.com
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: Jia-Ju Bai <baijiaju1990@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Zixuan Fu <r33s3n6@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
[BUG]
A corrupt inline xattr header can make ocfs2_xattr_inline_attach_refcount()
feed an unchecked header into the refcount-attachment walk for inline
xattr values.
[CAUSE]
The inline refcount-attach path still derived the header directly from
di->i_xattr_inline_size and then passed it to code that iterates xh_count
and xattr entries.
[FIX]
Use the shared ibody header helper before attaching refcounts to inline
xattr values so corrupt header geometry is rejected with -EFSCORRUPTED
instead of being traversed.
Link: https://lore.kernel.org/20260508085914.61647-5-gality369@gmail.com
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: Jia-Ju Bai <baijiaju1990@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Zixuan Fu <r33s3n6@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
[BUG]
A corrupt inline xattr header can make ocfs2_xattr_ibody_remove() pass an
unchecked header into ocfs2_remove_value_outside() during inode xattr
teardown.
[CAUSE]
ocfs2_xattr_ibody_remove() still rebuilt the ibody xattr header directly
from di->i_xattr_inline_size and then handed it to code that iterates
xh_count and entry geometry.
[FIX]
Validate the inline xattr header with the shared helper before handing it
to the outside-value removal path, and propagate -EFSCORRUPTED on bad
metadata instead of traversing the unchecked header.
Link: https://lore.kernel.org/20260508085914.61647-4-gality369@gmail.com
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: Jia-Ju Bai <baijiaju1990@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Zixuan Fu <r33s3n6@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
[BUG]
A corrupt inline xattr header can make
ocfs2_has_inline_xattr_value_outside() walk xh_count from an unchecked
header while refcount-tree teardown decides whether inline xattrs still
point outside the inode body.
[CAUSE]
ocfs2_has_inline_xattr_value_outside() still computed the inline header
directly from di->i_xattr_inline_size and immediately iterated xh_count.
That is the same unchecked metadata boundary as the ibody lookup bug.
[FIX]
Reuse the shared inline-header helper before iterating xh_count. Because
this helper returns a boolean-style answer to its caller, treat a corrupt
header conservatively as "has outside values" instead of walking it.
Link: https://lore.kernel.org/20260508085914.61647-3-gality369@gmail.com
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: Jia-Ju Bai <baijiaju1990@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Zixuan Fu <r33s3n6@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "ocfs2: validate inline xattr header consumers".
Corrupt i_xattr_inline_size can move the computed inode-body xattr header
outside the dinode block. Several OCFS2 paths then trust xh_count or
xattr entry geometry from that unchecked header.
The reported KASAN splat hits the ibody lookup path:
BUG: KASAN: use-after-free in ocfs2_xattr_find_entry+0x37b/0x3a0
ocfs2_xattr_ibody_get()
ocfs2_xattr_get_nolock()
ocfs2_calc_xattr_init()
The same unchecked header derivation also exists in the outside-value
probe, ibody remove, inline refcount attach, and inline reflink paths.
This series factors the existing ibody list validation into a shared
helper and then converts the remaining inline-header consumers one at a
time.
Patch layout:
1. validate ibody get/find and reuse the helper in ibody list
2. validate the outside-value probe
3. validate ibody remove
4. validate inline refcount attach
5. validate inline reflink
This patch (of 5):
[BUG]
mknodat() can read past the end of a dinode block when ACL inheritance
walks a corrupted inode-body xattr header. Another report shows the same
unchecked lookup later faulting in the VFS open path after create
returns a garbage status.
KASAN: use-after-free in
ocfs2_xattr_find_entry+0x37b/0x3a0 fs/ocfs2/xattr.c:1078
Read of size 2 at addr ffff88801c520300 by task syz.0.10/360
Trace:
...
ocfs2_xattr_find_entry+0x37b/0x3a0 fs/ocfs2/xattr.c:1078
ocfs2_xattr_ibody_get fs/ocfs2/xattr.c:1178 [inline]
ocfs2_xattr_get_nolock+0x2ee/0x1110 fs/ocfs2/xattr.c:1309
ocfs2_calc_xattr_init+0x716/0xac0 fs/ocfs2/xattr.c:628
ocfs2_mknod+0x935/0x2400 fs/ocfs2/namei.c:333
ocfs2_create+0x158/0x390 fs/ocfs2/namei.c:676
vfs_create fs/namei.c:3493 [inline]
vfs_create+0x445/0x6f0 fs/namei.c:3477
do_mknodat+0x2d8/0x5e0 fs/namei.c:4372
__do_sys_mknodat fs/namei.c:4400 [inline]
__se_sys_mknodat fs/namei.c:4397 [inline]
__x64_sys_mknodat+0xb6/0xf0 fs/namei.c:4397
...
Another report:
BUG: unable to handle page fault for address: fffffbfff3e40ec0
RIP: 0010:__d_entry_type include/linux/dcache.h:414 [inline]
RIP: 0010:d_can_lookup include/linux/dcache.h:429 [inline]
RIP: 0010:d_is_dir include/linux/dcache.h:439 [inline]
RIP: 0010:path_openat+0xe2f/0x2ce0 fs/namei.c:4134
Trace:
...
do_filp_open+0x1f6/0x430 fs/namei.c:4161
do_sys_openat2+0x117/0x1c0 fs/open.c:1437
__x64_sys_openat+0x15b/0x220 fs/open.c:1463
...
[CAUSE]
ocfs2_xattr_ibody_list() already validates the inline xattr size and
entry count, but ocfs2_xattr_ibody_get() and ocfs2_xattr_ibody_find()
still derive the inline header directly from di->i_xattr_inline_size and
then trust xh_count. A corrupted inline size or entry count can therefore
move the computed header outside the dinode block before get/find start
walking it. That can either make ocfs2_xattr_find_entry() dereference
xs->header->xh_count outside the block or make ocfs2_xattr_get_nolock()
bubble a garbage status back through ocfs2_calc_xattr_init() into the
create/open path.
[FIX]
Factor the existing ibody header geometry checks into a shared helper.
Use it in ocfs2_xattr_ibody_get() and ocfs2_xattr_ibody_find(), and have
ocfs2_xattr_ibody_list() reuse the same helper instead of open-coding
the validation. Reject corrupt ibody metadata with -EFSCORRUPTED before
the lookup path can walk bogus xattr geometry or return a garbage status.
Link: https://lore.kernel.org/20260508085914.61647-1-gality369@gmail.com
Link: https://lore.kernel.org/20260508085914.61647-2-gality369@gmail.com
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Jia-Ju Bai <baijiaju1990@gmail.com>
Cc: Zixuan Fu <r33s3n6@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
[BUG]
A fuzzed OCFS2 image can corrupt the current slot journal dinode while
mount is still in progress. The mount path first reports the invalid
journal block and then crashes in shutdown:
kernel BUG at fs/ocfs2/journal.c:1034!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
RIP: 0010:ocfs2_journal_toggle_dirty+0x2d6/0x340 fs/ocfs2/journal.c:1034
Call Trace:
ocfs2_journal_shutdown+0x414/0xc30 fs/ocfs2/journal.c:1116
ocfs2_mount_volume fs/ocfs2/super.c:1785 [inline]
ocfs2_fill_super+0x30a9/0x3cd0 fs/ocfs2/super.c:1083
get_tree_bdev_flags+0x38b/0x640 fs/super.c:1698
get_tree_bdev+0x24/0x40 fs/super.c:1721
ocfs2_get_tree+0x21/0x30 fs/ocfs2/super.c:1184
vfs_get_tree+0x9a/0x370 fs/super.c:1758
fc_mount fs/namespace.c:1199 [inline]
do_new_mount_fc fs/namespace.c:3642 [inline]
do_new_mount fs/namespace.c:3718 [inline]
path_mount+0x5b8/0x1ea0 fs/namespace.c:4028
do_mount fs/namespace.c:4041 [inline]
__do_sys_mount fs/namespace.c:4229 [inline]
__se_sys_mount fs/namespace.c:4206 [inline]
__x64_sys_mount+0x282/0x320 fs/namespace.c:4206
...
[CAUSE]
ocfs2_journal_toggle_dirty() used to return -EIO when journal->j_bh no
longer contained a valid dinode, because the startup and shutdown paths
already handled that failure. Commit 10995aa2451a
("ocfs2: Morph the haphazard OCFS2_IS_VALID_DINODE() checks.") changed
the check to a BUG_ON() under the assumption that the journal dinode had
already been validated. That turns an unexpected invalid journal dinode
during mount teardown into a kernel crash instead of a normal mount
failure.
[FIX]
Replace the BUG_ON() with WARN_ON() and return -EIO. This keeps the
invariant warning for debugging, but restores the original behavior of
failing startup or shutdown cleanly instead of panicking the kernel.
Link: https://lore.kernel.org/20260512024115.4036371-1-gality369@gmail.com
Fixes: 10995aa2451a ("ocfs2: Morph the haphazard OCFS2_IS_VALID_DINODE() checks.")
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
[BUG]
openat(..., O_WRONLY|O_CREAT|O_TRUNC) can hit:
kernel BUG at fs/ocfs2/file.c:454!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
RIP: 0010:ocfs2_truncate_file+0x1204/0x13c0 fs/ocfs2/file.c:454
Call Trace:
ocfs2_setattr+0xa6d/0x1fd0 fs/ocfs2/file.c:1212
notify_change+0x4b5/0x1030 fs/attr.c:546
do_truncate+0x1d2/0x230 fs/open.c:68
handle_truncate fs/namei.c:3596 [inline]
do_open fs/namei.c:3979 [inline]
path_openat+0x260f/0x2ce0 fs/namei.c:4134
do_filp_open+0x1f6/0x430 fs/namei.c:4161
do_sys_openat2+0x117/0x1c0 fs/open.c:1437
do_sys_open fs/open.c:1452 [inline]
__do_sys_openat fs/open.c:1468 [inline]
__se_sys_openat fs/open.c:1463 [inline]
__x64_sys_openat+0x15b/0x220 fs/open.c:1463
...
[CAUSE]
ocfs2_truncate_file() treats di_bh->i_size matching inode->i_size as an
internal code invariant and BUGs if it is broken.
That assumption is too strong for corrupted metadata. The dinode block can
still be structurally valid enough to pass ocfs2_read_inode_block() while
no longer matching an already-instantiated VFS inode. On local mounts,
ocfs2_inode_lock_update() skips refresh entirely, so truncate can
observe the mismatch directly and crash instead of rejecting the
corruption.
[FIX]
Turn the BUG_ON into normal OCFS2 corruption handling. If truncate sees
di_bh->i_size disagree with inode->i_size, report it with ocfs2_error() and
abort before touching truncate state.
This keeps the fix at the first boundary that actually requires the
sizes to match and avoids widening checks into hotter generic
inode-lock paths
Link: https://lore.kernel.org/20260512021601.3936417-1-gality369@gmail.com
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Clean up inconsistent indentation (mixing tabs and spaces) and remove
extraneous whitespace in several Kconfig files across the tree. This is a
purely cosmetic change to improve readability.
Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:
$ sed -e 's/^ /\t/' -i */Kconfig
Link: https://lore.kernel.org/20260407053945.14116-1-linux.amoon@gmail.com
Signed-off-by: Anand Moon <linux.amoon@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> [fs]
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> [mm]
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> [mm]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add KUnit coverage for fat_clus_to_blknr() and fat_get_blknr_offset()
using stub msdos_sb_info values so cluster-to-sector and i_pos split math
stays correct.
Link: https://lore.kernel.org/20260405011920.28622-1-adinata.softwareengineer@gmail.com
Signed-off-by: Adi Nata <adinata.softwareengineer@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
ocfs2 quota recovery allocates a bitmap buffer with kmalloc and does not
fully initialize it. This can lead to use of uninitialized bits during
quota recovery from a corrupted filesystem image.
Use kzalloc instead to ensure the bitmap is zero-initialized.
Link: https://lore.kernel.org/20260418131048.1052507-1-tristmd@gmail.com
Reported-by: syzbot+7ea0b96c4ddb49fd1a70@syzkaller.appspotmail.com
Signed-off-by: Tristan Madani <tristan@talencesecurity.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Replace strlen(fn) with strnlen(fn, NAME_MAX + 1) when validating the
final path component in __proc_create().
This preserves the existing name limit while bounding the length scan to
one byte past the maximum name length. Handle empty names separately, and
treat names longer than NAME_MAX as too long.
Link: https://lore.kernel.org/20260421122648.56723-2-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Thorsten Blum <thorsten.blum@linux.dev>
Cc: wangzijie <wangzijie1@honor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
* deduplicate "iter.tgid += 1" line,
Right now it is done once inside next_tgid() itself and second time
inside "for" loop.
* deduplicate next_tgid() call itself with different loop style:
auto it = make_tgid_iter();
while (next_tgid(&it)) {
...
}
gcc seems to inline it twice:
$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
add/remove: 0/1 grow/shrink: 1/0 up/down: 100/-245 (-145)
Function old new delta
proc_pid_readdir 531 631 +100
next_tgid 245 - -245
* make tgid_iter.pid_ns const
it never changes during readdir anyway
[akpm@linux-foundation.org: remove newline]
Link: https://lore.kernel.org/20260422191745.435556-2-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
next_tgid() accepts pid namespace as an argument, but it never changes
during readdir (which would be unthinkable thing to do anyway).
Move it inside iterator type and hide from direct usage.
Link: https://lore.kernel.org/20260422191745.435556-1-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, when cache=loose is enabled, file reads are cached in the
page cache, but symlink reads are not. This patch allows the results
of p9_client_readlink() to be stored in the page cache, eliminating
the need for repeated 9P transactions on subsequent symlink accesses.
This change improves performance for workloads that involve frequent
symlink resolution.
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Message-ID: <982462d17c0c0d2856763266a25eb04d080c1dbb.1779355927.git.repk@triplefau.lt>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
For cache=loose mounts, set the default negative dentry cache retention
time to 24 hours.
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Message-ID: <b5beca3e70890ab8a4f0b9e99bd69cb97f5cb9eb.1779355927.git.repk@triplefau.lt>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Introduce a new mount option, negtimeout, for v9fs that allows users
to specify how long negative dentries are retained in the cache. The
retention time can be set in milliseconds (e.g. negtimeout=10000 for
a 10secs retention time) or a negative value (e.g. negtimeout=-1) to
keep negative entries until the buffer cache management removes them.
For consistency reasons, this option should only be used in exclusive
or read-only mount scenarios, aligning with the cache=loose usage.
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Message-ID: <b2d66500aa5a2f6540347c4aa46a4be10dd01bc6.1779355927.git.repk@triplefau.lt>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Not caching negative dentries can result in poor performance for
workloads that repeatedly look up non-existent paths. Each such
lookup triggers a full 9P transaction with the server, adding
unnecessary overhead.
A typical example is source compilation, where multiple cc1 processes
are spawned and repeatedly search for the same missing header files
over and over again.
This change enables caching of negative dentries, so that lookups for
known non-existent paths do not require a full 9P transaction. The
cached negative dentries are retained for a configurable duration
(expressed in milliseconds), as specified by the ndentry_timeout
field in struct v9fs_session_info. If set to -1, negative dentries
are cached indefinitely.
This optimization reduces lookup overhead and improves performance for
workloads involving frequent access to non-existent paths.
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Message-ID: <e542317dd03bbadb5249abd3ea6aecfdca692c19.1779355927.git.repk@triplefau.lt>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
When mkdir succeeds, v9fs_vfs_mkdir_dotl() and v9fs_vfs_mkdir() return
ERR_PTR(0) which is incorrect. They should return NULL instead for
success and ERR_PTR() only with negative error codes for failure.
Return NULL instead of passing to ERR_PTR while err is zero
Fixes smatch warnings:
fs/9p/vfs_inode_dotl.c:420 v9fs_vfs_mkdir_dotl() warn: passing zero to 'ERR_PTR'
fs/9p/vfs_inode.c:695 v9fs_vfs_mkdir() warn: passing zero to 'ERR_PTR'
The v9fs_vfs_mkdir() code was further simplified because v9fs_create()
can never return NULL, so we do not need to check for fid being set
separately, and the error path can be a simple return immediately after
v9fs_create() failure.
There is no intended functional change.
Fixes: 88d5baf69082 ("Change inode_operations.mkdir to return struct dentry *")
Suggested-by: David Laight <david.laight.linux@gmail.com>
Acked-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Hongling Zeng <zenghongling@kylinos.cn>
Message-ID: <20260520022650.14217-1-zenghongling@kylinos.cn>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
* dev:
crypto: pkcs7: export verify_pkcs7_message_sig() as EXPORT_SYMBOL_GPL
ipe: restore the kdoc comments for evaluate_property()
hornet: depend on CONFIG_SECURITY and CONFIG_BPF_SYSCALL
ipe: Add BPF program load policy enforcement via Hornet integration
selftests/hornet: Add a selftest for the Hornet LSM
hornet: Add a light skeleton data extractor scripts
hornet: Introduce gen_sig
lsm: introduce the Hornet LSM
lsm: add additional enum values for bpf integrity checks
lsm: framework for BPF integrity verification
crypto: pkcs7: add tests for pkcs7_get_authattr
crypto: pkcs7: add ability to extract signed attributes by OID
crypto: pkcs7: add flag for validated trust on a signed info block
security,fs,nfs,net: update security_inode_listsecurity() interface
|
|
Commit 8a81f16de64f ("NFSD: Add a "default" block size") introduced
NFSSVC_DEFBLKSIZE at 1MB, well below the 4MB NFSSVC_MAXBLKSIZE
ceiling, with the stated intent that a later change would raise the
default.
Raising the default reduces per-RPC overhead on fast networks by
amortizing header processing and scheduling costs across larger
payloads. The halving loop in nfsd_get_default_max_blksize()
constrains the returned value to 1/4096 of available RAM, so the
new 4MB default takes effect only on systems with at least 16GB of
RAM. Smaller machines continue to receive the same computed value
as before. Administrators can still override the computed value
through /proc/fs/nfsd/max_block_size.
On systems where the new default takes effect,
svc_sock_setbufsize() sizes each service socket's send and receive
buffers as nreqs * max_mesg * 2. Quadrupling max_mesg therefore
quadruples the per-socket buffer reservation at a fixed thread
count, which operators tuning large thread pools should account
for.
Note well: Your NFS client implementation must support large read
and write size settings to benefit from this change.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Roland Mainz <roland.mainz@nrubsig.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When NFSD_CMD_UNLOCK_EXPORT revokes NFSv4 state for an export path,
GC-managed nfsd_file entries for files under that path may remain
in the file cache. These cached handles hold the underlying
filesystem busy, preventing a subsequent unmount.
Add nfsd_file_close_export(), which walks the nfsd_file hash table
and closes GC-eligible entries whose underlying file resides on the
same filesystem and is a descendant of the export path. Because
nfsd_file entries do not carry an export reference, the ancestry
check uses is_subdir() on the file's dentry. False positives --
closing a cached handle that did not originate from the target
export -- are harmless; the handle is simply reopened on the next
access.
The handler calls nfsd_file_close_export() before revoking NFSv4
state, mirroring the order used by NFSD_CMD_UNLOCK_FILESYSTEM
(which cancels copies and releases NLM locks before revoking
state). Both calls run under nfsd_mutex.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When a filesystem is exported to NFS clients, NFSv4 state
(opens, locks, delegations, layouts) holds references that
prevent the underlying filesystem from being unmounted.
NFSD_CMD_UNLOCK_FILESYSTEM addresses this at superblock
granularity, but administrators unexporting a single path on a
shared filesystem (e.g., one of several exports on the same device)
need finer control.
Add NFSD_CMD_UNLOCK_EXPORT, which revokes NFSv4 state acquired
through exports of a specific path. Matching is by path identity
(dentry + vfsmount) via the sc_export field on each nfs4_stid,
so multiple svc_export objects for the same path -- one per
auth_domain -- are handled correctly without requiring the caller
to name a specific client.
The command takes a single "path" attribute. Userspace (exportfs
-u) sends this after removing the last client for a given path,
enabling the underlying filesystem to be unmounted. When multiple
clients share an export path, individual unexports do not trigger
state revocation; only the final one does.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add an sc_export field to struct nfs4_stid so that each stateid
records the export under which it was acquired. The export
reference is taken via exp_get() at stateid creation and released
via exp_put() in nfs4_put_stid().
Open stateids record the export from current_fh->fh_export.
Lock stateids and delegations inherit the export from their
parent open stateid. Layout stateids inherit from their
parent stateid. Directory delegations record the export from
cstate->current_fh.
A subsequent commit uses sc_export to scope state revocation to a
specific export, avoiding the need to walk inode dentry aliases at
revocation time.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Replace idr_for_each_entry_ul() with a while loop over
idr_get_next_ul() for consistency with find_one_export_stid(),
added in a subsequent commit.
No change in behavior.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add NFSD_CMD_UNLOCK_FILESYSTEM as a dedicated netlink command for
revoking NFS state under a filesystem path, providing a netlink
equivalent of /proc/fs/nfsd/unlock_fs.
The command requires a "path" string attribute containing the
filesystem path whose state should be released. The handler
resolves the path to its superblock, then cancels async copies,
releases NLM locks, and revokes NFSv4 state on that superblock.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The existing write_unlock_ip procfs interface releases NLM file
locks held by a specific client IP address, but procfs provides
no structured way to extend that operation to other scopes such as
revoking NFSv4 state.
Add NFSD_CMD_UNLOCK_IP as a dedicated netlink command for
releasing NLM locks by client address. The command accepts a
binary sockaddr_in or sockaddr_in6 in its address attribute.
The handler validates the address family and length, then calls
nlmsvc_unlock_all_by_ip() to release matching NLM locks. Because
lockd is a single global instance, that call operates across
all network namespaces regardless of which namespace the caller
inhabits.
A separate netlink command for filesystem-scoped unlock is added in
a subsequent commit.
The nfsd_ctl_unlock_ip tracepoint is updated from string-based
address logging to __sockaddr, which stores the binary sockaddr
and formats it with %pISpc. This affects both the new netlink path
and the existing procfs write_unlock_ip path, giving consistent
structured output in both cases.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The per-stateid revocation logic in nfsd4_revoke_states() handles
four stateid types in a deeply nested switch. Extract two helpers:
revoke_ol_stid() performs admin-revocation of an open or lock
stateid with st_mutex already held: marks the stateid as
SC_STATUS_ADMIN_REVOKED, closes POSIX locks for lock stateids,
and releases file access.
revoke_one_stid() dispatches by sc_type, acquires st_mutex with
the appropriate lockdep class for open and lock stateids, and
handles delegation unhash and layout close inline.
No functional change. Preparation for adding export-scoped state
revocation which reuses revoke_one_stid().
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
nfsd4_drop_revoked_stid() has no SC_TYPE_LAYOUT case, so when a
client sends FREE_STATEID for an admin-revoked layout stid, the
default branch releases cl_lock and returns without unhashing or
releasing the stid. The stid remains in the IDR and on the
per-client list until the client is destroyed.
Remove the layout stid from the per-client list and call
nfs4_put_stid() to drop the creation reference. When the
refcount reaches zero, nfsd4_free_layout_stateid() handles the
remaining cleanup: cancelling the fence worker, removing from
the per-file list, and freeing the slab object.
Fixes: 1e33e1414bec ("nfsd: allow layout state to be admin-revoked.")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dai Ngo <dai.ngo@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
The new get-reqs dump operations added to sunrpc_cache.yaml and
nfsd.yaml place the "requests" nested attribute under dump.request.
A netlink dump carries an empty request; its payload travels back
in the reply. Because the spec names no reply attributes, the YNL
C code generator synthesizes a forward reference to a
<op>_rsp struct that is never defined, breaking any consumer of
these specs.
This first surfaced when Thorsten Leemhuis built tools/net/ynl
against -next:
nfsd-user.h:746: error: field 'obj' has incomplete type
struct nfsd_svc_export_get_reqs_rsp obj ...
nfsd-user.h:826: error: field 'obj' has incomplete type
struct nfsd_expkey_get_reqs_rsp obj ...
nfsd-user.c:1211: error: 'nfsd_svc_export_get_reqs_rsp_parse'
undeclared
sunrpc_cache.yaml has the same defect in ip-map-get-reqs and
unix-gid-get-reqs, but nfsd.yaml errors out first in the Makefile's
alphabetical build order and hides the sunrpc failures.
These bugs were introduced by incorrect merge conflict resolution.
Reported-by: Thorsten Leemhuis <linux@leemhuis.info>
Closes: https://lore.kernel.org/linux-nfs/f6a3ca6d-e5cb-4a5c-9af2-8d2b1ce33ef0@leemhuis.info/
Fixes: 1045ccf519ce30 ("sunrpc: add netlink upcall for the auth.unix.ip cache")
Tested-by: Thorsten Leemhuis <linux@leemhuis.info>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add a new NFSD_CMD_CACHE_FLUSH generic netlink command that allows
userspace to flush the nfsd export caches (svc_export and expkey)
without writing to /proc/net/rpc/*/flush.
An optional NFSD_A_CACHE_FLUSH_MASK u32 attribute selects which caches
to flush (bit 1 = svc_export, bit 2 = expkey). If the attribute is
omitted, all nfsd caches are flushed.
This is used by exportfs to replace its /proc-based cache_flush() with a
netlink equivalent, with /proc fallback for older kernels.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add netlink-based cache upcall support for the expkey (nfsd.fh) cache,
following the same pattern as the existing svc_export netlink support.
Add expkey to the cache-type enum, a new expkey attribute-set with
client, fsidtype, fsid, negative, expiry, and path fields, and the
expkey-get-reqs / expkey-set-reqs operations to the nfsd YAML spec
and generated headers.
Implement nfsd_nl_expkey_get_reqs_dumpit() which snapshots pending
expkey cache requests and sends each entry's seqno, client name,
fsidtype, and fsid over netlink.
Implement nfsd_nl_expkey_set_reqs_doit() which parses expkey cache
responses from userspace (client, fsidtype, fsid, expiry, and path
or negative flag) and updates the cache via svc_expkey_lookup() /
svc_expkey_update().
Wire up the expkey_notify() callback in svc_expkey_cache_template
so cache misses trigger NFSD_CMD_CACHE_NOTIFY multicast events with
NFSD_CACHE_TYPE_EXPKEY.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Add netlink-based cache upcall support for the svc_export (nfsd.export)
cache to Documentation/netlink/specs/nfsd.yaml and regenerate the
resulting files.
Implement nfsd_cache_notify() which sends a NFSD_CMD_CACHE_NOTIFY
multicast event to the "exportd" group, carrying the cache type so
userspace knows which cache has pending requests.
Implement nfsd_nl_svc_export_get_reqs_dumpit() which snapshots
pending svc_export cache requests and sends each entry's seqno,
client name, and path over netlink.
Implement nfsd_nl_svc_export_set_reqs_doit() which parses svc_export
cache responses from userspace (client, path, expiry, flags, anon
uid/gid, fslocations, uuid, secinfo, xprtsec, fsid, or negative
flag) and updates the cache via svc_export_lookup() /
svc_export_update().
Wire up the svc_export_notify() callback in svc_export_cache_template
so cache misses trigger NFSD_CMD_CACHE_NOTIFY multicast events with
NFSD_CACHE_TYPE_SVC_EXPORT.
Note that the export-flags and xprtsec-mode enums are organized to match
their counterparts in include/uapi/linux/nfsd/export.h. The intent is
that future export options will only be added to the netlink headers,
which should eliminate the need to keep so much in sync.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
This function doesn't have anything to do with a timeout. The only
difference is that it warns if there are no listeners. Rename it to
sunrpc_cache_upcall_warn().
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
Since it will soon also send an upcall via netlink, if configured.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
It's not used outside of that file.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
When revoking delegation state, nfsd4_revoke_states() takes an extra
reference on the stid before calling unhash_delegation_locked(). If
unhash_delegation_locked() returns false (the delegation was already
unhashed by a concurrent path), dp is set to NULL and
revoke_delegation() is skipped, but the extra reference is never
released. Each occurrence permanently pins the stid in memory. The
leaked reference also prevents nfs4_put_stid() from decrementing
cl_admin_revoked, leaving the counter permanently inflated.
Drop the extra reference in the failure path.
Fixes: 8dd91e8d31fe ("nfsd: fix race between laundromat and free_stateid")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
'vfs/vfs-7.2.directory.delegations' and 'vfs/vfs-7.2.exportfs' into vfs-7.2-merge
|
|
iput() called from fuse_release_end() can Oops if the super block has
already been destroyed. Normally this is prevented by waiting for
num_waiting to go down to zero before commencing with super block shutdown.
This only works, however, for the last submount instance, as the wait
counter is per connection, not per superblock.
Revert to using synchronous release requests for the auto_submounts case,
which is virtiofs only at this time.
Reported-by: Aurélien Bombo <abombo@microsoft.com>
Cc: Greg Kurz <gkurz@redhat.com>
Closes: https://github.com/kata-containers/kata-containers/issues/12589
Fixes: 26e5c67deb2e ("fuse: fix livelock in synchronous file put from fuseblk workers")
Cc: stable@vger.kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The bounds check in ntfs_dir_emit() compares fname->name_len (a
character count) against e->size (a byte count) without accounting
for the 2-byte-per-character UTF-16LE encoding or the ATTR_FILE_NAME
header size:
if (fname->name_len + sizeof(struct NTFS_DE) > le16_to_cpu(e->size))
This computes: name_len + 16 > e_size
The correct check must account for the ATTR_FILE_NAME header (66 bytes
before the name) and the UTF-16LE character size (2 bytes each):
sizeof(NTFS_DE) + offsetof(ATTR_FILE_NAME, name) +
name_len * sizeof(short) > e_size
Which computes: 16 + 66 + name_len * 2 > e_size
The correct calculation already exists as fname_full_size() in ntfs.h
and is used in cmp_fnames(), namei.c, and fslog.c, but was not used
in the readdir path.
A crafted NTFS image with an index entry containing a small e->size
but large fname->name_len bypasses the current check, causing
ntfs_utf16_to_nls() to read past the entry boundary.
Additionally, add a key_size validation in hdr_find_e() to ensure the
declared key_size does not exceed the available entry data, preventing
comparison functions from reading past entry boundaries on the lookup
path.
Signed-off-by: Alessandro Schino <7991aleschino@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
On 64K page-size kernels, mounting NTFS volumes smaller than ~650 MB
fails with EINVAL. The issue is in log_replay(): the initial log page
size probe uses PAGE_SIZE (65536) instead of DefaultLogPageSize (4096)
when PAGE_SIZE exceeds DefaultLogPageSize * 2.
This makes norm_file_page() require the $LogFile to be at least
50 * 65536 = 3.2 MB, but mkfs.ntfs creates a $LogFile of only ~1.5 MB
for a typical 300 MB volume. norm_file_page() returns 0 and the mount
is rejected with EINVAL.
On 4K kernels the #if guard evaluates to true, so use_default=true is
passed and DefaultLogPageSize (4096) is used, requiring only ~200 KB.
This path works fine.
Fix this by always passing use_default=true, which forces the initial
probe to use DefaultLogPageSize regardless of the kernel's PAGE_SIZE.
This is safe because, after reading the on-disk restart area, log_replay()
already re-adjusts log->page_size to match the volume's actual
sys_page_size.
Also fix read_log_page() to pass log->page_size instead of PAGE_SIZE to
ntfs_fix_post_read(), matching the actual buffer size.
Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal")
Tested-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Jamie Nguyen <jamien@nvidia.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
The ntfs3 specific -Wmaybe-uninitialized flag found one more false-postive,
this time with gcc-10 on s390:
fs/ntfs3/frecord.c: In function 'ni_expand_list':
fs/ntfs3/frecord.c:1370:16: error: 'ins_attr' may be used uninitialized in this function [-Werror=maybe-uninitialized]
Add an explicit NULL pointer check before using the pointer, and
initialize it to NULL.
Fixes: 48d9b57b169f ("fs/ntfs3: add a subset of W=1 warnings for stricter checks")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
This fixes a BUG reported in iomap_write_end_inline:
iomap_inline_data_valid checks that the inline_data fits within
a page. If the inline_data is allocated with kmemdup there's no
guarantee that it's page-aligned, so the check sometimes fails.
Allocate it with alloc_page to ensure it's page-aligned.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221446
Fixes: 099ef9a ("fs/ntfs3: implement iomap-based file operations")
Signed-off-by: Mihai Brodschi <m.brodschi@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
format code according to .clang-format, add useful comments and remove
non-useful comments.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
Handle non-data/hole seeks through generic_file_llseek_size() and return
-ENXIO immediately when SEEK_DATA or SEEK_HOLE is requested at or past
EOF. Handle compressed files in such cases properly as well.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
Remove the separate ntfs_extend() and ntfs_truncate() helpers and route
file size changes through ntfs_set_size().
This consolidates ntfs3 size updates in one place and lets the write,
fallocate, and setattr paths share the same logic for updating i_size,
valid data length, and preallocated extents.
This patch fixes a few issues found during internal tests.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
It makes ntfs3 wait for direct I/O completion before returning to the
caller, instead of allowing the write path to complete asynchronously.
The issue was discovered during internal tests.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
Remove the separate ntfs_resident_writepage() helper and handle resident
writeback directly from ntfs_writepages(). This simplifies the resident
writeback path and keeps the folio handling local to ntfs_writepages().
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
Introduce run_lookup_entry_da() to look up data runs while taking
delayed allocation into account.
ntfs3 may have both committed extents and delayed allocation extents for
the same VCN range. The new helper checks delayed allocation first and
falls back to the real run, then corrects the returned range when a real
run overlaps with a delayed allocation run.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
Zero cached folios beyond the valid data length when closing a writable
mapping. This keeps cached data beyond initialized file contents zeroed
and prevents stale pagecache exposure after mmap-based writes.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
Implement fileattr_get() and fileattr_set() to fix a problem found
during the internal testing.
This allows ntfs3 to expose and modify inode flags through the generic
file attribute interface used by FS_IOC_GETFLAGS and FS_IOC_SETFLAGS.
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
|
|
Return the result of calling fanotify_fsid_equal() directly to simplify
the code.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://patch.msgid.link/20260527142233.1256340-3-thorsten.blum@linux.dev
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Update a comment to refer to folios instead of pages.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
xfs_mountfs currently ignores all errors from xfs_fs_reserve_ag_blocks,
which can lead to the mount path continuing on corruption errors.
Fix the check to only ignore -ENOSPC as in other callers, and unwind for
all other errors.
Fixes: 81ed94751b15 ("xfs: fix log intent recovery ENOSPC shutdowns when inactivating inodes")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Sticks out a bit better if we add a separate helper for it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Keep the rtgroup reference until after reporting the write pointer, as
that uses it. Right now this is not a major issue as we don't support
shrinking file systems in a way that makes RTGs go away, but let's stick
to the proper reference counting to prepare for that.
Fixes: c6ce65cb17aa ("xfs: add write pointer to xfs_rtgroup_geometry")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
xrep_cow_find_bad_rt() initializes scrub rtgroup state before the
force-rebuild path calls xrep_cow_mark_file_range(). If that call
fails, the code jumps directly to out_rtg, which skips the scrub
rtgroup cleanup and only drops the local rtgroup reference.
Remove the unnecessary jump so the function falls through to out_sr,
ensuring the realtime cursors, lock state, and sr->rtg reference are
released before returning.
Fixes: fd97fe111208 ("xfs: fix CoW forks for realtime files")
Cc: <stable@vger.kernel.org> # v6.14
Signed-off-by: Yingjie Gao <gaoyingjie@uniontech.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
xrep_cow_find_bad() returns success after the cleanup labels even if
AG setup, btree queries, or bitmap updates failed. This can make
repair continue with an incomplete bad-file-offset bitmap instead of
stopping at the original error.
The force-rebuild path has a related cleanup problem. If
xrep_cow_mark_file_range() fails, the function returns directly and
skips the scrub AG context and perag cleanup.
Let the force-rebuild path fall through to the existing cleanup code
and return the saved error after cleanup.
Fixes: dbbdbd008632 ("xfs: repair problems in CoW forks")
Cc: <stable@vger.kernel.org> # v6.8
Signed-off-by: Yingjie Gao <gaoyingjie@uniontech.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
chmod and chgrp were being ignored when mouting with the SMB3.1.1
POSIX Extensions. Add support for chmod and chgrp when mounting
with the SMB3.1.1 POSIX Extensions.
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
parse_sid() only verified that the fixed SID header fit in the
returned security descriptor, but did not verify that the full SID
body described by num_subauth was present.
A malicious server can return a truncated owner or group SID whose
header lies within the descriptor buffer while sub_auth[] extends
past the end of the allocation, leading to an out-of-bounds read
when the client later parses or copies that SID.
Validate the full SID body in parse_sid(), centralize owner/group SID
lookup and bounds checking in sid_from_sd(), and use that validation
in parse_sec_desc(), build_sec_desc(), and copy_sec_desc() before
sub_auth[] is accessed.
Signed-off-by: Qihang <q.h.hack.winter@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
cifs_swn_notify() looks up a witness registration by id under
cifs_swnreg_idr_mutex, drops the mutex, and then uses the registration's
cached tcon pointer. That pointer is not a lifetime reference, and it is
not a stable representative once cifs_get_swn_reg() lets multiple tcons
for the same net/share name share one registration id.
A same-share second mount can keep the cifs_swn_reg alive after the first
tcon unregisters and is freed. The registration then still points at the
freed first tcon, so taking tc_lock or incrementing tc_count through
swnreg->tcon only moves the use-after-free earlier. Taking tc_lock while
holding cifs_swnreg_idr_mutex also violates the documented CIFS lock
order.
Fix this by making the registration store only the stable witness
identity: id, net name, share name, and notify flags. When a notify
arrives, copy that identity under cifs_swnreg_idr_mutex, drop the mutex,
then find and pin a live witness tcon that currently matches the net/share
pair under the normal cifs_tcp_ses_lock -> tc_lock order. The notification
path uses that pinned tcon directly and drops the reference when done.
Registration and unregister messages now use the live tcon passed by the
caller instead of a cached tcon in the registration. The final unregister
send is folded into cifs_swn_unregister() while the registration is still
protected by cifs_swnreg_idr_mutex. This removes the previous
find/drop/reacquire raw-pointer window. The release path only removes the
idr entry and frees the stable identity strings.
This preserves the intended one-registration/many-tcon behavior: a
registration id represents a net/share pair, and notify handling acts on a
live representative selected at use time. It also preserves CLIENT_MOVE
ordering for the representative tcon because the old-IP unregister is sent
before cifs_swn_register() sends the new-IP register.
Fixes: fed979a7e082 ("cifs: Set witness notification handler for messages from userspace daemon")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Cifs files may be put into fileinfo_put_wq during umounting cifs.
After umount done, cifsFileInfo_put_final is called, which cause
following BUG:
BUG: kernel NULL pointer dereference, address: 0000000000000000
...
[ 134.222152] list_lru_add+0x64/0x1a0
[ 134.222399] ? cifs_put_tcon+0x171/0x340 [cifs]
[ 134.222772] d_lru_add+0x44/0x60
[ 134.222997] dput+0x1fc/0x210
[ 134.223213] cifsFileInfo_put_final+0x11a/0x140 [cifs]
[ 134.223576] process_one_work+0x17c/0x320
[ 134.223843] worker_thread+0x188/0x280
[ 134.224084] ? __pfx_worker_thread+0x10/0x10
[ 134.224366] kthread+0xcc/0x100
[ 134.224576] ? __pfx_kthread+0x10/0x10
[ 134.224827] ret_from_fork+0x30/0x50
[ 134.225063] ? __pfx_kthread+0x10/0x10
[ 134.225328] ret_from_fork_asm+0x1b/0x30
This can be reproduce by following:
unshare -n bash -c "
mkdir -p ${CIFS_MNT}
ip netns attach root 1
ip link add eth0 type veth peer veth0 netns root
ip link set eth0 up
ip -n root link set veth0 up
ip addr add 192.168.0.2/24 dev eth0
ip -n root addr add 192.168.0.1/24 dev veth0
ip route add default via 192.168.0.1 dev eth0
ip netns exec root sysctl net.ipv4.ip_forward=1
ip netns exec root iptables -t nat -A POSTROUTING -s 192.168.0.2 -o
${DEV} -j MASQUERADE
mount -t cifs ${CIFS_PATH} ${CIFS_MNT} -o
vers=3.0,sec=ntlmssp,credentials=${CIFS_CRED},rsize=65536,wsize=65536,cache=none,echo_interval=1
touch ${CIFS_MNT}/a.txt
ip netns exec root iptables -t nat -D POSTROUTING -s 192.168.0.2 -o
${DEV} -j MASQUERADE
"
umount ${CIFS_MNT}
Fixes: 340cea84f691 ("cifs: open files should not hold ref on superblock")
Signed-off-by: Jian Zhang <zhangjian496@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Apply conflicting option validation consistently across all the new
mount API paths, for both mount and remount.
Some checks were only applied during initial mount validation, while
others were handled during option parsing, causing mount and
remount/reconfigure to behave differently.
Move the conflicting option checks into smb3_handle_conflicting_options()
and call it from the common validation paths, including for
multichannel/max_channels handling.
Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api")
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Today we do not invalidate the cached_dirent or the entire
parent cfid when a dentry in a dir has been removed/moved.
This change invalidates the parent cfid so that we don't serve
directory contents from the cache.
Cc: <stable@vger.kernel.org>
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
compiling with W=2 pointed out that "written may be used uninitialized"
Fixes: 20d72b00ca81 ("netfs: Fix the request's work item to not require a ref")
Cc: stable@vger.kernel.org
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
cifs_copy_folioq_to_iter() copies a requested number of bytes from
a folio queue into the destination iterator. Since the encrypted
SMB2 READ path was changed to pass the server-declared payload
length (data_len) instead of the larger folioq buffer length, the
caller can ask for fewer bytes than the folio queue holds.
In that case the helper continues walking the remaining folios after
data_size has reached zero and calls copy_folio_to_iter() with
len = 0, which is unnecessary work.
The helper also returns 0 (success) when the folio queue is
exhausted before data_size bytes have been copied. The caller has
no way to distinguish that from a full copy and the reported
transfer count ends up larger than the amount of data placed in the
iterator.
Add an early exit when data_size reaches zero, and return an error
when the folio queue is exhausted before all requested bytes have
been copied.
Signed-off-by: Jeremy Erazo <mendozayt13@gmail.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
There is a spelling mistake in a ntfs_error message. Fix it.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
FSCTL_SET_SPARSE
FSCTL_SET_SPARSE in fsctl_set_sparse() modifies the file's sparse
attribute and saves it through xattr without any permission checks.
This exposes two issues:
1) A client on a read-only share can change the sparse attribute
on files it opened, even though the share is read-only.
Other FSCTL write operations already check
test_tree_conn_flag(work->tcon, KSMBD_TREE_CONN_FLAG_WRITABLE),
but FSCTL_SET_SPARSE does not.
2) Even on writable shares, clients without FILE_WRITE_DATA or
FILE_WRITE_ATTRIBUTES access should not modify the sparse
attribute. Similar handle-level checks exist in other functions
but are missing here.
Add both share-level writable check and per-handle access check.
Use goto out on error to avoid leaking file references.
Fixes: e2f34481b24d ("cifsd: add server-side procedures for SMB3")
Cc: Namjae Jeon <linkinjeon@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: Sean Shen <grayhat@foxmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
ksmbd_query_inode_status() and ksmbd_lookup_fd_inode() both take a
reference on a ksmbd_inode via __ksmbd_inode_lookup() (which performs
atomic_inc_not_zero()) and later release it using a bare
atomic_dec(&ci->m_count). Unlike ksmbd_inode_put(), a bare
atomic_dec() does not check whether the reference count has reached
zero, so if the caller happens to drop the last reference, the
ksmbd_inode is leaked: it stays in the global inode hash table with
m_count == 0, future __ksmbd_inode_lookup() calls reject it via
atomic_inc_not_zero(), and ksmbd_inode_free() is never invoked.
The race is:
T1: __ksmbd_inode_lookup() -> atomic_inc_not_zero(): m_count = 2
T2: ksmbd_inode_put() -> atomic_dec_and_test(): m_count = 1
(not freed)
T1: atomic_dec(&ci->m_count) -> m_count = 0
return (LEAK)
In ksmbd_lookup_fd_inode() the matched-fp path (which now also uses
ksmbd_inode_put()) cannot currently reach m_count == 0 because the
matched ksmbd_file holds its own reference on ci, but converting it to
the proper API keeps the three call sites consistent and avoids
future regressions if the locking changes.
Because ksmbd_inode_put() may free the ksmbd_inode if this drops the
last reference, the call must happen after up_read(&ci->m_lock) on the
two affected paths in ksmbd_lookup_fd_inode(). On the no-match path
this is a pure reordering; on the matched path ksmbd_fp_get() is
moved above the unlock so that the returned ksmbd_file is pinned
before the inode reference is released.
Signed-off-by: Aleksandr Golovnya <cofedish@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Commit d07b26f39246 ("ksmbd: require minimum ACE size in
smb_check_perm_dacl()") introduced a transposed bounds check:
if (offsetof(struct smb_ace, sid) + aces_size < CIFS_SID_BASE_SIZE)
Since offsetof(..sid) is 8 and CIFS_SID_BASE_SIZE is 8, this evaluates
to `aces_size < 0`. Because `aces_size` is always non-negative, this
check becomes dead code and never breaks the loop.
Worse, that commit removed the old 4-byte guard, meaning the loop now
reads `ace->size` (offset 2) even when `aces_size` is 0-3 bytes. This
re-opens a 2-byte heap out-of-bounds (OOB) read past the pntsd allocation
during subsequent SMB2_CREATE operations.
Fix this by properly transposing the comparison to require at least
16 bytes (8-byte offset + 8-byte SID base), matching the correct form
used in smb_inherit_dacl().
Fixes: d07b26f39246 ("ksmbd: require minimum ACE size in smb_check_perm_dacl()")
Cc: stable@vger.kernel.org
Signed-off-by: Ali Ganiyev <ali.qaniyev@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
"Regressions:
- Tighten bounds checking for sunrpc cache hash tables
- Don't report key material in the ftrace log
Stable fix:
- Fix lockd's implementation of the NLM TEST procedure"
* tag 'nfsd-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
lockd: fix TEST handling when not all permissions are available.
NFSD: Report whether fh_key was actually updated
sunrpc: prevent out-of-bounds read in __cache_seq_start()
|
|
* for-7.2/block:
block: remove blkdev_write_begin() and blkdev_write_end()
mtip32xx: fix use-after-free on service thread failure
block: don't set BIO_QUIET for BLK_STS_AGAIN
direct-io: remove IOCB_NOWAIT support
block: Avoid mounting the bdev pseudo-filesystem in userspace
block: switch numa_node to int in blk_mq_hw_ctx and init_request
block: skip sync_blockdev() on surprise removal in bdev_mark_dead()
blk-mq: add tracepoint block_rq_tag_wait
block: partitions: fix of_node refcount leak in of_partition()
|
|
None of the file systems using the legacy direct I/O code actually sets
FMODE_NOWAIT, and if they did this would not work, as the write locking
could not handle the retry. Remove this dead code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://patch.msgid.link/20260518063336.507369-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The f_fsid was originally derived from fs_devices->fsid and the
subvolume root ID. However, when temp_fsid is active, fs_devices->fsid
is randomized, making the standard derivation inconsistent.
Since metadata_uuid is optional, it is not a reliable alternative. This
patch instead retrieves the on-disk UUID from fs_info->super_copy->fsid.
To prevent f_fsid collisions between original and cloned filesystems,
this implementation hashes the dev_t for single-device btrfs filesystems
to ensure uniqueness. This is limited to single-device filesystems as
cloned mounts are currently only supported for that configuration. Note
that f_fsid will change if the device is replaced.
Additionally, since the kernel cannot distinguish between the original
and the cloned filesystem, this new f_fsid derivation is applied to
both.
Link: https://lore.kernel.org/linux-btrfs/cover.1772095546.git.asj@kernel.org/
Link: https://lore.kernel.org/linux-btrfs/cover.1774092915.git.asj@kernel.org/
Signed-off-by: Anand Jain <asj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When mounting a cloned filesystem with a temporary fsuuid (temp_fsid),
layered modules like overlayfs require a persistent identifier.
While internal in-memory fs_devices->fsid must remain unique to
the kernel module, let s_uuid carry the original on-disk UUID.
Signed-off-by: Anand Jain <asj@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[MINOR PROBLEM]
When mounting a filesystem with a valid DEV_STATS item, we will always
update the DEV_STATS again in the next transaction commit, even if there
is no change the values.
[CAUSE]
During the mount, btrfs_device_init_dev_stats() will read out the
on-disk DEV_STATS item for each device.
Then it calls btrfs_dev_stat_set() to update the in-memory structure.
However btrfs_dev_stat_set() does not only set the dev stats value, but
also increase device->dev_stats_ccnt.
That member determines if we should update the device item at the next
transaction commit. Since we have called btrfs_dev_stat_set() for each
dev status member, dev_stats_ccnt will be non-zero and we will update
the dev stats item even it doesn't change at all.
[FIX]
Instead of using btrfs_dev_stat_set() for valid on-disk DEV_STATUS
values, directly call atomic_set() to set the in-memory values.
For other call sites, we still want to use btrfs_dev_stat_set() so that
we will force updating/creating the dev stats item.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[MINOR PROBLEM]
When adding a new btrfs device, the corresponding DEV_STATS item creation
can only triggered by a mount cycle if there is no other error
triggered:
# mkfs.btrfs -f $dev1 $mnt
# mount $dev1 $mnt
# btrfs dev add $dev2 $mnt
# sync
# btrfs ins dump-tree -t dev $dev1
device tree key (DEV_TREE ROOT_ITEM 0)
leaf 30588928 items 6 free space 15853 generation 9 owner DEV_TREE
item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40 <<<
persistent item objectid DEV_STATS offset 1
device stats
write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
item 1 key (1 DEV_EXTENT 13631488) itemoff 16195 itemsize 48
Only after a mount cycle and a new transaction, the DEV_STATS for devid
2 can show up:
# umount $mnt
# mount $dev1 $mnt
# touch $mnt
# sync
# btrfs ins dump-tree -t dev $dev1
device tree key (DEV_TREE ROOT_ITEM 0)
leaf 30605312 items 7 free space 15788 generation 10 owner DEV_TREE
item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40
persistent item objectid DEV_STATS offset 1
device stats
write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
item 1 key (DEV_STATS PERSISTENT_ITEM 2) itemoff 16203 itemsize 40
persistent item objectid DEV_STATS offset 2
device stats
write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
[CAUSE]
Btrfs only updates the DEV_STATS item when the device->dev_stats_ccnt
counter is not 0.
This is to reduce COW for the device tree. However that dev_stats_ccnt is
only increased at the following call sites:
- btrfs_dev_stat_inc()
This happens when some IO error happened.
- btrfs_dev_stat_read_and_reset()
This happens for GET_DEV_STATS ioctl with BTRFS_DEV_STATS_RESET flag.
- btrfs_dev_stat_set()
This happens inside btrfs_device_init_dev_stats().
So when a new device is added, its dev_stats_ccnt is just initialized to
0, and btrfs won't create nor update the corresponding DEV_STATS item at
all.
[ENHANCEMENT]
When a new device is added, also increase the dev_stats_ccnt by one.
This includes both device add ioctl and dev-replace.
This will force btrfs to create a new DEV_STATS item or update the
existing one with the correct values.
This not only makes the DEV_STATS creation early, but also prevents
old DEV_STATS left from older kernels to cause false alerts for the
newly added device.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[MINOR BUG]
The following script will cause DEV_STATS item to be left after the
corresponding device is removed:
# mkfs.btrfs -f $dev1
# mount $dev1 $mnt
# btrfs dev add $dev2 $mnt
# umount $mnt
## Without real errors, only at mount time btrfs will update
## dev->dev_stats_ccnt, thus we need a mount cycle to create the
## DEV_STATS item for the new device.
# mount $dev1 $mnt
# touch $mnt/foobar
# sync
# btrfs dev remove $dev2 $mnt
# umount $mnt
This will result the DEV_STATS item for devid 2 still left in device
tree:
device tree key (DEV_TREE ROOT_ITEM 0)
leaf 31064064 items 7 free space 15788 generation 18 owner DEV_TREE
leaf 31064064 flags 0x1(WRITTEN) backref revision 1
fs uuid 4bd853ed-f6ef-45fd-bbf1-1c3a2d9987cb
chunk uuid b496eab1-ec23-46b5-81c1-2f1b3503ca07
item 0 key (DEV_STATS PERSISTENT_ITEM 1) itemoff 16243 itemsize 40
persistent item objectid DEV_STATS offset 1
device stats
write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
item 1 key (DEV_STATS PERSISTENT_ITEM 2) itemoff 16203 itemsize 40
persistent item objectid DEV_STATS offset 2
device stats
write_errs 0 read_errs 0 flush_errs 0 corruption_errs 0 generation 0
This is not a huge problem, but if the existing DEV_STATS contains
errors, and a new device is added into the fs taking the old devid, then
after a mount cycle, the new device will suddenly inherit old errors
which can give false alerts.
[CAUSE]
Btrfs never has the ability to delete DEV_STATS items.
It either create a new one through update_dev_stat_item(), or read an
existing one through btrfs_device_init_dev_stats().
However update_dev_stat_item() is only called lazily, if a new device is
created and no new update to dev stats, then it will skip the update of
the on-disk item.
So if the old DEV_STATS item exists and a new device is added, and no
errors during the remaining operations, the old DEV_STATS will not be
updated.
Then at the next mount cycle, btrfs_device_init_dev_stats() is called at
mount time, which will read out the old records, causing false alerts to
the newly added device.
[FIX]
Manually remove the DEV_STATS item during btrfs_rm_device().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[MINOR PROBLEM]
When a running dev-replace hits some error for the target device (devid
0), there will be a DEV_STATS with error records created at the next
transaction commit.
Unfortunately that item will never to be deleted.
This means at the next dev-replace, if the replace is interrupted, then
at the next mount, the target device will suddenly inherit the old error
records from that DEV_STATS item, which can give some false alerts on
that device.
This shouldn't affect end users that much, as it requires all the
following conditions to be met, which is pretty rare:
- The initial dev-replace hits some error on the target device
E.g. write errors, but those errors itself is already a big problem
for a running replace.
This is required to create the DEV_STATS item in the first place.
- The next replace is interrupted
This is required to allow btrfs to read from the old records.
[CAUSE]
Btrfs just never deletes the DEV_STATS after a replace is finished.
[FIX]
Remove the DEV_STATS item for devid 0 after the replace is finished.
This is not going to completely fix the error, as we still have other
error paths, e.g. by somehow the fs flips RO and can not start a new
transaction for the DEV_STATS item removal.
But those corner cases will be addressed by later patches which provide
a more generic fix to DEV_STATS related problems.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
get_new_location() uses BUG_ON() to crash the kernel if the file extent
item it looks up has any of offset, compression, encryption, or
other_encoding set non-zero. The data reloc inode is only written by
relocation's own paths and the four fields are always 0 in what the
kernel writes:
- insert_prealloc_file_extent() memsets the stack item to zero and
only fills in type, disk_bytenr, disk_num_bytes and num_bytes, so
offset/compression/encryption/other_encoding stay 0.
- insert_ordered_extent_file_extent() copies oe->compress_type into
the file extent's compression field, but the data reloc inode is
created with BTRFS_INODE_NOCOMPRESS so compress_type is always 0;
encryption and other_encoding are reserved-and-zero in btrfs.
A non-zero value here means the leaf decoded from disk does not match
what the kernel wrote, i.e. on-disk corruption. A malformed image
reaches this code via balance and panics the kernel.
A previous attempt to enforce all four constraints in tree-checker's
check_extent_data_item() was merged as commit 7d0ee95979e9 ("btrfs:
validate data reloc tree file extent item members in tree-checker")
and then reverted by commit 1c034697fcaa after btrfs/061 produced
false positives on arm64 with 64K pages. The reason: relocation
writeback legitimately produces REG file_extent_items with offset != 0
in the data reloc tree. When an ordered extent covers only the back
portion of an underlying PREALLOC (num_bytes < ram_bytes on the input
file_extent), insert_ordered_extent_file_extent() inserts a REG with
offset = oe->offset
num_bytes = oe->num_bytes
ram_bytes preserved from the original PREALLOC,
and this item can reach disk if a transaction commit fires while it
is present in the leaf.
The four fields belong in different layers:
- compression, encryption and other_encoding are universal
invariants for every item in the data reloc tree, regardless of
cluster geometry. Enforce them in tree-checker's
check_extent_data_item() so a corrupt leaf is rejected at read
time.
- offset is only an invariant at the cluster-boundary keys that
get_new_location() searches (the key is computed as
src_disk_bytenr - reloc_block_group_start). Partial-PREALLOC
writebacks legitimately place REG items at non-boundary keys with
offset != 0; tree-checker cannot reject these. The cluster-
boundary item is always written by either
insert_prealloc_file_extent() (offset=0 by memset) or by the
front portion of a partial writeback (offset=0 by construction),
so a non-zero offset there is corruption.
Enforce the universal invariants in check_extent_data_item() with a
file_extent_err() rejection. Convert the BUG_ON() in
get_new_location() to a -EUCLEAN return paired with btrfs_print_leaf()
and btrfs_err() so the offending leaf is logged. The caller in
replace_file_extents() already handles non-zero returns from
get_new_location() by breaking out of the loop without aborting the
transaction.
Suggested-by: Qu Wenruo <wqu@suse.com>
Suggested-by: David Sterba <dsterba@suse.com>
Reported-by: syzbot+3e20d8f3d41bac5dc9a2@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=3e20d8f3d41bac5dc9a2
Signed-off-by: Teng Liu <27rabbitlt@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
should_nocow() reads inode->defrag_bytes without holding inode->lock,
while btrfs_set_delalloc_extent() and btrfs_clear_delalloc_extent()
update it under that spinlock.
This is a data race. The read is a quick check used to decide whether
to fall back to COW for a NOCOW inode: if defrag_bytes is non-zero and
the range is tagged EXTENT_DEFRAG, we force COW so that defragmentation
can rewrite the extent. Reading a stale value is harmless because:
- A missed increment may skip COW once, but the defrag pass will
redo the extent later.
- A stale non-zero may force an unnecessary COW, which is a minor
efficiency loss, not a correctness issue.
On 64-bit platforms an aligned u64 load is naturally atomic so tearing
cannot happen. On 32-bit platforms u64 may tear, but we only test for
zero vs non-zero, so the heuristic stays correct regardless. Use
data_race() annotation.
Fixes: 47059d930f0e ("Btrfs: make defragment work with nodatacow option")
Signed-off-by: Cen Zhang <zzzccc427@gmail.com>
[ Use data_race() instead of READ_ONCXE() ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address
post-7.1 issues or aren't considered suitable for backporting.
All patches are singletons - please see the individual changelogs for
details"
* tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Revert "mm: introduce a new page type for page pool in page type"
mm/vmalloc: do not trigger BUG() on BH disabled context
MAINTAINERS, mailmap: change email for Eugen Hristev
mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page
kernel/fork: validate exit_signal in kernel_clone()
mm: memcontrol: propagate NMI slab stats to memcg vmstats
mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()
mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
zram: fix use-after-free in zram_writeback_endio
memfd: deny writeable mappings when implying SEAL_WRITE
ipc: limit next_id allocation to the valid ID range
Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"
MAINTAINERS: .mailmap: update after GEHC spin-off
|
|
The fs_path can use the auto freeing pattern and it's completely
contained in send. Define the freeing wrapper and add the cleanup
attributes.
Almost all conversions are straightforward, replacing goto with direct
return.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The qgroupid has a specific format, add common format specifier, similar
to what we have for checksums and keys.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When a block device does not report a maximum number of open or active
zones, currently assign BTRFS_DEFAULT_MAX_ACTIVE_ZONES (128) to
the internal limit, if the device has more than
BTRFS_DEFAULT_MAX_ACTIVE_ZONES zones.
But if the device has less than BTRFS_DEFAULT_MAX_ACTIVE_ZONES the
internal max_active_zones limit will stay at 0, even if the device has
zone resource limits. Furthermore, if the device has a total number of
zones that is less than BTRFS_DEFAULT_MAX_ACTIVE_ZONE, max_active_zones
should be set to at most the number of zones.
Also move the max_active_zone calculation and setting into a dedicated
helper, to shrink btrfs_get_dev_zone_info().
Fixes: 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is open-coded bvec_phys(), also remove direct use of bv_page.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Tested-by: Boris Burkov <boris@bur.io>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Calling __free_page() on folio_page() happens to work today, but
won't always. Besides, it's far simpler to call folio_put().
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Boris Burkov <boris@bur.io>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_decompress_buf2page()"
It seems that af566bdaff54 was tested against a tree which did not
contain commit 12851bd921d4 ("fs: Turn page_offset() into a wrapper
around folio_pos()). Unfortunately it has a bug of its own; on 32-bit
systems, shifting by PAGE_SHIFT will overflow on files larger than 4GiB.
Since page_offset() is now fixed, just revert af566bdaff54.
Fixes: af566bdaff54 (btrfs: fix the file offset calculation inside btrfs_decompress_buf2page())
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Tested-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When performing data relocation on a zoned filesystem, BTRFS can deadlock
in handle_reserve_tickets(). The relocation process is waiting on a space
reservation ticket that can never be fulfilled, because the relocation
itself is the operation responsible for freeing up that space.
Fix this by introducing a new flush state,
BTRFS_RESERVE_FLUSH_ZONED_RELOCATION, specifically for data chunk
allocation during zoned relocation. Like
BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE, this state uses
priority_reclaim_data_space() instead of the normal flushing path, which
avoids re-entering the relocation code and breaking the deadlock cycle.
In btrfs_alloc_data_chunk_ondemand(), select this new flush state when the
inode belongs to a data relocation root on a zoned filesystem.
Fixes: e2a7fd22378f ("btrfs: zoned: add zone reclaim flush state for DATA space_info")
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Don't account the free space in a data relocation space-info sub-group as
usable free space in statfs.
This is misleading as no user allocations can be made in this space-info
sub-group. It is only a target for relocation.
Fixes: f92ee31e031c ("btrfs: introduce btrfs_space_info sub-group")
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When searching for a data relocation block-group on mount,
btrfs_zoned_reserve_data_reloc_bg() is looking for the first empty DATA
block-group. But it first checks if the block-group is empty and if yes
continues the search, and then checks if it is the first DATA block-group.
There is actually no point in looking for the second empty DATA block
group as new DATA allocations will just allocate a new chunk for it. Pick
the first DATA block-group without any allocations done and set it as
relocation block-group.
At first, the commit 694ce5e143d6 ("btrfs: zoned: reserve data_reloc
block group on mount") introduced the functionality. At that time, we
took second unused (used == 0) block group, as the first one might be a
block group used for normal data. Later, commit daa0fde32235 ("btrfs:
zoned: fix data relocation block group reservation") switched to look
for an empty block group (alloc_offset == 0). At this point, there is no
reason taking the second one anymore. So, this commit is fixing an issue
in commit daa0fde32235.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Document the purpose of the RECLAIM_ZONES flush state.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
With all the previous preparations, it's finally time to enable the
huge folio support.
- The max folio size
Here we define BTRFS_MAX_FOLIO_SIZE, which is fixed at 2MiB.
This will ensure we have a large enough but not too large folio for
btrfs. This limit applies to all systems regardless of page size.
Then we also define BTRFS_MAX_BLOCKS_PER_FOLIO, which depends on
CONFIG_BTRFS_EXPERIMENTAL.
If it's an experimental build, BTRFS_MAX_BLOCKS_PER_FOLIO is 512,
otherwise it's BITS_PER_LONG.
The filemap max order will be calculated using both
BTRFS_MAX_FOLIO_SIZE and BTRFS_MAX_BLOCKS_PER_FOLIO.
E.g. for 64K page size with 64K fs block size, the limit will be
BTRFS_MAX_FOLIO_SIZE (2M), which limits the filemap max order to 5.
This will be lower than the old order (6), but folios larger than 2M
are rarely any better for IO performance. Meanwhile excessively large
folios can cause other problems like stalling the IO pipeline for too
long.
For 4K page size and 4K fs block size, the limit will be increased to
2M from the old 256K.
This new size is constrained by both BTRFS_MAX_FOLIO_SIZE (2M) and
BTRFS_MAX_BLOCKS_PER_FOLIO (512 * 4K), allowing x86_64 to achieve huge
folio support, and the filemap max order will be 9.
- btrfs_bio_ctrl::submit_bitmap
This will be enlarged to contain BTRFS_MAX_BLOCKS_PER_FOLIO bits, and
this will be on-stack memory.
This will increase on-stack memory usage by 56 bytes compared to the
baseline (before the first patch in the series).
- Local @delalloc_bitmap inside writepage_delalloc()
Unfortunately we cannot afford to handle an allocation error here, thus
again we use on-stack memory.
Thus this will increase on-stack memory usage by 56 bytes again.
So unfortunately this means during the delalloc window, the writeback path
will have +112 bytes on-stack memory usage, and for other cases the
writeback path will have +56 bytes on-stack memory usage.
The +56 bytes (btrfs_bio_ctrl::submit_bitmap) can be removed
after we have reworked the compression submission, so the current
on-stack submit_bitmap is mostly a workaround until then.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
fuse_ref_folio() unlocks the request but does not re-lock it before
returning. fuse_chan_abort() can end the request and the async end
callback (eg fuse_writepage_free()) can free the args while the
subsequent copy chain logic after fuse_ref_folio() accesses them,
leading to use-after-free issues.
Fix this by locking the request in fuse_ref_folio() before returning.
Fixes: c3021629a0d8 ("fuse: support splice() reading from fuse device")
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
fuse_try_move_folio() unlocks the request on entry but does not
re-lock it on the success path. This means fuse_chan_abort() can end the
request and free the fuse_io_args (eg fuse_readpages_end()) while the
subsequent copy chain logic after fuse_try_move_folio() accesses the
fuse_io_args, leading to use-after-free issues.
Fix this by calling lock_request() before replace_page_cache_folio().
This ensures the request is locked on the success path which will
prevent the fuse_io_args from being freed while the later copying logic
runs, and also ensures that the ap->folios[i]->mapping is never null
since ap->folios[i] will always point to the newfolio after
replace_page_cache_folio().
Fixes: ce534fb05292 ("fuse: allow splice to move pages")
Cc: stable@vger.kernel.org
Reported-by: Lei Lu <llfamsec@gmail.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Reading /proc/interrupts iterates over the interrupt number space one by
one and looks up the descriptors one by one. That's just a waste of time.
When CONFIG_GENERIC_IRQ_SHOW is enabled this can utilize the maple tree and
cache the descriptor pointer efficiently for the sequence file operations.
Implement a CONFIG_GENERIC_IRQ_SHOW specific version in the core code and
leave the fs/proc/ variant for the legacy architectures which ignore generic
code.
This reduces the time wasted for looking up the next record significantly.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com>
Link: https://patch.msgid.link/20260517194932.165280601@kernel.org
|
|
The special treatment of these counts is just adding extra code for no real
value. The irq_stats mechanism allows to suppress output of counters, which
should never happen by default and provides a mechanism to enable them for
the rare case that they occur.
Move the IOAPIC misrouted and the PIC/APIC error counts into irq_stats,
mark them suppressed by default and update the sites which increment them.
This changes the output format of 'ERR' and 'MIS' in case there are events
to the regular per CPU display format and otherwise suppresses them
completely.
As a side effect this removes the arch_cpu_stat() mechanism from proc/stat
which was only there to account for the error interrupts on x86 and missed
to take the misrouted ones into account.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Radu Rendec <radu@rendec.net>
Link: https://patch.msgid.link/20260517194931.361942103@kernel.org
|
|
xfs_fs_map_blocks() currently passes XFS_BMAPI_ENTIRE to xfs_bmapi_read(),
which causes the bmap code to expand the mapping to cover the entire
extent rather than the requested range.
A single LAYOUTGET request from the client can cause the server to
issue multiple calls to xfs_fs_map_blocks() for different offsets
within the same extent. Because the use of XFS_BMAPI_ENTIRE flag,
these calls can produce overlapping mappings.
As a result, the LAYOUTGET reply sent to the NFS client may contain
overlapping extents. This creates ambiguity in extent selection for a
given file range, which can lead to incorrect device selection,
inconsistent handling of datastate, and ultimately data corruption or
protocol violations on the client side.
Problem discovered with xfstest generic/075 test using NFSv4.2 mount
with SCSI layout.
Fix this by replacing the XFS_BMAPI_ENTIRE flag with '0' so that
xfs_bmapi_read() returns only the mapping for the requested range.
Fixes: cc6c40e09d7b1 ("NFSD/blocklayout: Support multiple extents per LAYOUTGET").
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
xfs_fs_map_blocks() acquires the data map lock and then calls
xfs_bmapi_read(). If xfs_bmapi_read() fails, the function currently
still falls through to xfs_bmbt_to_iomap(), which consumes an
uninitialized imap record and may return invalid data to the caller.
Fix this by releasing the data map lock and returning immediately when
xfs_bmapi_read() reports an error. This prevents xfs_bmbt_to_iomap()
from being called with an uninitialized xfs_bmbt_irec.
Fixes: 527851124d10f ("xfs: implement pNFS export operations")
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
It is safe to call _ntfs_bad_inode on live inodes since:
commit 519b078998ce ("fs/ntfs3: Exclude call make_bad_inode for live nodes.")
The WARN_ON was added when it wasn't safe by:
commit d99208b91933 ("fs/ntfs3: cancle set bad inode after removing name fails")
Replace the WARN_ON with a call to _ntfs_bad_inode() to prevent further
operations on the inconsistent inode.
Reported-by: syzbot+4d8e30dbafb5c1260479@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=4d8e30dbafb5c1260479
Fixes: 519b078998ce ("fs/ntfs3: Exclude call make_bad_inode for live nodes.")
Signed-off-by: Helen Koike <koike@igalia.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
When run_remove_range() removes a middle portion of a non-sparse run,
it splits the run into head and tail parts. The tail is inserted via
run_add_entry() but uses the original r->lcn as its starting LCN
instead of advancing it by the split offset.
For example, removing VCN range [10, 20) from a run
{vcn=0, lcn=100, len=30} should produce:
{vcn=0, lcn=100, len=10} (head)
{vcn=20, lcn=120, len=10} (tail, lcn advanced by 20)
But the current code produces:
{vcn=0, lcn=100, len=10}
{vcn=20, lcn=100, len=10} (wrong: points to same physical clusters)
This creates overlapping physical mappings in the in-memory run tree,
which can corrupt cluster allocation decisions and lead to data
corruption.
The correct pattern is already used in run_insert_range():
CLST lcn2 = r->lcn == SPARSE_LCN ? SPARSE_LCN : (r->lcn + len1);
Apply the same logic in run_remove_range().
Fixes: 10d7c95af043 ("fs/ntfs3: add delayed-allocation (delalloc) support")
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
In the analysis pass of $LogFile journal replay, log_replay() copies
LCNs from each action log record into an existing Dirty Page Table
(DPT) entry without bounding the destination index. A crafted NTFS
image with DPT entry lcns_follow=1 and an action log record with
lcns_follow=2 produces a kernel slab out-of-bounds write at mount
time:
BUG: KASAN: slab-out-of-bounds in log_replay+0x654c/0xdb60
Write of size 8 at addr ffff8880095e1040 by task mount
Two attacker-controlled fields can drive j+i past the allocated
page_lcns[] array:
1. dp->lcns_follow (capacity) can be smaller than lrh->lcns_follow.
2. lrh->target_vcn may be smaller than dp->vcn, making the u64
subtraction wrap to a huge size_t.
Validate target VCN delta and per-record LCN count against the
DPT entry capacity, bail via the existing out: cleanup label with
-EINVAL.
This mirrors the bounds-check pattern added in commit b2bc7c44ed17
("fs/ntfs3: Fix slab-out-of-bounds read in DeleteIndexEntryRoot")
and commit 0ca0485e4b2e ("fs/ntfs3: validate rec->used in
journal-replay file record check").
Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal")
Reported-by: Yunpeng Tian <shionthanatos@gmail.com>
Reported-by: Mingda Zhang <npczmd@qq.com>
Reported-by: Gongming Wang <gmwgg05@gmail.com>
Reported-by: Peiyuan Xu <paulbucket12@gmail.com>
Reported-by: Qinrun Dai <jupmouse@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Yunpeng Tian <shionthanatos@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
|
|
Under heavy garbage collection pressure from RocksDB workloads,
filesystem shutdowns can occur in xfs_zone_gc_iter_irec when
xfs_iget() returns -EINVAL for deleted files.
Fix this by handling -EINVAL just like we handle -ENOENT, allowing
zone GC to safely ignore stale mappings.
Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection")
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
isofs_readdir() allocates a temporary buffer with __get_free_page().
kmalloc() is a better API for such use and it also provides better
scalability and more debugging possibilities.
Replace use of __get_free_page() with kmalloc().
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Link: https://patch.msgid.link/20260523-b4-fs-v1-11-275e36a83f0e@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
dquot_init() allocates a single page for dquot_hash with
__get_free_pages().
kmalloc() is a better API for such use and it also provides better
scalability and more debugging possibilities.
Replace use of __get_free_pages() with kmalloc() and get rid of the order
variable that remained 0 for more than 20 years.
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Link: https://patch.msgid.link/20260523-b4-fs-v1-1-275e36a83f0e@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Cross-merge BPF and other fixes after downstream PR.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
|
|
|
|
In the beginning of the loop, we try to obtain a locked delayed ref head,
if 'locked_ref' is currently NULL, by calling btrfs_select_ref_head(),
which can return an error pointer. If the error pointer is -EAGAIN we do
a continue and go back to the beginning of the loop, which will not try
again to call btrfs_select_ref_head() since 'locked_ref' is no longer
NULL but it's ERR_PTR(-EAGAIN), and then we do:
spin_lock(&locked_ref->lock);
against a ERR_PTR(-EAGAIN) value, generating an invalid pointer
dereference.
Fix this by ensuring that 'locked_ref' is set to NULL when
btrfs_select_ref_head() returns ERR_PTR(-EAGAIN) and incrementing 'count'
as well, to prevent infinite looping. We do this by doing a goto to the
bottom of the loop that already sets 'locked_ref' to NULL and does a
cond_resched(), with an increment to 'count' right before the goto.
These measures were in place before the refactoring in commit 0110a4c43451
("btrfs: refactor __btrfs_run_delayed_refs loop") but were unintentionally
lost afterwards.
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/ag8ARRwykv8bpJ87@stanley.mountain/
Fixes: 0110a4c43451 ("btrfs: refactor __btrfs_run_delayed_refs loop")
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
sb_write_pointer() reads the super block from the block device page cache
using read_cache_page_gfp(). This has the same race with BLKBSZSET as the
one fixed by commit 3f29d661e568 ("btrfs: sync read disk super and set
block size").
Take the mapping invalidate lock around read_cache_page_gfp() to
serialize the read against block size changes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: KangNing Liao <lkangn.kernel@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
ntfs_link() converts the new link name with ntfs_nlstoucs() using
NTFS_MAX_NAME_LEN. In this case ntfs_nlstoucs() allocates the result
from ntfs_name_cache, and its contract requires callers to release the
buffer with kmem_cache_free(ntfs_name_cache, ...).
All other ntfs_nlstoucs() callers in namei.c do that, but ntfs_link()
uses kfree(), which mismatches the allocator for successfully converted
names.
The conversion failure path reaches the common out label with uname ==
NULL. That was harmless for kfree(), but kmem_cache_free() does not
provide the same NULL contract. Return directly on conversion failure
and free successful conversions with ntfs_name_cache.
Fixes: af0db57d4293 ("ntfs: update inode operations")
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
If hpfs_map_dnode_bitmap fails, the code would call hpfs_brelse4 on
uninitialized quad buffer head, causing a crash.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu>
Cc: stable@vger.kernel.org
|
|
[CURRENT LIMIT]
Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger
than BITS_PER_LONG.
One call site that utilizes this limit is btrfs_bio_ctrl::submit_bitmap,
which makes it very simple and straightforward to just grab an unsigned
long value and assign it to submit_bitmap.
Unfortunately that limit prevents us from supporting huge folios.
For 4K page size and block size, a huge folio (order 9) means 512 blocks
inside a 2M folio.
[ENHANCEMENT]
Instead of using a fixed unsigned long value, change
btrfs_bio_ctrl::submit_bitmap to an unsigned long pointer.
And for cases where an unsigned long can hold the whole bitmap,
introduce @submit_bitmap_value, and just point that pointer to that
unsigned long.
Then update all direct users of bio_ctrl->submit_bitmap to use the
pointer version.
There are several call sites that get extra changes:
- @range_bitmap inside extent_writepage_io()
Which is only utilized to truncate the bitmap.
Since we do not want to allocate new memory just for such temporary
usage, change the original bitmap_set() and bitmap_and() into
bitmap_clear() for the ranges outside of the target range.
- Getting dirty subpage bitmap inside writepage_delalloc()
Since we're passing an unsigned long pointer now, we need to go with
different handling (bs == ps, blocks_per_folio <= BITS_PER_LONG,
blocks_per_folio > BITS_PER_LONG).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[CURRENT LIMIT]
Btrfs currently only supports sub-bitmaps (e.g. dirty bitmap) no larger
than BITS_PER_LONG.
That limit allows us to easily grab an unsigned long without the need to
properly allocate memory for a larger bitmap.
Unfortunately that limit prevents us from supporting huge folios.
For 4K page size and block size, a huge folio (order 9) means 512 blocks
inside a 2M folio.
[ENHANCEMENT]
To allow direct bitmap operations without allocating new memory,
introduce two different ways to access the subpage bitmaps:
- Return an unsigned long value
This only happens if blocks_per_folio <= BITS_PER_LONG.
We read out the sub-bitmap into an unsigned long, and return the
value.
This is the old existing method.
This involves get_bitmap_value_##name() helper functions.
And this time the helper functions are defined as inline functions
instead of macros to provide better type checks.
- Return a pointer where the sub-bitmap starts
This only happens if blocks_per_folio >= BITS_PER_LONG.
This is the new method for sub-bitmaps larger than BITS_PER_LONG.
Since the sizes of sub-bitmaps are all aligned to BITS_PER_LONG, we
can directly access the start word of the sub-bitmap.
This involves get_bitmap_pointer_##name() helper functions.
Then change the existing sub-bitmaps users to use the new helpers:
- Bitmap dumping
Switch between get_bitmap_value_##name() and
get_bitmap_pointer_##name() depending on the sub-bitmap size.
- btrfs_get_subpage_dirty_bitmap()
Rename it to btrfs_get_subpage_dirty_bitmap_value() to follow the new
value/pointer naming.
Since we do not support huge folios yet, there is no pointer version
for the dirty bitmap.
Furthermore, add the support for block size == page size cases for
btrfs_get_subpage_dirty_bitmap_value(), so that the caller no longer
needs to check if the folio needs subpage handling.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The comments at the beginning of subpage.c are out-of-date, a lot of the
limitations have been already resolved.
Update them to reflect the latest status.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
NULL check before kfree() is unnecessary and triggers coccinelle warnings.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
Coccinelle warned about unnecessary patterns when
assigning to bool variables.
Simply assign the condition directly.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
The ntfs driver does not implement quota accounting. It creates
new inodes with the NTFS 1.2 $STANDARD_INFORMATION layout and does
not maintain the NTFS 3.x owner_id/quota_charged fields or the
$Quota usage records that Windows would need for meaningful quota
accounting.
The only runtime quota path left in the driver is the remount-rw
code that tries to mark $Quota/$Q out of date, plus the mount-time
code that loads $Quota and its $Q index solely to support that
marker.
Since the driver does not maintain the per-file quota metadata,
setting QUOTA_FLAG_OUT_OF_DATE does not make the quota state
meaningful, and failures in this unsupported path can unnecessarily
block remount-rw or force a mount read-only.
Remove the quota marker, the $Quota/$Q loading state, and the
unused quota volume flag. Keep the on-disk quota layout definitions
in layout.h so the documented NTFS structures remain available.
Suggested-by: Hyunchul Lee <hyc.lee@gmail.com>
Link: https://lore.kernel.org/all/CANFS6bYTzioqZjYt=51Kb9RdR3MKXaez_fh_WCLoym093VxFmg@mail.gmail.com/
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
Replace the manual ternary "s" pluralization with str_plural() to
simplify the code.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
in ntfs_ea_lookup and ntfs_listxattr, this verifies that there is enough
space in the EA entry before accessing the next_entry_offset field of
the EA entry.
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
Validate index entries immediately after reading an index root or index
block from disk. This eliminates repeated checks in lookup and readdir,
and reduce the risk of missing checks in those paths.
Tested-by: woot000 <woot000@woot000.com>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
Add a dedicated helper to perform stricter validation of $INDEX_ROOT and
use it for both directory inodes and named index inodes. This keeps the
root size and header geometry checks consistent across both read paths.
Tested-by: woot000 <woot000@woot000.com>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
Modify ntfs_index_block_inconsisent() to perform stricter validation of
INDEX_HEADER geometry in INDX blocks, and update
ntfs_lookup_inode_by_name() to use that function to validate INDX
blocks.
Tested-by: woot000 <woot000@woot000.com>
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
get_nr_free_clusters() allocates a temporary file_ra_state before it
publishes the precomputed free cluster count, sets NVolFreeClusterKnown(),
and wakes vol->free_waitq. If that allocation fails, the worker returns
without setting the flag or waking waiters, so callers waiting for the free
count can block indefinitely.
The readahead state is only used synchronously while scanning the bitmap.
Keep it on the stack and pass it by address to the readahead helper. This
eliminates the early allocation failure path instead of adding a special
case that publishes a conservative count and wakes the waitqueue.
Zero-initialize the on-stack state because file_ra_state_init() only sets
ra_pages and prev_pos.
Apply the same treatment to __get_nr_free_mft_records(), which scans the
MFT bitmap with the same short-lived readahead state.
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
This patch fixes the ABBA deadlock between extent_lock and extent
mrec_lock triggered by xfstests generic/113, that occurs since the commit
6994acf33bae ("ntfs: use base mft_no when looking up base inode for
extent record").
Path A (inode writeback):
VFS writeback
-> ntfs_write_inode()
-> __ntfs_write_inode()
-> mutex_lock(&ni->extent_lock)
-> mutex_lock(&tni->mrec_lock)
Path B (MFT folio writeback):
VFS writeback of $MFT dirty folios
-> ntfs_mft_writepages()
-> ntfs_write_mft_block()
-> ntfs_may_write_mft_record()
-> holds one extent mrec_lock from a previous iteration
-> tries to acquire another base inode extent_lock
By removing all extent_lock and extent mrec_lock acquisition from the MFT
folio writeback path, the ABBA lock ordering is eliminated:
Path A: __ntfs_write_inode(): extent_lock -> mrec_lock
Path B (removed): ntfs_write_mft_block(): mrec_lock -> extent_lock
Path B is always redundant for extent records because:
1. mark_mft_record_dirty(ext_ni) does NOT dirty the MFT folio.
It only sets NInoDirty(ext_ni) and marks the base VFS inode dirty
via __mark_inode_dirty(I_DIRTY_DATASYNC), which triggers Path A.
Therefore, normal extent modifications never create a situation where
the MFT folio is dirty and Path B is not scheduled.
2. The MFT folio only gets dirtied via ntfs_mft_mark_dirty() inside
ntfs_mft_record_alloc(). But all identified callers in attrib.c
(ntfs_attr_add, ntfs_attr_record_move_away,
ntfs_attr_make_non_resident, ntfs_attr_record_resize) follow through
with mark_mft_record_dirty(), which triggers Path A to write the
complete record.
3. ntfs_evict_big_inode() calls ntfs_commit_inode() before freeing extent
inodes, ensuring all dirty extents are flushed via Path A before the
base inode leaves the icache.
Signed-off-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
load_and_init_upcase() currently aliases vol->upcase to the global
default upcase whenever the shared prefix matches, and then truncates
vol->upcase_len to that shorter prefix. The result is correct only by
accident: upcase[] accesses in name collation are gated by upcase_len,
so the prefix-equality alias produces the same fold output as keeping
the volume's own shorter table.
Still, prefix equality is not equality: the volume table is logically
distinct from the default and should not be replaced by it unless they
are byte-for-byte identical. Use memcmp() to compare the complete table
in one expression and drop the now-redundant upcase_len rewrite.
No user-visible change is expected for compliant volumes whose $UpCase
has exactly default_upcase_len entries; shorter volume tables are no
longer aliased to the default.
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
ntfs_fill_super()'s err_out_now path frees only the volume struct via
kfree(vol), leaving several vol-owned allocations behind on every mount
failure:
- vol->nls_map, loaded by ntfs_init_fs_context() via
load_nls_default() (or replaced by an explicit nls= option in
ntfs_parse_param()), is never unload_nls()'d.
- vol->volume_label, allocated by load_system_files() through
ntfs_ucstonls() once the $Volume name attribute has been parsed, is
not released by load_system_files()'s own error labels nor by the
fill_super() inline cleanup that only runs on d_make_root()
failure. Any later failure inside load_system_files() leaks it.
- vol->lcn_empty_bits_per_page was kvfree()'d in
unl_upcase_iput_tmp_ino_err_out_now without clearing the pointer,
so it could not be folded into a single common cleanup.
Because the failure paths never call ntfs_volume_free() and never reach
the d_make_root() inline cleanup block (it sits above the label and is
jumped over by the load_system_files() / kvmalloc failure gotos), these
resources accumulate per failed mount attempt with no chance of
recovery short of unloading the module. This is a silent leak: the
inodes loaded prior to failure remain hashed but generic_shutdown_super()
skips evict_inodes() when sb->s_root is unset, so no CHECK_DATA_CORRUPTION
warning is emitted either.
Move the per-volume frees down to err_out_now and drop the
lcn_empty_bits_per_page kvfree() from the upper label so the cleanup is
performed exactly once on every failure path. Using unconditional
kvfree() / kfree() / unload_nls() is safe because they all accept NULL
and the upper labels that previously freed nls_map (the d_make_root()
inline cleanup) already clear the pointer.
Signed-off-by: DaeMyung Kang <charsyam@gmail.com>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
|
|
We need the driver-core fixes in here as well to build on top of.
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
|
|
Before btrfs switched to the new mount API in 2023, we were setting
SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the
filesystem may have files which don't have security xattrs, enabling it
to do some optimizations.
Unfortunately this was missed in the transition, meaning that IS_NOSEC
will always return false for a btrfs inode. This means that
btrfs_direct_write() calls will always get the inode lock exclusively,
meaning that DIO writes to the same file will be serialized.
On my machine, this one-line change results in a ~59% improvement in DIO
throughput:
Before patch:
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
...
fio-3.39
Starting 32 processes
test: Laying out IO file (1 file / 1024MiB)
Jobs: 32 (f=32): [w(32)][100.0%][w=764MiB/s][w=195k IOPS][eta 00m:00s]
test: (groupid=0, jobs=32): err= 0: pid=586: Wed Apr 22 13:03:04 2026
write: IOPS=202k, BW=787MiB/s (826MB/s)(46.1GiB/60012msec); 0 zone resets
bw ( KiB/s): min=498714, max=1199892, per=100.00%, avg=806659.03, stdev=4229.94, samples=3808
iops : min=124677, max=299971, avg=201661.82, stdev=1057.49, samples=3808
cpu : usr=0.32%, sys=1.27%, ctx=8329204, majf=0, minf=1163
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,12094328,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=787MiB/s (826MB/s), 787MiB/s-787MiB/s (826MB/s-826MB/s), io=46.1GiB (49.5GB), run=60012-60012msec
After patch:
test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
...
fio-3.39
Starting 32 processes
test: Laying out IO file (1 file / 1024MiB)
Jobs: 32 (f=32): [w(32)][100.0%][w=1255MiB/s][w=321k IOPS][eta 00m:00s]
test: (groupid=0, jobs=32): err= 0: pid=572: Wed Apr 22 13:13:46 2026
write: IOPS=320k, BW=1250MiB/s (1311MB/s)(73.3GiB/60003msec); 0 zone resets
bw ( MiB/s): min= 619, max= 2289, per=100.00%, avg=1251.28, stdev= 9.64, samples=3808
iops : min=158538, max=586025, avg=320320.80, stdev=2468.97, samples=3808
cpu : usr=0.35%, sys=11.50%, ctx=1584847, majf=0, minf=1160
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=0,19203309,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
WRITE: bw=1250MiB/s (1311MB/s), 1250MiB/s-1250MiB/s (1311MB/s-1311MB/s), io=73.3GiB (78.7GB), run=60003-60003msec
The script to reproduce that:
#!/bin/bash
mkfs.btrfs -f /dev/nvme0n1
mount /dev/nvme0n1 /mnt/test
mkdir /mnt/test/nocow
chattr +C /mnt/test/nocow
fio /root/test.fio
# cat /root/test.fio
[global]
rw=randwrite
ioengine=io_uring
iodepth=64
size=1g
direct=1
startdelay=20
force_async=4
ramp_time=5
runtime=60
group_reporting=1
numjobs=32
time_based
disk_util=0
clat_percentiles=0
disable_lat=1
disable_clat=1
disable_slat=1
filename=/mnt/test/nocow/fiofile
[test]
name=test
bs=4k
stonewall
This was on a VM with 8 cores and 8GB of RAM, with a real NVMe exposed
through PCI passthrough. The figures for XFS and ext4 in comparison are
both about ~3GB/s.
Fixes: ad21f15b0f79 ("btrfs: switch to the new mount API")
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently btrfs_writepages() just accumulates as large bio as possible
(within writeback_control constraints) and then submits it. This can
however lead to significant latency in writeback IO submission (I have
observed tens of milliseconds) because the submitted bio easily has over
hundred of megabytes. Consequently this leads to IO pipeline stalls and
reduced throughput.
At the same time beyond certain size submitting so large bio provides
diminishing returns because the bio is split by the block layer
immediately anyway. So compute (estimate of) bio size beyond which we
are unlikely to improve performance and just submit the bio for
writeback once we accumulate that much to keep the IO pipeline busy.
This improves writeback throughput for sequential writes by about 15% on
the test machine I was using.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
[ Fix the handling of missing device to avoid NULL pointer dereference. ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Optimize the btrfs_abort_transaction() for size as it (by our
convention) must be put right after the error condition is detected.
The exact file:line is reported so there's a portion that must be
inlined. As this is cold code it bloats functions. In previous patch
"btrfs: move transaction abort message to __btrfs_abort_transaction()"
the error message was moved to the common helper, saving like 20KiB of
btrfs.ko and several instructions per call site and some stack space.
There's little left to be optimized, we need to keep the atomic
test_and_set_bit() and to convey that as 'first hit' to
__btrfs_abort_transaction().
Right now it's a bool, which takes 8 bytes on stack for each call but
it's 1 bit of information. We can encode that to some of the other
parameters.
For that let's use the 'error' parameter, by convention it's negative
errno so we can reliably detect if it's the first hit or a later error.
Also the negation is usually implemented by a single instruction (NEG on
x86_64) so the resulting object code is kept short.
This reduces btrfs.ko by 8K and stack in several functions by 8 bytes.
Cumulative effect with the other commit is -30K of btrfs.ko. While the
encoding is an implementation detail, it's contained within the API.
Making the transaction abort calls very light is desired.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In preparation to encode more information to the error value add a step
that verifies if the value is valid (i.e. < 0). This works for
compile-time and runtime (in debugging mode).
The compile-time check recognizes direct constants and defines an array
type. An invalid condition leads to negative array size which is caught
by compiler.
The runtime check constructs the array type from the condition and only
verifies the correct size, as we don't need to tweak the size to be
negative.
The sizeof() expressions do not generate any code. In the debugging
config the warning adds about 9KiB of btrfs.ko code size.
The array size trick is needed as we can't use static_array(), not even
with __builtin_constant_p().
Sample error message:
In file included from inode.c:40:
inode.c: In function ‘__cow_file_range_inline’:
transaction.h:261:26: error: size of unnamed array is negative
261 | (void)sizeof(char[-!(__builtin_constant_p(error) ? (error) < 0 : 1)]); \
| ^
transaction.h:275:9: note: in expansion of macro ‘VERIFY_NEGATIVE_ERROR’
275 | VERIFY_NEGATIVE_ERROR(error); \
| ^~~~~~~~~~~~~~~~~~~~~
inode.c:665:17: note: in expansion of macro ‘btrfs_abort_transaction’
665 | btrfs_abort_transaction(trans, 17);
| ^~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In the beginning of the loop, we try to obtain a locked delayed ref head,
if 'locked_ref' is currently NULL, by calling btrfs_select_ref_head(),
which can return an error pointer. If the error pointer is -EAGAIN we do
a continue and go back to the beginning of the loop, which will not try
again to call btrfs_select_ref_head() since 'locked_ref' is no longer
NULL but it's ERR_PTR(-EAGAIN), and then we do:
spin_lock(&locked_ref->lock);
against a ERR_PTR(-EAGAIN) value, generating an invalid pointer
dereference.
Fix this by ensuring that 'locked_ref' is set to NULL when
btrfs_select_ref_head() returns ERR_PTR(-EAGAIN) and incrementing 'count'
as well, to prevent infinite looping. We do this by doing a goto to the
bottom of the loop that already sets 'locked_ref' to NULL and does a
cond_resched(), with an increment to 'count' right before the goto.
These measures were in place before the refactoring in commit 0110a4c43451
("btrfs: refactor __btrfs_run_delayed_refs loop") but were unintentionally
lost afterwards.
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/ag8ARRwykv8bpJ87@stanley.mountain/
Fixes: 0110a4c43451 ("btrfs: refactor __btrfs_run_delayed_refs loop")
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
sb_write_pointer() reads the super block from the block device page cache
using read_cache_page_gfp(). This has the same race with BLKBSZSET as the
one fixed by commit 3f29d661e568 ("btrfs: sync read disk super and set
block size").
Take the mapping invalidate lock around read_cache_page_gfp() to
serialize the read against block size changes.
Signed-off-by: KangNing Liao <lkangn.kernel@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_sync_log() is one of the main functions called during a fsync.
Add trace events for when entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_log_new_name() is an important function that affects inode logging
and is called during link and rename operations. Add trace events for when
entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_record_new_subvolume() is an important operation that affects
inode logging and is called during subvolume creation. Add a trace event
for it to help debug issues.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_record_snapshot_destroy() is an important operation that affects
inode logging and is called during subvolume/snapshot deletion as well as
during rmdir. Add a trace event for it to help debug issues.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_record_unlink_dir() is an important operation that affects inode
logging and is called during unlink and rename operations. Add a trace
event for it to help debug issues.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
log_new_delayed_dentries() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In overwrite_item():
There's no point in printing the root's ID if the assertion fails, since
it can only be BTRFS_TREE_LOG_OBJECTID if it fails.
In log_new_delayed_dentries():
There's no point in using a verbose assertion to print the value of
ctx->logging_new_delayed_dentries because it's a boolean, so if the
assertion fails we know its value is true (1).
So convert them to simpler assertion to make the code less verbose.
It also slightly reduces the object size, at least on x86_64 using
Debian's gcc 14.2.0-19 (if CONFIG_BTRFS_ASSERT is enabled in the kernel
config, which is the case for SUSE distributions for example).
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
2028244 197176 15624 2241044 223214 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
2028228 197176 15624 2241028 223204 fs/btrfs/btrfs.ko
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
log_conflicting_inodes() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
add_conflicting_inode() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
log_new_dir_dentries() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
log_all_new_ancestors() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_log_all_parents() is an important step called during a fsync, as
well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_log_inode() is one of the most important steps called during a fsync,
as well as during rename and link operations on inodes that were previously
logged. Add trace events for when entering and exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We use this unnamed enum for the log mode and then pass it around log
functions as an int type with the odd name "inode_only" which suggests a
boolean. So add a name to the enum and change the type everywhere to that
enum and rename the parameters to something more clear - "log_mode".
Also move the enum into tree-log.h - it will be used later by new trace
events.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_log_inode_parent() is one of the most important steps called during
a fsync operation as well as during rename and link operations on inodes
that were previously logged. Add trace events for when entering and
exiting that function.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we only have a trace event for when a fsync operation starts,
but this alone is not very helpful. Add a trace event for when fsync
finishes, which reports its return value, so that using tracing we can
see which other trace events happened in between (several will be added
soon for inode logging steps) and even measure execution time.
So rename the existing trace event btrfs_sync_file to
btrfs_sync_file_enter and add the trace event btrfs_sync_file_exit.
The naming is similar to what ext4 does (ext4_sync_file_enter and
ext4_sync_file_exit) and with similar information reported.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we can skip logging the inode during fsync, we check for writeback
errors in the inode's mapping by calling filemap_check_wb_err() and then
jump to the 'out_release_extents' label, which in turn jumps to the 'out'
label under which we check again for a writeback error by calling
file_check_and_advance_wb_err(). So the filemap_check_wb_err() ends up
being redundant. This happens since commit 333427a505be ("btrfs: minimal
conversion to errseq_t writeback error reporting on fsync").
Remove the filemap_check_wb_err() call.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|