aboutsummaryrefslogtreecommitdiffstats
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2 daysMerge branch 'slab/for-next' of ↵Mark Brown6-389/+413
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git
2 daysMerge branch 'driver-core-next' of ↵Mark Brown1-1/+1
https://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core.git
2 daysMerge branch 'next' of ↵Mark Brown2-47/+39
https://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git
2 daysMerge branch 'fs-next' of linux-nextMark Brown1-1/+0
# Conflicts: # fs/btrfs/defrag.c
2 daysMerge branch 'mm-unstable' of ↵Mark Brown53-1929/+5994
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
2 daysMerge branch 'mm-nonmm-stable' of ↵Mark Brown2-2/+7
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
2 daysMerge branch 'mm-stable' of ↵Mark Brown41-597/+1342
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
2 daysMerge branch 'fixes' of ↵Mark Brown2-6/+17
https://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock.git
3 daysnext-20260522/vfs-braunerMark Brown1-1/+0
# Conflicts: # fs/fuse/dev.c
3 daysMerge branch 'slab/for-7.2/alloc_bulk' into slab/for-nextVlastimil Babka (SUSE)3-37/+41
3 daysmm/slab: improve kmem_cache_alloc_bulkChristoph Hellwig3-37/+41
The kmem_cache_alloc_bulk return value is weird. It returns the number of allocated objects, but that must always be 0 or the requested number based on the implementations and the handling in the callers, but that assumption is not actually documented anywhere, which confuses automated review tools. Fix this by returning a bool if the allocation succeeded and adding a kerneldoc comment explaining the API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> # skbuff Link: https://patch.msgid.link/20260528093437.2519248-2-hch@lst.de Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
3 daysMerge branch 'slab/for-7.2/alloc_token' into slab/for-nextVlastimil Babka (SUSE)5-162/+121
3 daysmm/slub: detach and reattach partial slabs in batchHao Li1-8/+20
get_partial_node_bulk() moves each selected slab from the node's partial list to the local pc->slabs list using a remove_partial() and list_add() pair. In practice, the loop often detaches several adjacent slabs. Doing this individually repeatedly manipulates list pointers while holding n->list_lock, which causes unnecessary churn. To demonstrate this, the counts below show how often single vs. multiple consecutive slabs are retrieved during a will-it-scale mmap stress test: consecutive_slabs_count frequency = 1 277345324 = 2 335238023 = 3 175717884 >= 4 88862337 The data confirms that retrieving multiple contiguous slabs is highly frequent. To optimize this, track contiguous runs of matching slabs and move each run in a single operation using list_bulk_move_tail(). This reduces list pointer churn inside the lock critical section. Apply the same optimization to __refill_objects_node() when reattaching leftover partial slabs back to the node's partial list. The will-it-scale mmap benchmark shows a 2% ~ 5% performance improvement after applying this patch. Signed-off-by: Hao Li <hao.li@linux.dev> Link: https://patch.msgid.link/20260529035120.81304-3-hao.li@linux.dev Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
3 daysmm/slub: introduce helpers for node partial slab stateHao Li1-6/+17
Wrap partial slab count inc/dec and flag set/clear into helper functions to reduce code duplication. Note that __add_partial() is called locklessly in early_kmem_cache_node_alloc(), but since there is no such use case for removal, __remove_partial() does not exist. Suggested-by: Harry Yoo <harry@kernel.org> Signed-off-by: Hao Li <hao.li@linux.dev> Link: https://patch.msgid.link/20260529035120.81304-2-hao.li@linux.dev Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
3 daysmm/migrate: find_mm_struct: fix race between security checks and suid execOleg Nesterov1-4/+9
The target task can execute a setuid binary between ptrace_may_access() and get_task_mm(). Protect this critical section with exec_update_lock. I don't think cpuset_mems_allowed(task) should be called under exec_update_lock, but this patch just tries to add the minimal fix. Perhaps we can later add a common helper which can be used by find_mm_struct() and kernel_migrate_pages(). Link: https://lore.kernel.org/ahWxQ3JxdR5ff2qf@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Gregory Price <gourry@gourry.net> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: David Hildenbrand <david@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jann Horn <jannh@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Ying Huang <ying.huang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: remove mentions of PageWritebackMatthew Wilcox (Oracle)3-12/+12
Update two comments to refer to writeback in general instead of the specific flag. Convert the large comment in memory.c to be entirely folio-based. Link: https://lore.kernel.org/20260526195650.353196-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmemcg: multi objcg charge supportShakeel Butt1-58/+142
Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") split a memcg's single obj_cgroup into one per NUMA node so that reparenting LRU folios can take per-node lru locks. As a side effect, the per-CPU obj_stock_pcp -- which caches exactly one cached_objcg -- thrashes on workloads where threads of the same memcg run on different NUMA nodes. The kernel test robot reported a 67.7% regression on stress-ng.switch.ops_per_sec from this pattern. Mirror the multi-slot pattern already used by memcg_stock_pcp: turn nr_bytes and cached_objcg into NR_OBJ_STOCK-element arrays, scan all slots on consume/refill/account, prefer empty slots when inserting, and evict a slot round-robin only when full. With multiple slots a CPU can hold the per-node objcg variants of one memcg plus a few siblings without ever forcing a drain. A single int8_t index records which slot the cached slab stats belong to; the stats are flushed on slot or pgdat change. With NR_OBJ_STOCK = 5 the layout (verified with pahole) is: offset 0 : lock(1) + index(1) + node_id(2) + slab stats(4) = 8B offset 8 : nr_bytes[5] = 10B offset 18 : padding = 6B offset 24 : cached[5] = 40B offset 64 : (line 2) work_struct + flags (cold) so consume_obj_stock, refill_obj_stock and the slab account path each touch exactly one 64-byte cache line on non-debug 64-bit builds. Link: https://lore.kernel.org/20260526033931.1760588-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") Tested-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <qi.zheng@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmemcg: int16_t for cached slab statsShakeel Butt1-13/+12
Currently struct obj_stock_pcp stores cached slab stats in 'int' which is 4 bytes per counter on 64-bit machines. Switch them to int16_t to shrink the cached metadata. The existing PAGE_SIZE flush in __account_obj_stock() bounds *bytes at PAGE_SIZE on 4KiB and 16KiB page archs, well within int16_t. On 64KiB pages PAGE_SIZE is well above S16_MAX so that flush never fires, and a sufficiently long run of accumulations would overflow the cache. Add an explicit S16_MAX guard before each add: when the next add would push abs(*bytes) past S16_MAX, fold the cached value into @nr and flush directly via mod_objcg_mlstate() before the accumulation. Link: https://lore.kernel.org/20260526033931.1760588-4-shakeel.butt@linux.dev Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Qi Zheng <qi.zheng@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmemcg: uint16_t for nr_bytes in obj_stock_pcpShakeel Butt1-6/+19
Currently struct obj_stock_pcp stores nr_bytes in an 'unsigned int' which is 4 bytes on 64-bit machines. Switch the field to uint16_t to shrink the per-CPU cache. The kernel supports PAGE_SIZE_4KB, _8KB, _16KB, _32KB, _64KB and _256KB (see HAVE_PAGE_SIZE_* in arch/Kconfig). After the PAGE_SIZE-aligned flush in __refill_obj_stock(), the sub-page remainder fits in uint16_t up through 64KiB pages where PAGE_SIZE - 1 == U16_MAX, but on 256KiB pages PAGE_SIZE - 1 == 0x3FFFF exceeds U16_MAX. The accumulator also needs to stay within uint16_t between page-aligned flushes on 64KiB pages where PAGE_SIZE itself is U16_MAX + 1. Accumulate the new total in an 'unsigned int' local, then on PAGE_SHIFT <= 16 flush whenever the accumulator would hit U16_MAX; together with the existing allow_uncharge flush at PAGE_SIZE this keeps the uint16_t safe. On configs with PAGE_SHIFT > 16 (PAGE_SIZE_256KB on hexagon and powerpc 44x, both 32-bit), uint16_t cannot represent the sub-page remainder. Define obj_stock_bytes_t as 'unsigned int' on those archs so nr_bytes can hold the full remainder and the normal page-boundary flush in __refill_obj_stock() and the page extraction in drain_obj_stock() both work correctly. The single-cache-line layout target only applies to PAGE_SHIFT <= 16; those archs are 32-bit embedded and not the optimization target. Link: https://lore.kernel.org/20260526033931.1760588-3-shakeel.butt@linux.dev Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Qi Zheng <qi.zheng@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmemcg: store node_id instead of pglist_data pointerShakeel Butt1-7/+19
Patch series "memcg: shrink obj_stock_pcp and cache multiple objcgs", v3. Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") split a memcg's single obj_cgroup into one per NUMA node so that reparenting LRU folios can take per-node lru locks. As a side effect, the per-CPU obj_stock_pcp -- which caches a single cached_objcg pointer -- thrashes on workloads where threads of the same memcg run on different NUMA nodes. The kernel test robot reported a 67.7% regression on stress-ng.switch.ops_per_sec from this pattern. Commit d0211878ce06 ("memcg: cache obj_stock by memcg, not by objcg pointer") landed as a temporary fix by treating sibling per-node objcgs as equivalent for the cache lookup, intended to be reverted once per-node kmem accounting is introduced. This series takes a more general approach: cache multiple objcgs per CPU using the multi-slot pattern memcg_stock_pcp already uses, so the per-node objcg variants of one memcg can all coexist in the stock without ever forcing a drain. The temporary fix can then be reverted. To avoid increasing the per-CPU cache footprint, the first three patches shrink the existing single-slot obj_stock_pcp fields. The final patch converts cached_objcg and nr_bytes into NR_OBJ_STOCK=5 slot arrays and reorders the struct so the entire consume/refill/account hot path fits within a single 64-byte cache line on non-debug 64-bit builds (verified with pahole). This patch (of 4): The struct obj_stock_pcp stores a pointer to pglist_data for the slab stats cached on the cpu. On 64-bit machines, this costs 8 bytes. The pointer is not strictly required: NODE_DATA() can recover it from the node id. Replace cached_pgdat with int16_t node_id and use NUMA_NO_NODE as the "no stats cached" sentinel. At the moment all the archs limit MAX_NUMNODES to 1024 so int16_t is plenty; a BUILD_BUG_ON() makes sure we notice if that ever changes. Link: https://lore.kernel.org/20260526033931.1760588-1-shakeel.butt@linux.dev Link: https://lore.kernel.org/20260526033931.1760588-2-shakeel.butt@linux.dev Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: kernel test robot <oliver.sang@intel.com> Acked-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Qi Zheng <qi.zheng@linux.dev> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/dmapool: use static key for boot-time debug enablementLi RongQing1-23/+29
Replace the #ifdef CONFIG_SLUB_DEBUG_ON conditional compilation with a static key (dmapool_debug_enabled). This allows enabling dmapool debugging at boot time via: dmapool_debug Instead of requiring CONFIG_SLUB_DEBUG_ON at compile time. Benefits: - Debugging can be enabled without rebuilding the kernel - Uses standard kernel static_key mechanism with minimal overhead Link: https://lore.kernel.org/20260524034015.1830-1-lirongqing@baidu.com Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Signed-off-by: Li RongQing <lirongqing@baidu.com> Cc: David Hildenbrand <david@kernel.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: shmem: refactor thpsize_shmem_enabled_show() with helper arraysRan Xiaokai1-12/+22
Replace the hardcoded if/else chain of test_bit() calls and string literals in thpsize_shmem_enabled_show() with a loop over huge_shmem_orders_by_mode[] and huge_shmem_enabled_mode_strings[] arrays. This makes thpsize_shmem_enabled_show() consistent with thpsize_shmem_enabled_store() and eliminates duplicated mode name strings. Link: https://lore.kernel.org/20260525102700.68707-3-ranxiaokai627@163.com Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Breno Leitao <leitao@debian.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (arm) <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: shmem: refactor thpsize_shmem_enabled_store() with sysfs_match_string()Ran Xiaokai1-46/+59
Patch series "refactors thpsize_shmem_enabled_store() and thpsize_shmem_enabled_show()", v4. This patch (of 2): Inspired by commit 82d9ff648c6c ("mm: huge_memory: refactor anon_enabled_store() with set_anon_enabled_mode()"), refactor thpsize_shmem_enabled_store() using sysfs_match_string(). This eliminates the duplicated spin_lock/unlock(), set/clear_bit(), calls across all branches, reducing code duplication. Behavioral change: Call start_stop_khugepaged() only when the mode actually changes. If unchanged, call set_recommended_min_free_kbytes() to preserve legacy watermark behavior. This avoids unnecessary khugepaged restarts. Tested with selftests ./run_kselftest.sh -t mm:ksft_thp.sh, all test cases passed. Link: https://lore.kernel.org/20260525102700.68707-1-ranxiaokai627@163.com Link: https://lore.kernel.org/20260525102700.68707-2-ranxiaokai627@163.com Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Breno Leitao <leitao@debian.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: make mmap_miss accounting symmetric for VM_SEQ_READUsama Arif1-2/+12
do_sync_mmap_readahead() skips both the mmap_miss increment and the MMAP_LOTSAMISS check for VM_SEQ_READ mappings, since sequential access is non-speculative and should always read ahead. The two decrement sites in do_async_mmap_readahead() and filemap_map_pages() do not mirror this skip, so concurrent faults on a VM_SEQ_READ mapping can still drive ra->mmap_miss down to zero through the decrement paths even though nothing in the sync path ever increments it. The counter itself is per-file (file->f_ra.mmap_miss), so it can be moved by any VMA mapping the file, not just the one currently faulting. Skip the decrement for VM_SEQ_READ in both decrement sites so the counter only moves for mappings that also participate in the increment side. No functional change for VM_SEQ_READ users, since the increment-side gate already prevents the counter from being consulted on their behalf, but it stops a VM_SEQ_READ mapping from biasing the counter for other mappings of the same file. Link: https://lore.kernel.org/20260525145751.2671248-1-usama.arif@linux.dev Signed-off-by: Usama Arif <usama.arif@linux.dev> Closes: https://lore.kernel.org/all/8edc8cd0-f65c-4456-9b3f-362e744c9a96@linux.dev/ Reviewed-by: William Kucharski <william.kucharski@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 dayspercpu: fix hint invariant breakageJoonwon Kang1-25/+93
The invariant "scan_hint_start > contig_hint_start if and only if scan_hint == contig_hint" should be kept for hint management. However, it could be broken in some cases: - if (new contig == contig_hint == scan_hint) && (contig_hint_start < scan_hint_start < new contig start) && the new contig is to become a new contig_hint due to its better alignment, then scan_hint should be invalidated instead of keeping the old value. - if (new contig == contig_hint > scan_hint) && (new contig start < contig_hint_start) && the new contig is not to become a new contig_hint, then scan_hint should be not updated to the new contig. This commit mainly fixes this invariant breakage and includes more: - Handle the cases where the new contig overlaps with the contig_hint or with scan_hint. - Merge the new contig with other hints when it overlaps with them and treat it as a whole free region instead of a separate small region. - Fix the invariant breakage and also optimizes scan_hint further. Some of the optimization cases when no overlap occurs are: - if (new contig > contig_hint > scan_hint) && (scan_hint_start < new contig start < contig_hint_start), then keep scan_hint instead of invalidating it. - if (new contig > contig_hint == scan_hint) && (contig_hint_start < new contig start < scan_hint_start), then update scan_hint to the old contig_hint instead of invalidating it. - if (new contig == contig_hint > scan_hint) && (new contig start < contig_hint_start) && the new contig is to become a new contig_hint due to its better alignment, then update scan_hint to the old contig_hint instead of invalidating or keeping it. Link: https://lore.kernel.org/20260513085117.1024175-4-joonwonkang@google.com Signed-off-by: Joonwon Kang <joonwonkang@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 dayspercpu: introduce struct pcpu_regionJoonwon Kang3-138/+130
The new struct is introduced to prevent mis-use of hint start and size and improve readability. Link: https://lore.kernel.org/all/aegQgyf3KuIZMK9x@palisades.local/ Link: https://lore.kernel.org/20260513085117.1024175-3-joonwonkang@google.com Signed-off-by: Joonwon Kang <joonwonkang@google.com> Suggested-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 dayspercpu: do not trust hint starts when they are not setJoonwon Kang1-7/+13
contig_hint_start can be trusted outside the hint update function since it will be updated everytime contig_hint is broken. On the other hand, scan_hint_start might still be invalid anywhere in the code due to the broken scan_hint not being updated promptly. If those starts are trusted when they are not set, it could lead to false invalidation or update of the hints. Link: https://lore.kernel.org/20260513085117.1024175-2-joonwonkang@google.com Signed-off-by: Joonwon Kang <joonwonkang@google.com> Reviewed-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 dayspercpu: fix wrong chunk hints updateJoonwon Kang1-1/+1
Chunk end offset was set to a block end offset, which could prevent chunk hints from being updated correctly. It was observed that the chunk free size gets minus or shorter than the actual free size due to this. This commit fixes it. Link: https://lore.kernel.org/20260513085117.1024175-1-joonwonkang@google.com Fixes: 92c14cab4326 ("percpu: convert chunk hints to be based on pcpu_block_md") Signed-off-by: Joonwon Kang <joonwonkang@google.com> Reviewed-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/alloc_tag: replace fixed-size early PFN array with dynamic linked listHao Ge1-6/+6
Pages allocated before page_ext is available have their codetag left uninitialized. Track these early PFNs and clear their codetag in clear_early_alloc_pfn_tag_refs() to avoid "alloc_tag was not set" warnings when they are freed later. Currently a fixed-size array of 8192 entries is used, with a warning if the limit is exceeded. However, the number of early allocations depends on the number of CPUs and can be larger than 8192. Replace the fixed-size array with a dynamically allocated linked list of pfn_pool structs. Each node is allocated via alloc_page() and mapped to a pfn_pool containing a next pointer, an atomic slot counter, and a PFN array that fills the remainder of the page. The tracking pages themselves are allocated via alloc_page(), which would trigger __pgalloc_tag_add() -> alloc_tag_add_early_pfn() and recurse indefinitely. Introduce __GFP_NO_CODETAG (reuses the %__GFP_NO_OBJ_EXT bit) and pass gfp_flags through pgalloc_tag_add() so that the early path can skip recording allocations that carry this flag. Link: https://lore.kernel.org/20260506022256.32664-1-hao.ge@linux.dev Signed-off-by: Hao Ge <hao.ge@linux.dev> Suggested-by: Suren Baghdasaryan <surenb@google.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: avoid underflow in madvise_collapse for sub-PMD MADV_COLLAPSEChen Wandun1-3/+6
madvise_collapse() computes the THP-aligned window: hstart = ALIGN(start, HPAGE_PMD_SIZE); /* round up */ hend = ALIGN_DOWN(end, HPAGE_PMD_SIZE); /* round down */ The following case will cause hstart > hend, and result in underflow in the return statement, avoid it by returning zero early when hstart > hend. The return value is due to input is valid to madvise(), and there is nothing to collapse. madvise(PMD-aligned + PAGE_SIZE, PAGE_SIZE, MADV_COLLAPSE); In addition, kmalloc_obj(), mmgrab() and lru_add_drain_all() are unnecessary when hstart == hend, so skip these operations by returning early too. Link: https://lore.kernel.org/20260513055428.1664898-1-chenwandun@lixiang.com Signed-off-by: Chen Wandun <chenwandun@lixiang.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: enable clean pagecache folio collapse for writable filesZi Yan2-7/+10
collapse_file() is capable of collapsing pagecache folios from writable files to PMD folios. Now enable clean pagecache folio collapse in addition to read-only pagecache folio collapse by removing the inode_is_open_for_write() from file_thp_enabled() and only performing filemap_flush() if the file is read-only. This means userspace needs to explicitly flush the content of pagecache folios before khugepaged can collapse the folios, or use madvise(MADV_COLLAPSE), which does the flush in the retry. The reason is that blindly enabling dirty pagecache folio from writable files collapse makes khugepaged flush these folios all the time. It is undesirable to cause system level pagecache flushes. To properly support dirty pagecache folio collapse, filemap_flush() needs to be avoided. Potentially, merging associated buffer instead of dropping it with filemap_release_folio() might be needed. NOTE: this breaks khugepaged selftests for writable file pagecache collapse, which is set to fail all the time. The next commit fixes it. Link: https://lore.kernel.org/20260517135416.1434539-14-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/truncate: use folio_split() in truncate_inode_partial_folio()Zi Yan1-4/+4
After READ_ONLY_THP_FOR_FS is removed, FS either supports large folio or not. folio_split() can be used on a FS with large folio support without worrying about getting a THP on a FS without large folio support. When READ_ONLY_THP_FOR_FS was present, a PMD large pagecache folio can appear in a FS without large folio support after khugepaged or madvise(MADV_COLLAPSE) creates it. During truncate_inode_partial_folio(), such a PMD large pagecache folio is split and if the FS does not support large folio, it needs to be split to order-0 ones and could not be split non uniformly to ones with various orders. try_folio_split_to_order() was added to handle this situation by checking folio_check_splittable(..., SPLIT_TYPE_NON_UNIFORM) to detect if the large folio is created due to READ_ONLY_THP_FOR_FS and the FS does not support large folio. Now READ_ONLY_THP_FOR_FS is removed, all large pagecache folios are created with FSes supporting large folio, this function is no longer needed and all large pagecache folios can be split non uniformly. Link: https://lore.kernel.org/20260517135416.1434539-10-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/huge_memory: remove folio split check for READ_ONLY_THP_FOR_FSZi Yan1-27/+3
Without READ_ONLY_THP_FOR_FS, large file-backed folios cannot be created by a FS without large folio support. The check is no longer needed. Link: https://lore.kernel.org/20260517135416.1434539-9-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: fs: remove filemap_nr_thps*() functions and their usersZi Yan3-30/+0
They are used by READ_ONLY_THP_FOR_FS to handle writes to FSes without large folio support, so that read-only THPs created in these FSes are not seen by the FSes when the underlying fd becomes writable. Now read-only PMD THPs only appear in a FS with large folio support and the supported orders include PMD_ORDER. READ_ONLY_THP_FOR_FS was using mapping->nr_thps, inode->i_writecount, and smp_mb() to prevent writes to a read-only THP and collapsing writable folios into a THP. In collapse_file(), mapping->nr_thps is increased, then smp_mb(), and if inode->i_writecount > 0, collapse is stopped, while do_dentry_open() first increases inode->i_writecount, then a full memory fence, and if mapping->nr_thps > 0, all read-only THPs are truncated. Now this mechanism can be removed along with READ_ONLY_THP_FOR_FS code, since a dirty folio check has been added after try_to_unmap() in collapse_file() to prevent dirty folios from being collapsed as clean. Link: https://lore.kernel.org/20260517135416.1434539-7-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: remove READ_ONLY_THP_FOR_FS Kconfig optionZi Yan1-11/+0
After removing READ_ONLY_THP_FOR_FS check in file_thp_enabled(), khugepaged and MADV_COLLAPSE can run on FSes with PMD THP pagecache support even without READ_ONLY_THP_FOR_FS enabled. Remove the Kconfig first so that no one can use READ_ONLY_THP_FOR_FS as upcoming commits remove mapping->nr_thps, which its safe guard mechanism relies on. Link: https://lore.kernel.org/20260517135416.1434539-6-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Nico Pache <npache@redhat.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: remove READ_ONLY_THP_FOR_FS check in hugepage_enabled()Zi Yan1-10/+16
Remove the READ_ONLY_THP_FOR_FS gate and khugepaged for file-backed pmd-sized hugepages are enabled by the global transparent hugepage control. khugepaged can still be enabled by per-size control for anon and shmem when the global control is off. Add shmem_hpage_pmd_enabled() stub for !CONFIG_SHMEM to remove IS_ENABLED(SHMEM) in hugepage_enabled(). Clean up hugepage_enabled() by moving anon code to anon_hpage_enabled(). Link: https://lore.kernel.org/20260517135416.1434539-5-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Nico Pache <npache@redhat.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/huge_memory: remove READ_ONLY_THP_FOR_FS from file_thp_enabled()Zi Yan1-3/+3
Replace it with a check on the max folio order of the file's address space mapping, making sure PMD folio is supported. Keep the inode open-for-write check, since even if collapse_file() now makes sure all to-be-collapsed folios are clean and the created PMD file THP can be handled by FSes properly, the filemap_flush() could perform undesirable write back. Link: https://lore.kernel.org/20260517135416.1434539-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: add folio dirty check after try_to_unmap()Zi Yan1-4/+24
This check ensures the correctness of read-only PMD folio collapse after it is enabled for all FSes supporting PMD pagecache folios and replaces READ_ONLY_THP_FOR_FS. READ_ONLY_THP_FOR_FS only supports read-only fd and uses mapping->nr_thps and inode->i_writecount to prevent any write to read-only to-be-collapsed folios. In upcoming commits, READ_ONLY_THP_FOR_FS will be removed and the aforementioned mechanism will go away too. To ensure khugepaged functions as expected after the changes, skip if any folio is dirty after try_to_unmap(), since a dirty folio at that point means this read-only folio can get writes between try_to_unmap() and try_to_unmap_flush() via cached TLB entries and khugepaged does not support writable pagecache folio collapse yet. Link: https://lore.kernel.org/20260517135416.1434539-3-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: remove READ_ONLY_THP_FOR_FS checkZi Yan1-2/+8
Patch series "Remove CONFIG_READ_ONLY_THP_FOR_FS and enable file THP for writable files", v6. This patch (of 14): collapse_file() requires FSes supporting large folio with at least PMD_ORDER, so replace the READ_ONLY_THP_FOR_FS check with that. MADV_COLLAPSE ignores shmem huge config, so exclude the check for shmem. While at it, replace VM_BUG_ON with VM_WARN_ON_ONCE. Add a helper function mapping_pmd_folio_support() for FSes supporting large folio with at least PMD_ORDER. Link: https://lore.kernel.org/20260517135416.1434539-1-ziy@nvidia.com Link: https://lore.kernel.org/20260517135416.1434539-2-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Nico Pache <npache@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Barry Song <baohua@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <songliubraving@fb.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: run khugepaged for all ordersBaolin Wang1-16/+20
If any order (m)THP is enabled we should allow running khugepaged to attempt scanning and collapsing mTHPs. In order for khugepaged to operate when only mTHP sizes are specified in sysfs, we must modify the predicate function that determines whether it ought to run to do so. This function is currently called hugepage_pmd_enabled(), this patch renames it to hugepage_enabled() and updates the logic to check to determine whether any valid orders may exist which would justify khugepaged running. We must also update collapse_allowable_orders() to check all orders if the vma is anonymous and the collapse is khugepaged. After this patch khugepaged mTHP collapse is fully enabled. Link: https://lore.kernel.org/20260522150009.121603-14-npache@redhat.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Acked-by: Usama Arif <usama.arif@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: avoid unnecessary mTHP collapse attemptsNico Pache1-1/+23
There are cases where, if an attempted collapse fails, all subsequent orders are guaranteed to also fail. Avoid these collapse attempts by bailing out early. Link: https://lore.kernel.org/20260522150009.121603-13-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: Usama Arif <usama.arif@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysfix potential use-after-free of vma in mthp_collapse()Nico Pache1-5/+5
Between V17 and v18, one reviewer (Wei) brought up that we are not doing the uffd-armed check until deep in the collapse operation. While not functionally incorrect, it can lead to unnecessary work. We optimized this by passing the vma variable to mthp_collapse() and using the collapse_max_ptes_none() function to check the state of uffd-armed preventing the wasted work later in the collapse. mthp_collapse() is called after mmap_read_unlock(), so the vma pointer can become stale. Remove the vma parameter and pass NULL to collapse_max_ptes_none() instead. Link: https://lore.kernel.org/2b2cda8c-358a-4a5c-989c-ae42593ef2ea@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: introduce mTHP collapse supportNico Pache1-9/+172
Enable khugepaged to collapse to mTHP orders. This patch implements the main scanning logic using a bitmap to track occupied pages and a stack structure that allows us to find optimal collapse sizes. Previous to this patch, PMD collapse had 3 main phases, a light weight scanning phase (mmap_read_lock) that determines a potential PMD collapse, an alloc phase (mmap unlocked), then finally heavier collapse phase (mmap_write_lock). To enabled mTHP collapse we make the following changes: During PMD scan phase, track occupied pages in a bitmap. When mTHP orders are enabled, we remove the restriction of max_ptes_none during the scan phase to avoid missing potential mTHP collapse candidates. Once we have scanned the full PMD range and updated the bitmap to track occupied pages, we use the bitmap to find the optimal mTHP size. Implement collapse_scan_bitmap() to perform binary recursion on the bitmap and determine the best eligible order for the collapse. A stack structure is used instead of traditional recursion to manage the search. This also prevents a traditional recursive approach when the kernel stack struct is limited. The algorithm recursively splits the bitmap into smaller chunks to find the highest order mTHPs that satisfy the collapse criteria. We start by attempting the PMD order, then moved on the consecutively lower orders (mTHP collapse). The stack maintains a pair of variables (offset, order), indicating the number of PTEs from the start of the PMD, and the order of the potential collapse candidate. The algorithm for consuming the bitmap works as such: 1) push (0, HPAGE_PMD_ORDER) onto the stack 2) pop the stack 3) check if the number of set bits in that (offset,order) pair statisfy the max_ptes_none threshold for that order 4) if yes, attempt collapse 5) if no (or collapse fails), push two new stack items representing the left and right halves of the current bitmap range, at the next lower order 6) repeat at step (2) until stack is empty. Below is a diagram representing the algorithm and stack items: offset mid_offset | | | | v v ____________________________________ | PTE Page Table | -------------------------------------- <-------><-------> order-1 order-1 mTHP collapses reject regions containing swapped out or shared pages. This is because adding new entries can lead to new none pages, and these may lead to constant promotion into a higher order mTHP. A similar issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse introducing at least 2x the number of pages, and on a future scan will satisfy the promotion condition once again. This issue is prevented via the collapse_max_ptes_none() function which imposes the max_ptes_none restrictions above. We currently only support mTHP collapse for max_ptes_none values of 0 and HPAGE_PMD_NR - 1. resulting in the following behavior: - max_ptes_none=0: Never introduce new empty pages during collapse - max_ptes_none=HPAGE_PMD_NR-1: Always try collapse to the highest available mTHP order Any other max_ptes_none value will emit a warning and default mTHP collapse to max_ptes_none=0. There should be no behavior change for PMD collapse. Once we determine what mTHP sizes fits best in that PMD range a collapse is attempted. A minimum collapse order of 2 is used as this is the lowest order supported by anon memory as defined by THP_ORDERS_ALL_ANON. Currently madv_collapse is not supported and will only attempt PMD collapse. We can also remove the check for is_khugepaged inside the PMD scan as the collapse_max_ptes_none() function handles this logic now. Link: https://lore.kernel.org/20260522150009.121603-12-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: introduce collapse_allowable_orders helper functionNico Pache1-3/+12
Add collapse_allowable_orders() to generalize THP order eligibility. The function determines which THP orders are permitted based on collapse context (khugepaged vs madv_collapse). This consolidates collapse configuration logic and provides a clean interface for future mTHP collapse support where the orders may be different. Link: https://lore.kernel.org/20260522150009.121603-11-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: improve tracepoints for mTHP ordersNico Pache1-4/+5
Add the order to the mm_collapse_huge_page<_swapin,_isolate> tracepoints to give better insight into what order is being operated at for. Link: https://lore.kernel.org/20260522150009.121603-10-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: add per-order mTHP collapse failure statisticsNico Pache2-2/+26
Add three new mTHP statistics to track collapse failures for different orders when encountering swap PTEs, excessive none PTEs, and shared PTEs: - collapse_exceed_swap_pte: Increment when mTHP collapse fails due to encountering a swap PTE. - collapse_exceed_none_pte: Counts when mTHP collapse fails due to exceeding the none PTE threshold for the given order - collapse_exceed_shared_pte: Counts when mTHP collapse fails due to encountering a shared PTE. These statistics complement the existing THP_SCAN_EXCEED_* events by providing per-order granularity for mTHP collapse attempts. The stats are exposed via sysfs under `/sys/kernel/mm/transparent_hugepage/hugepages-*/stats/` for each supported hugepage size. As we currently do not support collapsing mTHPs that contain a swap or shared entry, those statistics keep track of how often we are encountering failed mTHP collapses due to these restrictions. We will add support for mTHP collapse for anonymous pages next; lets also track when this happens at the PMD level within the per-mTHP stats. Link: https://lore.kernel.org/20260522150009.121603-9-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: skip collapsing mTHP to smaller ordersNico Pache1-0/+8
khugepaged may try to collapse a mTHP to a smaller mTHP, resulting in some pages being unmapped. Skip these cases until we have a way to check if its ok to collapse to a smaller mTHP size (like in the case of a partially mapped folio). This check is also not done during the scan phase as the current collapse order is unknown at that time. This patch is inspired by Dev Jain's work on khugepaged mTHP support [1]. Link: https://lore.kernel.org/20260522150009.121603-8-npache@redhat.com Link: https://lore.kernel.org/lkml/20241216165105.56185-11-dev.jain@arm.com/ [1] Co-developed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Usama Arif <usama.arif@linux.dev> Acked-by: David Hildenbrand (arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysadd a clarifying comment and change warn_onNico Pache1-1/+8
Add a clarifying comment describing how the locking/mmu notifier is handled and change the WARN_ON_ONCE to VM_WARN_ON_ONCE per davids suggestion. Link: https://lore.kernel.org/a48032dd-7881-43c0-b439-5cda6124ea58@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: David Hildenbrand (arm) <david@kernel.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Liam R. Howlett" <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: generalize collapse_huge_page for mTHP collapseNico Pache1-38/+55
Pass an order and offset to collapse_huge_page to support collapsing anon memory to arbitrary orders within a PMD. order indicates what mTHP size we are attempting to collapse to, and offset indicates were in the PMD to start the collapse attempt. For non-PMD collapse we must leave the anon VMA write locked until after we collapse the mTHP-- in the PMD case all the pages are isolated, but in the mTHP case this is not true, and we must keep the lock to prevent access/changes to the page tables. This can happen if the rmap walkers hit a pmd_none while the PMD entry is currently unavailable due to being temporarily removed during the collapse phase. Link: https://lore.kernel.org/20260522150009.121603-7-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: Usama Arif <usama.arif@linux.dev> Acked-by: David Hildenbrand (arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: require collapse_huge_page to enter/exit with the lock droppedNico Pache1-8/+8
Currently the collapse_huge_page function requires the mmap_read_lock to enter with it held, and exit with it dropped. This function moves the unlock into its parent caller, and changes this semantic to requiring it to enter/exit with it always unlocked. In future patches, we need this expectation, as for in mTHP collapse, we may have already have dropped the lock, and do not want to conditionally check for this by passing through the lock_dropped variable. No functional change is expected as one of the first things the collapse_huge_page function does is drop this lock before allocating the hugepage. Link: https://lore.kernel.org/20260522150009.121603-6-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 dayscleanup collapse_max_ptes_noneNico Pache1-5/+4
make max_ptes_none a const and cleanup the pr_warn_once Link: https://lore.kernel.org/b5fa19c5-4b3e-40b8-8e78-fc31169a7a79@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: David Hildenbrand (arm) <david@kernel.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Liam R. Howlett" <liam@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: generalize __collapse_huge_page_* for mTHP supportNico Pache1-33/+88
Generalize the order of the __collapse_huge_page_* and collapse_max_* functions to support future mTHP collapse. The current mechanism for determining collapse with the khugepaged_max_ptes_none value is not designed with mTHP in mind. This raises a key design issue: if we support user defined max_pte_none values (even those scaled by order), a collapse of a lower order can introduces an feedback loop, or "creep", when max_ptes_none is set to a value greater than HPAGE_PMD_NR / 2. [1] With this configuration, a successful collapse to order N will populate enough pages to satisfy the collapse condition on order N+1 on the next scan. This leads to unnecessary work and memory churn. To fix this issue introduce a helper function that will limit mTHP collapse support to two max_ptes_none values, 0 and HPAGE_PMD_NR - 1. This effectively supports two modes: [2] - max_ptes_none=0: never collapses if it encounters an empty PTE or a PTE that maps the shared zeropage. Consequently, no memory bloat. - max_ptes_none=511 (on 4k pagesz): Always collapse to the highest available mTHP order. This removes the possibility of "creep", and a warning will be emitted if any non-supported max_ptes_none value is configured with mTHP enabled. Any intermediate value will default mTHP collapse to max_ptes_none=0. mTHP collapse will not honor the khugepaged_max_ptes_shared or khugepaged_max_ptes_swap parameters, and will fail if it encounters a shared or swapped entry. No functional changes in this patch; however it defines future behavior for mTHP collapse. Link: https://lore.kernel.org/20260522150009.121603-5-npache@redhat.com Link: https://lore.kernel.org/all/e46ab3ab-a3d7-4fb7-9970-d0704bd5d05a@arm.com [1] Link: https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com [2] Co-developed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Usama Arif <usama.arif@linux.dev> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: rework max_ptes_* handling with helper functionsNico Pache1-36/+84
The following cleanup reworks all the max_ptes_* handling into helper functions. This increases the code readability and will later be used to implement the mTHP handling of these variables. With these changes we abstract all the madvise_collapse() special casing (do not respect the sysctls) away from the functions that utilize them. And will be used later in this series to cleanly restrict the mTHP collapse behavior. No functional change is intended; however, we are now only reading the sysfs variables once per scan, whereas before these variables were being read on each loop iteration. Link: https://lore.kernel.org/20260522150009.121603-4-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Usama Arif <usama.arif@linux.dev> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: generalize alloc_charge_folio()Dev Jain2-7/+17
Pass order to alloc_charge_folio() and update mTHP statistics. Link: https://lore.kernel.org/20260522150009.121603-3-npache@redhat.com Signed-off-by: Dev Jain <dev.jain@arm.com> Co-developed-by: Nico Pache <npache@redhat.com> Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Usama Arif <usama.arif@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: generalize hugepage_vma_revalidate for mTHP supportNico Pache1-8/+12
Patch series "khugepaged: add mTHP collapse support", v18. The following series provides khugepaged with the capability to collapse anonymous memory regions to mTHPs. To achieve this we generalize the khugepaged functions to no longer depend on PMD_ORDER. Then during the PMD scan, we use a bitmap to track individual pages that are occupied (!none/zero). After the PMD scan is done, we use the bitmap to find the optimal mTHP sizes for the PMD range. The restriction on max_ptes_none is removed during the scan, to make sure we account for the whole PMD range in the bitmap. When no mTHP size is enabled, the legacy behavior of khugepaged is maintained. We currently only support max_ptes_none values of 0 or HPAGE_PMD_NR - 1 (ie 511). If any other value is specified, the kernel will emit a warning and mTHP collapse will default to max_ptes_none=0. If a mTHP collapse is attempted, but contains swapped out, or shared pages, we don't perform the collapse. It is now also possible to collapse to mTHPs without requiring the PMD THP size to be enabled. These limitations are to prevent collapse "creep" behavior. This prevents constantly promoting mTHPs to the next available size, which would occur because a collapse introduces more non-zero pages that would satisfy the promotion condition on subsequent scans. Patch 1-2: Generalize hugepage_vma_revalidate and alloc_charge_folio for arbitrary orders. Patch 3: Rework max_ptes_* handling into helper functions Patch 4: Generalize __collapse_huge_page_* for mTHP support Patch 5: Require collapse_huge_page to enter/exit with the lock dropped Patch 6: Generalize collapse_huge_page for mTHP collapse Patch 7: Skip collapsing mTHP to smaller orders Patch 8-9: Add per-order mTHP statistics and tracepoints Patch 10: Introduce collapse_allowable_orders helper function Patch 11-13: Introduce bitmap and mTHP collapse support, fully enabled Patch 14: Documentation Testing: - Built for x86_64, aarch64, ppc64le, and s390x - ran all arches on test suites provided by the kernel-tests project - internal testing suites: functional testing and performance testing - selftests mm - I created a test script that I used to push khugepaged to its limits while monitoring a number of stats and tracepoints. The code is available here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) The summary from my testings was that there was no significant regression noticed through this test. In some cases my changes had better collapse latencies, and was able to scan more pages in the same amount of time/work, but for the most part the results were consistent. - redis testing. I did some testing with these changes along with my defer changes (see followup [2] post for more details). We've decided to get the mTHP changes merged first before attempting the defer series. - some basic testing on 64k page size. - lots of general use. This patch (of 14): For khugepaged to support different mTHP orders, we must generalize this to check if the PMD is not shared by another VMA and that the order is enabled. No functional change in this patch. Also correct a comment about the functionality of the revalidation and fix a double space issues. Link: https://lore.kernel.org/20260522150009.121603-1-npache@redhat.com Link: https://lore.kernel.org/20260522150009.121603-2-npache@redhat.com Link: https://gitlab.com/npache/khugepaged_mthp_test [1] Link: https://lore.kernel.org/lkml/20250515033857.132535-1-npache@redhat.com/ [2] Co-developed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Usama Arif <usama.arif@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/gup: honour FOLL_PIN in NOMMU __get_user_pages_locked()Greg Kroah-Hartman1-3/+10
The !CONFIG_MMU implementation of __get_user_pages_locked() takes a bare get_page() reference for each page regardless of foll_flags: if (pages[i]) get_page(pages[i]); This is reached from pin_user_pages*() with FOLL_PIN set. unpin_user_page() is shared between MMU and NOMMU configurations and unconditionally calls gup_put_folio(..., FOLL_PIN), which subtracts GUP_PIN_COUNTING_BIAS (1024) from the folio refcount. This means that pin adds 1, and then unpin will subtract 1024. If a user maps a page (refcount 1), registers it 1023 times as an io_uring fixed buffer (1023 pin_user_pages calls -> refcount 1024), then unregisters: the first unpin_user_page subtracts 1024, refcount hits 0, the page is freed and returned to the buddy allocator. The remaining 1022 unpins write into whatever was reallocated, and the user's VMA still maps the freed page (NOMMU has no MMU to invalidate it). Reallocating the page for an io_uring pbuf_ring then lets userspace corrupt the new owner's data through the stale mapping. Use try_grab_folio() which adds GUP_PIN_COUNTING_BIAS for FOLL_PIN and 1 for FOLL_GET, mirroring the CONFIG_MMU path so pin and unpin are symmetric. While at it, don't return NULL pointers in the page array, as this is really not expected for GUP users; instead, just fail and return -EFAULT. [david@kernel.org: changelog update] https://lore.kernel.org/e9c5cf89-fa4c-4b83-ae70-9d3c72542ee9@kernel.org Link: https://lore.kernel.org/2026042303-vendor-outright-b9d2@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Assisted-by: David Hildenbrand <david@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Reported-by: Anthropic Assisted-by: gkh_clanker_t1000 Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/vmscan: unify writeback reclaim statistic and throttlingKairui Song1-49/+43
Currently MGLRU and non-MGLRU handle the reclaim statistic and writeback handling very differently, especially throttling. Basically MGLRU just ignored the throttling part. Let's just unify this part, use a helper to deduplicate the code so both setups will share the same behavior. Test using following reproducer using bash: echo "Setup a slow device using dm delay" dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048 LOOP=$(losetup --show -f /var/tmp/backing) mkfs.ext4 -q $LOOP echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \ dmsetup create slow_dev mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow echo "Start writeback pressure" sync && echo 3 > /proc/sys/vm/drop_caches mkdir /sys/fs/cgroup/test_wb echo 128M > /sys/fs/cgroup/test_wb/memory.max (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \ dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192) echo "Clean up" echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev dmsetup resume slow_dev umount -l /mnt/slow && sync dmsetup remove slow_dev Before this commit, `dd` will get OOM killed immediately if MGLRU is enabled. Classic LRU is fine. After this commit, throttling is now effective and no more spin on LRU or premature OOM. Stress test on other workloads also looks good. Global throttling is not here yet, we will fix that separately later. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-15-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Chen Ridong <chenridong@huaweicloud.com> Tested-by: Leno Hou <lenohou@gmail.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/vmscan: remove sc->unqueued_dirtyKairui Song1-2/+0
No one is using it now, just remove it. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-14-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/vmscan: remove sc->file_takenKairui Song1-3/+0
No one is using it now, just remove it. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-13-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mglru: remove no longer used reclaim argument for folio protectionKairui Song1-7/+4
Now dirty reclaim folios are handled after isolation, not before, since dirty reactivation must take the folio off LRU first, and that helps to unify the dirty handling logic. So this argument is no longer needed. Just remove it. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-12-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chen Ridong <chenridong@huaweicloud.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mglru: simplify and improve dirty writeback handlingKairui Song1-25/+16
Right now the flusher wakeup mechanism for MGLRU is less responsive and unlikely to trigger compared to classical LRU. The classical LRU wakes the flusher if one batch of folios passed to shrink_folio_list is unevictable due to under writeback. MGLRU instead check and handle this after the whole reclaim loop is done. We previously even saw OOM problems due to passive flusher, which were fixed but still not perfect [1]. We have just unified the dirty folio counting and activation routine, now just move the dirty flush into the loop right after shrink_folio_list. This improves the performance a lot for workloads involving heavy writeback and prepares for throttling too. Test with YCSB workloadb showed a major performance improvement: Before this series: Throughput(ops/sec): 62485.02962831822 AverageLatency(us): 500.9746963330107 pgpgin 159347462 workingset_refault_file 34522071 After this commit: Throughput(ops/sec): 80857.08510208207 AverageLatency(us): 386.653262968934 pgpgin 112233121 workingset_refault_file 19516246 The performance is a lot better with significantly lower refault. We also observed similar or higher performance gain for other real-world workloads. We were concerned that the dirty flush could cause more wear for SSD: that should not be the problem here, since the wakeup condition is when the dirty folios have been pushed to the tail of LRU, which indicates that memory pressure is so high that writeback is blocking the workload already. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-11-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1] Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chen Ridong <chenridong@huaweicloud.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mglru: use the common routine for dirty/writeback reactivationKairui Song1-19/+0
Currently MGLRU will move the dirty writeback folios to the second oldest gen instead of reactivate them like the classical LRU. This might help to reduce the LRU contention as it skipped the isolation. But as a result we will see these folios at the LRU tail more frequently leading to inefficient reclaim. Besides, the dirty / writeback check after isolation in shrink_folio_list is more accurate and covers more cases. So instead, just drop the special handling for dirty writeback, use the common routine and re-activate it like the classical LRU. This should in theory improve the scan efficiency. These folios will be rotated back to LRU tail once writeback is done so there is no risk of hotness inversion. And now each reclaim loop will have a higher success rate. This also prepares for unifying the writeback and throttling mechanism with classical LRU, we keep these folios far from tail so detecting the tail batch will have a similar pattern with classical LRU. The micro optimization that avoids LRU contention by skipping the isolation is gone, which should be fine. Compared to IO and writeback cost, the isolation overhead is trivial. And using the common routine also keeps the folio's referenced bits (tier bits), which could improve metrics in the long term. Also no more need to clean reclaim bit as the common routine will make use of it. Note the common routine updates a few throttling and writeback counters, which are not used, and never have been for the MGLRU case. We will start making use of these in later commits. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-10-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chen Ridong <chenridong@huaweicloud.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mglru: remove redundant swap constrained check upon isolationKairui Song1-6/+0
Remove the swap-constrained early reject check upon isolation. This check is a micro optimization when swap IO is not allowed, so folios are rejected early. But it is redundant and overly broad since shrink_folio_list() already handles all these cases with proper granularity. Notably, this check wrongly rejected lazyfree folios, and it doesn't cover all rejection cases. shrink_folio_list() uses may_enter_fs(), which distinguishes non-SWP_FS_OPS devices from filesystem-backed swap and does all the checks after folio is locked, so flags like swap cache are stable. This check also covers dirty file folios, which are not a problem now since sort_folio() already bumps dirty file folios to the next generation, but causes trouble for unifying dirty folio writeback handling. And there should be no performance impact from removing it. We may have lost a micro optimization, but unblocked lazyfree reclaim for NOIO contexts, which is not a common case in the first place. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-9-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mglru: don't abort scan immediately right after agingKairui Song1-3/+9
Right now, if eviction triggers aging, the reclaimer will abort. This is not the optimal strategy for several reasons. Aborting the reclaim early wastes a reclaim cycle when under pressure, and for concurrent reclaim, if the LRU is under aging, all concurrent reclaimers might fail. And if the age has just finished, new cold folios exposed by the aging are not reclaimed until the next reclaim iteration. What's more, the current aging trigger is quite lenient, having 3 gens with a reclaim priority lower than default will trigger aging, and blocks reclaiming from one memcg. This wastes reclaim retry cycles easily. And in the worst case, if the reclaim is making slower progress and all following attempts fail due to being blocked by aging, it triggers unexpected early OOM. And if a lruvec requires aging, it doesn't mean it's hot. Instead, the lruvec could be idle for quite a while, and hence it might contain lots of cold folios to be reclaimed. While it's helpful to rotate memcg LRU after aging for global reclaim, as global reclaim fairness is coupled with the rotation in shrink_many, memcg fairness is instead handled by cgroup iteration in shrink_node_memcgs. So, for memcg level pressure, this abort is not the key part for keeping the fairness. And in most cases, there is no need to age, and fairness must be achieved by upper-level reclaim control. So instead, just keep the scanning going unless one whole batch of folios failed to be isolated or enough folios have been scanned, which is triggered by evict_folios returning 0. And only abort for global reclaim after one batch, so when there are fewer memcgs, progress is still made, and the fairness mechanism described above still works fine. And in most cases, the one more batch attempt for global reclaim might just be enough to satisfy what the reclaimer needs, hence improving global reclaim performance by reducing reclaim retry cycles. Rotation is still there after the reclaim is done, which still follows the comment in mmzone.h. And fairness still looking good. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-8-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mglru: use a smaller batch for reclaimKairui Song1-1/+1
With a fixed number to reclaim calculated at the beginning, making each following step smaller should reduce the lock contention and avoid over-aggressive reclaim of folios, as it will abort earlier when the number of folios to be reclaimed is reached. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-7-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mglru: avoid reclaim type fall back when isolation makes no progressBarry Song (Xiaomi)1-2/+7
While isolation makes no progress in scan_folios(), we quickly fall back to the other type in isolate_folios(). This is incorrect, as the current type may still have sufficient folios. Falling back can undermine the positive_ctrl_err() result from get_type_to_scan(), which is derived from swappiness. So just continue scanning this type for another round. Worth noting if the cold generations are all reclaimed, scan will no longer make any progress either, which may undermine the swappiness again. This is not a new issue and hence better be fixed later [1]. Link: https://lore.kernel.org/linux-mm/CAGsJ_4zjdOYEtuO6gNjABm7NDxW0skzBFNRNee-k2D6VwsYEQA@mail.gmail.com/ [1] Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-6-02fabb92dc43@tencent.com Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chen Ridong <chenridong@huaweicloud.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mglru: scan and count the exact number of foliosKairui Song1-29/+29
Make the scan helpers return the exact number of folios being scanned or isolated. Since the reclaim loop now has a natural scan budget that controls the scan progress, returning the scan number and consuming the budget makes the scan more accurate and easier to follow. The number of scanned folios for each iteration is always larger than 0, unless the reclaim must stop for a forced aging, so there is no more need for any special handling when there is no progress made: - `return isolated || !remaining ? scanned : 0` in scan_folios: both the function and the call now just return the exact scan count, combined with the scan budget introduced in the previous commit to avoid livelock or under scan. - `scanned += try_to_inc_min_seq` in evict_folios: adding a bool as a scan count was kind of confusing and no longer needed, as scan number should never be zero as long as there are still evictable gens. We may encounter a empty old gen that returns 0 scan count, to avoid that, do a try_to_inc_min_seq before toisolation which have slight to none overhead in most cases. - `evictable_min_seq + MIN_NR_GENS > max_seq` guard in evict_folios: the per-type get_nr_gens == MIN_NR_GENS check in scan_folios naturally returns 0 when only two gens remain and breaks the loop. Also change try_to_inc_min_seq to return void, as its return value is no longer used by any caller. Call it before isolate_folios to flush any empty gens left by external folio freeing, and again after isolate_folios when scanning moved or protected folios may have emptied the oldest gen. The scan still stops if only two gens are left, as the scan number will be zero. This matches the previous behavior. This forced gen protection may be removed or softened later to improve reclaim further. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-5-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mglru: restructure the reclaim loopKairui Song1-36/+36
The current loop will calculate the scan number on each iteration. The number of folios to scan is based on the LRU length, with some unclear behaviors, e.g, the scan number is only shifted by reclaim priority when aging is not needed or when at the default priority, and it couples the number calculation with aging and rotation. Adjust, simplify it, and decouple aging and rotation. Just calculate the scan number for once at the beginning of the reclaim, always respect the reclaim priority, and make the aging and rotation more explicit. This slightly changes how aging and offline memcg reclaim works: Previously, aging was skipped at DEF_PRIORITY even when eviction was no longer possible, so the reclaimer wasted an iteration until the priority escalated. Now aging runs immediately whenever it is needed to make progress; the DEF_PRIORITY skip only applies when eviction is still viable. This may avoid wasted iterations that over-reclaim slab and break reclaim balance in multi-cgroup setups. Similar for offline memcg. Previously, offline memcg wouldn't be aged unless it didn't have any evictable folios. Now, we might age it if it has only 3 generations, which should be fine. On one hand, offline memcg might still hold long-term folios, and in fact, a long-existing offline memcg must be pinned by some long-term folios like shmem. These folios might be used by other memcg, so aging them as ordinary memcg seems correct. Besides, aging enables further reclaim of an offlined memcg, which will certainly happen if we keep shrinking it. And offline memcg might soon be no longer an issue with reparenting. Overall, the memcg LRU rotation, as described in mmzone.h, remains the same. Note that because the scan budget is now pinned at loop entry, tiny lruvec might skip this reclaim pass, also skipping aging, which could be beneficial as aging is not helpful since it will still be un-reclaimable after aging. Reclaim will go on as usual once priority escalates. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-4-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chen Ridong <chenridong@huaweicloud.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mglru: relocate the LRU scan batch limit to callersKairui Song1-7/+9
Same as active / inactive LRU, MGLRU isolates and scans folios in batches. The batch split is done hidden deep in the helper, which makes the code harder to follow. The helper's arguments are also confusing since callers usually request more folios than the batch size, so the helper almost never processes the full requested amount. Move the batch splitting into the top loop to make it cleaner, there should be no behavior change. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-3-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mglru: rename variables related to aging and rotationKairui Song1-7/+7
The current variable name isn't helpful. Make the variable names more meaningful. Only naming change, no behavior change. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-2-02fabb92dc43@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Suggested-by: Barry Song <baohua@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Leno Hou <lenohou@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mglru: consolidate common code for retrieving evictable sizeKairui Song1-22/+14
Patch series "mm/mglru: improve reclaim loop and dirty folio", v7. This series cleans up and slightly improves MGLRU's reclaim loop and dirty writeback handling. As a result, we can see an up to ~30% increase in some workloads like MongoDB with YCSB and a huge decrease in file refault, no swap involved. Other common benchmarks have no regression, and LOC is reduced, with less unexpected OOM, too. Some of the problems were found in our production environment, and others were mostly exposed while stress testing during the development of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up the code base and fixes several performance issues, preparing for further work. MGLRU's reclaim loop is a bit complex, and hence these problems are somehow related to each other. The aging, scan number calculation, and reclaim loop are coupled together, and the dirty folio handling logic is quite different, making the reclaim loop hard to follow and the dirty flush ineffective. This series slightly cleans up and improves these issues using a scan budget by calculating the number of folios to scan at the beginning of the loop, and decouples aging from the reclaim calculation helpers. Then, move the dirty flush logic inside the reclaim loop so it can kick in more effectively. These issues are somehow related, and this series handles them and improves MGLRU reclaim in many ways. Test results: All tests are done on a 48c96t NUMA machine with 2 nodes and a 128G memory machine using NVME as storage. Classical (non-MGLRU) LRU numbers are included as "MGLRU disabled" for each benchmark below; see [8] and [9] for the longer write-up. MongoDB ======= Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000, threads:32), which does 95% read and 5% update to generate mixed read and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and the WiredTiger cache size is set to 4.5G, using NVME as storage. This is close to the case we observed regressing in our production environment: mixed read and writeback pressure, so it is a practical case for evaluation. Not using SWAP. The intent is to isolate the file LRU writeback path. Enabling SWAP would just add noise from anonymous reclaim. MGLRU Before: Throughput(ops/sec): 60653.502655 workingset_refault_file 12904916 pgpgin 165366622 pgpgout 5219588 MGLRU After: Throughput(ops/sec): 82384.354760 (+35.8%, higher is better) workingset_refault_file 7128285 (-44.7%, lower is better) pgpgin 113170693 (-31.5%, lower is better) pgpgout 5639724 MGLRU Disabled: Throughput(ops/sec): 93713.640901 workingset_refault_file 15013443 pgpgin 85365614 pgpgout 5866508 We can see a significant performance improvement after this series. The test is done on NVME and the performance gap would be even larger for slow devices, such as HDD or network storage. We observed over 100% gain for some workloads with slow IO. Note, classical LRU is still faster for this benchmark, MGLRU may catch up later with further work [7]. Chrome & Node.js [3] ==================== Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2 nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64 workers. Many memcgs each applying roughly equal pressure exercises the LRU's ability to detect/protect each tenant's working set and to balance reclamation fairly between tenants, which makes this a meaningful test for the reclaim mechanism. Fairness is reported via Jain's fairness index (1.0 means all tenants get exactly equal allocation, lower is worse). Under equal pressure, all memcgs should make roughly equal forward progress. See [8] for the longer rationale and per-memcg breakdown. MGLRU before: Total requests: 81898 Per-worker mean: 1279.7 Per-worker 95% CI (mean): [ 1259.0, 1300.4] Jain's fairness index: 0.995893 (1.0 = perfectly fair) Latency: Bucket Count Pct Cumul [0,1)s 28392 34.67% 34.67% [1,2)s 8022 9.80% 44.46% [2,4)s 6130 7.48% 51.95% [4,8)s 39354 48.05% 100.00% MGLRU after: Total requests: 82901 Per-worker mean: 1295.3 Per-worker 95% CI (mean): [ 1265.3, 1325.4] Jain's fairness index: 0.991607 (1.0 = perfectly fair) Latency: Bucket Count Pct Cumul [0,1)s 28128 33.93% 33.93% [1,2)s 8756 10.56% 44.49% [2,4)s 7028 8.48% 52.97% [4,8)s 38989 47.03% 100.00% MGLRU disabled: Total requests: 62399 Per-worker mean: 975.0 Per-worker 95% CI (mean): [ 941.9, 1008.1] Jain's fairness index: 0.982156 (1.0 = perfectly fair) Latency: Bucket Count Pct Cumul [0,1)s 20051 32.13% 32.13% [1,2)s 2255 3.61% 35.75% [2,4)s 6149 9.85% 45.60% [4,8)s 33927 54.37% 99.97% [8,16)s 17 0.03% 100.00% Reclaim is still fair and effective, total requests number seems slightly better. OOM issue with aging and throttling =================================== For the throttling OOM issue, it can be easily reproduced using dd and cgroup limit as demonstrated and fixed by a later patch in this series. The aging OOM is a bit tricky, a specific reproducer can be used to simulate what we encountered in production environment [4]: Spawns multiple workers that keep reading the given file using mmap, and pauses for 120ms after one file read batch. It also spawns another set of workers that keep allocating and freeing a given size of anonymous memory. The total memory size exceeds the memory limit (eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit). - MGLRU disabled: Finished 128 iterations. - MGLRU enabled: OOM with following info after about ~10-20 iterations: [ 62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 [ 62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460 [ 62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 62.640823] Memory cgroup stats for /demo: [ 62.641017] anon 10604879872 [ 62.641941] file 6574858240 OOM occurs despite there being still evictable file folios. - MGLRU enabled after this series: Finished 128 iterations. Worth noting there is another OOM related issue reported in V1 of this series, which is tested and looking OK now [5]. MySQL: ====== Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using ZRAM as swap and test command: sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \ --tables=48 --table-size=2000000 --threads=48 --time=600 run A 24G InnoDB buffer pool inside a 2G memcg with ZRAM as swap forces aggressive eviction of cached database anon pages, which exercises the LRU's hot page detection and the eviction path under swap pressure. The workload is practical, and the pressure is higher than what we usually see in production but it is intended to expose the extreme case. MGLRU before: 17313.688333 tps MGLRU after: 17286.195000 tps MGLRU disabled: 16245.330000 tps Seems only noise level changes, no regression. FIO: ==== Testing with the following command, where /mnt/ramdisk is a 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg, 6 test run each: fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \ --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \ --rw=randread --norandommap --time_based \ --ramp_time=1m --runtime=5m --group_reporting Random buffered mmap read on a ramdisk strips out storage variance and stresses purely the LRU's ability to evict and recycle the page cache under heavy random read pressure. MGLRU before: 9033.91 MB/s MGLRU after: 9065.72 MB/s MGLRU disabled: 8254.54 MB/s Also seem only noise level changes and no regression or slightly better. Build kernel: ============= Build kernel test using ZRAM as swap, kernel source on tmpfs, in a memcg with memory.max=3G, using make -j96 and defconfig, measuring system time, 6 test run each. Building the kernel is a classical mixed anon + file workload (lots of small file reads/writes plus parallel anon allocations from cc/ld) and is representative of many real compilation jobs. MGLRU before: 2823.13s MGLRU after: 2801.26s MGLRU disabled: 5023.50s Also seem only noise level changes, no regression or very slightly better. Android: ======== Xinyu reported a performance gain on Android, too, with this series. The test consisted of cold-starting multiple applications sequentially under moderate system load [6]; this is a real Android user-visible scenario, dominated by the LRU's ability to keep the right working set resident and re-fault launch-critical pages quickly. Before: Launch Time Summary (all apps, all runs) Mean 868.0ms P50 888.0ms P90 1274.2ms P95 1399.0ms After: Launch Time Summary (all apps, all runs) Mean 850.5ms (-2.07%) P50 861.5ms (-3.04%) P90 1179.0ms (-8.05%) P95 1228.0ms (-12.2%) This patch (of 15): Merge commonly used code for counting evictable folios in a lruvec. No behavior change. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-0-02fabb92dc43@tencent.com Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-1-02fabb92dc43@tencent.com Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1] Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2] Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3] Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4] Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5] Link: https://lore.kernel.org/linux-mm/20260417025123.2971253-1-wxy2009nrrr@163.com/ [6] Link: https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-0-913619b014d9@tencent.com/ [7] Link: https://lore.kernel.org/linux-mm/CAMgjq7BzQAPp8u_3-9e3ueXmRCoW=2sydok0hFM=MYL7VC1YYg@mail.gmail.com/ [8] Link: https://lore.kernel.org/linux-mm/CAMgjq7D+4QmiWe73OPFuH0s+ZKCUJoo+MfcWOdJcV+VO-T2Wmg@mail.gmail.com/ [9] Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Yuanchu Xie <yuanchu@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Leno Hou <lenohou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysuserfaultfd: make functions that are not used outside uffd staticMike Rapoport (Microsoft)1-12/+12
After merging fs/userfaultfd.c into mm/userfaultfd.c, several functions that were previously shared between the two files are now only used within mm/userfaultfd.c. Make them static and remove their declarations from include/linux/userfaultfd_k.h. Link: https://lore.kernel.org/20260523173759.3964908-3-rppt@kernel.org Assisted-by: Copilot:claude-opus-4-6 Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysuserfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.cMike Rapoport (Microsoft)1-0/+2215
Patch series "userfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.c", v3. These patches merge fs/userfaultfd.c into mm/userfaultfd.c and make functions used only inside mm/userfaultfd.c static. This patch (of 2): Historically userfaultfd implementation has been split between fs/userfaultfd.c and mm/userfaultfd.c. The mm/ part implemented memory management operations, while the fs/ part implemented file descriptor handling and called into the mm/ part for the actual memory management work. This separation is quite artificial and fs/userfaultfd.c does not seem to belong to fs/ because it's only a user if vfs APIs and like for other users, for example, memfd and secretmem, the file descriptor handling could live in mm/ as well. "Append" fs/userfaultfd.c to mm/userfaultfd and update fs/Makefile and MAINTAINERS accordingly. No intended functional changes. Link: https://lore.kernel.org/20260523173759.3964908-1-rppt@kernel.org Link: https://lore.kernel.org/20260523173759.3964908-2-rppt@kernel.org Assisted-by: Copilot:claude-opus-4-6 Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Christian Brauner (Amutable) <brauner@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Hildenbrand <david@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/page_alloc: remove VM_BUG_ON()s from pindex helpersBrendan Jackman1-8/+1
Vlastimil pointed out that the VM_BUG_ON()s have fallen out of favour, so remove them. Link: https://lore.kernel.org/20260526-page_alloc-unmapped-prep-v2-1-412f4d486115@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Link: https://lore.kernel.org/all/4074a816-9e75-45a6-8141-25459bcc106b@kernel.org/ Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mglru: use folio_mark_accessed to replace folio_set_activeBarry Song (Xiaomi)3-8/+24
MGLRU gives high priority to folios mapped in page tables. As a result, folio_set_active() is invoked for all folios read during page faults. In practice, however, readahead can bring in many folios that are never accessed via page tables. A previous attempt by Lei Liu proposed introducing a separate LRU for readahead[1] to make readahead pages easier to reclaim, but that approach is likely over-engineered. Before commit 4d5d14a01e2c ("mm/mglru: rework workingset protection"), folios with PG_active were always placed in the youngest generation, leading to over-protection and increased refaults. After that commit, PG_active folios are placed in the second youngest generation, which is still too optimistic given the presence of readahead. In contrast, the classic active/inactive scheme is more conservative. This patch switches to using folio_mark_accessed() and begins prefaulted file folios from the second oldest generation instead of active generations. We should also adjust the following accordingly: - WORKINGSET_ACTIVATE: aligned with setting active for refaulted workingset folios; - lru_gen_folio_seq(): place (pre)faulted file folios into the second oldest generation; - promote second-scanned folios to workingset in folio_check_references(): we now have to depend on folio_lru_refs() > 1, since we previously relied on PG_referenced being set during the first scan, but PG_referenced is now set earlier. On x86, running a kernel build inside a memcg with a 1GB memory limit using 20 threads. w/o patch: real 1m50.764s user 25m32.305s sys 4m0.012s pswpin: 1333245 pswpout: 4366443 pgpgin: 6962592 pgpgout: 17780712 swpout_zero: 1019603 swpin_zero: 14764 refault_file: 287794 refault_anon: 1347963 w/ patch: real 1m48.879s user 25m29.224s sys 3m37.421s pswpin: 568480 pswpout: 2322657 pgpgin: 4073416 pgpgout: 9613408 swpout_zero: 593275 swpin_zero: 9118 refault_file: 262505 refault_anon: 577550 active/inactive LRU: real 1m49.928s user 25m28.196s sys 3m40.740s pswpin: 463452 pswpout: 2309119 pgpgin: 4438856 pgpgout: 9568628 swpout_zero: 743704 swpin_zero: 7244 refault_file: 562555 refault_anon: 470694 Lance and Xueyuan made a huge contribution to this patch through testing. Link: https://lore.kernel.org/20260526130938.66253-1-baohua@kernel.org Link: https://lore.kernel.org/linux-mm/20250916072226.220426-1-liulei.rjpt@vivo.com/ [1] Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org> Tested-by: Lance Yang <lance.yang@linux.dev> Tested-by: Xueyuan Chen <xueyuan.chen21@gmail.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Kairui Song <kasong@tencent.com> Cc: Qi Zheng <qi.zheng@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: wangzicheng <wangzicheng@honor.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Lei Liu <liulei.rjpt@vivo.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Kalesh Singh <kaleshsingh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 dayskasan/test: only do kmalloc_double_kzfree for generic modeWang Wensheng1-0/+10
kmalloc_double_kzfree() would corrupt kernel memory when the just freed memory were allocated by another thread before the second call to kfree_sensitive() and the new allocation tag happened to match the old one. This could not happen in GENERIC mode as it uses quarantine. Link: https://lore.kernel.org/20260524031053.381776-1-wsw9603@163.com Signed-off-by: Wang Wensheng <wsw9603@163.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: trace esz at first setupSeongJae Park1-0/+2
DAMON traces effective size quota from the second update, only if a change has been made by the update. Tracing only changed updates was an intentional decision to avoid unnecessary same value tracing. Always skipping the first value is just an unintended mistake. The mistake makes the tracepoint based investigation incomplete, because the first effective size quota is never traced. It is not a big issue when the 'consist' quota tuner is used, because it keeps changing the quota in the usual setup. However, when the 'temporal' tuner is used, the quota value is not changed before the goal achievement status is completely changed. For example, if the DAMOS scheme is started with an under-achieved goal, the quota is set to the maximum value, and kept the same value until the goal is achieved. Because DAMON skips the first value, the user cannot know what effective quota the current scheme is using. Only after the goal is achieved, the effective quota is changed to zero, and traced. Unconditionally trace the initial quota value to fix this problem. Note that the 'temporal' quota tuner was introduced by commit af738a6a00c1 ("mm/damon/core: introduce DAMOS_QUOTA_GOAL_TUNER_TEMPORAL"), which was added to 7.1-rc1. But even with the 'consist' quota tuner, the tracing is unintentionally incomplete. Hence this commit marks the introduction of the trace event as the broken commit. Link: https://lore.kernel.org/20260520150311.80925-1-sj@kernel.org Fixes: a86d695193bf ("mm/damon: add trace event for effective size quota") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.17.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/tests/core-kunit: add damon_set_regions() test casesSeongJae Park1-22/+120
damon_set_regions() is one of the main DAMON kernel API functions that set up the monitoring target memory region boundaries. Implement unit tests for verifying its basic functionalities. Link: https://lore.kernel.org/20260522154026.80546-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: remove damon_verify_nr_regions()SeongJae Park1-19/+0
When CONFIG_DAMON_DEBUG_SANITY is enabled, damon_verify_nr_regions() is called for each damon_nr_regions() invocation. damon_veify_nr_regions() iterates all regions. damon_nr_regions() is called for each region in kdamond_reset_aggregated() and damos_apply_scheme(). Hence it imposes O(n**2) overhead where n is the number of regions. Though the verification is enabled only under DAMON_DEBUG_SANITY, which is not for production use cases, it could be too high overhead. Meanwhile, damon_verify_ctx() is doing the damon_nr_regions() test. Because damon_verify_ctx() is called for each kdamond_call(), the test coverage from damon_verify_ctx() could be sufficient. Remove damon_nr_regions() verification. Link: https://lore.kernel.org/20260522154026.80546-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: add kdamond_call() debug_sanity checkSeongJae Park1-0/+33
kdamond_call() is the place where DAMON API callers are allowed to access the DAMON context's public internal state including the monitoring results. Hence it is important to ensure it is called with the expected DAMON context state. Do the check under DAMON_DEBUG_SANITY. Link: https://lore.kernel.org/20260522154026.80546-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: hide damon_destroy_region()SeongJae Park1-1/+2
damon_destroy_region() is being used by only DAMON core, but exposed to DAMON API callers. Exposing something that is not really being used by others will only increase the maintenance cost. Hide it. Link: https://lore.kernel.org/20260522154026.80546-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: hide damon_insert_region()SeongJae Park1-0/+11
damon_insert_region() is being used by only DAMON core, but exposed to DAMON API callers. Exposing something that is not really being used by others will only increase the maintenance cost. Hide it. Link: https://lore.kernel.org/20260522154026.80546-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: hide damon_add_region()SeongJae Park1-1/+1
damon_add_region() is being used by only DAMON core, but exposed to DAMON API callers. Exposing something that is not really being used by others will only increase the maintenance cost. Hide it. Link: https://lore.kernel.org/20260522154026.80546-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/tests/vaddr-kunit: replace damon_add_region() with damon_set_regions()SeongJae Park1-7/+20
DAMON virtual address operation set (vaddr) unit tests is using damon_add_region() for setup of DAMON monitoring target region boundaries setup. But, damon_set_regions() is designed for exactly the purpose. All other DAMON API callers use the function for the purpose. Replace damon_add_region() usage in the unit tests with damon_set_regions(), for unifying the use case and reducing the maintenance cost. Link: https://lore.kernel.org/20260522154026.80546-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: do not use region out of a loop in damon_set_regions()SeongJae Park1-2/+9
damon_set_regions() assumes the DAMON region iterator is referencing the last region after the region iteration loop is completed. The code is indeed implemented in the way, but that is not a documented safe behavior. Hence it is unreliable and difficult to read. Cleanup the code to avoid the case. No behavioral change is intended. Link: https://lore.kernel.org/20260522154026.80546-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: safely handle no region case in damon_set_regions()SeongJae Park1-0/+13
Patch series "mm/damon: minor improvements for code readability and tests". Implement minor improvements on code readability and tests for DAMON. First seven patches are for DAMON code readability and resulting maintenance. Patches 1 and 2 make damon_set_regions() safer and easier to read. Patches 3 and 4 remove fragmented DAMON API use cases. Patches 5-7 hides unused core functions that are unnecessarily exposed to API callers. The following seven patches are for DAMON tests improvement. Patches 8 and 9 adds and removes DAMON_DEBUG_SANITY verifications to ensure reasonable test coverage without too high overhead. Patch 10 adds a new kunit test for damon_set_regions(). Patch 11 makes sysfs.py selftest more gracefully finishes under test failures. Patches 12-13 adds simple sysfs.sh test cases for the monitoring intervals goal directory, the addr_unit file and the pause file. This patch (of 14): damon_set_regions() calls damon_first_region() regardless of the number of DAMON regions in a given DAMON target. damon_first_region() internally uses list_first_entry(), which clearly documents the list is expected to be not empty. Due to the internal implementation of the macro, damon_set_regions() is safe for now. But the internal implementation of the macro can be changed in future. Refactor the function to explicitly and safely handle the empty region list case without depending on the internal implementation. No behavioral change is intended. Link: https://lore.kernel.org/20260522154026.80546-1-sj@kernel.org Link: https://lore.kernel.org/20260522154026.80546-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/vma: eliminate mmap_action->error_hook, introduce error_filterLorenzo Stoakes1-8/+21
Rather than providing a hook, simplify things by providing the ability to filter errors. This allows us to more carefully validate the value provided and thus ensure only a valid error code is specified, and simplifies the interface. This way, we eliminate all hooks but mmap_prepare and allow only mmap actions to be specified (which core mm controls). This significantly improves robustness and eliminates any unnecessary code duplication in driver mmap hooks. We also update the /dev/mem logic (the only user) to use mmap_action->error_filter instead. Link: https://lore.kernel.org/e770b28427937057fa953ac380a134b24acd8bb4.1779462249.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Hildenbrand <david@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/vma: remove mmap_action->success_hookLorenzo Stoakes1-2/+0
This hook was introduced to work around code that seemed to absolutely require access to a VMA pointer upon mmap(). However, providing this hook leaves a backdoor to drivers getting access to the very thing mmap_prepare eliminates - a pointer to the VMA. Let's solve this contradiction by removing it. The key intended user was hugetlb, however it seems that the best course now is to avoid allowing all drivers the ability to work around mmap_prepare, and find a different solution there. Link: https://lore.kernel.org/2521c19866f3f10f9085d094cc4f06769042be71.1779462249.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Hildenbrand <david@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysdrivers/char/mem: eliminate unnecessary use of success_hookLorenzo Stoakes2-0/+4
Patch series "remove mmap_action success, error hooks", v2. The mmap_action->success_hook was a strange beast added to enable code which appeared to absolutely require access to a VMA pointer to work correctly. Primarily this was for hugetlb, however a different approach will be taken there, as clearly more work is required to figure out a sensible way of converting hugetlb to use mmap_prepare. The other user was the memory char driver, specifically /dev/zero which has the unusual property of explicitly setting file-backed VMAs anonymous. Providing the success hook was always foolish, as it allowed drivers a way to workaround the restriction that they should not access a pointer to a not-yet-correctly-initialised VMA - which defeats the purpose of the mmap_prepare work. We can achieve the same thing in memory char driver without needing the success hook, so this series removes that, then removes the success hook altogether. The error hook is also unnecessary - the motivation for this was for functions which need to filter the error code when performing an mmap action in order to avoid breaking userspace. We can achieve this by just providing a field for the error code. Doing this means we don't have to worry about the hook doing anything odd. We also add a check to ensure the error code is in fact valid. Again the memory char driver is the only current user of this, so this series updates it to use that. After this change mmap_action has no custom hooks at all, which seems rather more cromulent than before. This patch (of 3): /dev/zero, uniquely, marks memory mapped there as anonymous. This is currently achieved using the mmap_action->success_hook. However this hook circumvents the abstraction of VMA initialisation so it's preferable to do things a different way. To achieve this, this patch firstly defaults the VMA descriptor's vm_ops field to the dummy VMA operations, which is what file-backed VMAs default this field to. That way, we can detect whether a driver sets this field to NULL in order to mark it anonymous. We then introduce vma_desc_set_anonymous() to do this explicitly, and invoke it in mmap_zero_prepare(). This way, any driver which does not explicitly set desc->vm_ops, retains the dummy vm_ops as they would previously. We also update set_vma_user_defined_fields() to make clear that we are either setting vma->vm_ops to what is provided by the driver (or defaulting to dummy_vm_ops if not set), or setting the VMA anonymous. This lays the groundwork for removing the success hook. Link: https://lore.kernel.org/cover.1779462249.git.ljs@kernel.org Link: https://lore.kernel.org/5d1e8bd29d6e070218ba7a03461df562e372b91e.1779462249.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Hildenbrand <david@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/page_alloc: fix defrag_mode for non-reclaimable allocationsDmitry Ilvokhin1-1/+12
When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent migratetype fallbacks and keep pageblocks clean. The allocator relies on reclaim and compaction to free pages of the correct type before allowing fallback as a last resort. However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke direct reclaim or compaction. With defrag_mode=1, these allocations hit the !can_direct_reclaim bailout in __alloc_pages_slowpath() with ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback. This causes a large number of SLUB allocation failures for skbuff_head_cache under network-heavy workloads, despite free memory being available in other migratetype freelists. We observed it on a few of the Meta workloads that adopted defrag_mode=1. For the service under load there were 85509 SLUB allocation failures messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations for skbuff_head_cache, despite free pages being available in other migratetype freelists (~13 GB free). Since it is networking path from the practical point of view, this means dropped packets, failed RPC requests, tail latency spikes and overall service degradation. Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd reclaim but cannot do direct reclaim themselves (GFP_ATOMIC). Purely speculative allocations like GFP_TRANSHUGE_LIGHT that don't set __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable fallbacks and should not cause fragmentation. Link: https://lore.kernel.org/20260520122228.201550-1-d@ilvokhin.com Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode") Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: clarify next_intervals_tune_sis update pathniecheng1-0/+3
Patch series "mm/damon: documentation and comment fixes". This patch (of 3): damon_set_attrs() updates next_aggregation_sis and next_ops_update_sis for online attrs updates, but it does not update next_intervals_tune_sis there. This can look like a missing update when reading damon_set_attrs() alone, while next_intervals_tune_sis is actually updated in kdamond_fn(). Add a short comment to make this explicit. Link: https://lore.kernel.org/20260520012104.93602-1-sj@kernel.org Link: https://lore.kernel.org/20260520012104.93602-2-sj@kernel.org Suggested-by: SeongJae Park <sj@kernel.org> Signed-off-by: niecheng <niecheng1@uniontech.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Sakurai Shun <ssh1326@icloud.com> Cc: Zenghui Yu <zenghui.yu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/vaddr: attempt per-vma lock during page table walkKefeng Wang1-26/+43
Currently, DAMON virtual address operations use mmap_read_lock during page table walks, which can cause unnecessary contention under high concurrency. Introduce damon_va_walk_page_range() to first attempt acquiring a per-vma lock. If the VMA is found and the range is fully contained within it, the page table walk proceeds with the per-vma lock instead of mmap_read_lock. This optimization is expected to be particularly effective for damon_va_young() and damon_va_mkold(), which are frequently called and typically operate within a single VMA. Link: https://lore.kernel.org/20260512151523.2092638-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/memory-failure: use zone_pcp_disable() for poison handlingKaitao Cheng1-15/+3
__page_handle_poison() used drain_all_pages() instead of zone_pcp_disable() because dissolve_free_hugetlb_folio() could restore HVO vmemmap pages and decrement hugetlb_optimize_vmemmap_key. That static key update took cpu_hotplug_lock through static_key_slow_dec(), while zone_pcp_disable() holds pcp_batch_high_lock. CPU hotplug takes the locks in the opposite order through page_alloc_cpu_online/dead(), so the combination could deadlock. That dependency no longer exists. Commit da3e2d1ca43d ("mm/hugetlb: remove hugetlb_optimize_vmemmap_key static key") removed the HVO static key and the static_branch_dec() from hugetlb_vmemmap_restore_folio(). The dissolve_free_hugetlb_folio() path no longer reaches static_key_slow_dec(). Use zone_pcp_disable() again while dissolving the hugetlb folio and taking the target page off the buddy allocator. This prevents the drained PCP lists from being refilled before take_page_off_buddy() runs, making the page isolation deterministic. Link: https://lore.kernel.org/20260514085754.84097-1-kaitao.cheng@linux.dev Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/vmalloc: free unused pages on vrealloc() shrinkShivam Kalra1-4/+52
When vrealloc() shrinks an allocation and the new size crosses a page boundary, unmap and free the tail pages that are no longer needed. This reclaims physical memory that was previously wasted for the lifetime of the allocation. The heuristic is simple: always free when at least one full page becomes unused. Huge page allocations (page_order > 0) are skipped, as partial freeing would require splitting. Allocations with VM_FLUSH_RESET_PERMS are also skipped, as their direct-map permissions must be reset before pages are returned to the page allocator, which is handled by vm_reset_perms() during vfree(). Additionally, allocations with VM_USERMAP are skipped because remap_vmalloc_range_partial() validates mapping requests against the unchanged vm->size; freeing tail pages would cause vmalloc_to_page() to return NULL for the unmapped range. To protect concurrent readers, the shrink path uses Node lock to synchronize before freeing the pages. Finally, we notify kmemleak of the reduced allocation size using kmemleak_free_part() to prevent the kmemleak scanner from faulting on the newly unmapped virtual addresses. The virtual address reservation (vm->size / vmap_area) is intentionally kept unchanged, preserving the address for potential future grow-in-place support. Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-4-70b96ee3e9c9@zohomail.in Signed-off-by: Shivam Kalra <shivamkalra98@zohomail.in> Suggested-by: Danilo Krummrich <dakr@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/vmalloc: use physical page count in vread_iter() for VM_ALLOC areasShivam Kalra1-1/+12
For VM_ALLOC areas in vread_iter(), derive the vm area size from vm->nr_pages rather than get_vm_area_size(). Only VM_ALLOC areas are subject to vrealloc() shrinking, which frees pages without reducing the virtual reservation size. Switch to using vm->nr_pages for VM_ALLOC areas so the reader remains correct once shrink support is added. Other mapping types (vmap, ioremap) do not initialize nr_pages and will continue using get_vm_area_size(). [shivamkalra98@zohomail.in: add an nr_pages check] Link: https://lore.kernel.org/aff47da5-4fd5-481d-be18-e1eb99639490@zohomail.in Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-3-70b96ee3e9c9@zohomail.in Signed-off-by: Shivam Kalra <shivamkalra98@zohomail.in> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/vmalloc: use physical page count for vrealloc() grow-in-place checkShivam Kalra1-1/+7
Update the grow-in-place check in vrealloc() to compare the requested size against the actual physical page count (vm->nr_pages) rather than the virtual area size (alloced_size, derived from get_vm_area_size()). Currently both values are equivalent, but the upcoming vrealloc() shrink functionality will free pages without reducing the virtual reservation size. After such a shrink, the old alloced_size-based comparison would incorrectly allow a grow-in-place operation to succeed and attempt to access freed pages. Switch to vm->nr_pages now so the check remains correct once shrink support is added. Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-2-70b96ee3e9c9@zohomail.in Signed-off-by: Shivam Kalra <shivamkalra98@zohomail.in> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/vmalloc: extract vm_area_free_pages() helper from vfree()Shivam Kalra1-7/+27
Patch series "mm/vmalloc: free unused pages on vrealloc() shrink", v14. This series implements the TODO in vrealloc() to unmap and free unused pages when shrinking across a page boundary. Problem: When vrealloc() shrinks an allocation, it updates bookkeeping (requested_size, KASAN shadow) but does not free the underlying physical pages. This wastes memory for the lifetime of the allocation. Solution: - Patch 1: Extracts a vm_area_free_pages(vm, start_idx, end_idx) helper from vfree() that frees a range of pages with memcg and nr_vmalloc_pages accounting. Freed page pointers are set to NULL to prevent stale references. - Patch 2: Update the grow-in-place check in vrealloc() to compare the requested size against the actual physical page count (vm->nr_pages) rather than the virtual area sizes. This is a prerequisite for shrinking. - Patch 3: For VM_ALLOC areas in vread_iter(), derive the vm area size from vm->nr_pages rather than get_vm_area_size(), which would overestimate the mapped range after a shrink. Other mapping types (vmap, ioremap) don't set nr_pages and keep using get_vm_area_size(). - Patch 4: Uses the helper to free tail pages when vrealloc() shrinks across a page boundary. - Patch 5: Adds a vrealloc test case to lib/test_vmalloc that exercises grow-realloc, shrink-across-boundary, shrink-within-page, and grow-in-place paths. The virtual address reservation is kept intact to preserve the range for potential future grow-in-place support. A concrete user is the Rust binder driver's KVVec::shrink_to [1], which performs explicit vrealloc() shrinks for memory reclamation. This patch (of 5): Extract page freeing and NR_VMALLOC stat accounting from vfree() into a reusable vm_area_free_pages() helper. The helper operates on a range [start_idx, end_idx) of pages from a vm_struct, making it suitable for both full free (vfree) and partial free (upcoming vrealloc shrink). Freed page pointers in vm->pages[] are set to NULL to prevent stale references when the vm_struct outlives the free (as in vrealloc shrink). Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-0-70b96ee3e9c9@zohomail.in Link: https://lore.kernel.org/20260519-vmalloc-shrink-v14-1-70b96ee3e9c9@zohomail.in Link: https://lore.kernel.org/all/20260216-binder-shrink-vec-v3-v6-0-ece8e8593e53@zohomail.in/ [1] Signed-off-by: Shivam Kalra <shivamkalra98@zohomail.in> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs: setup damon_filter->memcg_id from pathSeongJae Park2-1/+12
Find and set the memcg_id for damon_filter from the user-passed memory cgroup path when updating the DAMON input parameters. Link: https://lore.kernel.org/20260518234119.97569-27-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs-schemes: move memcg_path_to_id() to sysfs-commonSeongJae Park3-41/+43
The next commit will need to find the memcg id from the user-passed path to the memory cgroup, from sysfs.c. memcg_path_to_id() is doing that, but defined in sysfs-schemes.c as a static function. Move the function to sysfs-common.c and mark it as non-static, so that the next commit can reuse the function. Link: https://lore.kernel.org/20260518234119.97569-26-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs: add filters/<F>/path fileSeongJae Park1-0/+44
Introduce a new DAMON sysfs file for letting users setup the target memory cgroup of the belonging memory cgroup attribute monitoring. The file is named 'path', located under the probe filter directory. Users can set the target memory cgroup by writing the path to the memory cgroup from the cgroup mount point to the file. Link: https://lore.kernel.org/20260518234119.97569-25-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/paddr: support DAMON_FILTER_TYPE_MEMCGSeongJae Park1-0/+14
Implement the support of DAMON_FILTER_TYPE_MEMCG on the DAMON operation set implementation for the physical address space. Link: https://lore.kernel.org/20260518234119.97569-24-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: introduce DAMON_FILTER_TYPE_MEMCGSeongJae Park1-0/+14
Belonging memory cgoup is another data attribute that can be useful to monitor. Introduce a new DAMON filter type, namely DAMON_FILTER_TYPE_MEMCG, for monitoring of this attribute. Link: https://lore.kernel.org/20260518234119.97569-23-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon: trace probe_hitsSeongJae Park1-0/+9
Introduce a new tracepoint for exposing the per-region per-probe positive sample count via tracefs. Link: https://lore.kernel.org/20260518234119.97569-19-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs-schemes: implement probe/hits fileSeongJae Park1-7/+34
Implement sysfs file for showing the per-region per-probe hits count. Link: https://lore.kernel.org/20260518234119.97569-18-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs-schemes: implement probe dirSeongJae Park1-6/+95
Implement sysfs directory for showing per-probe hits count of each region. Link: https://lore.kernel.org/20260518234119.97569-17-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs-schemes: implement tried_regions/<r>/probes/SeongJae Park1-4/+63
Implement a sysfs directory for showing the per-region probe hit counts. It is named 'probes/' and located under the DAMOS tried region directory. Link: https://lore.kernel.org/20260518234119.97569-16-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs: setup probes on DAMON core API parametersSeongJae Park1-0/+37
Add user-installed data probes to DAMON core API parameters, so that user inputs for data probes are passed to DAMON core. Link: https://lore.kernel.org/20260518234119.97569-15-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs: implement filter dir filesSeongJae Park1-0/+114
Implement sysfs files under the data probe filter directory for letting users to configure each filter. Link: https://lore.kernel.org/20260518234119.97569-14-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs: implement filter dirSeongJae Park1-1/+124
Implement a sysfs directory for letting the users to configure each data probe filter. Link: https://lore.kernel.org/20260518234119.97569-13-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs: implement filters directorySeongJae Park1-1/+64
Implement a directory for letting users to install data probe filters. Link: https://lore.kernel.org/20260518234119.97569-12-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs: implement probe dirSeongJae Park1-0/+119
Implement sysfs directory for letting users install each data probe. Link: https://lore.kernel.org/20260518234119.97569-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs: implement probes dirSeongJae Park1-0/+46
Implement sysfs directory that can be used by the users to install data probes. Link: https://lore.kernel.org/20260518234119.97569-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/paddr: support data attributes monitoringSeongJae Park1-0/+62
Implement and register damon_operations->apply_probes() callback to support data attributes monitoring. Link: https://lore.kernel.org/20260518234119.97569-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: do data attributes monitoringSeongJae Park1-0/+6
Implement the data attributes monitoring execution. Update kdamond to invoke the probes application callback, and reset the aggregated number of per-region per-probe positive samples for every aggregation interval. Link: https://lore.kernel.org/20260518234119.97569-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: introduce damon_region->probe_hitsSeongJae Park1-0/+10
Add an array for the per-region per-probe positive samples count. For simple and efficient implementation, add a limit to the number of data probes and set the array to support only the limited number of counters. Link: https://lore.kernel.org/20260518234119.97569-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: commit probesSeongJae Park1-0/+104
Update damon_commit_ctx() to commit installed data probes, too. Link: https://lore.kernel.org/20260518234119.97569-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: introduce damon_filterSeongJae Park1-0/+30
Define a data structure for constructing damon_probe's attributes check, namely damon_filter. It is very similar to damos_filter but works only for monitoring purposes. Also embed that into damon_probe, implement essential handling of the link, with fundamental helpers. Link: https://lore.kernel.org/20260518234119.97569-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: embed damon_probe objects in damon_ctxSeongJae Park1-0/+38
Let damon_probe objects be able to be installed on a given damon_ctx, by adding a linked list header for storing the objects. Add initialization and cleanup of the new field with helper functions, too. Link: https://lore.kernel.org/20260518234119.97569-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/memory-failure: remove hugetlb output parameter from ↵Ye Liu1-10/+11
try_memory_failure_hugetlb() Use -ENOENT return value to distinguish "not a hugetlb page" from "hugetlb handled", instead of carrying an extra output parameter. Link: https://lore.kernel.org/20260515020144.164941-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Suggested-by: Oscar Salvador <osalvador@suse.de> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm, swap: merge zeromap into swap tableKairui Song6-116/+192
By allocating one additional bit in the swap table entry's flags field alongside the count, we can store the zeromap inline For 64 bit systems, zeromap will store in the swap table, avoiding zeromap allocation. It reduces the allocated memory. That is the happy path. For certain 32-bit archs, there might not be enough bits in the swap table to contain both PFN and flags. Therefore, conditionally let each cluster have a zeromap field at build time, and use that instead. If the swapfile cluster is not fully used, it will still save memory for zeromap. The empty cluster does not allocate a zeromap. In the worst case, all cluster are fully populated. We will use memory similar to the previous zeromap implementation. A few macros were moved to different headers for build time struct definition. [akpm@linux-foundation.org: swap_cluster_alloc_table(): remove unused local `ret] [akpm@linux-foundation.org: fix unused label `err_free'] Link: https://lore.kernel.org/20260517-swap-table-p4-v5-12-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Youngjun Park <youngjun.park@lge.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/memcg: remove no longer used swap cgroup arrayKairui Song6-188/+0
Now all swap cgroup records are stored in the swap cluster directly, the static array is no longer needed. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-11-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmemcgv1: don't compile swap functions when CONFIG_SWAP=nAndrew Morton1-0/+2
Stub these out to save some dead code and to fix a build error with the upcoming "mm/memcg, swap: store cgroup id in cluster table directly". Link: https://lore.kernel.org/202605281711.bSeZlErK-lkp@intel.com Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/memcg, swap: store cgroup id in cluster table directlyKairui Song7-37/+131
Drop the usage of the swap_cgroup_ctrl, and use the dynamic cluster table instead. The per-cluster memcg table is 1024 / 512 bytes on most archs, and does not need RCU protection: the cgroup data is only read and written under the cluster lock. That keeps things simple, lets the allocation use plain kmalloc with immediate kfree (no deferred free), and keeps fragmentation acceptable. [akpm@linux-foundation.org: fix CONFIG_SWAP=n build] Link: https://lore.kernel.org/20260517-swap-table-p4-v5-10-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm, swap: consolidate cluster allocation helpersKairui Song1-61/+49
Swap cluster table management is spread across several narrow helpers. As a result, the allocation and fallback sequences are open-coded in multiple places. A few more per-cluster tables will be added soon, so avoid duplicating these sequences per table type. Fold the existing pairs into cluster-oriented helpers, and rename for consistency. No functional change, only a few sanity checks are slightly adjusted. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-9-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm, swap: delay and unify memcg lookup and charging for swapinKairui Song3-24/+24
Instead of checking the cgroup private ID during page table walk in swap_pte_batch(), move the memcg lookup into __swap_cache_add_check() under the cluster lock. The first pre-alloc check is speculative and skips the memcg check since the post-alloc stable check ensures all slots covered by the folio belong to the same memcg. It is very rare for contiguous and aligned entries across a contiguous region of a page table of the same process or shmem mapping to belong to different memcgs. This also prepares for recording the memcg info in the cluster's table. Also make the order check and fallback more compact. There should be no user-observable behavior change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-8-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm, swap: support flexible batch freeing of slots in different memcgsKairui Song1-4/+29
Instead of requiring the caller to ensure all slots are in the same memcg, make the function handle different memcgs at once. This is both a micro optimization and required for removing the memcg lookup in the page table layer, so it can be unified at the swap layer. We are not removing the memcg lookup in the page table in this commit. It has to be done after the memcg lookup is deferred to the swap layer. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-7-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/memcg, swap: tidy up cgroup v1 memsw swap helpersKairui Song6-23/+29
The cgroup v1 swap helpers always operate on swap cache folios whose swap entry is stable: the folio is locked and in the swap cache. There is no need to pass the swap entry or page count as separate parameters when they can be derived from the folio itself. Simplify the redundant parameters and add sanity checks to document the required preconditions. Also rename memcg1_swapout to __memcg1_swapout to indicate it requires special calling context: the folio must be isolated and dying, and the call must be made with interrupts disabled. No functional change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-6-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm, swap: unify large folio allocationKairui Song5-278/+78
Now that direct large order allocation is supported in the swap cache, both anon and shmem can use it instead of implementing their own methods. This unifies the fallback and swap cache check, which also reduces the TOCTOU race window of swap cache state: previously, high order swapin required checking swap cache states first, then allocating and falling back separately. Now all these steps happen in the same compact loop. Order fallback and statistics are also unified, callers just need to check and pass the acceptable order bitmask. There is basically no behavior change. This only makes things more unified and prepares for later commits. Cgroup and zero map checks can also be moved into the compact loop, further reducing race windows and redundancy Link: https://lore.kernel.org/20260517-swap-table-p4-v5-5-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm, swap: add support for stable large allocation in swap cache directlyKairui Song3-71/+170
To make it possible to allocate large folios directly in swap cache, provide a new infrastructure helper to handle the swap cache status check, allocation, and order fallback in the swap cache layer The new helper replaces the existing swap_cache_alloc_folio. Based on this, all the separate swap folio allocation that is being done by anon / shmem before is converted to use this helper directly, unifying folio allocation for anon, shmem, and readahead. This slightly consolidates how allocation is synchronized, making it more stable and less prone to errors. The slot-count and cache-conflict check is now always performed with the cluster lock held before allocation, and repeated under the same lock right before cache insertion. This double check produces a stable result compared to the previous anon and shmem mTHP allocation implementation, avoids the false-negative conflict checks that the lockless path can return — large allocations no longer have to be unwound because the range turned out to be occupied — and aborts early for already-freed slots, which helps ordinary swapin and especially readahead, with only a marginal increase in cluster-lock contention (the lock is very lightly contended and stays local in the first place). Hence, callers of swap_cache_alloc_folio() no longer need to check the swap slot count or swap cache status themselves. And now whoever first successfully allocates a folio in the swap cache will be the one who charges it and performs the swap-in. The race window of swapping is also reduced since the loop is much more compact. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-4-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/huge_memory: move THP gfp limit helper into headerKairui Song1-27/+3
Shmem has some special requirements for THP GFP and has to limit it in certain zones or provide a more lenient fallback. We'll use this helper for generic swap THP allocation, which needs to support shmem. For a typical GFP_HIGHUSER_MOVABLE swap-in, this helper is basically a no-op. But it's necessary for certain shmem users, mostly drivers. No feature change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-3-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm, swap: move common swap cache operations into standalone helpersKairui Song1-46/+100
Move a few swap cache checking, adding, and deletion operations into standalone helpers to be used later. And while at it, add proper kernel doc. No feature or behavior change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-2-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm, swap: simplify swap cache allocation helperKairui Song3-103/+103
Patch series "mm, swap: swap table phase IV: unify allocation", v5. This series unifies the allocation and charging of anon and shmem swap in folios, provides better synchronization, consolidates the metadata management, hence dropping the static array and map, and improves the performance. The static metadata overhead is now close to zero, and workload performance is slightly improved. For example, mounting a 1TB swap device saves about 512MB of memory: Before: free -m total used free shared buff/cache available Mem: 1464 805 346 1 382 658 Swap: 1048575 0 1048575 After: free -m total used free shared buff/cache available Mem: 1464 277 899 1 356 1187 Swap: 1048575 0 1048575 Memory usage is ~512M lower, and we now have a close to 0 static overhead. It was about 2 bytes per slot before, now roughly 0.09375 bytes per slot (48 bytes ci info per cluster, which is 512 slots). Performance test is also looking good, testing Redis in a 2G VM using 6G ZRAM as swap: valkey-server --maxmemory 2560M redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get Before: 3385017.283654 RPS After: 3433309.307292 RPS (1.42% better) Testing with build kernel under global pressure on a 48c96t system, limiting the total memory to 8G, using 12G ZRAM, 24 test runs, enabling THP: make -j96, using defconfig Before: user time 2904.59s system time 4773.99s After: user time 2909.38s system time 4641.55s (2.77% better) Testing with usemem on a 32c machine using 48G brd ramdisk and 16G RAM, 12 test run: usemem --init-time -O -y -x -n 48 1G Before: Throughput (Sum): 6482.58 MB/s Free Latency: 371371.67us After: Throughput (Sum): 6539.28 MB/s Free Latency: 363059.88us Seems similar, or slightly better. This series also reduces memory thrashing, I no longer see any: "Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF", it was shown several times during stress testing before this series when under great pressure: Before: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 18 After: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 0 This patch (of 12): Instead of trying to return the existing folio if the entry is already cached in swap_cache_alloc_folio, simply return an error pointer if the allocation failed, and drop the output argument that indicates what kind of folio is actually returned. And a proper wrapper swap_cache_read_folio that decouples and handles the actual requirement - read in the folio, or return the already read folio in cache. This is what async swapin and readahead actually required. As for zswap swap out, the caller just needs to abort if the allocation fails because the entry is gone or already cached, so removing simplifies the return argument, making it cleaner. No feature change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com Link: https://lore.kernel.org/20260517-swap-table-p4-v5-1-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Cc: Youngjun Park <youngjun.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/page_alloc: document that alloc_pages_nolock() uses RCUBrendan Jackman1-2/+2
The allocator interacts with cgroups which rely on RCU. RCU does not work everywhere, so the "any context" claim is slightly overstated here. This should already be enforced by objtool, since this function is not marked noinstr the x86 build should fail if you call it from a place where RCU is not watching. But, expecting readers to make that connection for themselves seems a bit cruel (I don't think there is even any documentation of what noinstr means at all, let alone the connection with RCU). Note this is not claiming that any cgroup code called from the allocator would actually break if this restriction was violated, it could very well be that there's no real way for the allocator to act on a cgroup that can disappear concurrently. But, since it's likely nobody has verified this one way or another, better to just be safe and declare that RCU is required. Allocating from an RCU-unsafe context seems a bit crazy anyway. Link: https://lore.kernel.org/20260519-nolock-rcu-comment-v1-1-4a630c8794e5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Suggested-by: Junaid Shahid <junaids@google.com> Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/page_alloc: drop a misleading __always_inlineBrendan Jackman1-1/+1
get_pfnblock_migratetype() is called from outside page_alloc.c, so it cannot always be inlined. Remove the annotation to avoid misleading readers. At least in my minimal config, with GCC, this doesn't change mm/page_alloc.o at all. Link: https://lore.kernel.org/all/20260517-b4-drop-always-inline-v1-1-97b90930e8b8@google.com/ Signed-off-by: Brendan Jackman <jackmanb@google.com> Suggested-by: Vlastimil Babka <vbabka@kernel.org> Link: https://lore.kernel.org/all/016c8bef-57ef-44ef-bf60-86dbfd368dcd@kernel.org/ Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Vishal Moola <vishal.moola@gmail.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/page_alloc: remove ifdefs from pindex helpersBrendan Jackman1-16/+14
The ifdefs are not technically needed here, everything used here is always defined. Switching to IS_ENABLED() makes the code a bit less tiresome to read. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-4-dacdf5402be8@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: rejig pageblock mask definitionsBrendan Jackman1-9/+9
- Add a PAGEBLOCK_ prefix to the names to avoid polluting the "global namespace" too much. - This new prefix makes MIGRATETYPE_AND_ISO_MASK look pretty long. Well, that global mask only exists for quite a specific purpose, and is quite a weird thing to have a name for anyway. So drop it and take advantage of the newly-defined PAGEBLOCK_ISO_MASK. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-3-dacdf5402be8@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/page_alloc: don't overload migratetype in find_suitable_fallback()Brendan Jackman3-20/+35
This function currently returns a signed integer that encodes status in-band, as negative numbers, along with a migratetype. Switch to a more explicit/verbose style that encodes the status and migratetype separately. In the spirit of making things more explicit, also create an enum to avoid using magic integer literals with special meanings. This enables documenting the values at their definition instead of in one of the callers. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-2-dacdf5402be8@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: introduce for_each_free_list()Brendan Jackman1-4/+7
Patch series "mm: misc cleanups from __GFP_UNMAPPED series". In v2 of the __GFP_UNMAPPED series [0], we realised that some of the patches could potentially be merged as independent cleanups. These are all independent of one another, if you think some are useful cleanups and others are pointless churn, it should be fine to just pick whatever subset you prefer. No functional change intended. This patch (of 4): There are a couple of places that iterate over the freelists with awareness of the data structures' layout. It seems ideally, code outside of mm should not be aware of the page allocator's freelists at all. But, this patch just doesn't hide them completely, it's just a meek incremental step in that direction: provide a macro to iterate over it without needing to be aware of the actual struct fields. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-0-dacdf5402be8@google.com Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-1-dacdf5402be8@google.com Link: https://lore.kernel.org/all/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com/ [0] Signed-off-by: Brendan Jackman <jackmanb@google.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/filemap: fix page_cache_prev_miss() when no hole is foundTal Zussman1-6/+7
page_cache_prev_miss() is documented to return a value outside the searched range when no gap is found. However, the no-gap-found path returns xas.xa_index, which after a successful loop is the first index in the range. As such, that index is misreported as a gap. The sole caller, page_cache_sync_ra(), uses the return value to estimate the cached run preceding a sequential read. In some cases, the buggy return value can undercount the contiguous range by one, shrinking the readahead window or pushing borderline requests into the small-random-read branch. Fix this by returning the start of the range - 1 when no hole is found. Update page_cache_next_miss() for clarity as well. Both helpers were previously fixed together in commit 9425c591e06a ("page cache: fix page_cache_next/prev_miss off by one"), but the fix was reverted because it caused a hugetlb performance regression. hugetlb no longer uses these functions and next_miss was subsequently refixed in commit 901a269ff3d5 ("filemap: fix page_cache_next_miss() when no hole found") and commit bbcaee20e03e ("readahead: fix return value of page_cache_next_miss() when no hole is found"), but prev_miss was not addressed. This was found by pointing Claude Opus 4.7 at mm/filemap.c. Link: https://lore.kernel.org/20260512-prev_miss_fix-v2-1-4af8e5c1ae62@columbia.edu Fixes: 0d3f92966629 ("page cache: Convert hole search to XArray") Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Tal Zussman <tz2294@columbia.edu> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Vishal Moola <vishal.moola@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/shrinker: simplify shrinker_memcg_alloc() using guard()wangxuewen1-8/+5
Use guard(mutex) to automatically handle shrinker_mutex locking and unlocking in shrinker_memcg_alloc(). This removes the explicit mutex_unlock() call, the goto-based error path, and the redundant ret variable, resulting in cleaner and more concise code. Link: https://lore.kernel.org/20260513075214.2655710-1-18810879172@163.com Signed-off-by: wangxuewen <wangxuewen@kylinos.cn> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Dave Chinner <david@fromorbit.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Xuewen Wang <wangxuewen@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm, swap: avoid leaving unused extend table after alloc raceKairui Song1-8/+34
Allocating an extend table requires dropping the ci lock first. While the lock is dropped, a concurrent put can decrease the slot's swap count to a value that is no longer maxed out, so the extend table is no longer required. The current allocation path still attach the new extend table to the cluster anyway, leaving it unused. The next maxed out count on the same cluster may still reuse the table, and frees it properly. But swapoff could leak it indeed. To eliminate the waste, re-check under the ci lock that the extend table is still needed before publishing it, and free the local allocation otherwise. Also close the check window by ensuring every count decrement that brings a slot below SWP_TB_COUNT_MAX - 1 runs swap_extend_table_try_free(), not just the MAX to MAX - 1 transition. With this, a freshly published extend table that becomes redundant due to a racing put is freed on the very next decrement, restoring the invariant that an empty cluster never has a non-NULL ci->extend_table. The added overhead is ignorable. [kasong@tencent.com: v2] Link: https://lore.kernel.org/20260515-swap-extend-table-fix-v2-1-833d72ad53e5@tencent.com Link: https://lore.kernel.org/20260513-swap-extend-table-fix-v1-1-a71dea851fb3@tencent.com Fixes: 0d6af9bcf383 ("mm, swap: use the swap table to track the swap count") Signed-off-by: Kairui Song <kasong@tencent.com> Reported-by: Breno Leitao <leitao@debian.org> Closes: https://lore.kernel.org/linux-mm/agG6Dp0umhs6O1SY@gmail.com/ Tested-by: Breno Leitao <leitao@debian.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/readahead: no PG_readahead on EOFFrederick Mayle1-3/+12
When readahead pulls in all the remaining pages for a file, setting the readahead bit is counter productive. The async readahead it would trigger would almost certainly be a no-op. Additionally, for mmap'd file IO, the readahead bit limits the fault around [1], causing an extra minor fault when the page is accessed. This was discovered when looking at /sys/kernel/tracing/events/readahead traces for a simple program. With the patch applied, fewer page_cache_ra_unbounded calls are observed. [1] do_fault_around calls filemap_map_pages, which finds eligible pages by calling next_uptodate_folio [2]. next_uptodate_folio skips pages with PG_readahead set [3]. Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3921-L3939 [2] Link: https://github.com/torvalds/linux/blob/v7.0/mm/filemap.c#L3721-L3722 [3] Link: https://lore.kernel.org/20260508181237.670645-1-fmayle@google.com Signed-off-by: Frederick Mayle <fmayle@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/hugetlb_cma: restrict hugetlb_cma parameter to gigantic-page alignmentSang-Heon Jeon1-19/+16
Existing hugetlb_cma parameter handling logic rejects sizes smaller than one gigantic page, but rounds up larger sizes that are not a multiple of it. The two behaviors are inconsistent and neither is documented. To remove existing inconsistent and undefined behavior, restrict hugetlb_cma parameter to only accept multiples of the gigantic page size. After this restriction, the redundant round_up() in the allocation loop can be removed. The new restriction is also documented in kernel-parameters.txt. Also, including other minor changes for readability improvement with no functional change. Link: https://lore.kernel.org/20260503084225.415980-1-ekffu200098@gmail.com Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com> Suggested-by: Muchun Song <muchun.song@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/mseal: use min/max in mseal_applyThorsten Blum1-2/+3
Use the type-checked min()/max() macros instead of MIN()/MAX(), which are supposed to be used "for obvious constants only". Link: https://lore.kernel.org/20260503115915.18680-3-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/readahead: simplify page_cache_ra_unbounded loop counter resetFrederick Mayle1-2/+2
Minor cleanup, no behavior change intended. `read_pages` ensures that `ractl->_nr_pages` is zero before it returns, so the `ractl->_nr_pages` term in these expressions contributes nothing. This seems to have been true since the statements were introduced in commit f615bd5c4725f ("mm/readahead: Handle ractl nr_pages being modified"). The new expression has an intuitive explanation. When filesystems perform readahead, they increment `ractl->_index` by the number of pages processed, so, after `read_pages` returns, `ractl->_index` points to the first page after those already processed. `index` points to the first page considered in the loop. So, `ractl->_index - index` is the number of pages processed by the loop so far. Link: https://lore.kernel.org/20260512203154.754075-3-fmayle@google.com Signed-off-by: Frederick Mayle <fmayle@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/readahead: add kerneldoc for read_pagesFrederick Mayle1-0/+11
Patch series "mm: document read_pages and simplify usage". Add a kerneldoc for read_pages() to formalize an invariant and then use it to simplify the callers in page_cache_ra_unbounded(). This patch (of 2): Formalize one of the invariants provided by the current implementation so that callers can depend on it, as discussed in [1]. Link: https://lore.kernel.org/all/20260501061146.6e61392d125cf1847d7cc181@linux-foundation.org/ [1] Link: https://lore.kernel.org/20260512203154.754075-2-fmayle@google.com Signed-off-by: Frederick Mayle <fmayle@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/shrinker: avoid out-of-bounds read in set_shrinker_bit()David Carlier1-2/+3
set_shrinker_bit() reads info->unit[shrinker_id_to_index(shrinker_id)] before checking shrinker_id against info->map_nr_max, so an id past the currently visible map_nr_max reads past the unit[] array before the WARN_ON_ONCE() catches it. Determined from code inspection. Move the load into the bounded branch. Link: https://lore.kernel.org/20260510183700.102475-1-devnexen@gmail.com Fixes: 307bececcd12 ("mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}") Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Qi Zheng <qi.zheng@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Dave Chinner <david@fromorbit.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: fix inconsistent MMF_VM_HUGEPAGE flag due to allocation ↵Ye Liu1-2/+5
failure order __khugepaged_enter() sets MMF_VM_HUGEPAGE before allocating the corresponding mm_slot. If mm_slot_alloc() fails, the function returns with the flag set but without inserting the mm into the khugepaged tracking structures, leaving the mm in an inconsistent state where future registration attempts are skipped. Fix this by reordering: allocate the mm_slot first, then check and set the flag. If the flag is already set, free the allocated slot and return. This ensures the flag is only set when the mm is successfully registered in the khugepaged tracking structures. Link: https://lore.kernel.org/20260511025408.54035-1-ye.liu@linux.dev Fixes: 16618670276a ("mm: khugepaged: avoid pointless allocation for "struct mm_slot"") Signed-off-by: Ye Liu <liuye@kylinos.cn> Suggested-by: David Hildenbrand <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/percpu-internal.h: optimise pcpu_chunk struct to save memoryzenghongling1-3/+3
Using pahole, we can see that there are some padding holes in the current pcpu_chunk structure,Adjusting the layout of pcpu_chunk can reduce these holes,decreasing its size from 192 bytes to 128 bytes and eliminating a wasted cache line. With allmodconfig (CONFIG_PERCPU_STATS + NEED_PCPUOBJ_EXT) Before: /* size: 256, cachelines: 4, members: 19 */ After: /* size: 192, cachelines: 3, members: 19 */ with NEED_PCPUOBJ_EXT Before: struct pcpu_chunk { struct list_head list; /* 0 16 */ int free_bytes; /* 16 4 */ struct pcpu_block_md chunk_md; /* 20 32 */ /* XXX 4 bytes hole, try to pack */ long unsigned int * bound_map; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ void * base_addr __attribute__((__aligned__(64))); /* 64 8 */ long unsigned int * alloc_map; /* 72 8 */ struct pcpu_block_md * md_blocks; /* 80 8 */ void * data; /* 88 8 */ bool immutable; /* 96 1 */ bool isolated; /* 97 1 */ /* XXX 2 bytes hole, try to pack */ int start_offset; /* 100 4 */ int end_offset; /* 104 4 */ /* XXX 4 bytes hole, try to pack */ struct obj_cgroup * * obj_cgroups; /* 112 8 */ int nr_pages; /* 120 4 */ int nr_populated; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ int nr_empty_pop_pages; /* 128 4 */ /* XXX 4 bytes hole, try to pack */ long unsigned int populated[]; /* 136 0 */ /* size: 192, cachelines: 3, members: 17 */ /* sum members: 122, holes: 4, sum holes: 14 */ /* padding: 56 */ /* forced alignments: 1 */ } __attribute__((__aligned__(64))); After: struct pcpu_chunk { struct list_head list; /* 0 16 */ int free_bytes; /* 16 4 */ struct pcpu_block_md chunk_md; /* 20 32 */ /* XXX 4 bytes hole, try to pack */ long unsigned int * bound_map; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ void * base_addr __attribute__((__aligned__(64))); /* 64 8 */ long unsigned int * alloc_map; /* 72 8 */ struct pcpu_block_md * md_blocks; /* 80 8 */ void * data; /* 88 8 */ bool immutable; /* 96 1 */ bool isolated; /* 97 1 */ /* XXX 2 bytes hole, try to pack */ int start_offset; /* 100 4 */ int end_offset; /* 104 4 */ int nr_pages; /* 108 4 */ int nr_populated; /* 112 4 */ int nr_empty_pop_pages; /* 116 4 */ struct obj_cgroup * * obj_cgroups; /* 120 8 */ /* --- cacheline 2 boundary (128 bytes) --- */ long unsigned int populated[]; /* 128 0 */ /* size: 128, cachelines: 2, members: 17 */ /* sum members: 122, holes: 2, sum holes: 6 */ /* forced alignments: 1 */ } __attribute__((__aligned__(64))); Link: https://lore.kernel.org/20260511070309.44044-1-zenghongling@kylinos.cn Signed-off-by: zenghongling <zenghongling@kylinos.cn> Suggested-by: Dennis Zhou <dennis@kernel.org> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/reclaim: validate min_region_size to be power of 2Liew Rui Yan1-0/+5
Problem ======= When a user sets an invalid 'addr_unit' (e.g., 3) via DAMON_RECLAIM, 'min_region_sz' becomes a non-power-of-2 value. While damon_commit_ctx() correctly detects this and returns -EINVAL, it sets the 'maybe_corrupted' flag during this process. This flag causes the running kdamond to terminate. While the termination is a safety measure, it is suboptimal in this case because the error is just a simple invalid input from the user, which shouldn't neccessitate stopping the kdamond. Reproduction ============ 1. Enable DAMON_RECLAIM 2. Set addr_unit=3 3. Commit inputs via 'commit_inputs' 4. Observe kdamond termination Solution ======== Add an early validation in damon_reclaim_apply_parameters() to check 'min_region_sz' before any state change occurs. If it is non-power-of-2, return -EINVAL immediately, preventing 'maybe_corrupted' from being set. Link: https://lore.kernel.org/20260501013750.71704-3-aethernet65535@gmail.com Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/lru_sort: validate min_region_size to be power of 2Liew Rui Yan1-0/+5
Patch series "mm/damon: validate min_region_size to be power of 2", v5. Problem ======= When a user sets an invalid 'addr_unit' (e.g., 3) via DAMON_LRU_SORT or DAMON_RECLAIM, 'min_region_sz' becomes a non-power-of-2 value. While damon_commit_ctx() correctly detects this and returns -EINVAL, it sets the 'maybe_corrupted' flag during this process. This flag causes the running kdamond to terminate. While the termination is a safety measure, it is suboptimal in this case because the error is just a simple invalid input from the user, which shouldn't neccessitate stopping the kdamond. Solution ======== Add an early validation in damon_lru_sort_apply_parameters() and damon_reclaim_apply_parameters() to check 'min_region_sz' before any state change occurs. If it is non-power-of-2, return -EINVAL immediately, preventing 'maybe_corrupted' from being set. Patch 1 fixes the issue for DAMON_LRU_SORT. Patch 2 fixes the issue for DAMON_RECLAIM. This patch (of 2): Problem ======= When a user sets an invalid 'addr_unit' (e.g., 3) via DAMON_LRU_SORT, 'min_region_sz' becomes a non-power-of-2 value. While damon_commit_ctx() correctly detects this and returns -EINVAL, it sets the 'maybe_corrupted' flag during this process. This flag causes the running kdamond to terminate. While the termination is a safety measure, it is suboptimal in this case because the error is just a simple invalid input from the user, which shouldn't neccessitate stopping the kdamond. Reproduction ============ 1. Enable DAMON_LRU_SORT 2. Set addr_unit=3 3. Commit inputs via 'commit_inputs' 4. Observe kdamond termination Solution ======== Add an early validation in damon_lru_sort_apply_parameters() to check 'min_region_sz' before any state change occurs. If it is non-power-of-2, return -EINVAL immediately, preventing 'maybe_corrupted' from being set. Link: https://lore.kernel.org/20260501013750.71704-1-aethernet65535@gmail.com Link: https://lore.kernel.org/20260501013750.71704-2-aethernet65535@gmail.com Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs-schemes: fix double increment of nr_regionsVineet Agarwal1-1/+1
damos_sysfs_populate_region_dir() increments sysfs_regions->nr_regions twice when adding a new region: once explicitly before kobject_init_and_add(), and once again through the post-increment used for the kobject name. As a result, nr_regions no longer matches the actual number of live regions, and region directory names skip numbers (1, 3, 5, ...). Use the already incremented value for naming instead of incrementing nr_regions a second time. Link: https://lore.kernel.org/20260512041157.109845-1-agarwal.vineet2006@gmail.com Fixes: 66178e4ec30a ("mm/damon/sysfs: use damos_walk() for update_schemes_tried_{bytes,regions}") Signed-off-by: Vineet Agarwal <agarwal.vineet2006@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysdrivers/base/memory: make memory block get/put explicitMuchun Song1-3/+2
Rename the memory block lookup helper to make the acquired reference explicit, add memory_block_put() to wrap put_device(), remove find_memory_block(), and use memory_block_get() as the single block-id based lookup interface. This makes it clearer to callers that a successful lookup holds a reference that must be dropped, reducing the chance of forgetting the matching put and leaking the memory block device reference. Link: https://lore.kernel.org/linux-mm/7887915D-E598-42B3-9AFE-BFFBACE8DE2D@linux.dev/#t Link: https://lore.kernel.org/20260512072635.3969576-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> #s390 Cc: Richard Cheng <icheng@nvidia.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Doug Anderson <dianders@chromium.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 dayspowerpc/mm: remove CONFIG_HAVE_BOOTMEM_INFO_NODEDavid Hildenbrand (Arm)1-1/+1
register_page_bootmem_info_node() essentially only calls register_page_bootmem_memmap(). However, on powerpc that function is a nop. So there is not benefit in using CONFIG_HAVE_BOOTMEM_INFO_NODE anymore, let's just drop it. We can stop including bootmem_info.h. Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-8-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/bootmem_info: stop marking mem_section_usage as MIX_SECTION_INFODavid Hildenbrand (Arm)1-11/+1
We never free the ms->usage data for boot memory sections (see section_deactivate()). And to identify whether ms->usage was allocated from memblock, we simply identify it by looking at PG_reserved. Consequently, there is no need to mark ms->usage as MIX_SECTION_INFO. Let's just stop doing that. Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-6-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/bootmem_info: stop marking the pgdat as NODE_INFODavid Hildenbrand (Arm)1-8/+1
We removed the last user of NODE_INFO in commit 119c31caa59e ("mm/sparse: remove !CONFIG_SPARSEMEM_VMEMMAP leftovers for CONFIG_MEMORY_HOTPLUG"). But it really was never used it besides for safety-checks ever since it was introduced in commit 04753278769f ("memory hotplug: register section/node id to free"), where we had the comment: 5) The node information like pgdat has similar issues. But, this will be able to be solved too by this. (Not implemented yet, but, remembering node id in the pages.) Of course, that never happened, and we are not planning on freeing the node data (pgdat/pglist_data), during memory hotunplug. So let's just stop marking the pgdat as NODE_INFO. Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-5-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/bootmem_info: remove call to kmemleak_free_part_phys()David Hildenbrand (Arm)1-1/+0
The call to kmemleak_free_part_phys() was added in 2022 in commit dd0ff4d12dd2 ("bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem"). In 2025, commit b2aad24b5333 ("mm/memmap: prevent double scanning of memmap by kmemleak") started to use MEMBLOCK_ALLOC_NOLEAKTRACE when allocating the memmap to skip the kmemleak_alloc_phys() in the buddy. So remove the call to kmemleak_free_part_phys(). If this would still be required for other purposes, either free_reserved_page() should take care of it, or selected users. Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-4-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/bootmem_info: stop using PG_privateDavid Hildenbrand (Arm)1-2/+0
Nobody checks PG_private for these pages, and we can happily use set_page_private() without setting PG_private. So let's just stop setting/clearing PG_private. Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-3-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/bootmem_info: drop initialization of page->lruDavid Hildenbrand (Arm)1-1/+0
In the past, we used to store the type in page->lru.next, introduced by commit 5f24ce5fd34c ("thp: remove PG_buddy"). The location changed over the years; ever since commit 0386aaa6e9c8 ("bootmem: stop using page->index"), we store it alongside the info in page->private. Consequently, there is no need to reset page->lru anymore. Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-2-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/memory_hotplug: factor out altmap freeing checksMuchun Song1-7/+9
Use a small helper to centralize altmap freeing after verifying that all vmemmap pages were released. This keeps the check consistent between the normal teardown path and the memory hotplug error paths. Link: https://lore.kernel.org/20260511084307.1827127-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon: replace damon_rand() with a per-ctx lockless PRNGJiayuan Chen4-17/+38
damon_rand() on the sampling_addr hot path called get_random_u32_below(), which takes a local_lock_irqsave() around a per-CPU batched entropy pool and periodically refills it with ChaCha20. At elevated nr_regions counts (20k+), the lock_acquire / local_lock pair plus __get_random_u32_below() dominate kdamond perf profiles. Replace the helper with a lockless lfsr113 generator (struct rnd_state) held per damon_ctx and seeded from get_random_u64() in damon_new_ctx(). kdamond is the single consumer of a given ctx, so no synchronization is required. Range mapping uses traditional reciprocal multiplication, similar as get_random_u32_below(); for spans larger than U32_MAX (only reachable on 64-bit) the slow path combines two u32 outputs and uses mul_u64_u64_shr() at 64-bit width. On 32-bit the slow path is dead code and gets eliminated by the compiler. The new helper takes a ctx parameter; damon_split_regions_of() and the kunit tests that call it directly are updated accordingly. lfsr113 is a linear PRNG and MUST NOT be used for anything security-sensitive. DAMON's sampling_addr is not exposed to userspace and is only consumed as a probe point for PTE accessed-bit sampling, so a non-cryptographic PRNG is appropriate here. Tested with paddr monitoring and max_nr_regions=20000: kdamond CPU usage reduced from ~72% to ~50% of one core. Link: https://lore.kernel.org/20260505145212.108644-1-jiayuan.chen@linux.dev Link: https://lore.kernel.org/damon/20260426173346.86238-1-sj@kernel.org/T/#m4f1fd74112728f83a41511e394e8c3fef703039c Link: https://lore.kernel.org/20260509011816.85145-1-sj@kernel.org Signed-off-by: Jiayuan Chen <jiayuan.chen@shopee.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Shu Anzai <shu17az@gmail.com> Cc: Quanmin Yan <yanquanmin1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysfooAndrew Morton41-597/+1342
3 daysmm/damon/lru_sort: handle ctx allocation failureSeongJae Park1-0/+4
DAMON_LRU_SORT allocates the damon_ctx object for its kdamond in its init function. damon_lru_sort_enabled_store() wrongly assumes the allocation will always succeed once tried. If the damon_ctx allocation was failed, therefore, code execution reaches to damon_commit_ctx() while 'ctx' is NULL. As a result, it dereferences the NULL 'ctx' pointer. Avoid the NULL dereference by returning -ENOMEM if 'ctx' is NULL. Link: https://lore.kernel.org/20260529000104.7006-3-sj@kernel.org Fixes: c4a8e662c839 ("mm/damon/lru_sort: use damon_initialized()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/reclaim: handle ctx allocation failureSeongJae Park1-0/+4
Patch series "mm/damon/{reclaim,lru_sort}: handle ctx allocation failures". DAMON_RECLAIM and DAMON_LRU_SORT could dereference NULL pointers if their damon_ctx object allocations fail. The bugs are expected to happen infrequently because the allocations are arguably too small to fail on common setups. But theoretically they are possible and the consequences are bad. Fix those. The issues were discovered [1] by Sashiko. This patch (of 2): DAMON_RECLAIM allocates the damon_ctx object for its kdamond in its init function. damon_reclaim_enabled_store() wrongly assumes the allocation will always succeed once tried. If the damon_ctx allocation was failed, therefore, code execution reaches to damon_commit_ctx() while 'ctx' is NULL. As a result, it dereferences the NULL 'ctx' pointer. Avoid the NULL dereference by returning -ENOMEM if 'ctx' is NULL. Link: https://lore.kernel.org/20260529000104.7006-2-sj@kernel.org Link: https://lore.kernel.org/20260419014800.877-1-sj@kernel.org [1] Fixes: 3f7a914ab9a5 ("mm/damon/reclaim: use damon_initialized()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysuserfaultfd: remove redundant check in vm_uffd_ops()Mike Rapoport (Microsoft)1-1/+1
Lorenzo says: static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma) { if (vma_is_anonymous(vma)) return &anon_uffd_ops; return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL; } This is doing a redundant check _and_ making life confusing, as if !vma->vm_ops is a condition that can be reached there, it can't, as vma_is_anonymous() is literally a !vma->vm_ops check :) Remove the redundant check. Link: https://lore.kernel.org/20260527184751.4147364-4-rppt@kernel.org Fixes: 0f48947c4232 ("userfaultfd: introduce vm_uffd_ops") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Suggested-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: David Carlier <devnexen@gmail.com> Cc: Michael Bommarito <michael.bommarito@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysuserfaultfd: refuse to __mfill_atomic_pte() for unsupported VMAsMike Rapoport (Microsoft)1-0/+5
__mfill_atomic_pte() unconditionally dereferences ops because there is an assumption that VMAs that can undergo mfill_* operations are vetted on registration and must have valid vm_uffd_ops. Add a guard against potential bugs and make sure __mfill_atomic_pte() bails out if ops is NULL. Link: https://lore.kernel.org/20260527184751.4147364-3-rppt@kernel.org Fixes: ad9ac3081332 ("userfaultfd: introduce vm_uffd_ops->alloc_folio()") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Suggested-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: David CARLIER <devnexen@gmail.com> Cc: David Hildenbrand <david@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Michael Bommarito <michael.bommarito@gmail.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysuserfaultfd: verify VMA state across UFFDIO_COPY retryMike Rapoport (Microsoft)1-12/+73
Patch series "userfaultfd: verify VMA state across UFFDIO_COPY retry", v2. ... and two more small fixes. This patch (of 3): mfill_copy_folio_retry() drops the VMA lock for copy_from_user() and reacquires it afterwards. The destination VMA can be replaced during that window. The existing check compares vma_uffd_ops() before and after the retry, but if a shmem VMA with MAP_SHARED is replaced with a shmem VMA with MAP_PRIVATE (or vice versa) the replacement goes undetected. The change from MAP_PRIVATE to MAP_SHARED will treat the folio allocated with shmem_alloc_folio() as anonymous and this will cause BUG() when mfill_atomic_install_pte() will try to folio_add_new_anon_rmap(). The change from MAP_SHARED to MAP_PRIVATE allows injection of folios into the page cache of the original VMA. There is no need to change for hugetlb because it never uses mfill_copy_folio_retry(). Introduce helpers for more comprehensive comparison of VMA state: - mfill_retry_state_save() to save the relevant VMA state into a struct mfill_retry_state (original uffd_ops, relevant VMA flags, vm_file and pgoff) before dropping the lock - mfill_retry_state_changed() to compare the saved state with the state of the VMA acquired after retaking the locks - mfill_retry_state_put() to release vm_file pinning. Use DEFINE_FREE() cleanup to wrap mfill_retry_state_put() to avoid complicating error handling paths in mfill_copy_folio_retry(). Link: https://lore.kernel.org/20260527184751.4147364-1-rppt@kernel.org Link: https://lore.kernel.org/20260527184751.4147364-2-rppt@kernel.org Fixes: 292411fda25b ("mm/userfaultfd: detect VMA type change after copy retry in mfill_copy_folio_retry()") Fixes: 6ab703034f14 ("userfaultfd: mfill_atomic(): remove retry logic") Co-developed-by: Michael Bommarito <michael.bommarito@gmail.com> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Suggested-by: Peter Xu <peterx@redhat.com> Co-developed-by: David Carlier <devnexen@gmail.com> Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/huge_memory: update file PMD counter before folio_put()Yin Tirui1-0/+2
__split_huge_pmd_locked() updates the file/shmem RSS counter after dropping the PMD mapping's folio reference. If folio_put() drops the last reference, mm_counter_file() can later read freed folio state via folio_test_swapbacked(). Move the counter update before folio_put(). Link: https://lore.kernel.org/20260526101337.1984081-1-yintirui@huawei.com Fixes: fadae2953072 ("thp: use mm_file_counter to determine update which rss counter") Signed-off-by: Yin Tirui <yintirui@huawei.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chen Jun <chenjun102@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/huge_memory: update file PUD counter before folio_put()Yin Tirui1-1/+1
__split_huge_pud_locked() updates the file/shmem RSS counter after dropping the PUD mapping's folio reference. If folio_put() drops the last reference, mm_counter_file() can later read freed folio state via folio_test_swapbacked(). Move the counter update before folio_put(). Link: https://lore.kernel.org/20260526101355.1984244-1-yintirui@huawei.com Fixes: dbe54153296d ("mm/huge_memory: add vmf_insert_folio_pud()") Signed-off-by: Yin Tirui <yintirui@huawei.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (arm) <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chen Jun <chenjun102@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/hugetlb_vmemmap: fix incorrect vmemmap restore in rollbackMuchun Song1-18/+18
vmemmap_restore_pte() rebuilds restored vmemmap pages from a tail-page template derived from compound_head(). This is wrong when the current PTE already maps a page whose contents are not tail-page metadata. In the rollback path of vmemmap_remap_free(), the first restored PTE is backed by vmemmap_head and contains head-page metadata. Reconstructing that page from a tail-page template overwrites the head-page state and corrupts the restored vmemmap page. Fix this by copying the full page from the page currently mapped by the PTE. Also pass vmemmap_tail to the rollback walk so only PTEs backed by the shared tail page are restored, while the head PTE remains mapped to vmemmap_head. Add VM_WARN_ON_ONCE() checks for unexpected cases. Link: https://lore.kernel.org/20260525025213.2229628-1-songmuchun@bytedance.com Fixes: c0b495b91a47 ("mm/hugetlb: refactor code around vmemmap_walk") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Kiryl Shutsemau <kas@kernel.org> Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/ops-common: call folio_test_lru() after folio_get()SeongJae Park1-2/+2
damon_get_folio() speculatively calls folio_test_lru() before folio_try_get(). The folio can get freed and reallocated to a tail page. In the case, VM_BUG_ON_PGFLAGS() in const_folio_flags() can be triggered. Remove the speculative call. Also mark folio_test_lru() check right after folio_try_get() success as no more unlikely. The race should be rare. Also the problem can happen only if the kernel has enabled CONFIG_DEBUG_VM_PGFLAGS. No real world report of this issue has been made so far. This fix is based on only theoretical analysis. That said, a bug is a bug. A similar issue was also fixed via commit 3203b3ab0fcf ("mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()"). I don't expect this change will make a meaningful impact to DAMON performance in the real world, though I will be happy to be corrected from the real world reports. The issue was discovered [1] by Sashiko. Link: https://lore.kernel.org/20260525162256.8317-1-sj@kernel.org Link: https://lore.kernel.org/20260517234112.89245-1-sj@kernel.org [1] Fixes: 3f49584b262c ("mm/damon: implement primitives for the virtual memory address spaces") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Fernand Sieber <sieberf@amazon.com> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: <stable@vger.kernel.org> # 5.15.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/cma_sysfs: skip inactive CMA areas in sysfsKaitao Cheng1-2/+6
cma_activate_area() can fail after a CMA area has already been added to cma_areas[]. In that case the area is left in the global array, but it does not reach the point where CMA_ACTIVATED is set. cma_sysfs_init() currently walks all cma_area_count entries and creates sysfs files for every area, including ones that failed activation. These areas are not usable CMA areas and should not be exposed to userspace as valid CMA regions. If such an inactive area is exposed, userspace sees a CMA directory whose read-only accounting files report zeros. total_pages and available_pages report zero because the failed activation path clears cma->count and cma->available_count, while the allocation and release counters also stay at zero because the area cannot service CMA allocations. This makes the failed area look like a valid but empty CMA region and can mislead tests, monitoring, and diagnostics. Skip CMA areas that did not reach CMA_ACTIVATED when creating the sysfs objects. Since inactive entries can now be skipped, make the error unwind tolerate entries that never had cma_kobj initialized. Link: https://lore.kernel.org/20260524140420.61864-1-kaitao.cheng@linux.dev Link: https://lore.kernel.org/20260522131434.78532-1-kaitao.cheng@linux.dev Fixes: 43ca106fa8ec ("mm: cma: support sysfs") Signed-off-by: Kaitao Cheng <chengkaitao@kylinos.cn> Reported-by: David Hildenbrand (Arm) <david@kernel.org> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Suggested-by: Muchun Song <songmuchun@bytedance.com> Reported-by: Muchun Song <songmuchun@bytedance.com> Closes: https://lore.kernel.org/linux-mm/55481a8b-dcfc-4bef-ba59-aa0b43dca88b@kernel.org/ Acked-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Dmitry Osipenko <digetx@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: swap_cgroup: fix NULL deref in lookup_swap_cgroup_id on swapless hostJose Fernandez (Anthropic)1-0/+2
lookup_swap_cgroup_id() passes swap_cgroup_ctrl[type].map to __swap_cgroup_id_lookup() without checking that the type was ever registered via swap_cgroup_swapon(). On a swapless host every ctrl->map is NULL, so __swap_cgroup_id_lookup() dereferences NULL + a scaled swp_offset(). Since commit bea67dcc5eea ("mm: attempt to batch free swap entries for zap_pte_range()"), zap_pte_range() -> swap_pte_batch() calls lookup_swap_cgroup_id() on any non-present, non-none PTE that decodes as a real swap entry, without first validating it against swap_info[]. A single PTE corrupted into a type-0 swap entry takes the host down at process exit. We hit this in production on a swapless 6.12.58 host: ~1s of "get_swap_device: Bad swap file entry 3f800204222bb" (do_swap_page() being correctly defensive about the same entry) followed by BUG: unable to handle page fault for address: 000003f800204220 RIP: 0010:lookup_swap_cgroup_id+0x2b/0x60 Call Trace: swap_pte_batch+0xbf/0x230 zap_pte_range+0x4c8/0x780 unmap_page_range+0x190/0x3e0 exit_mmap+0xd9/0x3c0 do_exit+0x20c/0x4b0 syzbot has reported the identical stack. The source of the PTE corruption is a separate bug; this change makes the teardown path as robust as the fault path already is. Every other caller of lookup_swap_cgroup_id() is downstream of a get_swap_device() that has already validated the entry, so the new branch is cold. Link: https://lore.kernel.org/20260504-swap-cgroup-fix-7-0-v1-1-f53ff41ee553@linux.dev Fixes: bea67dcc5eea ("mm: attempt to batch free swap entries for zap_pte_range()") Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev> Reported-by: syzbot+e12bd9ca48157add237a@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/69859728.050a0220.3b3015.0033.GAE@google.com Assisted-by: Claude:unspecified Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 dayskfence: fix KASAN HW tags bypass via runtime sample_interval changeAlexander Potapenko1-0/+5
If a user writes a non-zero value to the sample_interval module parameter at runtime, the missing KASAN HW tags check in the late init path allows KFENCE to be enabled alongside KASAN HW tags, bypassing the boot restriction. This patch adds the missing check to param_set_sample_interval() to reject the parameter change if KASAN HW tags are enabled. Link: https://lore.kernel.org/20260507095237.741017-1-glider@google.com Fixes: 09833d99db36 ("mm/kfence: disable KFENCE upon KASAN HW tags enablement") Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Pimyn Girgis <pimyn@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daystreewide: fix indentation and whitespace in Kconfig filesAnand Moon1-2/+2
Clean up inconsistent indentation (mixing tabs and spaces) and remove extraneous whitespace in several Kconfig files across the tree. This is a purely cosmetic change to improve readability. Adjust indentation from spaces to tab (+optional two spaces) as in coding style with command like: $ sed -e 's/^ /\t/' -i */Kconfig Link: https://lore.kernel.org/20260407053945.14116-1-linux.amoon@gmail.com Signed-off-by: Anand Moon <linux.amoon@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> [fs] Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> [mm] Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> [mm] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/kmemleak: dedupe verbose scan output by allocation backtraceBreno Leitao1-8/+140
Patch series "mm/kmemleak: dedupe verbose scan output", v3. I am starting to run with kmemleak in verbose enabled in some "probe points" across the my employers fleet so that suspected leaks land in dmesg without needing a separate read of /sys/kernel/debug/kmemleak. The downside is that workloads which leak many objects from a single allocation site flood the console with byte-for-byte identical backtraces. Hundreds of duplicates per scan are common, drowning out distinct leaks and unrelated kernel messages, while adding no signal beyond the first occurrence. This series collapses those duplicates inside kmemleak itself. Each unique stackdepot trace_handle prints once per scan, followed by a short summary line when more than one object shares it: kmemleak: unreferenced object 0xff110001083beb00 (size 192): kmemleak: comm "modprobe", pid 974, jiffies 4294754196 kmemleak: ... kmemleak: backtrace (crc 6f361828): kmemleak: __kmalloc_cache_noprof+0x1af/0x650 kmemleak: ... kmemleak: ... and 71 more object(s) with the same backtrace The "N new suspected memory leaks" tally and the contents of /sys/kernel/debug/kmemleak are unchanged - the per-object detail is still available on demand, only the verbose (dmesg) output is collapsed. Patch 1 is the kmemleak change. Patch 2 adds a selftest that loads samples/kmemleak's CONFIG_SAMPLE kmemleak-test module to generate ten leaks sharing one call site and checks that the printed count is strictly less than the reported leak total. Not sure if Patch 2 is useful or not, if not, it is easier to discard. This patch (of 2): In kmemleak's verbose mode, every unreferenced object found during a scan is logged with its full header, hex dump and 16-frame backtrace. Workloads that leak many objects from a single allocation site flood dmesg with byte-for-byte identical backtraces, drowning out distinct leaks and other kernel messages. Dedupe within each scan using stackdepot's trace_handle as the key: for every leaked object with a recorded stack trace, look up the representative kmemleak_object in a per-scan xarray keyed by trace_handle. The first sighting stores the object pointer (with a get_object() reference) and sets object->dup_count to 1; later sightings just bump dup_count on the representative. After the scan, walk the xarray once and emit each unique backtrace, followed by a single summary line when more than one object shares it. Leaks whose trace_handle is 0 (early-boot allocations tracked before kmemleak_init() set up object_cache, or stack_depot_save() failures under memory pressure) cannot be deduped, so they are still printed inline via the same locked OBJECT_ALLOCATED-checked helper. The contents of /sys/kernel/debug/kmemleak are unchanged - only the verbose console output is collapsed. Safety notes: - The xarray store happens outside object->lock: object->lock is a raw spinlock, while xa_store() may grab xa_node slab locks at a higher wait-context level which lockdep flags as invalid. trace_handle is captured under object->lock (which serialises with kmemleak_update_trace()'s writer), so it is safe to use after dropping the lock. - get_object() pins the kmemleak_object metadata across rcu_read_unlock(), but the underlying tracked allocation can still be freed concurrently. The deferred print path therefore re-acquires object->lock and re-checks OBJECT_ALLOCATED via print_leak_locked() before touching object->pointer; __delete_object() clears that flag under the same lock before the user memory goes away. The same helper is used by the trace_handle == 0 and xa_store() failure fallbacks, so every printer in the new path has identical safety guarantees. - If get_object() fails after we set OBJECT_REPORTED, the object is already being torn down (use_count hit zero); the leak count is still accurate but the verbose line is dropped, which is correct - the memory was freed concurrently and is no longer a leak. - If xa_store() fails to allocate an xa_node under memory pressure, we fall back to printing inline via print_leak_locked() instead of silently dropping the leak. - The hex dump is skipped for coalesced entries (dup_count > 1): bytes would differ across objects sharing a backtrace anyway, and skipping it removes the only remaining read of object->pointer's contents in the deferred path. The representative's reported size may also differ from the coalesced objects' sizes; the printed trace_handle reflects the representative's current value rather than the value used as the dedup key, which is normally - but not strictly - identical. Link: https://lore.kernel.org/20260506-kmemleak_dedup-v3-0-2d36aafc34da@debian.org Link: https://lore.kernel.org/20260506-kmemleak_dedup-v3-1-2d36aafc34da@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/swap: add cond_resched() in swap_reclaim_full_clusters to prevent softlockupZijiang Huang1-0/+1
We hit a real softlockup in an internal stress test environment. The workload was LTP memory/swap stress on a large arm64 machine, with 320 CPUs, about 1TB memory and an 8.6GB swap device. The system was under heavy load and the swap device had a large number of full clusters. The softlockup was triggered during a stress test after about 3 days. So, add periodic cond_resched() calls during large full_clusters reclaim operations to prevent softlockup issues. Detailed call trace as follow: PID: 3817773 TASK: ffff0883bb28b780 CPU: 48 COMMAND: "kworker/48:7" #0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4 #1 [ffff800080183d90] panic at ffffa4c1360d5e9c #2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8 ... #16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614 #17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc #18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474 #19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c #20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc #21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398 #22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c Link: https://lore.kernel.org/20260506130919.2298807-1-kerayhuang@tencent.com Fixes: 5168a68eb78f ("mm, swap: avoid over reclaim of full clusters") Signed-off-by: Zijiang Huang <kerayhuang@tencent.com> Reviewed-by: Kairui Song <kasong@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Reviewed-by: albinwyang <albinwyang@tencent.com> Reviewed-by: Baoquan He <baoquan.he@linux.dev> Acked-by: Chris Li <chrisl@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Youngjun Park <youngjun.park@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/stat: add a parameter for reading kdamond pidSeongJae Park1-0/+39
Patch series "mm/damon/stat: add kdamond_pid parameter". DAMON_STAT doesn't provide the pid of its kdamond, unlike DAMON_RECLAIM and DAMON_LRU_SORT. This makes user-space management of DAMON_STAT unnecessarily complicated. Provide the information via a new parameter, namely kdamond_pid, and document it. This patch (of 2): Knowing the pid of the kdamonds can help user-space management including monitoring of DAMON's system resource consumption. To make it easier, DAMON_SYSFS, DAMON_RECLAIM and DAMON_LRU_SORT provide the pid information. DAMON_STAT is not providing it, though. Expose the pid of DAMON_STAT kdamond via a new read-only module parameter, namely kdamond_pid. This also makes DAMON modules usage more standardized, because DAMON_RECLAIM and DAMON_LRU_SORT also provide the information via their read-only parameters of the same name. Link: https://lore.kernel.org/20260502020505.80822-1-sj@kernel.org Link: https://lore.kernel.org/20260502020505.80822-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/reclaim: add autotune_monitoring_intervals parameterSeongJae Park1-5/+28
Patch series "mm/damon/reclaim: support monitoring intervals auto-tuning". The monitoring intervals auto-tuning feature of DAMON has proven to be useful in multiple environments. Add a new DAMON_RECLAIM parameter for supporting the feature, and update the document for the new parameter. This patch (of 2): DAMON's monitoring intervals auto-tuning feature has proven to be useful in multiple environments. DAMON_RECLAIM is still asking users to do the manual tuning of the intervals. Add a module parameter for utilizing the auto-tuning feature with the suggested default setup. Note that use of the auto-tuning overrides the manually entered monitoring intervals. Also, note that the 'min_age' will dynamically changed proportional to auto-tuned intervals. It is recommended to use 'min_age' short enough and use 'quota_mem_pressure_us' like coldness threshold auto-tuning features together. Link: https://lore.kernel.org/20260501011740.81988-1-sj@kernel.org Link: https://lore.kernel.org/20260501011740.81988-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/filemap: do not count FAULT_FLAG_TRIED retries as mmap hitsfujunjie1-0/+1
A fault that starts synchronous mmap readahead can return VM_FAULT_RETRY after dropping mmap_lock. The retry may then map the folio brought in by that same miss. Do not let this retry decrement mmap_miss. The retry still maps the folio from the page cache; it just does not count as a useful mmap readahead hit. Link: https://lore.kernel.org/tencent_22E6B8849EC1141FE7773C64467E6F1E2C09@qq.com Signed-off-by: fujunjie <fujunjie1@qq.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Vishal Moola <vishal.moola@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/filemap: count only the faulting address as a mmap hitfujunjie1-31/+31
Patch series "mm/filemap: tighten mmap_miss hit accounting", v3. mmap_miss is increased when synchronous mmap readahead is needed, and decreased when filemap_map_pages() maps folios that are already in the page cache. The decrease side can over-credit hits in two cases: - fault-around installs nearby PTEs even though the fault only proves that the faulting address was accessed; - after synchronous mmap readahead returns VM_FAULT_RETRY, the retry can find the folio brought in by the same miss and immediately cancel that miss. Current evidence comes from a local KVM/data-disk microbenchmark using mmap_miss_probe, with an 8 GiB guest, 2 vCPUs, 8192 KiB read_ahead_kb, cold page cache before each run, 1% of the file accessed, and medians of 3 runs. mmap_miss_probe mmap()s a prepared file with MADV_NORMAL and then touches one byte at selected base-page offsets. The access order is random, sequential, or a fixed page stride. The harness drops caches before each run and samples /proc/vmstat around that access loop. The 20 GiB case below is a larger-than-memory file case in an 8 GiB guest. No separate memory hog was used. The 4 GiB case uses the same 8 GiB guest but keeps the file fit-in-memory. Each case used a fresh temporary qcow2 data disk, seen by the guest as /dev/vda, formatted as ext4 and mounted at /mnt/mmap-matrix. Each result is "pgpgin GiB / elapsed seconds". "pgpgin GiB" is the delta of the guest /proc/vmstat pgpgin counter, converted from KiB to GiB; it is used here as an approximate block input counter, not as resident memory or exact application IO. "Elapsed seconds" is the wall-clock runtime of the whole mmap_miss_probe access pass, not per-access latency. For the 20 GiB larger-than-memory case: workload before after random 223.377 GiB/101.293s 1.010 GiB/4.790s stride1021 204.214 GiB/97.557s 204.208 GiB/108.086s stride2053 409.584 GiB/193.700s 0.970 GiB/3.685s stride4099 406.452 GiB/134.241s 0.975 GiB/3.499s sequential 0.212 GiB/0.050s 0.212 GiB/0.057s For the 4 GiB fit-in-memory case: workload before after random 3.987 GiB/1.960s 0.980 GiB/1.221s stride1021 4.002 GiB/1.838s 4.002 GiB/1.851s stride2053 3.991 GiB/1.835s 0.811 GiB/0.985s stride4099 4.001 GiB/1.836s 0.819 GiB/1.037s sequential 0.056 GiB/0.013s 0.056 GiB/0.018s The 20 GiB setup also has an ablation. P1 is only the faulting-address hit accounting change. P2-only is only the FAULT_FLAG_TRIED retry filter. P1+P2 is the combined accounting change: workload variant result random baseline 223.377 GiB/101.293s random P1 223.268 GiB/98.481s random P2-only 223.257 GiB/100.091s random P1+P2 1.010 GiB/4.790s stride2053 baseline 409.584 GiB/193.700s stride2053 P1 409.584 GiB/197.645s stride2053 P2-only 15.722 GiB/5.485s stride2053 P1+P2 0.970 GiB/3.685s sequential baseline 0.212 GiB/0.050s sequential P1 0.212 GiB/0.046s sequential P2-only 0.212 GiB/0.050s sequential P1+P2 0.212 GiB/0.057s After the v2 implementation refactor, only the final P1+P2 shape was rerun in the same setup. The numbers stayed in line with the v1 P1+P2 rows above: workload larger-than-memory case fit-in-memory case 20 GiB file, 1% access 4 GiB file, 1% access random 1.010 GiB/4.383s 0.980 GiB/1.088s stride1021 204.216 GiB/105.601s 4.001 GiB/1.783s stride2053 0.970 GiB/3.760s 0.810 GiB/0.908s stride4099 0.975 GiB/3.410s 0.818 GiB/0.870s sequential 0.212 GiB/0.060s 0.056 GiB/0.016s This does not claim to solve every sparse pattern. The stride1021 rows are intentionally shown as a boundary: with 8192 KiB read_ahead_kb, file->f_ra.ra_pages is 2048 base pages, and synchronous mmap read-around uses a 2048-page window centered around the fault, roughly [index - 1024, index + 1023]. stride1021 is 1021 * 4 KiB = 4084 KiB, so the next access lands inside the previous read-around window. About every other access can be a real faulting-address page-cache hit, and the other half can each read about 8 MiB. For about 52k accesses in the 20 GiB/1% run, half of them times 8 MiB is about 205 GiB, matching the observed 204 GiB. This patch (of 2): filemap_map_pages() reduces file->f_ra.mmap_miss when fault-around maps folios that are already present in the page cache. That hit accounting is too generous because fault-around can install PTEs around the faulting address even though the fault only proves that the faulting address was accessed. Move the mmap_miss update back into filemap_map_pages(), drop the mmap_miss argument from the helper functions, and decrement mmap_miss only when the helper return value shows that the faulting address was mapped. Keep the existing workingset-folio behavior unchanged. Link: https://lore.kernel.org/tencent_AA501E9A238337BD167E5C2ACF948A1AF308@qq.com Link: https://lore.kernel.org/tencent_756F151FE66F3D80479A6F982C0AB8569F09@qq.com Signed-off-by: fujunjie <fujunjie1@qq.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Vishal Moola <vishal.moola@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: use zone lock guard in __offline_isolated_pages()Dmitry Ilvokhin1-3/+2
Use spinlock_irqsave zone lock guard in __offline_isolated_pages() to replace the explicit lock/unlock pattern with automatic scope-based cleanup. Link: https://lore.kernel.org/13149be4f8151e18eb5f1eb4f3241ab3cffb373e.1777462630.git.d@ilvokhin.com Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: use zone lock guard in free_pcppages_bulk()Dmitry Ilvokhin1-4/+1
Use spinlock_irqsave zone lock guard in free_pcppages_bulk() to replace the explicit lock/unlock pattern with automatic scope-based cleanup. Link: https://lore.kernel.org/aafc2d660057a91eb40417f8ff4645b0a8c525e2.1777462630.git.d@ilvokhin.com Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: use zone lock guard in put_page_back_buddy()Dmitry Ilvokhin1-8/+4
Use spinlock_irqsave zone lock guard in put_page_back_buddy() to replace the explicit lock/unlock pattern with automatic scope-based cleanup. Link: https://lore.kernel.org/b0fceedca37139da36aa626ac72eb9840b641021.1777462630.git.d@ilvokhin.com Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: use zone lock guard in take_page_off_buddy()Dmitry Ilvokhin1-7/+3
Use spinlock_irqsave zone lock guard in take_page_off_buddy() to replace the explicit lock/unlock pattern with automatic scope-based cleanup. This also allows to return directly from the loop, removing the 'ret' variable. Link: https://lore.kernel.org/a981721632a981f148c63e3f7df3d1116a0c3f6d.1777462630.git.d@ilvokhin.com Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: use zone lock guard in set_migratetype_isolate()Dmitry Ilvokhin1-33/+25
Use spinlock_irqsave scoped lock guard in set_migratetype_isolate() to replace the explicit lock/unlock pattern with automatic scope-based cleanup. The scoped variant is used to keep dump_page() outside the locked section to avoid a lockdep splat. Link: https://lore.kernel.org/6883351ad7f74d20875fff30e0e3214a089cea97.1777462630.git.d@ilvokhin.com Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: use zone lock guard in unreserve_highatomic_pageblock()Dmitry Ilvokhin1-6/+2
Use spinlock_irqsave zone lock guard in unreserve_highatomic_pageblock() to replace the explicit lock/unlock pattern with automatic scope-based cleanup. Link: https://lore.kernel.org/69db814cd178915cb5615334a29304678f960963.1777462630.git.d@ilvokhin.com Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: use zone lock guard in unset_migratetype_isolate()Dmitry Ilvokhin1-5/+2
Use spinlock_irqsave zone lock guard in unset_migratetype_isolate() to replace the explicit lock/unlock and goto pattern with automatic scope-based cleanup. Link: https://lore.kernel.org/815c0905ea77828ed32bf56ff0a6d3c6548eb3a2.1777462630.git.d@ilvokhin.com Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm: use zone lock guard in reserve_highatomic_pageblock()Dmitry Ilvokhin1-8/+5
Patch series "mm: use spinlock guards for zone lock", v3. This series uses spinlock guard for zone lock across several mm functions to replace explicit lock/unlock patterns with automatic scope-based cleanup. This simplifies the control flow by removing 'flags' variables, goto labels, and redundant unlock calls. Patches are ordered by decreasing value. The first six patches simplify the control flow by removing gotos, multiple unlock paths, or 'ret' variables. The last two are simpler lock/unlock pair conversions that only remove 'flags' and can be dropped if considered unnecessary churn. Binary size increase is +39 bytes, with Peter Zijlstra's fix for guards [1] applied. This is due to the compiler not being able to deduplicate epilogue and eliminate redundant NULL check. See discussion [2] for more details. I proposed a patch [3] that fixes this, but until it is merged we need to assume +39 bytes will stay (though it is compiler dependent). This patch (of 8): Use the spinlock_irqsave zone lock guard in reserve_highatomic_pageblock() to replace the explicit lock/unlock and goto out_unlock pattern with automatic scope-based cleanup. Link: https://lore.kernel.org/cover.1777462630.git.d@ilvokhin.com Link: https://lore.kernel.org/3657e1144e2ffc1ca0eb57d57d89bfec4073d8c6.1777462630.git.d@ilvokhin.com Link: https://lore.kernel.org/all/20260309164516.GE606826@noisy.programming.kicks-ass.net/ [1] Link: https://lore.kernel.org/all/afC5C6fylF4AsITV@shell.ilvokhin.com/ [2] Link: https://lore.kernel.org/all/20260427165037.205337-1-d@ilvokhin.com/ [3] Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@kernel.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/khugepaged: return -EAGAIN for SCAN_PAGE_HAS_PRIVATE in MADV_COLLAPSEVineet Agarwal1-0/+1
MADV_COLLAPSE uses errno values to provide actionable feedback to userspace. Temporary resource constraints are mapped to -EAGAIN so the caller may retry, while intrinsic failures of the specified range are mapped to -EINVAL. collapse_file() returns SCAN_PAGE_HAS_PRIVATE when filemap_release_folio() fails while isolating file-backed folios for collapse. This currently falls through the default case in madvise_collapse_errno() and is reported to userspace as -EINVAL. However, filemap_release_folio() failure commonly reflects temporary folio state rather than a permanently uncollapsible range. For example, ext4 returns false when a folio still has dirty journalled data, btrfs returns false for dirty or writeback folios before extent state release, and NFS may return false while reclaiming filesystem-private folio state. In such cases, retrying MADV_COLLAPSE after writeback, reclaim or journal progress may succeed. This matches the existing -EAGAIN handling for SCAN_PAGE_DIRTY_OR_WRITEBACK and other transient collapse failures more closely than -EINVAL. Therefore, map SCAN_PAGE_HAS_PRIVATE to -EAGAIN so userspace receives retryable feedback for this temporary failure path. Link: https://lore.kernel.org/20260429140434.439456-1-agarwal.vineet2006@gmail.com Signed-off-by: Vineet Agarwal <agarwal.vineet2006@gmail.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/stat: use damon_set_region_system_rams_default()SeongJae Park1-50/+3
damon_stat_set_moniotirng_region() is nearly a duplicate of the core function, damon_set_region_system_rams_default(). Use the core implementation. Link: https://lore.kernel.org/20260429041232.90257-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/core: remove damon_set_region_biggest_system_ram_default()SeongJae Park1-64/+0
Now nobody is using damon_set_region_biggest_system_ram_default(). Remove it. Link: https://lore.kernel.org/20260429041232.90257-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/lru_sort: cover all system ramsSeongJae Park1-3/+5
DAMON_LRU_SORT allows users to set the physical address range to monitor and do the work on. When users don't explicitly set the range, the biggest system ram resource of the system is selected as the monitoring target address range. The intention was to reduce the overhead from monitoring non-System RAM areas because monitoring non-System RAM may be meaningless. However, because of the sampling based access check and adaptive regions adjustment, the overhead should be negligible. It makes more sense to just cover all system rams of the system. Do so. Link: https://lore.kernel.org/20260429041232.90257-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/reclaim: cover all system ramsSeongJae Park1-7/+7
DAMON_RECLAIM allows users to set the physical address range to monitor and do the work on. When users don't explicitly set the range, the biggest System RAM resource of the system is selected as the monitoring target address range. The intention was to reduce the overhead from monitoring non-System RAM areas because monitoring of non-System RAM may be meaningless. However, because of the sampling based access check and adaptive regions adjustment, the overhead should be negligible. It makes more sense to just cover all system rams of the system. Do so. Link: https://lore.kernel.org/20260429041232.90257-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon: introduce damon_set_region_system_rams_default()SeongJae Park1-5/+72
Patch series "mm/damon/reclaim,lru_sort: monitor all system rams by default". DAMON_RECLAIM and DAMON_LRU_SORT set the biggest 'System RAM' resource of the system as the default monitoring target address range. The main intention behind the design is to minimize the overhead coming from monitoring of non-System RAM areas. This could result in an odd setup when there are multiple discrete System RAMs of considerable sizes. For example, there are System RAMs each having 500 GiB size. In this case, only the first 500 GiB will be set as the monitoring region by default. This is particularly common on NUMA systems. Hence the modules allow users to set the monitoring target address range using the module parameters if the default setup doesn't work for them. In other words, the current design trades ease of setup for lower overhead. However, because DAMON utilizes the sampling based access check and the adaptive regions adjustment mechanisms, the overhead from the monitoring of non-System RAM areas should be negligible in most setups. Meanwhile, the setup complexity is causing real headaches for users who need to run those modules on various types of systems. That is, the current tradeoff is not a good deal. Set the physical address range that can cover all System RAM areas of the system as the default monitoring regions for DAMON_RECLAIM and DAMON_LRU_SORT. Technically speaking, this is changing documented behavior. However, it makes no sense to believe there is a real use case that really depends on the old weird default behavior. If the old default behavior was working for them in the reasonable way, this change will only add a negligible amount of monitoring overhead. If it didn't work, the users may already be using manual monitoring regions setup, and they will not be affected by this change. Patches Sequence ================ Patch 1 introduces a new core function that will be used for the new default monitoring target region setup. Patch 2 and 3 update DAMON_RECLAIM and DAMON_LRU_SORT to use the new function instead of the old one, respectively. Patch 4 removes the old core function that was replaced by the new one, as there is no more user of it. Patch 5 updates DAMON_STAT to use the new one instead of its in-house nearly-duplicate self implementation of the functionality. Finally patches 6 and 7 update the DAMON_RECLAIM and DAMON_LRU_SORT user documentation for the new behaviors, respectively. This patch (of 7): damon_set_region_biggest_system_ram_default() sets the monitoring target region as the caller requested. If the caller didn't specify the region, it finds the biggest System RAM of the system and sets it as the target region. When there are more than one considerable size of System RAM resources in the system, the default target setup makes no sense. Introduce a variant, namely damon_set_region_system_rams_default(). It sets a physical address range that covers all System RAM resources as the default target region. Link: https://lore.kernel.org/20260429041232.90257-1-sj@kernel.org Link: https://lore.kernel.org/20260429041232.90257-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysvmalloc: add __GFP_SKIP_KASAN supportMuhammad Usama Anjum1-4/+9
Patch series "kasan: hw_tags: Disable tagging for stack and page-tables", v4. Stacks and page tables are always accessed with the match-all tag, so assigning a new random tag every time at allocation and setting invalid tag at deallocation time, just adds overhead without improving the detection. With __GFP_SKIP_KASAN the page keeps its poison tag and KASAN_TAG_KERNEL (match-all tag) is stored in the page flags while keeping the poison tag in the hardware. The benefit of it is that 256 tag setting instruction per 4 kB page aren't needed at allocation and deallocation time. Thus match-all pointers still work, while non-match tags (other than poison tag) still fault. __GFP_SKIP_KASAN only skips for KASAN_HW_TAGS mode, so coverage is unchanged. Benchmark: The benchmark has two modes. In thread mode, the child process forks and creates N threads. In pgtable mode, the parent maps and faults a specified memory size and then forks repeatedly with children exiting immediately. Thread benchmark: 2000 iterations, 2000 threads: 2.575 s → 2.229 s (~13.4% faster) The pgtable samples: - 2048 MB, 2000 iters 19.08 s → 17.62 s (~7.6% faster) This patch (of 3): For allocations that will be accessed only with match-all pointers (e.g., kernel stacks), setting tags is wasted work. If the caller already set __GFP_SKIP_KASAN, skip tag setting of vmalloc pages. Before this patch, __GFP_SKIP_KASAN wasn't being used with vmalloc APIs. So it wasn't being checked. Now its being checked and acted upon. Other KASAN modes are unchanged because __GFP_SKIP_KASAN is ignored for them in the page allocator, and in vmalloc too we ignore this flag for them. This is a preparatory patch for optimizing kernel stack allocations. Link: https://lore.kernel.org/20260429102704.680174-1-dev.jain@arm.com Link: https://lore.kernel.org/20260429102704.680174-2-dev.jain@arm.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com> Co-developed-by: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Co-developed-by: Dev Jain <dev.jain@arm.com> Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: K Prateek Nayak <kprateek.nayak@amd.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/memcontrol: hoist pstatc_pcpu assignment out of CPU loopHui Zhu1-3/+2
In mem_cgroup_alloc(), the assignment of pstatc_pcpu is invariant with respect to the for_each_possible_cpu() loop: both the 'parent' pointer and 'parent->vmstats_percpu' remain constant throughout all iterations. The original code redundantly re-evaluated the 'if (parent)' condition and reassigned pstatc_pcpu on every CPU iteration, then repeated the same ternary check 'parent ? pstatc_pcpu : NULL' when storing into statc->parent_pcpu. Move the single conditional assignment of pstatc_pcpu to before the loop, resolving both the loop-invariant placement issue and the duplicated null check. On systems with a large number of possible CPUs, this eliminates repeated branch evaluation with no functional change. No functional change intended. Link: https://lore.kernel.org/20260429084216.186238-1-hui.zhu@linux.dev Signed-off-by: Hui Zhu <zhuhui@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/migrate: rename PAGE_ migration flags to FOLIO_Shivank Garg1-25/+23
These flags only track folio-specific state during migration and are not used for movable_ops pages. Rename the enum values and the old_page_state variable to match. No functional change. Link: https://lore.kernel.org/20260324190706.964555-4-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Suggested-by: David Hildenbrand <david@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/tests/core-kunit: test pause commitmentSeongJae Park1-0/+4
Add a kunit test for commitment of damon_ctx->pause parameter that can be done using damon_commit_ctx(). Link: https://lore.kernel.org/20260427151231.113429-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
3 daysmm/damon/sysfs: add pause file under context dirSeongJae Park1-0/+31
Add pause DAMON sysfs file under the context directory. It exposes the damon_ctx->pause API parameter to the users so that they can use the pause/resume feature. Link: https://lore.kernel.org/20260427151231.113429-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>