mm/mglru: consolidate common code for retrieving evictable size

Patch series "mm/mglru: improve reclaim loop and dirty folio", v7. This series cleans up and slightly improves MGLRU's reclaim loop and dirty writeback handling. As a result, we can see an up to ~30% increase in some workloads like MongoDB with YCSB and a huge decrease in file refault, no swap involved. Other common benchmarks have no regression, and LOC is reduced, with less unexpected OOM, too. Some of the problems were found in our production environment, and others were mostly exposed while stress testing during the development of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up the code base and fixes several performance issues, preparing for further work. MGLRU's reclaim loop is a bit complex, and hence these problems are somehow related to each other. The aging, scan number calculation, and reclaim loop are coupled together, and the dirty folio handling logic is quite different, making the reclaim loop hard to follow and the dirty flush ineffective. This series slightly cleans up and improves these issues using a scan budget by calculating the number of folios to scan at the beginning of the loop, and decouples aging from the reclaim calculation helpers. Then, move the dirty flush logic inside the reclaim loop so it can kick in more effectively. These issues are somehow related, and this series handles them and improves MGLRU reclaim in many ways. Test results: All tests are done on a 48c96t NUMA machine with 2 nodes and a 128G memory machine using NVME as storage. Classical (non-MGLRU) LRU numbers are included as "MGLRU disabled" for each benchmark below; see [8] and [9] for the longer write-up. MongoDB ======= Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000, threads:32), which does 95% read and 5% update to generate mixed read and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and the WiredTiger cache size is set to 4.5G, using NVME as storage. This is close to the case we observed regressing in our production environment: mixed read and writeback pressure, so it is a practical case for evaluation. Not using SWAP. The intent is to isolate the file LRU writeback path. Enabling SWAP would just add noise from anonymous reclaim. MGLRU Before: Throughput(ops/sec): 60653.502655 workingset_refault_file 12904916 pgpgin 165366622 pgpgout 5219588 MGLRU After: Throughput(ops/sec): 82384.354760 (+35.8%, higher is better) workingset_refault_file 7128285 (-44.7%, lower is better) pgpgin 113170693 (-31.5%, lower is better) pgpgout 5639724 MGLRU Disabled: Throughput(ops/sec): 93713.640901 workingset_refault_file 15013443 pgpgin 85365614 pgpgout 5866508 We can see a significant performance improvement after this series. The test is done on NVME and the performance gap would be even larger for slow devices, such as HDD or network storage. We observed over 100% gain for some workloads with slow IO. Note, classical LRU is still faster for this benchmark, MGLRU may catch up later with further work [7]. Chrome & Node.js [3] ==================== Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2 nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64 workers. Many memcgs each applying roughly equal pressure exercises the LRU's ability to detect/protect each tenant's working set and to balance reclamation fairly between tenants, which makes this a meaningful test for the reclaim mechanism. Fairness is reported via Jain's fairness index (1.0 means all tenants get exactly equal allocation, lower is worse). Under equal pressure, all memcgs should make roughly equal forward progress. See [8] for the longer rationale and per-memcg breakdown. MGLRU before: Total requests: 81898 Per-worker mean: 1279.7 Per-worker 95% CI (mean): [ 1259.0, 1300.4] Jain's fairness index: 0.995893 (1.0 = perfectly fair) Latency: Bucket Count Pct Cumul [0,1)s 28392 34.67% 34.67% [1,2)s 8022 9.80% 44.46% [2,4)s 6130 7.48% 51.95% [4,8)s 39354 48.05% 100.00% MGLRU after: Total requests: 82901 Per-worker mean: 1295.3 Per-worker 95% CI (mean): [ 1265.3, 1325.4] Jain's fairness index: 0.991607 (1.0 = perfectly fair) Latency: Bucket Count Pct Cumul [0,1)s 28128 33.93% 33.93% [1,2)s 8756 10.56% 44.49% [2,4)s 7028 8.48% 52.97% [4,8)s 38989 47.03% 100.00% MGLRU disabled: Total requests: 62399 Per-worker mean: 975.0 Per-worker 95% CI (mean): [ 941.9, 1008.1] Jain's fairness index: 0.982156 (1.0 = perfectly fair) Latency: Bucket Count Pct Cumul [0,1)s 20051 32.13% 32.13% [1,2)s 2255 3.61% 35.75% [2,4)s 6149 9.85% 45.60% [4,8)s 33927 54.37% 99.97% [8,16)s 17 0.03% 100.00% Reclaim is still fair and effective, total requests number seems slightly better. OOM issue with aging and throttling =================================== For the throttling OOM issue, it can be easily reproduced using dd and cgroup limit as demonstrated and fixed by a later patch in this series. The aging OOM is a bit tricky, a specific reproducer can be used to simulate what we encountered in production environment [4]: Spawns multiple workers that keep reading the given file using mmap, and pauses for 120ms after one file read batch. It also spawns another set of workers that keep allocating and freeing a given size of anonymous memory. The total memory size exceeds the memory limit (eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit). - MGLRU disabled: Finished 128 iterations. - MGLRU enabled: OOM with following info after about ~10-20 iterations: [ 62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 [ 62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460 [ 62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0 [ 62.640823] Memory cgroup stats for /demo: [ 62.641017] anon 10604879872 [ 62.641941] file 6574858240 OOM occurs despite there being still evictable file folios. - MGLRU enabled after this series: Finished 128 iterations. Worth noting there is another OOM related issue reported in V1 of this series, which is tested and looking OK now [5]. MySQL: ====== Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using ZRAM as swap and test command: sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \ --tables=48 --table-size=2000000 --threads=48 --time=600 run A 24G InnoDB buffer pool inside a 2G memcg with ZRAM as swap forces aggressive eviction of cached database anon pages, which exercises the LRU's hot page detection and the eviction path under swap pressure. The workload is practical, and the pressure is higher than what we usually see in production but it is intended to expose the extreme case. MGLRU before: 17313.688333 tps MGLRU after: 17286.195000 tps MGLRU disabled: 16245.330000 tps Seems only noise level changes, no regression. FIO: ==== Testing with the following command, where /mnt/ramdisk is a 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg, 6 test run each: fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \ --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \ --rw=randread --norandommap --time_based \ --ramp_time=1m --runtime=5m --group_reporting Random buffered mmap read on a ramdisk strips out storage variance and stresses purely the LRU's ability to evict and recycle the page cache under heavy random read pressure. MGLRU before: 9033.91 MB/s MGLRU after: 9065.72 MB/s MGLRU disabled: 8254.54 MB/s Also seem only noise level changes and no regression or slightly better. Build kernel: ============= Build kernel test using ZRAM as swap, kernel source on tmpfs, in a memcg with memory.max=3G, using make -j96 and defconfig, measuring system time, 6 test run each. Building the kernel is a classical mixed anon + file workload (lots of small file reads/writes plus parallel anon allocations from cc/ld) and is representative of many real compilation jobs. MGLRU before: 2823.13s MGLRU after: 2801.26s MGLRU disabled: 5023.50s Also seem only noise level changes, no regression or very slightly better. Android: ======== Xinyu reported a performance gain on Android, too, with this series. The test consisted of cold-starting multiple applications sequentially under moderate system load [6]; this is a real Android user-visible scenario, dominated by the LRU's ability to keep the right working set resident and re-fault launch-critical pages quickly. Before: Launch Time Summary (all apps, all runs) Mean 868.0ms P50 888.0ms P90 1274.2ms P95 1399.0ms After: Launch Time Summary (all apps, all runs) Mean 850.5ms (-2.07%) P50 861.5ms (-3.04%) P90 1179.0ms (-8.05%) P95 1228.0ms (-12.2%) This patch (of 15): Merge commonly used code for counting evictable folios in a lruvec. No behavior change. Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-0-02fabb92dc43@tencent.com Link: https://lore.kernel.org/20260428-mglru-reclaim-v7-1-02fabb92dc43@tencent.com Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1] Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2] Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3] Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4] Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5] Link: https://lore.kernel.org/linux-mm/20260417025123.2971253-1-wxy2009nrrr@163.com/ [6] Link: https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-0-913619b014d9@tencent.com/ [7] Link: https://lore.kernel.org/linux-mm/CAMgjq7BzQAPp8u_3-9e3ueXmRCoW=2sydok0hFM=MYL7VC1YYg@mail.gmail.com/ [8] Link: https://lore.kernel.org/linux-mm/CAMgjq7D+4QmiWe73OPFuH0s+ZKCUJoo+MfcWOdJcV+VO-T2Wmg@mail.gmail.com/ [9] Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Yuanchu Xie <yuanchu@google.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Stevens <stevensd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Xu <weixugc@google.com> Cc: Yafang <laoar.shao@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Leno Hou <lenohou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
author: Kairui Song <kasong@tencent.com> 2026-04-28 02:06:52 +0800
committer: Andrew Morton <akpm@linux-foundation.org> 2026-05-28 21:31:28 -0700
commit: 059f1456a664c96fa8a26d051bf2e2efcc5aa409 (patch)
tree: 2a38353f7a710ed880bcfbf0120a14fa3bd82f6b
parent: aad78963ef575ba45652547e2d43698e896e3f63 (diff)
download: linux-next-history-059f1456a664c96fa8a26d051bf2e2efcc5aa409.tar.gz
1 files changed, 14 insertions, 22 deletions
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 76193a84a2afc..5901219dd7fc1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4088,27 +4088,33 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
 	sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY);
 }
 
-static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+static unsigned long lruvec_evictable_size(struct lruvec *lruvec, int swappiness)
 {
 	int gen, type, zone;
-	unsigned long total = 0;
-	int swappiness = get_swappiness(lruvec, sc);
+	unsigned long seq, total = 0;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
 	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
 		for (seq = min_seq[type]; seq <= max_seq; seq++) {
 			gen = lru_gen_from_seq(seq);
-
 			for (zone = 0; zone < MAX_NR_ZONES; zone++)
 				total += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
 		}
 	}
 
+	return total;
+}
+
+static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+{
+	unsigned long total;
+	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	total = lruvec_evictable_size(lruvec, swappiness);
+
 	/* whether the size is big enough to be helpful */
 	return mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
 }
@@ -4913,9 +4919,6 @@ retry:
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 			     int swappiness, unsigned long *nr_to_scan)
 {
-	int gen, type, zone;
-	unsigned long size = 0;
-	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	DEFINE_MIN_SEQ(lruvec);
 
 	*nr_to_scan = 0;
@@ -4923,18 +4926,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
 
-	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
-		for (seq = min_seq[type]; seq <= max_seq; seq++) {
-			gen = lru_gen_from_seq(seq);
-
-			for (zone = 0; zone < MAX_NR_ZONES; zone++)
-				size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
-		}
-	}
-
-	*nr_to_scan = size;
+	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
 	/* better to run aging even though eviction is still possible */
 	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }
author	Kairui Song <kasong@tencent.com>	2026-04-28 02:06:52 +0800
committer	Andrew Morton <akpm@linux-foundation.org>	2026-05-28 21:31:28 -0700
commit	059f1456a664c96fa8a26d051bf2e2efcc5aa409 (patch)
tree	2a38353f7a710ed880bcfbf0120a14fa3bd82f6b
parent	aad78963ef575ba45652547e2d43698e896e3f63 (diff)
download	linux-next-history-059f1456a664c96fa8a26d051bf2e2efcc5aa409.tar.gz