aboutsummaryrefslogtreecommitdiffstats
path: root/fs
diff options
authorBoris Burkov <boris@bur.io>2026-05-08 13:11:26 -0700
committerDavid Sterba <dsterba@suse.com>2026-05-16 03:06:56 +0200
commit975e63c7a8074d83e195577b7f76dadc9a3d14b7 (patch)
treeecdc3818c0ffeb391a4908c79dde7c8f43fa36ae /fs
parent080ecbd05432970a8df1f70586b925a18b5ea6f4 (diff)
downloadlinux-next-history-975e63c7a8074d83e195577b7f76dadc9a3d14b7.tar.gz
btrfs: always drop root->inodes lock before cond_resched()
find_first_inode() and find_first_inode_to_shrink() lock root->inodes, then loop over them, occasionally skipping some inodes. When they skip an inode, they attempt to share the cpu/lock with cond_resched_lock(). However, that has a subtle problem associated with it. cond_resched_lock() only drops the lock if it needs to actually call schedule(). With CONFIG_PREEMPT_NONE, this means the full timeslice as detected at ticks. With 8+ cpus and default tunables, this is 2.8ms. So regardless of HZ, we will run for at least 2.8ms in this loop without dropping the lock, assuming it finds no suitable inodes. If HZ is small enough, it might be even worse as the tick granularity becomes bigger than the timeslice. The knock-on effect of this is that callers to btrfs_del_inode_from_root() like kswapd trying to shrink the inode slab or userspace threads calling evict() will spin on xa_lock(&root->inodes) for 2.8ms, so the extent map shrinker dominates the lock even though ostensibly it is intending to share it. This produces memory pressure as there is only one kswapd and it runs sequentially so it can get stuck in the inode slab shrinking. To fix it, simply replace cond_resched_lock() with an open coded variant which unconditionally does unlock/lock around cond_resched. Sharing the lock is decoupled from sharing the CPU, and all the users of the lock now share it fairly. I was able to reproduce this on test systems by producing a lot of empty files (to make a big root->inodes xarray), then producing memory pressure by reading large files larger than ram, triggering kswapd and the extent_map shrinker. The lock contention is visible with perf or lockstat. This patch also relieved a user-apparent bottleneck on a production system from the original report. Tested-by: Rik van Riel <riel@surriel.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
Diffstat (limited to 'fs')
-rw-r--r--fs/btrfs/extent_map.c4
-rw-r--r--fs/btrfs/inode.c4
2 files changed, 6 insertions, 2 deletions
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 095a561d733f0..fa9d183f4f866 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -1246,7 +1246,9 @@ static struct btrfs_inode *find_first_inode_to_shrink(struct btrfs_root *root,
write_unlock(&tree->lock);
next:
from = btrfs_ino(inode) + 1;
- cond_resched_lock(&root->inodes.xa_lock);
+ xa_unlock(&root->inodes);
+ cond_resched();
+ xa_lock(&root->inodes);
}
xa_unlock(&root->inodes);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 75136a1727101..1ca1cbdf25bcd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10699,7 +10699,9 @@ struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino)
break;
from = btrfs_ino(inode) + 1;
- cond_resched_lock(&root->inodes.xa_lock);
+ xa_unlock(&root->inodes);
+ cond_resched();
+ xa_lock(&root->inodes);
}
xa_unlock(&root->inodes);