diff options
| author | Boris Burkov <boris@bur.io> | 2026-05-08 13:11:26 -0700 |
|---|---|---|
| committer | David Sterba <dsterba@suse.com> | 2026-05-16 03:06:56 +0200 |
| commit | 975e63c7a8074d83e195577b7f76dadc9a3d14b7 (patch) | |
| tree | ecdc3818c0ffeb391a4908c79dde7c8f43fa36ae /fs | |
| parent | 080ecbd05432970a8df1f70586b925a18b5ea6f4 (diff) | |
| download | linux-next-history-975e63c7a8074d83e195577b7f76dadc9a3d14b7.tar.gz | |
btrfs: always drop root->inodes lock before cond_resched()
find_first_inode() and find_first_inode_to_shrink() lock root->inodes,
then loop over them, occasionally skipping some inodes. When they skip
an inode, they attempt to share the cpu/lock with cond_resched_lock().
However, that has a subtle problem associated with it.
cond_resched_lock() only drops the lock if it needs to actually call
schedule(). With CONFIG_PREEMPT_NONE, this means the full timeslice as
detected at ticks. With 8+ cpus and default tunables, this is 2.8ms. So
regardless of HZ, we will run for at least 2.8ms in this loop without
dropping the lock, assuming it finds no suitable inodes. If HZ is
small enough, it might be even worse as the tick granularity becomes
bigger than the timeslice.
The knock-on effect of this is that callers to
btrfs_del_inode_from_root() like kswapd trying to shrink the inode slab
or userspace threads calling evict() will spin on xa_lock(&root->inodes)
for 2.8ms, so the extent map shrinker dominates the lock even though
ostensibly it is intending to share it. This produces memory pressure as
there is only one kswapd and it runs sequentially so it can get stuck in
the inode slab shrinking.
To fix it, simply replace cond_resched_lock() with an open coded variant
which unconditionally does unlock/lock around cond_resched. Sharing the
lock is decoupled from sharing the CPU, and all the users of the lock
now share it fairly.
I was able to reproduce this on test systems by producing a lot of empty
files (to make a big root->inodes xarray), then producing memory
pressure by reading large files larger than ram, triggering kswapd and
the extent_map shrinker. The lock contention is visible with perf or
lockstat. This patch also relieved a user-apparent bottleneck on a
production system from the original report.
Tested-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Diffstat (limited to 'fs')
| -rw-r--r-- | fs/btrfs/extent_map.c | 4 | ||||
| -rw-r--r-- | fs/btrfs/inode.c | 4 |
2 files changed, 6 insertions, 2 deletions
diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 095a561d733f0..fa9d183f4f866 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -1246,7 +1246,9 @@ static struct btrfs_inode *find_first_inode_to_shrink(struct btrfs_root *root, write_unlock(&tree->lock); next: from = btrfs_ino(inode) + 1; - cond_resched_lock(&root->inodes.xa_lock); + xa_unlock(&root->inodes); + cond_resched(); + xa_lock(&root->inodes); } xa_unlock(&root->inodes); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 75136a1727101..1ca1cbdf25bcd 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -10699,7 +10699,9 @@ struct btrfs_inode *btrfs_find_first_inode(struct btrfs_root *root, u64 min_ino) break; from = btrfs_ino(inode) + 1; - cond_resched_lock(&root->inodes.xa_lock); + xa_unlock(&root->inodes); + cond_resched(); + xa_lock(&root->inodes); } xa_unlock(&root->inodes); |
