source/2mib_fs_dax.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121

====== Overview ======

In recent Linux kernels filesystem DAX supports 2 MiB hugepage faults in addition to the standard 4 KiB page faults.  This means that for each filesystem DAX page fault we can map either 4 KiB or 2 MiB worth of persistent memory into userspace.

Servicing page faults with 2 MiB hugepage mappings instead of 4 KiB mappings has several advantages.  It will result in fewer page faults (a single 2 MiB hugepage fault instead of 512 page faults at 4 KiB), smaller page tables and less TLB contention.  The end result of using filesystem DAX hugepages is reduced memory usage and increased performance.

However, for filesystem DAX to be able to use 2 MiB hugepages several things have to happen:

  - Our mmap() mapping has to be at least 2 MiB in size.
  - Our filesystem block allocation has to be at least 2 MiB in size.
  - Our filesystem block allocation has to have the same alignment as our mmap().

The first of these, the size of our mmap() region, is the most easily controlled.  The filesystem block allocations, though, are a bit more tricky.  Luckily the two filesystems that support filesystem DAX, ext4 and XFS, each have support for requesting specific filesystem block allocation alignments and sizes.  This feature was introduced in support of RAID, but we can use it equally well for filesystem DAX.

====== System Configuration ======
Here are the steps that I've used to successfully get filesystem DAX PMDs:

1. First, make sure that your namespace is in 'fsdax' mode.

	# ndctl list --human
	{
	  "dev":"namespace0.0",
	  "mode":"fsdax",
	  "size":"16.73 GiB (17.96 GB)",
	  "uuid":"179e5b98-96ee-4988-ba9f-ed9383d11598",
	  "sector_size":512,
	  "blockdev":"pmem0",
	  "numa_node":0
	}

2. Next, make sure that our persistent memory block device starts at a 2 MiB aligned physical address.  

This is important because when we ask the filesystem for 2 MiB aligned and sized block allocations it will provide those block allocations relative to the beginning of its block device.  If the filesystem is built on top of a namespace whose data starts at a 1 MiB aligned offset, for example, a block allocation that is 2 MiB aligned from the point of view of the filesystem will still be only 1 MiB aligned from DAX's point of view.  This will cause DAX to fall back to 4 KiB page faults.

We can find the alignment of the persistent memory namespaces by looking at /proc/iomem, among other places:

	# cat /proc/iomem
	...
	140000000-57fdfffff : Persistent Memory
	  140000000-57fdfffff : namespace0.0

Our namespace in this case begins at 5 GiB (0x1 4000 0000), which is 2 MiB (0x20 0000) aligned.

It is recommend to use raw devices and create multiple namespaces if the system configuration calls for persistent memory to be provisioned into smaller volumes. This is because namespace alignment is enforced at namespace creation time whereas partitions need to be created by tooling that is careful to align both the start of the namespace and the start of partitions. Long term the pmem device partition is scheduled for deprecation in favor of requiring namespaces for all provisioning.

Instead, if we create any partitions on top of our PMEM namespace, we must ensure that those partitions are likewise 2 MiB aligned.  By default fdisk will create partitions that are 1 MiB (2048 sector) aligned from the start of the parent block device:

	# fdisk -l /dev/pmem0
	Disk /dev/pmem0: 16.7 GiB, 17964204032 bytes, 35086336 sectors
	Units: sectors of 1 * 512 = 512 bytes
	Sector size (logical/physical): 512 bytes / 4096 bytes
	I/O size (minimum/optimal): 4096 bytes / 4096 bytes
	Disklabel type: dos
	Disk identifier: 0xfd17c8f9
	
	Device       Boot Start      End  Sectors  Size Id Type
	/dev/pmem0p1       2048 35086335 35084288 16.7G 83 Linux

A filesystem built on top of this partition won't be able to provide DAX with 2 MiB aligned block allocations.  We instead need to have our partition begin at a 2 MiB aligned boundary:

	# fdisk -l /dev/pmem0
	Disk /dev/pmem0: 16.7 GiB, 17964204032 bytes, 35086336 sectors
	Units: sectors of 1 * 512 = 512 bytes
	Sector size (logical/physical): 512 bytes / 4096 bytes
	I/O size (minimum/optimal): 4096 bytes / 4096 bytes
	Disklabel type: dos
	Disk identifier: 0xfd17c8f9
	
	Device       Boot Start      End  Sectors  Size Id Type
	/dev/pmem0p1       4096 35086335 35082240 16.7G 83 Linux

3. Once we have a block device that starts at a 2 MiB aligned persistent memory address, we then need to create a filesystem on top of it that will give us 2 MiB aligned and sized block allocations.  Here are the commands to do that with either ext4 or XFS:

ext4:
	# mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem0

xfs:
	# mkfs.xfs -f -d su=2m,sw=1 -m reflink=0 /dev/pmem0
	# mount /dev/pmem0 /mnt/dax
	# xfs_io -c "extsize 2m" /mnt/dax

Please refer to the man pages for
[[https://linux.die.net/man/8/mkfs.ext4|mkfs.ext4(8)]],
[[https://linux.die.net/man/8/mkfs.xfs|mkfs.xfs(8)]] and
[[https://linux.die.net/man/8/xfs_io|xfs_io(8)]] for more details.

4. Now that we have a filesystem that can give us 2 MiB sized and aligned
block allocations we just need to create a file that will receive those
allocations.  To do this we need to begin with a file that is at least 2 MiB
in size.  We can do this with
[[https://linux.die.net/man/1/truncate|truncate(1)]],
[[https://linux.die.net/man/2/ftruncate|ftruncate(2)]],
[[https://linux.die.net/man/1/fallocate|fallocate(1)]],
[[https://linux.die.net/man/3/posix_fallocate|posix_fallocate(3)]], etc.  For example:

	# fallocate --length 1G /mnt/dax/data
	
or
	
	# truncate --size 1G /mnt/dax/data

====== Verifying Results ======
Once we have a system that is capable of giving us 2 MiB filesystem DAX faults, we probably want to verify that we are actually succeeding in using faults of that size.

The way that I normally do this is by looking at the filesystem DAX tracepoints:

	# cd /sys/kernel/debug/tracing
	# echo 1 > events/fs_dax/dax_pmd_fault_done/enable 
	<run test which faults in filesystem DAX mappings>

We can then look at the **dax_pmd_fault_done** events in
<file>/sys/kernel/debug/tracing/trace</file> and see whether they were successful.  An event that successfully faulted in a filesystem DAX PMD looks like this:
	big-1434  [008] ....  1502.341229: dax_pmd_fault_done: dev 259:0 ino 0xc shared 
	WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 
	0x10700000 pgoff 0x305 max_pgoff 0x1400 NOPAGE
The first thing to look at is the **NOPAGE** return value at the end of the line.  This means that the fault succeeded and didn't return a page cache page, which is expected for DAX.  A 2 MiB fault that failed and fell back to 4 KiB DAX faults will instead look like this:
	small-1431  [008] ....  1499.402672: dax_pmd_fault_done: dev 259:0 ino 0xc shared
	WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end
	0x10500000 pgoff 0x220 max_pgoff 0x3ffff FALLBACK
You can see that this fault resulted in a fallback to 4 KiB faults via the
**FALLBACK** return code at the end of the line.  The rest of the data in this line can help you determine why the fallback happened.  In this case it was because I intentionally created an mmap() area that was smaller than 2 MiB.