1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
|
==== Persistent Memory ====
----
These pages contain instructions, links and other information related to persistent memory enabling in Linux.
=== Links ===
----
* NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
* DSM Interface(1.6): http://pmem.io/documents/NVDIMM_DSM_Interface-V1.6.pdf
* Driver Writer’s Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
* NVDIMM Kernel Tree: https://git.kernel.org/cgit/linux/kernel/git/nvdimm/nvdimm.git
* NDCTL: https://github.com/pmem/ndctl.git
* linux-nvdimm Mailing List: https://lists.01.org/mailman/listinfo/linux-nvdimm
* linux-nvdimm Patchwork: https://patchwork.kernel.org/project/linux-nvdimm/list/
* IXPDIMM management software: https://github.com/01org/IXPDIMMSW
* NDCTL man pages online: http://pmem.io/ndctl/
=== Blogs ===
* [[https://www.suse.com/communities/blog/nvdimm-enabling-suse-linux-enterprise-12-service-pack-2/|NVDIMM Enabling in SUSE Linux Enterprise 12, Service Pack 2 - Part 1]]
* [[https://www.suse.com/communities/blog/nvdimm-enabling-part-2-intel/|NVDIMM Enabling in SUSE Linux Enterprise 12, Service Pack 2 - Part 2]]
=== Industry specifications ===
* Advanced Configuration and Power Interface (ACPI) 6.2a: http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf
* Unified Extensible Firmware Interface (UEFI) Specification 2.7: http://www.uefi.org/sites/default/files/resources/UEFI_Spec_2_7.pdf
* Byte Addressable Energy Backed Interface (JESD245A): https://www.jedec.org/system/files/docs/JESD245A.pdf
=== Subtopics ===
* [[How to choose the correct memmap kernel parameter for PMEM on your system|How to choose the correct memmap kernel parameter for PMEM on your system]]
* [[2MiB_FS_DAX|How to get 2 MiB filesystem DAX faults]]
=== Quick Setup Guide ===
----
One interesting use of the PMEM driver is to allow users to begin developing
software using DAX, which was upstreamed in v4.0. On a non-NFIT system this
can be done by using PMEM's memmap kernel command line to manually create a
type 12 memory region.
Here are the additions I made for my system with 32 GiB of RAM:
1) Reserve 16 GiB of memory via the "memmap" kernel parameter in grub's
menu.lst, using PMEM's new "!" specifier:
<code>
memmap=16G!16G
</code>
The documentation for this parameter can be found here:
https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt
Also see: [[How to choose the correct memmap kernel parameter for PMEM on your system|How to choose the correct memmap kernel parameter for PMEM on your system]].
2) Set up the correct kernel configuration options for PMEM and DAX in .config.
Options in make menuconfig:
* Device Drivers - NVDIMM (Non-Volatile Memory Device) Support
* PMEM: Persistent memory block device support
* BLK: Block data window (aperture) device support
* BTT: Block Translation Table (atomic sector updates)
* Enable the block layer
* Block device DAX support <not available in kernel-4.5 due to page cache issues>
* File systems
* Direct Access (DAX) support
* Processor type and features
* Support non-standard NVDIMMs and ADR protected memory <if using the memmap kernel parameter>
<code>
CONFIG_BLK_DEV_RAM_DAX=y
CONFIG_FS_DAX=y
CONFIG_X86_PMEM_LEGACY=y
CONFIG_LIBNVDIMM=y
CONFIG_BLK_DEV_PMEM=m
CONFIG_ARCH_HAS_PMEM_API=y
</code>
This configuration gave me one pmem device with 16 GiB of space:
<code>
$ fdisk -l /dev/pmem0
Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
</code>
lsblk shows the block devices, including pmem devices. Examples:
<code>
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
pmem0 259:0 0 16G 0 disk
├─pmem0p1 259:6 0 4G 0 part /mnt/ext4-pmem0
└─pmem0p2 259:7 0 11.9G 0 part /mnt/btrfs-pmem0
pmem1 259:1 0 16G 0 disk /mnt/xfs-pmem1
pmem2 259:2 0 16G 0 disk /mnt/xfs-pmem2
pmem3 259:3 0 16G 0 disk /mnt/xfs-pmem3
</code>
<code>
$ lsblk -t
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME
pmem0 0 4096 0 4096 512 0 128 128 0B
pmem1 0 4096 0 4096 512 0 128 128 0B
pmem2 0 4096 0 4096 512 0 128 128 0B
pmem3 0 4096 0 4096 512 0 128 128 0B
</code>
=== Namespaces ===
----
You can divide persistent memory address ranges into namespaces with ndctl. This stores namespace label metadata at the beginning of the persistent memory address range.
ndctl supports four modes:
* raw - present as a /dev/pmemN block device
* supports filesystems with or without DAX
* no label metadata on the device
* sector - /dev/pmemNs block device with sector atomicity
* Block Translation Table (BTT) layer on top of a /dev/pmemN block device
* supports filesystems without DAX
* memory - /dev/pmemN block device supporting PCIe device DMA
* requires storing extra "struct page" entries somewhere
* this requires 64 bytes per 4 KiB of persistent memory
* struct page storage locations
* --map=mem = regular system memory
* adequate for small persistent memory capacities (e.g., 128 MiB for 8 GiB persistent memory)
* --map=dev = persistent memory
* intended for large persistent memory capacities (e.g., 7.45 GiB for 1 TB persistent memory)
* Supports filesystems with or without DAX
* dax - /dev/daxN.M character device supporting DAX
* does not support filesystems
* no interactions with the kernel page cache
* requires storing extra "struct page" entries in persistent memory
Example commands on an 8 GiB NVDIMM with output showing the resulting sizes and /dev/ device names:
<code>
$ ndctl create-namespace --mode raw -e namespace0.0 -f
{
"dev":"namespace0.0",
"mode":"raw",
"size":8589934592, # this is exactly 8 GiB
"blockdev":"pmem0"
}
$ ndctl create-namespace --mode sector -e namespace0.0 -f
{
"dev":"namespace0.0",
"mode":"sector",
"size":8580472832, # this is 9240 KiB less than 8 GiB
"uuid":"52b53e55-eccd-40bf-a2aa-9f03ebf30e6b",
"sector_size":4096,
"blockdev":"pmem0s"
}
$ ndctl create-namespace --mode memory --map mem -e namespace0.0 -f
{
"dev":"namespace0.0",
"mode":"memory",
"size":8587837440, # this is 2 MiB less than 8 GiB
"uuid":"349b7e53-dfbb-4b90-89ed-db80cfdaab0f",
"blockdev":"pmem0"
}
$ ndctl create-namespace --mode memory --map dev -e namespace0.0 -f
{
"dev":"namespace0.0",
"mode":"memory",
"size":8453619712, # this is 130 MiB less than 8 GiB
"uuid":"03faeca5-226c-48d9-bb47-f71cbc6d322e",
"blockdev":"pmem0"
}
$ sudo ndctl create-namespace --mode dax -e namespace0.0 -f
{
"dev":"namespace0.0",
"mode":"dax",
"size":8453619712, # this is 130 MiB less than 8 GiB
"uuid":"252d7895-91f3-42b7-9eeb-27ffc03e354c",
"daxdevs":[
{
"chardev":"dax0.0", # this is 130 MiB less than 8 GiB
"size":8453619712
}
]
}
</code>
=== Partitions ===
----
You can divide raw, sector, and memory devices (/dev/pmemN and /dev/pmemNs) into partitions.
In parted, the mkpart subcommand has this syntax
mkpart [part-type fs-type name] start end
Although mkpart defaults to 1 MiB alignment, you may want to use 2 MiB alignment to support more efficient page mappings - see https://nvdimm.wiki.kernel.org/2mib_fs_dax.
Example carving a 16 GiB /dev/pmem0 into 4 GiB, 8 GiB, and 4 GiB partitions (constrained by 1 MiB alignment at the beginning and end) (note: parted displays its outputs using SI decimal units; lsblk uses binary units):
<code>
$ parted -s -a optimal /dev/pmem0 \
mklabel gpt -- \
mkpart primary ext4 1MiB 4GiB \
mkpart primary xfs 4GiB 12GiB \
mkpart primary btrfs 12GiB -1MiB \
print
Model: Unknown (unknown)
Disk /dev/pmem0: 17.2GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 4295MB 4294MB ext4 primary
2 4295MB 12.9GB 8590MB xfs primary
3 12.9GB 17.2GB 4294MB btrfs primary
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
pmem0 259:0 0 16G 0 disk
├─pmem0p1 259:4 0 4G 0 part
├─pmem0p2 259:5 0 8G 0 part
└─pmem0p3 259:8 0 4G 0 part
$ fdisk -l /dev/pmem0
Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: B334CBC6-1C56-47DF-8981-770C866CEABE
Device Start End Sectors Size Type
/dev/pmem0p1 2048 8388607 8386560 4G Linux filesystem
/dev/pmem0p2 8388608 25165823 16777216 8G Linux filesystem
/dev/pmem0p3 25165824 33552383 8386560 4G Linux filesystem
</code>
=== Filesystems ===
----
You may place any filesystem (e.g., ext4, xfs, btrfs) on a raw or memory device (e.g., /dev/pmem0), a partition on a raw or memory device (e.g. /dev/pmem0p1), a sector device (e.g., /dev/pmem0s), or a partition on a sector device (e.g., /dev/pmem0sp1).
ext4 and xfs support DAX, which allow applications to perform direct access to persistent memory with mmap(). You may use DAX on raw devices and memory devices, but not on sector devices.
Example creating ext4, xfs, and btrfs filesystems on three partitions and mounting ext4 and xfs with DAX (note: df -h displays sizes in IEC binary units; df -H uses SI decimal units):
<code>
$ mkfs.ext4 -F /dev/pmem0p1
$ mkfs.xfs -f /dev/pmem0p2
$ mkfs.btrfs -f /dev/pmem0p3
$ mount -o dax /dev/pmem0p1 /mnt/ext4-pmem0
$ mount -o dax /dev/pmem0p2 /mnt/xfs-pmem0
$ mount /dev/pmem0p3 /mnt/btrfs-pmem0
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
pmem0 259:0 0 16G 0 disk
├─pmem0p1 259:4 0 4G 0 part /mnt/ext4-pmem0
├─pmem0p2 259:5 0 8G 0 part /mnt/xfs-pmem0
└─pmem0p3 259:8 0 4G 0 part /mnt/btrfs-pmem0
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/pmem0p1 3.9G 8.0M 3.7G 1% /mnt/ext4-pmem0
/dev/pmem0p2 8.0G 33M 8.0G 1% /mnt/xfs-pmem0
/dev/pmem0p3 4.0G 17M 3.8G 1% /mnt/btrfs-pmem0
$ df -H
Filesystem Size Used Avail Use% Mounted on
/dev/pmem0p1 4.2G 8.4M 4.0G 1% /mnt/ext4-pmem0
/dev/pmem0p2 8.6G 34M 8.6G 1% /mnt/xfs-pmem0
/dev/pmem0p3 4.3G 17M 4.1G 1% /mnt/btrfs-pmem0
</code>
=== iostats ===
----
iostats are disabled by default due to performance overhead (e.g., 12M IOPS dropping 25% to 9M IOPS). However, they can be enabled in sysfs if desired.
As of kernel 4.5, iostats are only collected for the base pmem device, not per-partition. Also, I/Os that go through DAX paths (rw_page, rw_bytes, and direct_access functions) are not counted, so nothing is collected for:
* I/O to files in filesystems mounted with -o dax
* I/O to raw block devices if CONFIG_BLOCK_DAX is enabled
<code>
$ echo 1 > /sys/block/pmem0/queue/iostats
$ echo 1 > /sys/block/pmem1/queue/iostats
$ echo 1 > /sys/block/pmem2/queue/iostats
$ echo 1 > /sys/block/pmem3/queue/iostats
$ iostat -mxy 1
avg-cpu: %user %nice %system %iowait %steal %idle
21.53 0.00 78.47 0.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
pmem0 0.00 0.00 4706551.00 0.00 18384.95 0.00 8.00 6.00 0.00 0.00 0.00 0.00 113.90
pmem1 0.00 0.00 4701492.00 0.00 18365.20 0.00 8.00 6.01 0.00 0.00 0.00 0.00 119.30
pmem2 0.00 0.00 4701851.00 0.00 18366.60 0.00 8.00 6.37 0.00 0.00 0.00 0.00 108.90
pmem3 0.00 0.00 4688767.00 0.00 18315.50 0.00 8.00 6.43 0.00 0.00 0.00 0.00 117.40
</code>
=== fio ===
----
Example fio script to perform 4 KiB random reads to four pmem devices:
<code>
[global]
direct=1
ioengine=libaio
norandommap
randrepeat=0
bs=256k # for bandwidth
bs=4k # for IOPS and latency
iodepth=1
runtime=30
time_based=1
group_reporting
thread
gtod_reduce=0 # for latency
gtod_reduce=1 # IOPS and bandwidth
zero_buffers
## local CPU
numjobs=9 # for bandwidth
numjobs=1 # for latency
numjobs=18 # for IOPS
cpus_allowed_policy=split
rw=randwrite
rw=randread
# CPU affinity based on two 18-core CPUs with QPI snoop configuration of cluster-on-die
[drive_0]
filename=/dev/pmem0
cpus_allowed=0-8,36-44
[drive_1]
filename=/dev/pmem1
cpus_allowed=9-17,45-53
[drive_2]
filename=/dev/pmem2
cpus_allowed=18-26,54-62
[drive_3]
filename=/dev/pmem3
cpus_allowed=27-35,63-71
</code>
When using /dev/dax character devices, you must specify the size, because character devices do not have a size.
Example fio script to perform 4 KiB random reads to four /dev/dax character devices:
<code>
[global]
ioengine=mmap
pre_read=1
norandommap
randrepeat=0
bs=4k
iodepth=1
runtime=60000
time_based=1
group_reporting
thread
gtod_reduce=1 # reduce=1 except for latency test
zero_buffers
size=2G
numjobs=36
cpus_allowed=0-17,36-53
cpus_allowed_policy=split
[drive_0]
filename=/dev/dax0.0
rw=randread
[drive_1]
filename=/dev/dax1.0
rw=randread
[drive_2]
filename=/dev/dax2.0
rw=randread
[drive_3]
filename=/dev/dax3.0
rw=randread
</code>
|