28
|
1
|
#+title: notes |
|
2
|
#+author: Richard Westhaver |
|
3
|
#+email: ellis@rwest.io |
|
4
|
#+description: NAS-T Notes |
|
5
|
#+setupfile: ../../clean.theme |
3
|
6
|
#+BIBLIOGRAPHY: refs.bib |
|
7
|
* File Systems |
|
8
|
** BTRFS |
|
9
|
#+begin_quote |
|
10
|
BTRFS is a Linux filesystem based on copy-on-write, allowing for |
|
11
|
efficient snapshots and clones. |
|
12
|
|
|
13
|
It uses B-trees as its main on-disk data structure. The design goal is |
|
14
|
to work well for many use cases and workloads. To this end, much |
|
15
|
effort has been directed to maintaining even performance as the |
|
16
|
filesystem ages, rather than trying to support a particular narrow |
|
17
|
benchmark use-case. |
|
18
|
|
|
19
|
Linux filesystems are installed on smartphones as well as enterprise |
|
20
|
servers. This entails challenges on many different fronts. |
|
21
|
|
|
22
|
- Scalability :: The filesystem must scale in many dimensions: disk |
|
23
|
space, memory, and CPUs. |
|
24
|
|
|
25
|
- Data integrity :: Losing data is not an option, and much effort is |
|
26
|
expended to safeguard the content. This includes checksums, metadata |
|
27
|
duplication, and RAID support built into the filesystem. |
|
28
|
|
|
29
|
- Disk diversity :: The system should work well with SSDs and hard |
|
30
|
disks. It is also expected to be able to use an array of different |
|
31
|
sized disks, which poses challenges to the RAID and striping |
|
32
|
mechanisms. |
|
33
|
#+end_quote |
|
34
|
-- [cite/t/f:@btrfs] |
|
35
|
*** [2023-08-08 Tue] btrfs performance speculation :: |
|
36
|
- [[https://www.percona.com/blog/taking-a-look-at-btrfs-for-mysql/]] |
|
37
|
- zfs outperforms immensely, but potential misconfiguration on btrfs side (virt+cow |
|
38
|
still enabled?) |
|
39
|
- https://www.ctrl.blog/entry/btrfs-vs-ext4-performance.html |
|
40
|
- see the follow up comment on this post |
|
41
|
- https://www.reddit.com/r/archlinux/comments/o2gc42/is_the_performance_hit_of_btrfs_serious_is_it/ |
|
42
|
#+begin_quote |
|
43
|
I’m the author of OP’s first link. I use BtrFS today. I often shift lots of |
|
44
|
de-duplicatable data around, and benefit greatly from file cloning. The data is actually |
|
45
|
the same data that caused the slow performance in the article. BtrFS and file cloning |
|
46
|
now performs this task quicker than a traditional file system. (Hm. It’s time for a |
|
47
|
follow-up article.) |
|
48
|
|
|
49
|
In a laptop with one drive: it doesn’t matter too much unless you do work that benefit |
|
50
|
from file cloning or snapshots. This will likely require you to adjust your tooling and |
|
51
|
workflow. I’ve had to rewrite the software I use every day to make it take advantage of |
|
52
|
the capabilities of a more modern file system. You won’t benefit much from the data |
|
53
|
recovery and redundancy features unless you’ve got two storage drives in your laptop and |
|
54
|
can setup redundant data copies. |
|
55
|
|
|
56
|
on similar hardware to mine? |
|
57
|
|
|
58
|
It’s not a question about your hardware as much as how you use it. The bad performance I |
|
59
|
documented was related to lots and lots of simultaneous random reads and writes. This |
|
60
|
might not be representative of how you use your computer. |
|
61
|
#+end_quote |
|
62
|
- https://dl.acm.org/doi/fullHtml/10.1145/3386362 |
|
63
|
- this is about distributed file systems (in this case Ceph) - they argue against |
|
64
|
basing DFS on ondisk-format filesystems (XFS ext4) - developed BlueStore as |
|
65
|
backend, which runs directly on raw storage hardware. |
|
66
|
- this is a good approach, but expensive (2 years in development) and risky |
|
67
|
- better approach is to take advantage of a powerful enough existing ondisk-FS |
|
68
|
format and pair it with supporting modules which abstract away the 'distributed' |
|
69
|
mechanics. |
|
70
|
- the strategy presented here is critical for enterprise-grade hardware where the |
|
71
|
ondisk filesystem becomes the bottleneck that you're looking to optimize |
|
72
|
- https://lore.kernel.org/lkml/cover.1676908729.git.dsterba@suse.com/ |
|
73
|
- linux 6.3 patch by David Sterba [2023-02-20 Mon] |
|
74
|
- btrfs continues to show improvements in the linux kernel, ironing out the kinks |
|
75
|
- makes it hard to compare benchmarks tho :/ |
|
76
|
*** MacOS support |
|
77
|
- see this WIP k-ext for macos: [[https://github.com/relalis/macos-btrfs][macos-btrfs]] |
|
78
|
- maybe we can help out with the VFS/mount support |
|
79
|
*** on-disk format |
|
80
|
- [[https://btrfs.readthedocs.io/en/latest/dev/On-disk-format.html][on-disk-format]] |
|
81
|
- 'btrfs consists entirely of several trees. the trees use copy-on-write.' |
|
82
|
- trees are stored in nodes which belong to a level in the b-tree structure. |
|
83
|
- internal nodes (inodes) contain refs to other inodes on the /next/ level OR |
|
84
|
- to leaf nodes then the level reaches 0. |
|
85
|
- leaf nodes contain various types depending on the tree. |
|
86
|
- basic structures |
|
87
|
- 0:8 uint = objectid, each tree has its own set of object IDs |
|
88
|
- 8:1 uint = item type |
|
89
|
- 9:8 uint = offset, depends on type. |
|
90
|
- little-endian |
|
91
|
- fields are unsigned |
|
92
|
- *superblock* |
|
93
|
- primary superblock is located at 0x10000 (64KiB) |
|
94
|
- Mirror copies of the superblock are located at physical addresses 0x4000000 (64 |
|
95
|
MiB) and 0x4000000000 (256GiB), if valid. copies are updated simultaneously. |
|
96
|
- during mount only the first super block at 0x10000 is read, error causes mount to |
|
97
|
fail. |
|
98
|
- BTRFS onls recognizes disks with a valid 0x10000 superblock. |
|
99
|
- *header* |
|
100
|
- stored at the start of every inode |
|
101
|
- data following it depends on whether it is an internal or leaf node. |
|
102
|
- *inode* |
|
103
|
- node header followed by a number of key pointers |
|
104
|
- 0:11 key |
|
105
|
- 11:8 uint = block number |
|
106
|
- 19:8 uint = generation |
|
107
|
- *lnode* |
|
108
|
- leaf nodes contain header followed by key pointers |
|
109
|
- 0:11 key |
|
110
|
- 11:4 uint = data offset relative to end of header(65) |
|
111
|
- 15:4 uint = data size |
|
112
|
- objects |
|
113
|
- ROOT_TREE |
|
114
|
- holds ROOT_ITEMs, ROOT_REFs, and ROOT_BACKREFs for every tree other than itself. |
|
115
|
- used to find the other trees and to determine the subvol structure. |
|
116
|
- holds items for the 'root tree directory'. laddr is store in the superblock |
|
117
|
- objectIDs |
|
118
|
- free ids: BTRFS_FIRST_FREE_OBJECTID=256ULL:BTRFS_LAST_FREE_OBJECTID=-256ULL |
|
119
|
- otherwise used for internal use |
|
120
|
*** send-stream format |
|
121
|
- [[https://btrfs.readthedocs.io/en/latest/dev/dev-send-stream.html][send stream format]] |
|
122
|
- Send stream format represents a linear sequence of commands describing actions to be |
|
123
|
performed on the target filesystem (receive side), created on the source filesystem |
|
124
|
(send side). |
|
125
|
- The stream is currently used in two ways: to generate a stream representing a |
|
126
|
standalone subvolume (full mode) or a difference between two snapshots of the same |
|
127
|
subvolume (incremental mode). |
|
128
|
- The stream can be generated using a set of other subvolumes to look for extent |
|
129
|
references that could lead to a more efficient stream by transferring only the |
|
130
|
references and not full data. |
|
131
|
- The stream format is abstracted from on-disk structures (though it may share some |
|
132
|
BTRFS specifics), the stream instructions could be generated by other means than the |
|
133
|
send ioctl. |
|
134
|
- it's a checksum+TLV |
|
135
|
- header: u32len,u16cmd,u32crc32c |
|
136
|
- data: type,length,raw data |
|
137
|
- the v2 protocol supports the encoded commands |
|
138
|
- the commands are kinda clunky - need to MKFIL/MKDIR then RENAM to create |
|
139
|
*** [2023-08-09 Wed] ioctls |
|
140
|
- magic#: 0x94 |
|
141
|
- https://docs.kernel.org/userspace-api/ioctl/ioctl-number.html |
|
142
|
- Btrfs filesystem some lifted to vfs/generic |
|
143
|
- fs/btrfs/ioctl.h and linux/fs.h |
|
144
|
** ZFS |
|
145
|
-- [cite/t/f:@zfs] |
|
146
|
|
|
147
|
- core component of TrueNAS software |
|
148
|
** TMPFS |
|
149
|
-- [cite/t/f:@tmpfs] |
|
150
|
- in-mem FS |
|
151
|
** EXT4 |
|
152
|
-- [cite/t/f:@ext4] |
|
153
|
** XFS |
|
154
|
-- [cite/t/f:@xfs] |
|
155
|
-- [cite/t/f:@xfs-scalability] |
|
156
|
* Storage Mediums |
|
157
|
** HDD |
|
158
|
-- [cite/t/f:@hd-failure-ml] |
|
159
|
** SSD |
|
160
|
-- [cite/t/f:@smart-ssd-qp] |
|
161
|
-- [cite/t/f:@ssd-perf-opt] |
|
162
|
|
|
163
|
** Flash |
|
164
|
-- [cite/t/f:@flash-openssd-systems] |
|
165
|
** NVMe |
|
166
|
-- [cite/t/f:@nvme-ssd-ux] |
|
167
|
-- [[https://nvmexpress.org/specifications/][specifications]] |
|
168
|
*** ZNS |
|
169
|
-- [cite/t/f:@zns-usenix] |
|
170
|
#+begin_quote |
|
171
|
Zoned Storage is an open source, standards-based initiative to enable data centers to |
|
172
|
scale efficiently for the zettabyte storage capacity era. There are two technologies |
|
173
|
behind Zoned Storage, Shingled Magnetic Recording (SMR) in ATA/SCSI HDDs and Zoned |
|
174
|
Namespaces (ZNS) in NVMe SSDs. |
|
175
|
#+end_quote |
|
176
|
-- [[https://zonedstorage.io/][zonedstorage.io]] |
|
177
|
-- $465 8tb 2.5"? [[https://www.serversupply.com/SSD/PCI-E/7.68TB/WESTERN%20DIGITAL/WUS4BB076D7P3E3_332270.htm][retail]] |
|
178
|
** eMMC |
|
179
|
-- [cite/t/f:@emmc-mobile-io] |
|
180
|
* Linux |
|
181
|
** syscalls |
|
182
|
*** ioctl |
|
183
|
- [[https://elixir.bootlin.com/linux/latest/source/Documentation/userspace-api/ioctl/ioctl-number.rst][ioctl-numbers]] |
|
184
|
* Rust |
|
185
|
** crates |
|
186
|
*** nix |
|
187
|
- [[https://crates.io/crates/nix][crates.io]] |
|
188
|
*** memmap2 |
|
189
|
- [[https://crates.io/crates/memmap2][crates.io]] |
|
190
|
*** zstd |
|
191
|
- [[https://crates.io/crates/zstd][crates.io]] |
|
192
|
*** rocksdb |
|
193
|
- [[https://crates.io/crates/rocksdb][crates.io]] |
|
194
|
*** tokio :tokio: |
|
195
|
- [[https://crates.io/crates/tokio][crates.io]] |
|
196
|
*** tracing :tokio: |
|
197
|
- [[https://crates.io/crates/tracing][crates.io]] |
|
198
|
**** tracing-subscriber |
|
199
|
- [[https://crates.io/crates/tracing-subscriber][crates.io]] |
|
200
|
*** axum :tokio: |
|
201
|
- [[https://crates.io/crates/axum][crates.io]] |
|
202
|
*** tower :tokio: |
|
203
|
- [[https://crates.io/crates/tower][crates.io]] |
|
204
|
*** uuid |
|
205
|
- [[https://crates.io/crates/uuid][crates.io]] |
|
206
|
** unstable |
|
207
|
*** lazy_cell |
|
208
|
- [[https://github.com/rust-lang/rust/issues/109736][tracking-issue]] |
|
209
|
*** {BTreeMap,BTreeSet}::extract_if |
|
210
|
- [[https://github.com/rust-lang/rust/issues/70530][tracking-issue]] |
|
211
|
* Lisp |
|
212
|
** ASDF |
|
213
|
- [[https://gitlab.common-lisp.net/asdf/asdf][gitlab.common-lisp.net]] |
|
214
|
- [[https://asdf.common-lisp.dev/][common-lisp.dev]] |
|
215
|
- [[https://github.com/fare/asdf/blob/master/doc/best_practices.md][best-practices]] |
|
216
|
- includes UIOP |
|
217
|
** Reference Projects |
|
218
|
*** StumpWM |
|
219
|
- [[https://github.com/stumpwm/stumpwm][github]] |
|
220
|
*** Nyxt |
|
221
|
- [[https://github.com/atlas-engineer/nyxt][github]] |
|
222
|
*** Kons-9 |
|
223
|
- [[https://github.com/kaveh808/kons-9][github]] |
|
224
|
*** cl-torrents |
|
225
|
- [[https://github.com/vindarel/cl-torrents][github]] |
|
226
|
*** Mezzano |
|
227
|
- [[https://github.com/froggey/Mezzano][github]] |
|
228
|
*** yalo |
|
229
|
- [[https://github.com/whily/yalo][github]] |
|
230
|
*** cl-ledger |
|
231
|
- [[https://github.com/ledger/cl-ledger][github]] |
|
232
|
*** Lem |
|
233
|
- [[https://github.com/lem-project/lem][github]] |
|
234
|
*** kindista |
|
235
|
- [[https://github.com/kindista/kindista][github]] |
|
236
|
*** lisp-chat |
|
237
|
- [[https://github.com/ryukinix/lisp-chat][github]] |
|
238
|
* Refs |
|
239
|
#+print_bibliography: |