changelog shortlog graph tags branches changeset files revisions annotate raw help

Mercurial > core / lisp/ffi/zstd/dict.lisp

changeset 657: 937a6f354047
parent: 7354623e5b54
child: 804b5ee20a46
author: Richard Westhaver <ellis@rwest.io>
date: Wed, 18 Sep 2024 21:48:06 -0400
permissions: -rw-r--r--
description: zstd tests and macros
1 ;;; dict.lisp --- Zstd Dictionary API
2 
3 ;;
4 
5 ;;; Commentary:
6 
7 ;; From zdict.h:
8 #|
9  * Zstd dictionary builder
10  *
11  * FAQ
12  * ===
13  * Why should I use a dictionary?
14  * ------------------------------
15  *
16  * Zstd can use dictionaries to improve compression ratio of small data.
17  * Traditionally small files don't compress well because there is very little
18  * repetition in a single sample, since it is small. But, if you are compressing
19  * many similar files, like a bunch of JSON records that share the same
20  * structure, you can train a dictionary on ahead of time on some samples of
21  * these files. Then, zstd can use the dictionary to find repetitions that are
22  * present across samples. This can vastly improve compression ratio.
23  *
24  * When is a dictionary useful?
25  * ----------------------------
26  *
27  * Dictionaries are useful when compressing many small files that are similar.
28  * The larger a file is, the less benefit a dictionary will have. Generally,
29  * we don't expect dictionary compression to be effective past 100KB. And the
30  * smaller a file is, the more we would expect the dictionary to help.
31  *
32  * How do I use a dictionary?
33  * --------------------------
34  *
35  * Simply pass the dictionary to the zstd compressor with
36  * `ZSTD_CCtx_loadDictionary()`. The same dictionary must then be passed to
37  * the decompressor, using `ZSTD_DCtx_loadDictionary()`. There are other
38  * more advanced functions that allow selecting some options, see zstd.h for
39  * complete documentation.
40  *
41  * What is a zstd dictionary?
42  * --------------------------
43  *
44  * A zstd dictionary has two pieces: Its header, and its content. The header
45  * contains a magic number, the dictionary ID, and entropy tables. These
46  * entropy tables allow zstd to save on header costs in the compressed file,
47  * which really matters for small data. The content is just bytes, which are
48  * repeated content that is common across many samples.
49  *
50  * What is a raw content dictionary?
51  * ---------------------------------
52  *
53  * A raw content dictionary is just bytes. It doesn't have a zstd dictionary
54  * header, a dictionary ID, or entropy tables. Any buffer is a valid raw
55  * content dictionary.
56  *
57  * How do I train a dictionary?
58  * ----------------------------
59  *
60  * Gather samples from your use case. These samples should be similar to each
61  * other. If you have several use cases, you could try to train one dictionary
62  * per use case.
63  *
64  * Pass those samples to `ZDICT_trainFromBuffer()` and that will train your
65  * dictionary. There are a few advanced versions of this function, but this
66  * is a great starting point. If you want to further tune your dictionary
67  * you could try `ZDICT_optimizeTrainFromBuffer_cover()`. If that is too slow
68  * you can try `ZDICT_optimizeTrainFromBuffer_fastCover()`.
69  *
70  * If the dictionary training function fails, that is likely because you
71  * either passed too few samples, or a dictionary would not be effective
72  * for your data. Look at the messages that the dictionary trainer printed,
73  * if it doesn't say too few samples, then a dictionary would not be effective.
74  *
75  * How large should my dictionary be?
76  * ----------------------------------
77  *
78  * A reasonable dictionary size, the `dictBufferCapacity`, is about 100KB.
79  * The zstd CLI defaults to a 110KB dictionary. You likely don't need a
80  * dictionary larger than that. But, most use cases can get away with a
81  * smaller dictionary. The advanced dictionary builders can automatically
82  * shrink the dictionary for you, and select the smallest size that doesn't
83  * hurt compression ratio too much. See the `shrinkDict` parameter.
84  * A smaller dictionary can save memory, and potentially speed up
85  * compression.
86  *
87  * How many samples should I provide to the dictionary builder?
88  * ------------------------------------------------------------
89  *
90  * We generally recommend passing ~100x the size of the dictionary
91  * in samples. A few thousand should suffice. Having too few samples
92  * can hurt the dictionaries effectiveness. Having more samples will
93  * only improve the dictionaries effectiveness. But having too many
94  * samples can slow down the dictionary builder.
95  *
96  * How do I determine if a dictionary will be effective?
97  * -----------------------------------------------------
98  *
99  * Simply train a dictionary and try it out. You can use zstd's built in
100  * benchmarking tool to test the dictionary effectiveness.
101  *
102  * # Benchmark levels 1-3 without a dictionary
103  * zstd -b1e3 -r /path/to/my/files
104  * # Benchmark levels 1-3 with a dictionary
105  * zstd -b1e3 -r /path/to/my/files -D /path/to/my/dictionary
106  *
107  * When should I retrain a dictionary?
108  * -----------------------------------
109  *
110  * You should retrain a dictionary when its effectiveness drops. Dictionary
111  * effectiveness drops as the data you are compressing changes. Generally, we do
112  * expect dictionaries to "decay" over time, as your data changes, but the rate
113  * at which they decay depends on your use case. Internally, we regularly
114  * retrain dictionaries, and if the new dictionary performs significantly
115  * better than the old dictionary, we will ship the new dictionary.
116  *
117  * I have a raw content dictionary, how do I turn it into a zstd dictionary?
118  * -------------------------------------------------------------------------
119  *
120  * If you have a raw content dictionary, e.g. by manually constructing it, or
121  * using a third-party dictionary builder, you can turn it into a zstd
122  * dictionary by using `ZDICT_finalizeDictionary()`. You'll also have to
123  * provide some samples of the data. It will add the zstd header to the
124  * raw content, which contains a dictionary ID and entropy tables, which
125  * will improve compression ratio, and allow zstd to write the dictionary ID
126  * into the frame, if you so choose.
127  *
128  * Do I have to use zstd's dictionary builder?
129  * -------------------------------------------
130  *
131  * No! You can construct dictionary content however you please, it is just
132  * bytes. It will always be valid as a raw content dictionary. If you want
133  * a zstd dictionary, which can improve compression ratio, use
134  * `ZDICT_finalizeDictionary()`.
135  *
136  * What is the attack surface of a zstd dictionary?
137  * ------------------------------------------------
138  *
139  * Zstd is heavily fuzz tested, including loading fuzzed dictionaries, so
140  * zstd should never crash, or access out-of-bounds memory no matter what
141  * the dictionary is. However, if an attacker can control the dictionary
142  * during decompression, they can cause zstd to generate arbitrary bytes,
143  * just like if they controlled the compressed data.
144  *
145  ******************************************************************************/
146 
147 
148 /*! ZDICT_trainFromBuffer():
149  * Train a dictionary from an array of samples.
150  * Redirect towards ZDICT_optimizeTrainFromBuffer_fastCover() single-threaded, with d=8, steps=4,
151  * f=20, and accel=1.
152  * Samples must be stored concatenated in a single flat buffer `samplesBuffer`,
153  * supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order.
154  * The resulting dictionary will be saved into `dictBuffer`.
155  * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)
156  * or an error code, which can be tested with ZDICT_isError().
157  * Note: Dictionary training will fail if there are not enough samples to construct a
158  * dictionary, or if most of the samples are too small (< 8 bytes being the lower limit).
159  * If dictionary training fails, you should use zstd without a dictionary, as the dictionary
160  * would've been ineffective anyways. If you believe your samples would benefit from a dictionary
161  * please open an issue with details, and we can look into it.
162  * Note: ZDICT_trainFromBuffer()'s memory usage is about 6 MB.
163  * Tips: In general, a reasonable dictionary has a size of ~ 100 KB.
164  * It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`.
165  * In general, it's recommended to provide a few thousands samples, though this can vary a lot.
166  * It's recommended that total size of all samples be about ~x100 times the target size of dictionary.
167  */
168 |#
169 ;;; Code:
170 (in-package :zstd)
171 (deferror zstd-ddict-error (zstd-alien-error) ())
172 (deferror zstd-cdict-error (zstd-alien-error)
173  ()
174  (:report (lambda (c s)
175  (format s "ZSTD CDict signalled error: ~A" (zstd-errorcode* (zstd-error-code c))))))
176 
177 (define-alien-enum (zstd-dict-content-type int)
178  :auto 0
179  :raw-content 1
180  :full-dict 2)
181 
182 (define-alien-enum (zstd-dict-load-method int)
183  :by-copy 0
184  :by-ref 1)
185 
186 (define-alien-enum (zstd-force-ignore-checksum int)
187  :validate-checksum 0
188  :ignore-checksum 1)
189 
190 (define-alien-enum (zstd-ref-multiple-ddicts int)
191  :ref-single-ddict 0
192  :ref-multiple-ddicts 1)
193 
194 (define-alien-enum (zstd-dict-attach-pref int)
195  :default-attach 0
196  :force-attach 1
197  :force-copy 2
198  :force-load 3)
199 
200 (define-alien-enum (zstd-literal-compression-mode int)
201  :auto 0
202  :huffman 1
203  :uncompressed 2)
204 
205 (define-alien-enum (zstd-param-switch int)
206  :auto 0
207  :enable 1
208  :disable 2)
209 
210 (define-alien-enum (zstd-frame-type int)
211  :frame 0
212  :skippable-frame 1)
213 
214 (define-alien-enum (zstd-sequence-format int)
215  :no-block-delimiters 0
216  :explicit-block-delimiters 1)
217 
218 ;;; Simple Dictionary API
219 (define-alien-routine "ZSTD_compress_usingDict" size-t
220  (cctx (* zstd-cctx))
221  (dst (* t))
222  (dst-capacity size-t)
223  (src (* t))
224  (src-size size-t)
225  (dict (* t))
226  (dict-size size-t)
227  (compression-level int))
228 
229 (define-alien-routine "ZSTD_decompress_usingDict" size-t
230  (dctx (* zstd-dctx))
231  (dst (* t))
232  (dst-capacity size-t)
233  (src (* t))
234  (src-size size-t)
235  (dict (* t))
236  (dict-size size-t))
237 
238 ;;; Bulk-processing Dictionary API
239 (define-alien-type zstd-cdict (struct zstd-cdict-s))
240 
241 (define-alien-routine "ZSTD_createCDict" (* zstd-cdict)
242  (dict-buffer (* t))
243  (dict-size size-t)
244  (compression-level int))
245 
246 (define-alien-routine "ZSTD_freeCDict" size-t (cdict (* zstd-cdict)))
247 
248 (define-alien-routine "ZSTD_compress_usingCDict" size-t
249  (cctx (* zstd-cctx))
250  (dst (* t))
251  (dst-capacity size-t)
252  (src (* t))
253  (src-size size-t)
254  (cdict (* zstd-cdict)))
255 
256 (define-alien-type zstd-ddict (struct zstd-ddict-s))
257 
258 (define-alien-routine "ZSTD_createDDict" (* zstd-ddict)
259  (dict-buffer (* t))
260  (dict-size size-t))
261 
262 (define-alien-routine "ZSTD_freeDDict" size-t (ddict (* zstd-ddict)))
263 
264 (define-alien-routine "ZSTD_compress_usingDDict" size-t
265  (dctx (* zstd-dctx))
266  (dst (* t))
267  (dst-capacity size-t)
268  (src (* t))
269  (src-size size-t)
270  (ddict (* zstd-ddict)))
271 
272 ;; dictionary utils
273 (define-alien-routine "ZSTD_getDictID_fromDict" unsigned
274  (dict (* t))
275  (dict-size size-t))
276 
277 (define-alien-routine "ZSTD_getDictID_fromCDict" unsigned
278  (cdict (* zstd-cdict)))
279 
280 (define-alien-routine "ZSTD_getDictID_fromDDict" unsigned
281  (cdict (* zstd-ddict)))
282 
283 (define-alien-routine "ZSTD_getDictID_fromFrame" unsigned
284  (src (* t))
285  (src-size size-t))
286 
287 (define-alien-routine "ZSTD_estimatedDictSize" size-t (dict-size size-t) (dict-load-method zstd-dict-load-method))
288 
289 (defmacro with-zstd-cdict ((cv &key buffer size (level (zstd-defaultclevel))) &body body)
290  (let ((size (or size (length buffer))))
291  `(with-alien ((,cv (* zstd-cdict) (zstd-createcdict (cast (octets-to-alien ,buffer) (* t)) ,size ,level)))
292  (unwind-protect (progn ,@body)
293  (zstd-freecdict ,cv)))))
294 
295 (defmacro with-zstd-ddict ((dv &key buffer size) &body body)
296  (let ((size (or size (length buffer))))
297  `(with-alien ((,dv (* zstd-ddict) (zstd-createddict (cast (octets-to-alien ,buffer) (* t)) ,size)))
298  (unwind-protect (progn ,@body)
299  (zstd-freeddict ,dv)))))