changelog shortlog graph tags branches changeset files revisions annotate raw help

Mercurial > core / lisp/ffi/zstd/dict.lisp

changeset 698: 96958d3eb5b0
parent: 08621be7e780
author: Richard Westhaver <ellis@rwest.io>
date: Fri, 04 Oct 2024 22:04:59 -0400
permissions: -rw-r--r--
description: fixes
1 ;;; dict.lisp --- Zstd Dictionary API
2 
3 ;;
4 
5 ;;; Commentary:
6 
7 ;; The CDict can be created once and shared across multiple threads since it's
8 ;; read-only.
9 
10 ;; Unclear if DDict is also read-only.
11 
12 ;; From zdict.h:
13 #|
14  * Zstd dictionary builder
15  *
16  * FAQ
17  * ===
18  * Why should I use a dictionary?
19  * ------------------------------
20  *
21  * Zstd can use dictionaries to improve compression ratio of small data.
22  * Traditionally small files don't compress well because there is very little
23  * repetition in a single sample, since it is small. But, if you are compressing
24  * many similar files, like a bunch of JSON records that share the same
25  * structure, you can train a dictionary on ahead of time on some samples of
26  * these files. Then, zstd can use the dictionary to find repetitions that are
27  * present across samples. This can vastly improve compression ratio.
28  *
29  * When is a dictionary useful?
30  * ----------------------------
31  *
32  * Dictionaries are useful when compressing many small files that are similar.
33  * The larger a file is, the less benefit a dictionary will have. Generally,
34  * we don't expect dictionary compression to be effective past 100KB. And the
35  * smaller a file is, the more we would expect the dictionary to help.
36  *
37  * How do I use a dictionary?
38  * --------------------------
39  *
40  * Simply pass the dictionary to the zstd compressor with
41  * `ZSTD_CCtx_loadDictionary()`. The same dictionary must then be passed to
42  * the decompressor, using `ZSTD_DCtx_loadDictionary()`. There are other
43  * more advanced functions that allow selecting some options, see zstd.h for
44  * complete documentation.
45  *
46  * What is a zstd dictionary?
47  * --------------------------
48  *
49  * A zstd dictionary has two pieces: Its header, and its content. The header
50  * contains a magic number, the dictionary ID, and entropy tables. These
51  * entropy tables allow zstd to save on header costs in the compressed file,
52  * which really matters for small data. The content is just bytes, which are
53  * repeated content that is common across many samples.
54  *
55  * What is a raw content dictionary?
56  * ---------------------------------
57  *
58  * A raw content dictionary is just bytes. It doesn't have a zstd dictionary
59  * header, a dictionary ID, or entropy tables. Any buffer is a valid raw
60  * content dictionary.
61  *
62  * How do I train a dictionary?
63  * ----------------------------
64  *
65  * Gather samples from your use case. These samples should be similar to each
66  * other. If you have several use cases, you could try to train one dictionary
67  * per use case.
68  *
69  * Pass those samples to `ZDICT_trainFromBuffer()` and that will train your
70  * dictionary. There are a few advanced versions of this function, but this
71  * is a great starting point. If you want to further tune your dictionary
72  * you could try `ZDICT_optimizeTrainFromBuffer_cover()`. If that is too slow
73  * you can try `ZDICT_optimizeTrainFromBuffer_fastCover()`.
74  *
75  * If the dictionary training function fails, that is likely because you
76  * either passed too few samples, or a dictionary would not be effective
77  * for your data. Look at the messages that the dictionary trainer printed,
78  * if it doesn't say too few samples, then a dictionary would not be effective.
79  *
80  * How large should my dictionary be?
81  * ----------------------------------
82  *
83  * A reasonable dictionary size, the `dictBufferCapacity`, is about 100KB.
84  * The zstd CLI defaults to a 110KB dictionary. You likely don't need a
85  * dictionary larger than that. But, most use cases can get away with a
86  * smaller dictionary. The advanced dictionary builders can automatically
87  * shrink the dictionary for you, and select the smallest size that doesn't
88  * hurt compression ratio too much. See the `shrinkDict` parameter.
89  * A smaller dictionary can save memory, and potentially speed up
90  * compression.
91  *
92  * How many samples should I provide to the dictionary builder?
93  * ------------------------------------------------------------
94  *
95  * We generally recommend passing ~100x the size of the dictionary
96  * in samples. A few thousand should suffice. Having too few samples
97  * can hurt the dictionaries effectiveness. Having more samples will
98  * only improve the dictionaries effectiveness. But having too many
99  * samples can slow down the dictionary builder.
100  *
101  * How do I determine if a dictionary will be effective?
102  * -----------------------------------------------------
103  *
104  * Simply train a dictionary and try it out. You can use zstd's built in
105  * benchmarking tool to test the dictionary effectiveness.
106  *
107  * # Benchmark levels 1-3 without a dictionary
108  * zstd -b1e3 -r /path/to/my/files
109  * # Benchmark levels 1-3 with a dictionary
110  * zstd -b1e3 -r /path/to/my/files -D /path/to/my/dictionary
111  *
112  * When should I retrain a dictionary?
113  * -----------------------------------
114  *
115  * You should retrain a dictionary when its effectiveness drops. Dictionary
116  * effectiveness drops as the data you are compressing changes. Generally, we do
117  * expect dictionaries to "decay" over time, as your data changes, but the rate
118  * at which they decay depends on your use case. Internally, we regularly
119  * retrain dictionaries, and if the new dictionary performs significantly
120  * better than the old dictionary, we will ship the new dictionary.
121  *
122  * I have a raw content dictionary, how do I turn it into a zstd dictionary?
123  * -------------------------------------------------------------------------
124  *
125  * If you have a raw content dictionary, e.g. by manually constructing it, or
126  * using a third-party dictionary builder, you can turn it into a zstd
127  * dictionary by using `ZDICT_finalizeDictionary()`. You'll also have to
128  * provide some samples of the data. It will add the zstd header to the
129  * raw content, which contains a dictionary ID and entropy tables, which
130  * will improve compression ratio, and allow zstd to write the dictionary ID
131  * into the frame, if you so choose.
132  *
133  * Do I have to use zstd's dictionary builder?
134  * -------------------------------------------
135  *
136  * No! You can construct dictionary content however you please, it is just
137  * bytes. It will always be valid as a raw content dictionary. If you want
138  * a zstd dictionary, which can improve compression ratio, use
139  * `ZDICT_finalizeDictionary()`.
140  *
141  * What is the attack surface of a zstd dictionary?
142  * ------------------------------------------------
143  *
144  * Zstd is heavily fuzz tested, including loading fuzzed dictionaries, so
145  * zstd should never crash, or access out-of-bounds memory no matter what
146  * the dictionary is. However, if an attacker can control the dictionary
147  * during decompression, they can cause zstd to generate arbitrary bytes,
148  * just like if they controlled the compressed data.
149  *
150  ******************************************************************************/
151 
152 
153 /*! ZDICT_trainFromBuffer():
154  * Train a dictionary from an array of samples.
155  * Redirect towards ZDICT_optimizeTrainFromBuffer_fastCover() single-threaded, with d=8, steps=4,
156  * f=20, and accel=1.
157  * Samples must be stored concatenated in a single flat buffer `samplesBuffer`,
158  * supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order.
159  * The resulting dictionary will be saved into `dictBuffer`.
160  * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)
161  * or an error code, which can be tested with ZDICT_isError().
162  * Note: Dictionary training will fail if there are not enough samples to construct a
163  * dictionary, or if most of the samples are too small (< 8 bytes being the lower limit).
164  * If dictionary training fails, you should use zstd without a dictionary, as the dictionary
165  * would've been ineffective anyways. If you believe your samples would benefit from a dictionary
166  * please open an issue with details, and we can look into it.
167  * Note: ZDICT_trainFromBuffer()'s memory usage is about 6 MB.
168  * Tips: In general, a reasonable dictionary has a size of ~ 100 KB.
169  * It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`.
170  * In general, it's recommended to provide a few thousands samples, though this can vary a lot.
171  * It's recommended that total size of all samples be about ~x100 times the target size of dictionary.
172  */
173 |#
174 ;;; Code:
175 (in-package :zstd)
176 (deferror zstd-ddict-error (zstd-alien-error) ())
177 (deferror zstd-cdict-error (zstd-alien-error)
178  ()
179  (:report (lambda (c s)
180  (format s "ZSTD CDict signalled error: ~A" (zstd-errorcode* (zstd-error-code c))))))
181 
182 (define-alien-enum (zstd-dict-content-type int)
183  :auto 0
184  :raw-content 1
185  :full-dict 2)
186 
187 (define-alien-enum (zstd-dict-load-method int)
188  :by-copy 0
189  :by-ref 1)
190 
191 (define-alien-enum (zstd-force-ignore-checksum int)
192  :validate-checksum 0
193  :ignore-checksum 1)
194 
195 (define-alien-enum (zstd-ref-multiple-ddicts int)
196  :ref-single-ddict 0
197  :ref-multiple-ddicts 1)
198 
199 (define-alien-enum (zstd-dict-attach-pref int)
200  :default-attach 0
201  :force-attach 1
202  :force-copy 2
203  :force-load 3)
204 
205 (define-alien-enum (zstd-literal-compression-mode int)
206  :auto 0
207  :huffman 1
208  :uncompressed 2)
209 
210 (define-alien-enum (zstd-param-switch int)
211  :auto 0
212  :enable 1
213  :disable 2)
214 
215 (define-alien-enum (zstd-frame-type int)
216  :frame 0
217  :skippable-frame 1)
218 
219 (define-alien-enum (zstd-sequence-format int)
220  :no-block-delimiters 0
221  :explicit-block-delimiters 1)
222 
223 ;;; Simple Dictionary API
224 (define-alien-routine "ZSTD_compress_usingDict" size-t
225  (cctx (* zstd-cctx))
226  (dst (* t))
227  (dst-capacity size-t)
228  (src (* t))
229  (src-size size-t)
230  (dict (* t))
231  (dict-size size-t)
232  (compression-level int))
233 
234 (define-alien-routine "ZSTD_decompress_usingDict" size-t
235  (dctx (* zstd-dctx))
236  (dst (* t))
237  (dst-capacity size-t)
238  (src (* t))
239  (src-size size-t)
240  (dict (* t))
241  (dict-size size-t))
242 
243 ;;; Bulk-processing Dictionary API
244 (define-alien-type zstd-cdict (struct zstd-cdict-s))
245 
246 (define-alien-routine "ZSTD_createCDict" (* zstd-cdict)
247  (dict-buffer (* t))
248  (dict-size size-t)
249  (compression-level int))
250 
251 (define-alien-routine "ZSTD_freeCDict" size-t (cdict (* zstd-cdict)))
252 
253 (define-alien-routine "ZSTD_compress_usingCDict" size-t
254  (cctx (* zstd-cctx))
255  (dst (* t))
256  (dst-capacity size-t)
257  (src (* t))
258  (src-size size-t)
259  (cdict (* zstd-cdict)))
260 
261 (define-alien-type zstd-ddict (struct zstd-ddict-s))
262 
263 (define-alien-routine "ZSTD_createDDict" (* zstd-ddict)
264  (dict-buffer (* t))
265  (dict-size size-t))
266 
267 (define-alien-routine "ZSTD_freeDDict" size-t (ddict (* zstd-ddict)))
268 
269 (define-alien-routine "ZSTD_decompress_usingDDict" size-t
270  (dctx (* zstd-dctx))
271  (dst (* t))
272  (dst-capacity size-t)
273  (src (* t))
274  (src-size size-t)
275  (ddict (* zstd-ddict)))
276 
277 ;; dictionary utils
278 (define-alien-routine "ZSTD_getDictID_fromDict" unsigned
279  (dict (* t))
280  (dict-size size-t))
281 
282 (define-alien-routine "ZSTD_getDictID_fromCDict" unsigned
283  (cdict (* zstd-cdict)))
284 
285 (define-alien-routine "ZSTD_getDictID_fromDDict" unsigned
286  (cdict (* zstd-ddict)))
287 
288 (define-alien-routine "ZSTD_getDictID_fromFrame" unsigned
289  (src (* t))
290  (src-size size-t))
291 
292 (define-alien-routine "ZSTD_estimatedDictSize" size-t (dict-size size-t) (dict-load-method zstd-dict-load-method))
293 
294 (defmacro with-zstd-cdict ((cv &key buffer size (level (zstd-defaultclevel))) &body body)
295  `(with-alien ((,cv (* zstd-cdict) (zstd-createcdict (cast (octets-to-alien ,buffer) (* t))
296  (or ,size (length ,buffer))
297  ,level)))
298  (unwind-protect (progn ,@body)
299  (zstd-freecdict ,cv))))
300 
301 (defmacro with-zstd-ddict ((dv &key buffer size) &body body)
302  `(with-alien ((,dv (* zstd-ddict)
303  (zstd-createddict (cast (octets-to-alien ,buffer) (* t)) (or ,size (length ,buffer)))))
304  (unwind-protect (progn ,@body)
305  (zstd-freeddict ,dv))))
306 
307 ;;; zdict.h
308 (define-alien-type zstd-cover-params
309  (struct zdict-cover-params
310  (k unsigned)
311  (d unsigned)
312  (steps unsigned)
313  (nb-threads unsigned)
314  (split-point double)
315  (shrink-dict unsigned)
316  (shrink-dict-max-regression unsigned)
317  (zparams zdict-params)))
318 
319 (define-alien-routine ("ZDICT_trainFromBuffer" zdict-train-from-buffer) size-t
320  (dict-buffer (* t))
321  (dict-buffer-capacity size-t)
322  (samples-buffer (* t))
323  (samples-sizes (* size-t))
324  (nb-samples unsigned))
325 
326 (define-alien-type zdict-params
327  (struct zdict-params-t
328  (compression-level int)
329  (notification-level unsigned)
330  (dict-id unsigned)))
331 
332 ;; NOTE: Requires returning struct by value
333 
334 ;; This is the ONLY function which used libzstd-alien.so right now.
335 (define-alien-routine ("ZDICT_finalizeDictionaryWithParams" zdict-finalize-dictionary) size-t
336  (dst-dict-buffer (* t))
337  (max-dict-size size-t)
338  (dict-content (* t))
339  (dict-content-size size-t)
340  (samples-buffer (* t))
341  (samples-sizes (* size-t))
342  (nb-samples unsigned)
343  (parameters (* zdict-params)))
344 
345 (define-alien-routine ("ZDICT_getDictID" zdict-get-dict-id) unsigned
346  (dict-buffer (* t))
347  (dict-size size-t))
348 
349 (define-alien-routine ("ZDICT_getDictHeaderSize" zdict-get-dict-header-size) size-t
350  (dict-buffer (* t))
351  (dict-size size-t))
352 
353 (define-alien-routine ("ZDICT_isError" zdict-is-error) unsigned
354  (error-code size-t))