Mercurial > core / lisp/ffi/zstd/dict.lisp
changeset 698: |
96958d3eb5b0 |
parent: |
08621be7e780
|
author: |
Richard Westhaver <ellis@rwest.io> |
date: |
Fri, 04 Oct 2024 22:04:59 -0400 |
permissions: |
-rw-r--r-- |
description: |
fixes |
1 ;;; dict.lisp --- Zstd Dictionary API 7 ;; The CDict can be created once and shared across multiple threads since it's 10 ;; Unclear if DDict is also read-only. 14 * Zstd dictionary builder 18 * Why should I use a dictionary? 19 * ------------------------------ 21 * Zstd can use dictionaries to improve compression ratio of small data. 22 * Traditionally small files don't compress well because there is very little 23 * repetition in a single sample, since it is small. But, if you are compressing 24 * many similar files, like a bunch of JSON records that share the same 25 * structure, you can train a dictionary on ahead of time on some samples of 26 * these files. Then, zstd can use the dictionary to find repetitions that are 27 * present across samples. This can vastly improve compression ratio. 29 * When is a dictionary useful? 30 * ---------------------------- 32 * Dictionaries are useful when compressing many small files that are similar. 33 * The larger a file is, the less benefit a dictionary will have. Generally, 34 * we don't expect dictionary compression to be effective past 100KB. And the 35 * smaller a file is, the more we would expect the dictionary to help. 37 * How do I use a dictionary? 38 * -------------------------- 40 * Simply pass the dictionary to the zstd compressor with 41 * `ZSTD_CCtx_loadDictionary()`. The same dictionary must then be passed to 42 * the decompressor, using `ZSTD_DCtx_loadDictionary()`. There are other 43 * more advanced functions that allow selecting some options, see zstd.h for 44 * complete documentation. 46 * What is a zstd dictionary? 47 * -------------------------- 49 * A zstd dictionary has two pieces: Its header, and its content. The header 50 * contains a magic number, the dictionary ID, and entropy tables. These 51 * entropy tables allow zstd to save on header costs in the compressed file, 52 * which really matters for small data. The content is just bytes, which are 53 * repeated content that is common across many samples. 55 * What is a raw content dictionary? 56 * --------------------------------- 58 * A raw content dictionary is just bytes. It doesn't have a zstd dictionary 59 * header, a dictionary ID, or entropy tables. Any buffer is a valid raw 62 * How do I train a dictionary? 63 * ---------------------------- 65 * Gather samples from your use case. These samples should be similar to each 66 * other. If you have several use cases, you could try to train one dictionary 69 * Pass those samples to `ZDICT_trainFromBuffer()` and that will train your 70 * dictionary. There are a few advanced versions of this function, but this 71 * is a great starting point. If you want to further tune your dictionary 72 * you could try `ZDICT_optimizeTrainFromBuffer_cover()`. If that is too slow 73 * you can try `ZDICT_optimizeTrainFromBuffer_fastCover()`. 75 * If the dictionary training function fails, that is likely because you 76 * either passed too few samples, or a dictionary would not be effective 77 * for your data. Look at the messages that the dictionary trainer printed, 78 * if it doesn't say too few samples, then a dictionary would not be effective. 80 * How large should my dictionary be? 81 * ---------------------------------- 83 * A reasonable dictionary size, the `dictBufferCapacity`, is about 100KB. 84 * The zstd CLI defaults to a 110KB dictionary. You likely don't need a 85 * dictionary larger than that. But, most use cases can get away with a 86 * smaller dictionary. The advanced dictionary builders can automatically 87 * shrink the dictionary for you, and select the smallest size that doesn't 88 * hurt compression ratio too much. See the `shrinkDict` parameter. 89 * A smaller dictionary can save memory, and potentially speed up 92 * How many samples should I provide to the dictionary builder? 93 * ------------------------------------------------------------ 95 * We generally recommend passing ~100x the size of the dictionary 96 * in samples. A few thousand should suffice. Having too few samples 97 * can hurt the dictionaries effectiveness. Having more samples will 98 * only improve the dictionaries effectiveness. But having too many 99 * samples can slow down the dictionary builder. 101 * How do I determine if a dictionary will be effective? 102 * ----------------------------------------------------- 104 * Simply train a dictionary and try it out. You can use zstd's built in 105 * benchmarking tool to test the dictionary effectiveness. 107 * # Benchmark levels 1-3 without a dictionary 108 * zstd -b1e3 -r /path/to/my/files 109 * # Benchmark levels 1-3 with a dictionary 110 * zstd -b1e3 -r /path/to/my/files -D /path/to/my/dictionary 112 * When should I retrain a dictionary? 113 * ----------------------------------- 115 * You should retrain a dictionary when its effectiveness drops. Dictionary 116 * effectiveness drops as the data you are compressing changes. Generally, we do 117 * expect dictionaries to "decay" over time, as your data changes, but the rate 118 * at which they decay depends on your use case. Internally, we regularly 119 * retrain dictionaries, and if the new dictionary performs significantly 120 * better than the old dictionary, we will ship the new dictionary. 122 * I have a raw content dictionary, how do I turn it into a zstd dictionary? 123 * ------------------------------------------------------------------------- 125 * If you have a raw content dictionary, e.g. by manually constructing it, or 126 * using a third-party dictionary builder, you can turn it into a zstd 127 * dictionary by using `ZDICT_finalizeDictionary()`. You'll also have to 128 * provide some samples of the data. It will add the zstd header to the 129 * raw content, which contains a dictionary ID and entropy tables, which 130 * will improve compression ratio, and allow zstd to write the dictionary ID 131 * into the frame, if you so choose. 133 * Do I have to use zstd's dictionary builder? 134 * ------------------------------------------- 136 * No! You can construct dictionary content however you please, it is just 137 * bytes. It will always be valid as a raw content dictionary. If you want 138 * a zstd dictionary, which can improve compression ratio, use 139 * `ZDICT_finalizeDictionary()`. 141 * What is the attack surface of a zstd dictionary? 142 * ------------------------------------------------ 144 * Zstd is heavily fuzz tested, including loading fuzzed dictionaries, so 145 * zstd should never crash, or access out-of-bounds memory no matter what 146 * the dictionary is. However, if an attacker can control the dictionary 147 * during decompression, they can cause zstd to generate arbitrary bytes, 148 * just like if they controlled the compressed data. 150 ******************************************************************************/ 153 /*! ZDICT_trainFromBuffer(): 154 * Train a dictionary from an array of samples. 155 * Redirect towards ZDICT_optimizeTrainFromBuffer_fastCover() single-threaded, with d=8, steps=4, 157 * Samples must be stored concatenated in a single flat buffer `samplesBuffer`, 158 * supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order. 159 * The resulting dictionary will be saved into `dictBuffer`. 160 * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`) 161 * or an error code, which can be tested with ZDICT_isError(). 162 * Note: Dictionary training will fail if there are not enough samples to construct a 163 * dictionary, or if most of the samples are too small (< 8 bytes being the lower limit). 164 * If dictionary training fails, you should use zstd without a dictionary, as the dictionary 165 * would've been ineffective anyways. If you believe your samples would benefit from a dictionary 166 * please open an issue with details, and we can look into it. 167 * Note: ZDICT_trainFromBuffer()'s memory usage is about 6 MB. 168 * Tips: In general, a reasonable dictionary has a size of ~ 100 KB. 169 * It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`. 170 * In general, it's recommended to provide a few thousands samples, though this can vary a lot. 171 * It's recommended that total size of all samples be about ~x100 times the target size of dictionary. 176 (deferror zstd-ddict-error (zstd-alien-error) ()) 177 (deferror zstd-cdict-error (zstd-alien-error) 179 (:report (lambda (c s) 180 (format s "ZSTD CDict signalled error: ~A" (zstd-errorcode* (zstd-error-code c)))))) 182 (define-alien-enum (zstd-dict-content-type int) 187 (define-alien-enum (zstd-dict-load-method int) 191 (define-alien-enum (zstd-force-ignore-checksum int) 195 (define-alien-enum (zstd-ref-multiple-ddicts int) 197 :ref-multiple-ddicts 1) 199 (define-alien-enum (zstd-dict-attach-pref int) 205 (define-alien-enum (zstd-literal-compression-mode int) 210 (define-alien-enum (zstd-param-switch int) 215 (define-alien-enum (zstd-frame-type int) 219 (define-alien-enum (zstd-sequence-format int) 220 :no-block-delimiters 0 221 :explicit-block-delimiters 1) 223 ;;; Simple Dictionary API 224 (define-alien-routine "ZSTD_compress_usingDict" size-t 227 (dst-capacity size-t) 232 (compression-level int)) 234 (define-alien-routine "ZSTD_decompress_usingDict" size-t 237 (dst-capacity size-t) 243 ;;; Bulk-processing Dictionary API 244 (define-alien-type zstd-cdict (struct zstd-cdict-s)) 246 (define-alien-routine "ZSTD_createCDict" (* zstd-cdict) 249 (compression-level int)) 251 (define-alien-routine "ZSTD_freeCDict" size-t (cdict (* zstd-cdict))) 253 (define-alien-routine "ZSTD_compress_usingCDict" size-t 256 (dst-capacity size-t) 259 (cdict (* zstd-cdict))) 261 (define-alien-type zstd-ddict (struct zstd-ddict-s)) 263 (define-alien-routine "ZSTD_createDDict" (* zstd-ddict) 267 (define-alien-routine "ZSTD_freeDDict" size-t (ddict (* zstd-ddict))) 269 (define-alien-routine "ZSTD_decompress_usingDDict" size-t 272 (dst-capacity size-t) 275 (ddict (* zstd-ddict))) 278 (define-alien-routine "ZSTD_getDictID_fromDict" unsigned 282 (define-alien-routine "ZSTD_getDictID_fromCDict" unsigned 283 (cdict (* zstd-cdict))) 285 (define-alien-routine "ZSTD_getDictID_fromDDict" unsigned 286 (cdict (* zstd-ddict))) 288 (define-alien-routine "ZSTD_getDictID_fromFrame" unsigned 292 (define-alien-routine "ZSTD_estimatedDictSize" size-t (dict-size size-t) (dict-load-method zstd-dict-load-method)) 294 (defmacro with-zstd-cdict ((cv &key buffer size (level (zstd-defaultclevel))) &body body) 295 `(with-alien ((,cv (* zstd-cdict) (zstd-createcdict (cast (octets-to-alien ,buffer) (* t)) 296 (or ,size (length ,buffer)) 298 (unwind-protect (progn ,@body) 299 (zstd-freecdict ,cv)))) 301 (defmacro with-zstd-ddict ((dv &key buffer size) &body body) 302 `(with-alien ((,dv (* zstd-ddict) 303 (zstd-createddict (cast (octets-to-alien ,buffer) (* t)) (or ,size (length ,buffer))))) 304 (unwind-protect (progn ,@body) 305 (zstd-freeddict ,dv)))) 308 (define-alien-type zstd-cover-params 309 (struct zdict-cover-params 313 (nb-threads unsigned) 315 (shrink-dict unsigned) 316 (shrink-dict-max-regression unsigned) 317 (zparams zdict-params))) 319 (define-alien-routine ("ZDICT_trainFromBuffer" zdict-train-from-buffer) size-t 321 (dict-buffer-capacity size-t) 322 (samples-buffer (* t)) 323 (samples-sizes (* size-t)) 324 (nb-samples unsigned)) 326 (define-alien-type zdict-params 327 (struct zdict-params-t 328 (compression-level int) 329 (notification-level unsigned) 332 ;; NOTE: Requires returning struct by value 334 ;; This is the ONLY function which used libzstd-alien.so right now. 335 (define-alien-routine ("ZDICT_finalizeDictionaryWithParams" zdict-finalize-dictionary) size-t 336 (dst-dict-buffer (* t)) 337 (max-dict-size size-t) 339 (dict-content-size size-t) 340 (samples-buffer (* t)) 341 (samples-sizes (* size-t)) 342 (nb-samples unsigned) 343 (parameters (* zdict-params))) 345 (define-alien-routine ("ZDICT_getDictID" zdict-get-dict-id) unsigned 349 (define-alien-routine ("ZDICT_getDictHeaderSize" zdict-get-dict-header-size) size-t 353 (define-alien-routine ("ZDICT_isError" zdict-is-error) unsigned