Mercurial > core / lisp/ffi/zstd/dict.lisp
changeset 657: |
937a6f354047 |
parent: |
7354623e5b54
|
child: |
804b5ee20a46 |
author: |
Richard Westhaver <ellis@rwest.io> |
date: |
Wed, 18 Sep 2024 21:48:06 -0400 |
permissions: |
-rw-r--r-- |
description: |
zstd tests and macros |
1 ;;; dict.lisp --- Zstd Dictionary API 9 * Zstd dictionary builder 13 * Why should I use a dictionary? 14 * ------------------------------ 16 * Zstd can use dictionaries to improve compression ratio of small data. 17 * Traditionally small files don't compress well because there is very little 18 * repetition in a single sample, since it is small. But, if you are compressing 19 * many similar files, like a bunch of JSON records that share the same 20 * structure, you can train a dictionary on ahead of time on some samples of 21 * these files. Then, zstd can use the dictionary to find repetitions that are 22 * present across samples. This can vastly improve compression ratio. 24 * When is a dictionary useful? 25 * ---------------------------- 27 * Dictionaries are useful when compressing many small files that are similar. 28 * The larger a file is, the less benefit a dictionary will have. Generally, 29 * we don't expect dictionary compression to be effective past 100KB. And the 30 * smaller a file is, the more we would expect the dictionary to help. 32 * How do I use a dictionary? 33 * -------------------------- 35 * Simply pass the dictionary to the zstd compressor with 36 * `ZSTD_CCtx_loadDictionary()`. The same dictionary must then be passed to 37 * the decompressor, using `ZSTD_DCtx_loadDictionary()`. There are other 38 * more advanced functions that allow selecting some options, see zstd.h for 39 * complete documentation. 41 * What is a zstd dictionary? 42 * -------------------------- 44 * A zstd dictionary has two pieces: Its header, and its content. The header 45 * contains a magic number, the dictionary ID, and entropy tables. These 46 * entropy tables allow zstd to save on header costs in the compressed file, 47 * which really matters for small data. The content is just bytes, which are 48 * repeated content that is common across many samples. 50 * What is a raw content dictionary? 51 * --------------------------------- 53 * A raw content dictionary is just bytes. It doesn't have a zstd dictionary 54 * header, a dictionary ID, or entropy tables. Any buffer is a valid raw 57 * How do I train a dictionary? 58 * ---------------------------- 60 * Gather samples from your use case. These samples should be similar to each 61 * other. If you have several use cases, you could try to train one dictionary 64 * Pass those samples to `ZDICT_trainFromBuffer()` and that will train your 65 * dictionary. There are a few advanced versions of this function, but this 66 * is a great starting point. If you want to further tune your dictionary 67 * you could try `ZDICT_optimizeTrainFromBuffer_cover()`. If that is too slow 68 * you can try `ZDICT_optimizeTrainFromBuffer_fastCover()`. 70 * If the dictionary training function fails, that is likely because you 71 * either passed too few samples, or a dictionary would not be effective 72 * for your data. Look at the messages that the dictionary trainer printed, 73 * if it doesn't say too few samples, then a dictionary would not be effective. 75 * How large should my dictionary be? 76 * ---------------------------------- 78 * A reasonable dictionary size, the `dictBufferCapacity`, is about 100KB. 79 * The zstd CLI defaults to a 110KB dictionary. You likely don't need a 80 * dictionary larger than that. But, most use cases can get away with a 81 * smaller dictionary. The advanced dictionary builders can automatically 82 * shrink the dictionary for you, and select the smallest size that doesn't 83 * hurt compression ratio too much. See the `shrinkDict` parameter. 84 * A smaller dictionary can save memory, and potentially speed up 87 * How many samples should I provide to the dictionary builder? 88 * ------------------------------------------------------------ 90 * We generally recommend passing ~100x the size of the dictionary 91 * in samples. A few thousand should suffice. Having too few samples 92 * can hurt the dictionaries effectiveness. Having more samples will 93 * only improve the dictionaries effectiveness. But having too many 94 * samples can slow down the dictionary builder. 96 * How do I determine if a dictionary will be effective? 97 * ----------------------------------------------------- 99 * Simply train a dictionary and try it out. You can use zstd's built in 100 * benchmarking tool to test the dictionary effectiveness. 102 * # Benchmark levels 1-3 without a dictionary 103 * zstd -b1e3 -r /path/to/my/files 104 * # Benchmark levels 1-3 with a dictionary 105 * zstd -b1e3 -r /path/to/my/files -D /path/to/my/dictionary 107 * When should I retrain a dictionary? 108 * ----------------------------------- 110 * You should retrain a dictionary when its effectiveness drops. Dictionary 111 * effectiveness drops as the data you are compressing changes. Generally, we do 112 * expect dictionaries to "decay" over time, as your data changes, but the rate 113 * at which they decay depends on your use case. Internally, we regularly 114 * retrain dictionaries, and if the new dictionary performs significantly 115 * better than the old dictionary, we will ship the new dictionary. 117 * I have a raw content dictionary, how do I turn it into a zstd dictionary? 118 * ------------------------------------------------------------------------- 120 * If you have a raw content dictionary, e.g. by manually constructing it, or 121 * using a third-party dictionary builder, you can turn it into a zstd 122 * dictionary by using `ZDICT_finalizeDictionary()`. You'll also have to 123 * provide some samples of the data. It will add the zstd header to the 124 * raw content, which contains a dictionary ID and entropy tables, which 125 * will improve compression ratio, and allow zstd to write the dictionary ID 126 * into the frame, if you so choose. 128 * Do I have to use zstd's dictionary builder? 129 * ------------------------------------------- 131 * No! You can construct dictionary content however you please, it is just 132 * bytes. It will always be valid as a raw content dictionary. If you want 133 * a zstd dictionary, which can improve compression ratio, use 134 * `ZDICT_finalizeDictionary()`. 136 * What is the attack surface of a zstd dictionary? 137 * ------------------------------------------------ 139 * Zstd is heavily fuzz tested, including loading fuzzed dictionaries, so 140 * zstd should never crash, or access out-of-bounds memory no matter what 141 * the dictionary is. However, if an attacker can control the dictionary 142 * during decompression, they can cause zstd to generate arbitrary bytes, 143 * just like if they controlled the compressed data. 145 ******************************************************************************/ 148 /*! ZDICT_trainFromBuffer(): 149 * Train a dictionary from an array of samples. 150 * Redirect towards ZDICT_optimizeTrainFromBuffer_fastCover() single-threaded, with d=8, steps=4, 152 * Samples must be stored concatenated in a single flat buffer `samplesBuffer`, 153 * supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order. 154 * The resulting dictionary will be saved into `dictBuffer`. 155 * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`) 156 * or an error code, which can be tested with ZDICT_isError(). 157 * Note: Dictionary training will fail if there are not enough samples to construct a 158 * dictionary, or if most of the samples are too small (< 8 bytes being the lower limit). 159 * If dictionary training fails, you should use zstd without a dictionary, as the dictionary 160 * would've been ineffective anyways. If you believe your samples would benefit from a dictionary 161 * please open an issue with details, and we can look into it. 162 * Note: ZDICT_trainFromBuffer()'s memory usage is about 6 MB. 163 * Tips: In general, a reasonable dictionary has a size of ~ 100 KB. 164 * It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`. 165 * In general, it's recommended to provide a few thousands samples, though this can vary a lot. 166 * It's recommended that total size of all samples be about ~x100 times the target size of dictionary. 171 (deferror zstd-ddict-error (zstd-alien-error) ()) 172 (deferror zstd-cdict-error (zstd-alien-error) 174 (:report (lambda (c s) 175 (format s "ZSTD CDict signalled error: ~A" (zstd-errorcode* (zstd-error-code c)))))) 177 (define-alien-enum (zstd-dict-content-type int) 182 (define-alien-enum (zstd-dict-load-method int) 186 (define-alien-enum (zstd-force-ignore-checksum int) 190 (define-alien-enum (zstd-ref-multiple-ddicts int) 192 :ref-multiple-ddicts 1) 194 (define-alien-enum (zstd-dict-attach-pref int) 200 (define-alien-enum (zstd-literal-compression-mode int) 205 (define-alien-enum (zstd-param-switch int) 210 (define-alien-enum (zstd-frame-type int) 214 (define-alien-enum (zstd-sequence-format int) 215 :no-block-delimiters 0 216 :explicit-block-delimiters 1) 218 ;;; Simple Dictionary API 219 (define-alien-routine "ZSTD_compress_usingDict" size-t 222 (dst-capacity size-t) 227 (compression-level int)) 229 (define-alien-routine "ZSTD_decompress_usingDict" size-t 232 (dst-capacity size-t) 238 ;;; Bulk-processing Dictionary API 239 (define-alien-type zstd-cdict (struct zstd-cdict-s)) 241 (define-alien-routine "ZSTD_createCDict" (* zstd-cdict) 244 (compression-level int)) 246 (define-alien-routine "ZSTD_freeCDict" size-t (cdict (* zstd-cdict))) 248 (define-alien-routine "ZSTD_compress_usingCDict" size-t 251 (dst-capacity size-t) 254 (cdict (* zstd-cdict))) 256 (define-alien-type zstd-ddict (struct zstd-ddict-s)) 258 (define-alien-routine "ZSTD_createDDict" (* zstd-ddict) 262 (define-alien-routine "ZSTD_freeDDict" size-t (ddict (* zstd-ddict))) 264 (define-alien-routine "ZSTD_compress_usingDDict" size-t 267 (dst-capacity size-t) 270 (ddict (* zstd-ddict))) 273 (define-alien-routine "ZSTD_getDictID_fromDict" unsigned 277 (define-alien-routine "ZSTD_getDictID_fromCDict" unsigned 278 (cdict (* zstd-cdict))) 280 (define-alien-routine "ZSTD_getDictID_fromDDict" unsigned 281 (cdict (* zstd-ddict))) 283 (define-alien-routine "ZSTD_getDictID_fromFrame" unsigned 287 (define-alien-routine "ZSTD_estimatedDictSize" size-t (dict-size size-t) (dict-load-method zstd-dict-load-method)) 289 (defmacro with-zstd-cdict ((cv &key buffer size (level (zstd-defaultclevel))) &body body) 290 (let ((size (or size (length buffer)))) 291 `(with-alien ((,cv (* zstd-cdict) (zstd-createcdict (cast (octets-to-alien ,buffer) (* t)) ,size ,level))) 292 (unwind-protect (progn ,@body) 293 (zstd-freecdict ,cv))))) 295 (defmacro with-zstd-ddict ((dv &key buffer size) &body body) 296 (let ((size (or size (length buffer)))) 297 `(with-alien ((,dv (* zstd-ddict) (zstd-createddict (cast (octets-to-alien ,buffer) (* t)) ,size))) 298 (unwind-protect (progn ,@body) 299 (zstd-freeddict ,dv)))))