core:lisp/ffi/zstd/dict.lisp

changelog shortlog graph tags branches changeset files revisions annotate raw help
Mercurial > core / lisp/ffi/zstd/dict.lisp

changeset 657:	937a6f354047
parent:	7354623e5b54
child:	804b5ee20a46
author:	Richard Westhaver <ellis@rwest.io>
date:	Wed, 18 Sep 2024 21:48:06 -0400
permissions:	-rw-r--r--
description:	zstd tests and macros
     1 ;;; dict.lisp --- Zstd Dictionary API
     2 
     3 ;; 
     4 
     5 ;;; Commentary:
     6 
     7 ;; From zdict.h:
     8 #|
     9  * Zstd dictionary builder
    10  *
    11  * FAQ
    12  * ===
    13  * Why should I use a dictionary?
    14  * ------------------------------
    15  *
    16  * Zstd can use dictionaries to improve compression ratio of small data.
    17  * Traditionally small files don't compress well because there is very little
    18  * repetition in a single sample, since it is small. But, if you are compressing
    19  * many similar files, like a bunch of JSON records that share the same
    20  * structure, you can train a dictionary on ahead of time on some samples of
    21  * these files. Then, zstd can use the dictionary to find repetitions that are
    22  * present across samples. This can vastly improve compression ratio.
    23  *
    24  * When is a dictionary useful?
    25  * ----------------------------
    26  *
    27  * Dictionaries are useful when compressing many small files that are similar.
    28  * The larger a file is, the less benefit a dictionary will have. Generally,
    29  * we don't expect dictionary compression to be effective past 100KB. And the
    30  * smaller a file is, the more we would expect the dictionary to help.
    31  *
    32  * How do I use a dictionary?
    33  * --------------------------
    34  *
    35  * Simply pass the dictionary to the zstd compressor with
    36  * `ZSTD_CCtx_loadDictionary()`. The same dictionary must then be passed to
    37  * the decompressor, using `ZSTD_DCtx_loadDictionary()`. There are other
    38  * more advanced functions that allow selecting some options, see zstd.h for
    39  * complete documentation.
    40  *
    41  * What is a zstd dictionary?
    42  * --------------------------
    43  *
    44  * A zstd dictionary has two pieces: Its header, and its content. The header
    45  * contains a magic number, the dictionary ID, and entropy tables. These
    46  * entropy tables allow zstd to save on header costs in the compressed file,
    47  * which really matters for small data. The content is just bytes, which are
    48  * repeated content that is common across many samples.
    49  *
    50  * What is a raw content dictionary?
    51  * ---------------------------------
    52  *
    53  * A raw content dictionary is just bytes. It doesn't have a zstd dictionary
    54  * header, a dictionary ID, or entropy tables. Any buffer is a valid raw
    55  * content dictionary.
    56  *
    57  * How do I train a dictionary?
    58  * ----------------------------
    59  *
    60  * Gather samples from your use case. These samples should be similar to each
    61  * other. If you have several use cases, you could try to train one dictionary
    62  * per use case.
    63  *
    64  * Pass those samples to `ZDICT_trainFromBuffer()` and that will train your
    65  * dictionary. There are a few advanced versions of this function, but this
    66  * is a great starting point. If you want to further tune your dictionary
    67  * you could try `ZDICT_optimizeTrainFromBuffer_cover()`. If that is too slow
    68  * you can try `ZDICT_optimizeTrainFromBuffer_fastCover()`.
    69  *
    70  * If the dictionary training function fails, that is likely because you
    71  * either passed too few samples, or a dictionary would not be effective
    72  * for your data. Look at the messages that the dictionary trainer printed,
    73  * if it doesn't say too few samples, then a dictionary would not be effective.
    74  *
    75  * How large should my dictionary be?
    76  * ----------------------------------
    77  *
    78  * A reasonable dictionary size, the `dictBufferCapacity`, is about 100KB.
    79  * The zstd CLI defaults to a 110KB dictionary. You likely don't need a
    80  * dictionary larger than that. But, most use cases can get away with a
    81  * smaller dictionary. The advanced dictionary builders can automatically
    82  * shrink the dictionary for you, and select the smallest size that doesn't
    83  * hurt compression ratio too much. See the `shrinkDict` parameter.
    84  * A smaller dictionary can save memory, and potentially speed up
    85  * compression.
    86  *
    87  * How many samples should I provide to the dictionary builder?
    88  * ------------------------------------------------------------
    89  *
    90  * We generally recommend passing ~100x the size of the dictionary
    91  * in samples. A few thousand should suffice. Having too few samples
    92  * can hurt the dictionaries effectiveness. Having more samples will
    93  * only improve the dictionaries effectiveness. But having too many
    94  * samples can slow down the dictionary builder.
    95  *
    96  * How do I determine if a dictionary will be effective?
    97  * -----------------------------------------------------
    98  *
    99  * Simply train a dictionary and try it out. You can use zstd's built in
   100  * benchmarking tool to test the dictionary effectiveness.
   101  *
   102  *   # Benchmark levels 1-3 without a dictionary
   103  *   zstd -b1e3 -r /path/to/my/files
   104  *   # Benchmark levels 1-3 with a dictionary
   105  *   zstd -b1e3 -r /path/to/my/files -D /path/to/my/dictionary
   106  *
   107  * When should I retrain a dictionary?
   108  * -----------------------------------
   109  *
   110  * You should retrain a dictionary when its effectiveness drops. Dictionary
   111  * effectiveness drops as the data you are compressing changes. Generally, we do
   112  * expect dictionaries to "decay" over time, as your data changes, but the rate
   113  * at which they decay depends on your use case. Internally, we regularly
   114  * retrain dictionaries, and if the new dictionary performs significantly
   115  * better than the old dictionary, we will ship the new dictionary.
   116  *
   117  * I have a raw content dictionary, how do I turn it into a zstd dictionary?
   118  * -------------------------------------------------------------------------
   119  *
   120  * If you have a raw content dictionary, e.g. by manually constructing it, or
   121  * using a third-party dictionary builder, you can turn it into a zstd
   122  * dictionary by using `ZDICT_finalizeDictionary()`. You'll also have to
   123  * provide some samples of the data. It will add the zstd header to the
   124  * raw content, which contains a dictionary ID and entropy tables, which
   125  * will improve compression ratio, and allow zstd to write the dictionary ID
   126  * into the frame, if you so choose.
   127  *
   128  * Do I have to use zstd's dictionary builder?
   129  * -------------------------------------------
   130  *
   131  * No! You can construct dictionary content however you please, it is just
   132  * bytes. It will always be valid as a raw content dictionary. If you want
   133  * a zstd dictionary, which can improve compression ratio, use
   134  * `ZDICT_finalizeDictionary()`.
   135  *
   136  * What is the attack surface of a zstd dictionary?
   137  * ------------------------------------------------
   138  *
   139  * Zstd is heavily fuzz tested, including loading fuzzed dictionaries, so
   140  * zstd should never crash, or access out-of-bounds memory no matter what
   141  * the dictionary is. However, if an attacker can control the dictionary
   142  * during decompression, they can cause zstd to generate arbitrary bytes,
   143  * just like if they controlled the compressed data.
   144  *
   145  ******************************************************************************/
   146 
   147 
   148 /*! ZDICT_trainFromBuffer():
   149  *  Train a dictionary from an array of samples.
   150  *  Redirect towards ZDICT_optimizeTrainFromBuffer_fastCover() single-threaded, with d=8, steps=4,
   151  *  f=20, and accel=1.
   152  *  Samples must be stored concatenated in a single flat buffer `samplesBuffer`,
   153  *  supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order.
   154  *  The resulting dictionary will be saved into `dictBuffer`.
   155  * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)
   156  *          or an error code, which can be tested with ZDICT_isError().
   157  *  Note:  Dictionary training will fail if there are not enough samples to construct a
   158  *         dictionary, or if most of the samples are too small (< 8 bytes being the lower limit).
   159  *         If dictionary training fails, you should use zstd without a dictionary, as the dictionary
   160  *         would've been ineffective anyways. If you believe your samples would benefit from a dictionary
   161  *         please open an issue with details, and we can look into it.
   162  *  Note: ZDICT_trainFromBuffer()'s memory usage is about 6 MB.
   163  *  Tips: In general, a reasonable dictionary has a size of ~ 100 KB.
   164  *        It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`.
   165  *        In general, it's recommended to provide a few thousands samples, though this can vary a lot.
   166  *        It's recommended that total size of all samples be about ~x100 times the target size of dictionary.
   167  */
   168 |#
   169 ;;; Code:
   170 (in-package :zstd)
   171 (deferror zstd-ddict-error (zstd-alien-error) ())
   172 (deferror zstd-cdict-error (zstd-alien-error)
   173     ()
   174     (:report (lambda (c s)
   175                (format s "ZSTD CDict signalled error: ~A" (zstd-errorcode* (zstd-error-code c))))))
   176 
   177 (define-alien-enum (zstd-dict-content-type int)
   178                    :auto 0
   179                    :raw-content 1
   180                    :full-dict 2)
   181 
   182 (define-alien-enum (zstd-dict-load-method int)
   183                    :by-copy 0
   184                    :by-ref 1)
   185 
   186 (define-alien-enum (zstd-force-ignore-checksum int)
   187                    :validate-checksum 0
   188                    :ignore-checksum 1)
   189 
   190 (define-alien-enum (zstd-ref-multiple-ddicts int)
   191                    :ref-single-ddict 0
   192                    :ref-multiple-ddicts 1)
   193 
   194 (define-alien-enum (zstd-dict-attach-pref int)
   195                    :default-attach 0
   196                    :force-attach 1
   197                    :force-copy 2
   198                    :force-load 3)
   199 
   200 (define-alien-enum (zstd-literal-compression-mode int)
   201                    :auto 0
   202                    :huffman 1
   203                    :uncompressed 2)
   204 
   205 (define-alien-enum (zstd-param-switch int)
   206                    :auto 0
   207                    :enable 1
   208                    :disable 2)
   209 
   210 (define-alien-enum (zstd-frame-type int)
   211                    :frame 0
   212                    :skippable-frame 1)
   213 
   214 (define-alien-enum (zstd-sequence-format int)
   215                    :no-block-delimiters 0
   216                    :explicit-block-delimiters 1)
   217 
   218 ;;; Simple Dictionary API
   219 (define-alien-routine "ZSTD_compress_usingDict" size-t
   220   (cctx (* zstd-cctx))
   221   (dst (* t))
   222   (dst-capacity size-t)
   223   (src (* t))
   224   (src-size size-t)
   225   (dict (* t))
   226   (dict-size size-t)
   227   (compression-level int))
   228 
   229 (define-alien-routine "ZSTD_decompress_usingDict" size-t
   230   (dctx (* zstd-dctx))
   231   (dst (* t))
   232   (dst-capacity size-t)
   233   (src (* t))
   234   (src-size size-t)
   235   (dict (* t))
   236   (dict-size size-t))
   237 
   238 ;;; Bulk-processing Dictionary API
   239 (define-alien-type zstd-cdict (struct zstd-cdict-s))
   240 
   241 (define-alien-routine "ZSTD_createCDict" (* zstd-cdict)
   242   (dict-buffer (* t))
   243   (dict-size size-t)
   244   (compression-level int))
   245 
   246 (define-alien-routine "ZSTD_freeCDict" size-t (cdict (* zstd-cdict)))
   247 
   248 (define-alien-routine "ZSTD_compress_usingCDict" size-t
   249   (cctx (* zstd-cctx))
   250   (dst (* t))
   251   (dst-capacity size-t)
   252   (src (* t))
   253   (src-size size-t)
   254   (cdict (* zstd-cdict)))
   255 
   256 (define-alien-type zstd-ddict (struct zstd-ddict-s))
   257 
   258 (define-alien-routine "ZSTD_createDDict" (* zstd-ddict)
   259   (dict-buffer (* t))
   260   (dict-size size-t))
   261 
   262 (define-alien-routine "ZSTD_freeDDict" size-t (ddict (* zstd-ddict)))
   263 
   264 (define-alien-routine "ZSTD_compress_usingDDict" size-t
   265   (dctx (* zstd-dctx))
   266   (dst (* t))
   267   (dst-capacity size-t)
   268   (src (* t))
   269   (src-size size-t)
   270   (ddict (* zstd-ddict)))
   271 
   272 ;; dictionary utils
   273 (define-alien-routine "ZSTD_getDictID_fromDict" unsigned
   274   (dict (* t))
   275   (dict-size size-t))
   276 
   277 (define-alien-routine "ZSTD_getDictID_fromCDict" unsigned
   278   (cdict (* zstd-cdict)))
   279 
   280 (define-alien-routine "ZSTD_getDictID_fromDDict" unsigned
   281   (cdict (* zstd-ddict)))
   282 
   283 (define-alien-routine "ZSTD_getDictID_fromFrame" unsigned
   284   (src (* t))
   285   (src-size size-t))
   286 
   287 (define-alien-routine "ZSTD_estimatedDictSize" size-t (dict-size size-t) (dict-load-method zstd-dict-load-method))
   288 
   289 (defmacro with-zstd-cdict ((cv &key buffer size (level (zstd-defaultclevel))) &body body)
   290   (let ((size (or size (length buffer))))
   291     `(with-alien ((,cv (* zstd-cdict) (zstd-createcdict (cast (octets-to-alien ,buffer) (* t)) ,size ,level)))
   292        (unwind-protect (progn ,@body)
   293          (zstd-freecdict ,cv)))))
   294 
   295 (defmacro with-zstd-ddict ((dv &key buffer size) &body body)
   296   (let ((size (or size (length buffer))))
   297     `(with-alien ((,dv (* zstd-ddict) (zstd-createddict (cast (octets-to-alien ,buffer) (* t)) ,size)))
   298        (unwind-protect (progn ,@body)
   299          (zstd-freeddict ,dv)))))