core:lisp/ffi/zstd/dict.lisp

changelog shortlog graph tags branches changeset files revisions annotate raw help
Mercurial > core / lisp/ffi/zstd/dict.lisp

changeset 698:	96958d3eb5b0
parent:	08621be7e780
author:	Richard Westhaver <ellis@rwest.io>
date:	Fri, 04 Oct 2024 22:04:59 -0400
permissions:	-rw-r--r--
description:	fixes
     1 ;;; dict.lisp --- Zstd Dictionary API
     2 
     3 ;; 
     4 
     5 ;;; Commentary:
     6 
     7 ;; The CDict can be created once and shared across multiple threads since it's
     8 ;; read-only.
     9 
    10 ;; Unclear if DDict is also read-only.
    11 
    12 ;; From zdict.h:
    13 #|
    14  * Zstd dictionary builder
    15  *
    16  * FAQ
    17  * ===
    18  * Why should I use a dictionary?
    19  * ------------------------------
    20  *
    21  * Zstd can use dictionaries to improve compression ratio of small data.
    22  * Traditionally small files don't compress well because there is very little
    23  * repetition in a single sample, since it is small. But, if you are compressing
    24  * many similar files, like a bunch of JSON records that share the same
    25  * structure, you can train a dictionary on ahead of time on some samples of
    26  * these files. Then, zstd can use the dictionary to find repetitions that are
    27  * present across samples. This can vastly improve compression ratio.
    28  *
    29  * When is a dictionary useful?
    30  * ----------------------------
    31  *
    32  * Dictionaries are useful when compressing many small files that are similar.
    33  * The larger a file is, the less benefit a dictionary will have. Generally,
    34  * we don't expect dictionary compression to be effective past 100KB. And the
    35  * smaller a file is, the more we would expect the dictionary to help.
    36  *
    37  * How do I use a dictionary?
    38  * --------------------------
    39  *
    40  * Simply pass the dictionary to the zstd compressor with
    41  * `ZSTD_CCtx_loadDictionary()`. The same dictionary must then be passed to
    42  * the decompressor, using `ZSTD_DCtx_loadDictionary()`. There are other
    43  * more advanced functions that allow selecting some options, see zstd.h for
    44  * complete documentation.
    45  *
    46  * What is a zstd dictionary?
    47  * --------------------------
    48  *
    49  * A zstd dictionary has two pieces: Its header, and its content. The header
    50  * contains a magic number, the dictionary ID, and entropy tables. These
    51  * entropy tables allow zstd to save on header costs in the compressed file,
    52  * which really matters for small data. The content is just bytes, which are
    53  * repeated content that is common across many samples.
    54  *
    55  * What is a raw content dictionary?
    56  * ---------------------------------
    57  *
    58  * A raw content dictionary is just bytes. It doesn't have a zstd dictionary
    59  * header, a dictionary ID, or entropy tables. Any buffer is a valid raw
    60  * content dictionary.
    61  *
    62  * How do I train a dictionary?
    63  * ----------------------------
    64  *
    65  * Gather samples from your use case. These samples should be similar to each
    66  * other. If you have several use cases, you could try to train one dictionary
    67  * per use case.
    68  *
    69  * Pass those samples to `ZDICT_trainFromBuffer()` and that will train your
    70  * dictionary. There are a few advanced versions of this function, but this
    71  * is a great starting point. If you want to further tune your dictionary
    72  * you could try `ZDICT_optimizeTrainFromBuffer_cover()`. If that is too slow
    73  * you can try `ZDICT_optimizeTrainFromBuffer_fastCover()`.
    74  *
    75  * If the dictionary training function fails, that is likely because you
    76  * either passed too few samples, or a dictionary would not be effective
    77  * for your data. Look at the messages that the dictionary trainer printed,
    78  * if it doesn't say too few samples, then a dictionary would not be effective.
    79  *
    80  * How large should my dictionary be?
    81  * ----------------------------------
    82  *
    83  * A reasonable dictionary size, the `dictBufferCapacity`, is about 100KB.
    84  * The zstd CLI defaults to a 110KB dictionary. You likely don't need a
    85  * dictionary larger than that. But, most use cases can get away with a
    86  * smaller dictionary. The advanced dictionary builders can automatically
    87  * shrink the dictionary for you, and select the smallest size that doesn't
    88  * hurt compression ratio too much. See the `shrinkDict` parameter.
    89  * A smaller dictionary can save memory, and potentially speed up
    90  * compression.
    91  *
    92  * How many samples should I provide to the dictionary builder?
    93  * ------------------------------------------------------------
    94  *
    95  * We generally recommend passing ~100x the size of the dictionary
    96  * in samples. A few thousand should suffice. Having too few samples
    97  * can hurt the dictionaries effectiveness. Having more samples will
    98  * only improve the dictionaries effectiveness. But having too many
    99  * samples can slow down the dictionary builder.
   100  *
   101  * How do I determine if a dictionary will be effective?
   102  * -----------------------------------------------------
   103  *
   104  * Simply train a dictionary and try it out. You can use zstd's built in
   105  * benchmarking tool to test the dictionary effectiveness.
   106  *
   107  *   # Benchmark levels 1-3 without a dictionary
   108  *   zstd -b1e3 -r /path/to/my/files
   109  *   # Benchmark levels 1-3 with a dictionary
   110  *   zstd -b1e3 -r /path/to/my/files -D /path/to/my/dictionary
   111  *
   112  * When should I retrain a dictionary?
   113  * -----------------------------------
   114  *
   115  * You should retrain a dictionary when its effectiveness drops. Dictionary
   116  * effectiveness drops as the data you are compressing changes. Generally, we do
   117  * expect dictionaries to "decay" over time, as your data changes, but the rate
   118  * at which they decay depends on your use case. Internally, we regularly
   119  * retrain dictionaries, and if the new dictionary performs significantly
   120  * better than the old dictionary, we will ship the new dictionary.
   121  *
   122  * I have a raw content dictionary, how do I turn it into a zstd dictionary?
   123  * -------------------------------------------------------------------------
   124  *
   125  * If you have a raw content dictionary, e.g. by manually constructing it, or
   126  * using a third-party dictionary builder, you can turn it into a zstd
   127  * dictionary by using `ZDICT_finalizeDictionary()`. You'll also have to
   128  * provide some samples of the data. It will add the zstd header to the
   129  * raw content, which contains a dictionary ID and entropy tables, which
   130  * will improve compression ratio, and allow zstd to write the dictionary ID
   131  * into the frame, if you so choose.
   132  *
   133  * Do I have to use zstd's dictionary builder?
   134  * -------------------------------------------
   135  *
   136  * No! You can construct dictionary content however you please, it is just
   137  * bytes. It will always be valid as a raw content dictionary. If you want
   138  * a zstd dictionary, which can improve compression ratio, use
   139  * `ZDICT_finalizeDictionary()`.
   140  *
   141  * What is the attack surface of a zstd dictionary?
   142  * ------------------------------------------------
   143  *
   144  * Zstd is heavily fuzz tested, including loading fuzzed dictionaries, so
   145  * zstd should never crash, or access out-of-bounds memory no matter what
   146  * the dictionary is. However, if an attacker can control the dictionary
   147  * during decompression, they can cause zstd to generate arbitrary bytes,
   148  * just like if they controlled the compressed data.
   149  *
   150  ******************************************************************************/
   151 
   152 
   153 /*! ZDICT_trainFromBuffer():
   154  *  Train a dictionary from an array of samples.
   155  *  Redirect towards ZDICT_optimizeTrainFromBuffer_fastCover() single-threaded, with d=8, steps=4,
   156  *  f=20, and accel=1.
   157  *  Samples must be stored concatenated in a single flat buffer `samplesBuffer`,
   158  *  supplied with an array of sizes `samplesSizes`, providing the size of each sample, in order.
   159  *  The resulting dictionary will be saved into `dictBuffer`.
   160  * @return: size of dictionary stored into `dictBuffer` (<= `dictBufferCapacity`)
   161  *          or an error code, which can be tested with ZDICT_isError().
   162  *  Note:  Dictionary training will fail if there are not enough samples to construct a
   163  *         dictionary, or if most of the samples are too small (< 8 bytes being the lower limit).
   164  *         If dictionary training fails, you should use zstd without a dictionary, as the dictionary
   165  *         would've been ineffective anyways. If you believe your samples would benefit from a dictionary
   166  *         please open an issue with details, and we can look into it.
   167  *  Note: ZDICT_trainFromBuffer()'s memory usage is about 6 MB.
   168  *  Tips: In general, a reasonable dictionary has a size of ~ 100 KB.
   169  *        It's possible to select smaller or larger size, just by specifying `dictBufferCapacity`.
   170  *        In general, it's recommended to provide a few thousands samples, though this can vary a lot.
   171  *        It's recommended that total size of all samples be about ~x100 times the target size of dictionary.
   172  */
   173 |#
   174 ;;; Code:
   175 (in-package :zstd)
   176 (deferror zstd-ddict-error (zstd-alien-error) ())
   177 (deferror zstd-cdict-error (zstd-alien-error)
   178     ()
   179     (:report (lambda (c s)
   180                (format s "ZSTD CDict signalled error: ~A" (zstd-errorcode* (zstd-error-code c))))))
   181 
   182 (define-alien-enum (zstd-dict-content-type int)
   183                    :auto 0
   184                    :raw-content 1
   185                    :full-dict 2)
   186 
   187 (define-alien-enum (zstd-dict-load-method int)
   188                    :by-copy 0
   189                    :by-ref 1)
   190 
   191 (define-alien-enum (zstd-force-ignore-checksum int)
   192                    :validate-checksum 0
   193                    :ignore-checksum 1)
   194 
   195 (define-alien-enum (zstd-ref-multiple-ddicts int)
   196                    :ref-single-ddict 0
   197                    :ref-multiple-ddicts 1)
   198 
   199 (define-alien-enum (zstd-dict-attach-pref int)
   200                    :default-attach 0
   201                    :force-attach 1
   202                    :force-copy 2
   203                    :force-load 3)
   204 
   205 (define-alien-enum (zstd-literal-compression-mode int)
   206                    :auto 0
   207                    :huffman 1
   208                    :uncompressed 2)
   209 
   210 (define-alien-enum (zstd-param-switch int)
   211                    :auto 0
   212                    :enable 1
   213                    :disable 2)
   214 
   215 (define-alien-enum (zstd-frame-type int)
   216                    :frame 0
   217                    :skippable-frame 1)
   218 
   219 (define-alien-enum (zstd-sequence-format int)
   220                    :no-block-delimiters 0
   221                    :explicit-block-delimiters 1)
   222 
   223 ;;; Simple Dictionary API
   224 (define-alien-routine "ZSTD_compress_usingDict" size-t
   225   (cctx (* zstd-cctx))
   226   (dst (* t))
   227   (dst-capacity size-t)
   228   (src (* t))
   229   (src-size size-t)
   230   (dict (* t))
   231   (dict-size size-t)
   232   (compression-level int))
   233 
   234 (define-alien-routine "ZSTD_decompress_usingDict" size-t
   235   (dctx (* zstd-dctx))
   236   (dst (* t))
   237   (dst-capacity size-t)
   238   (src (* t))
   239   (src-size size-t)
   240   (dict (* t))
   241   (dict-size size-t))
   242 
   243 ;;; Bulk-processing Dictionary API
   244 (define-alien-type zstd-cdict (struct zstd-cdict-s))
   245 
   246 (define-alien-routine "ZSTD_createCDict" (* zstd-cdict)
   247   (dict-buffer (* t))
   248   (dict-size size-t)
   249   (compression-level int))
   250 
   251 (define-alien-routine "ZSTD_freeCDict" size-t (cdict (* zstd-cdict)))
   252 
   253 (define-alien-routine "ZSTD_compress_usingCDict" size-t
   254   (cctx (* zstd-cctx))
   255   (dst (* t))
   256   (dst-capacity size-t)
   257   (src (* t))
   258   (src-size size-t)
   259   (cdict (* zstd-cdict)))
   260 
   261 (define-alien-type zstd-ddict (struct zstd-ddict-s))
   262 
   263 (define-alien-routine "ZSTD_createDDict" (* zstd-ddict)
   264   (dict-buffer (* t))
   265   (dict-size size-t))
   266 
   267 (define-alien-routine "ZSTD_freeDDict" size-t (ddict (* zstd-ddict)))
   268 
   269 (define-alien-routine "ZSTD_decompress_usingDDict" size-t
   270   (dctx (* zstd-dctx))
   271   (dst (* t))
   272   (dst-capacity size-t)
   273   (src (* t))
   274   (src-size size-t)
   275   (ddict (* zstd-ddict)))
   276 
   277 ;; dictionary utils
   278 (define-alien-routine "ZSTD_getDictID_fromDict" unsigned
   279   (dict (* t))
   280   (dict-size size-t))
   281 
   282 (define-alien-routine "ZSTD_getDictID_fromCDict" unsigned
   283   (cdict (* zstd-cdict)))
   284 
   285 (define-alien-routine "ZSTD_getDictID_fromDDict" unsigned
   286   (cdict (* zstd-ddict)))
   287 
   288 (define-alien-routine "ZSTD_getDictID_fromFrame" unsigned
   289   (src (* t))
   290   (src-size size-t))
   291 
   292 (define-alien-routine "ZSTD_estimatedDictSize" size-t (dict-size size-t) (dict-load-method zstd-dict-load-method))
   293 
   294 (defmacro with-zstd-cdict ((cv &key buffer size (level (zstd-defaultclevel))) &body body)
   295   `(with-alien ((,cv (* zstd-cdict) (zstd-createcdict (cast (octets-to-alien ,buffer) (* t))
   296                                                       (or ,size (length ,buffer))
   297                                                       ,level)))
   298      (unwind-protect (progn ,@body)
   299        (zstd-freecdict ,cv))))
   300 
   301 (defmacro with-zstd-ddict ((dv &key buffer size) &body body)
   302   `(with-alien ((,dv (* zstd-ddict)
   303                      (zstd-createddict (cast (octets-to-alien ,buffer) (* t)) (or ,size (length ,buffer)))))
   304      (unwind-protect (progn ,@body)
   305        (zstd-freeddict ,dv))))
   306 
   307 ;;; zdict.h
   308 (define-alien-type zstd-cover-params 
   309     (struct zdict-cover-params
   310             (k unsigned)
   311             (d unsigned)
   312             (steps unsigned)
   313             (nb-threads unsigned)
   314             (split-point double)
   315             (shrink-dict unsigned)
   316             (shrink-dict-max-regression unsigned)
   317             (zparams zdict-params)))
   318 
   319 (define-alien-routine ("ZDICT_trainFromBuffer" zdict-train-from-buffer) size-t
   320   (dict-buffer (* t))
   321   (dict-buffer-capacity size-t)
   322   (samples-buffer (* t))
   323   (samples-sizes (* size-t))
   324   (nb-samples unsigned))
   325 
   326 (define-alien-type zdict-params
   327   (struct zdict-params-t
   328           (compression-level int)
   329           (notification-level unsigned)
   330           (dict-id unsigned)))
   331 
   332 ;; NOTE: Requires returning struct by value
   333 
   334 ;; This is the ONLY function which used libzstd-alien.so right now.
   335 (define-alien-routine ("ZDICT_finalizeDictionaryWithParams" zdict-finalize-dictionary) size-t
   336   (dst-dict-buffer (* t))
   337   (max-dict-size size-t)
   338   (dict-content (* t))
   339   (dict-content-size size-t)
   340   (samples-buffer (* t))
   341   (samples-sizes (* size-t))
   342   (nb-samples unsigned)
   343   (parameters (* zdict-params)))
   344 
   345 (define-alien-routine ("ZDICT_getDictID" zdict-get-dict-id) unsigned
   346   (dict-buffer (* t))
   347   (dict-size size-t))
   348 
   349 (define-alien-routine ("ZDICT_getDictHeaderSize" zdict-get-dict-header-size) size-t
   350   (dict-buffer (* t))
   351   (dict-size size-t))
   352 
   353 (define-alien-routine ("ZDICT_isError" zdict-is-error) unsigned
   354   (error-code size-t))