changelog shortlog graph tags branches changeset files revisions annotate raw help

Mercurial > org > notes / parquet-parsing.org

changeset 8: 6ac37a61456a
child: 4839b0675118
author: Richard Westhaver <ellis@rwest.io>
date: Sat, 27 Jul 2024 02:45:34 -0400
permissions: -rw-r--r--
description: bump
1 * DAT/PARQUET
2 https://github.com/apache/parquet-format
3 https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
4 https://github.com/apache/parquet-testing
5 https://github.com/apache/parquet-java
6 ** glossary
7 - block :: same as HDFS block
8 - file :: file metadata is required, data is not
9 - row-group :: a logical horizontal partitioning of the data into
10  rows. no physical rep is guaranteed for row-group
11 - column-chunk :: a chunk of the data for a particular column
12 - page :: column chunks are divided into pages. a page is conceptually
13  indivisible in terms of compression/encoding. multiple page types
14  can be interleaved in a column chunk.
15 
16 Files consists of 1+ row-groups. A row-group contains exactly one
17 column chunk per column. Column chunks contain one or more pages.
18 
19 ** format summary
20 #+begin_example
21  4-byte magic number "PAR1"
22  <Column 1 Chunk 1>
23  <Column 2 Chunk 1>
24  ...
25  <Column N Chunk 1>
26  <Column 1 Chunk 2>
27  <Column 2 Chunk 2>
28  ...
29  <Column N Chunk 2>
30  ...
31  <Column 1 Chunk M>
32  <Column 2 Chunk M>
33  ...
34  <Column N Chunk M>
35  File Metadata
36  4-byte length in bytes of file metadata (little endian)
37  4-byte magic number "PAR1"
38 #+end_example