changelog shortlog graph tags branches changeset files revisions annotate raw help

Mercurial > org > notes / parquet-parsing.org

changeset 16: a63dfd1affed
parent: 4839b0675118
author: Richard Westhaver <ellis@rwest.io>
date: Sun, 08 Sep 2024 12:25:12 -0400
permissions: -rw-r--r--
description: updates
1 * DAT/PARQUET
2 :PROPERTIES:
3 :ID: 657a645b-0fad-4f95-a022-cd837ce188d6
4 :END:
5 https://github.com/apache/parquet-format
6 https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
7 https://github.com/apache/parquet-testing
8 https://github.com/apache/parquet-java
9 ** glossary
10 :PROPERTIES:
11 :ID: e71f388c-9ed1-4862-8890-7f74271e8df0
12 :END:
13 - block :: same as HDFS block
14 - file :: file metadata is required, data is not
15 - row-group :: a logical horizontal partitioning of the data into
16  rows. no physical rep is guaranteed for row-group
17 - column-chunk :: a chunk of the data for a particular column
18 - page :: column chunks are divided into pages. a page is conceptually
19  indivisible in terms of compression/encoding. multiple page types
20  can be interleaved in a column chunk.
21 
22 Files consists of 1+ row-groups. A row-group contains exactly one
23 column chunk per column. Column chunks contain one or more pages.
24 
25 ** format summary
26 :PROPERTIES:
27 :ID: ae54516c-c8a8-49f8-aac6-a95c18f5de8e
28 :END:
29 #+begin_example
30  4-byte magic number "PAR1"
31  <Column 1 Chunk 1>
32  <Column 2 Chunk 1>
33  ...
34  <Column N Chunk 1>
35  <Column 1 Chunk 2>
36  <Column 2 Chunk 2>
37  ...
38  <Column N Chunk 2>
39  ...
40  <Column 1 Chunk M>
41  <Column 2 Chunk M>
42  ...
43  <Column N Chunk M>
44  File Metadata
45  4-byte length in bytes of file metadata (little endian)
46  4-byte magic number "PAR1"
47 #+end_example