changelog shortlog graph tags branches changeset files file revisions raw help

Mercurial > org > notes / annotate parquet-parsing.org

changeset 9: 4839b0675118
parent: 6ac37a61456a
author: Richard Westhaver <ellis@rwest.io>
date: Sun, 11 Aug 2024 14:46:59 -0400
permissions: -rw-r--r--
description: ids
8
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
1
 * DAT/PARQUET
9
Richard Westhaver <ellis@rwest.io>
parents: 8
diff changeset
2
 :PROPERTIES:
Richard Westhaver <ellis@rwest.io>
parents: 8
diff changeset
3
 :ID:       657a645b-0fad-4f95-a022-cd837ce188d6
Richard Westhaver <ellis@rwest.io>
parents: 8
diff changeset
4
 :END:
8
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
5
 https://github.com/apache/parquet-format
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
6
 https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
7
 https://github.com/apache/parquet-testing
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
8
 https://github.com/apache/parquet-java
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
9
 ** glossary
9
Richard Westhaver <ellis@rwest.io>
parents: 8
diff changeset
10
 :PROPERTIES:
Richard Westhaver <ellis@rwest.io>
parents: 8
diff changeset
11
 :ID:       e71f388c-9ed1-4862-8890-7f74271e8df0
Richard Westhaver <ellis@rwest.io>
parents: 8
diff changeset
12
 :END:
8
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
13
 - block :: same as HDFS block
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
14
 - file :: file metadata is required, data is not
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
15
 - row-group :: a logical horizontal partitioning of the data into
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
16
   rows. no physical rep is guaranteed for row-group
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
17
 - column-chunk :: a chunk of the data for a particular column
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
18
 - page :: column chunks are divided into pages. a page is conceptually
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
19
   indivisible in terms of compression/encoding. multiple page types
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
20
   can be interleaved in a column chunk.
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
21
 
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
22
 Files consists of 1+ row-groups. A row-group contains exactly one
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
23
 column chunk per column. Column chunks contain one or more pages.
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
24
 
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
25
 ** format summary
9
Richard Westhaver <ellis@rwest.io>
parents: 8
diff changeset
26
 :PROPERTIES:
Richard Westhaver <ellis@rwest.io>
parents: 8
diff changeset
27
 :ID:       ae54516c-c8a8-49f8-aac6-a95c18f5de8e
Richard Westhaver <ellis@rwest.io>
parents: 8
diff changeset
28
 :END:
8
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
29
 #+begin_example
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
30
   4-byte magic number "PAR1"
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
31
   <Column 1 Chunk 1>
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
32
   <Column 2 Chunk 1>
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
33
   ...
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
34
   <Column N Chunk 1>
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
35
   <Column 1 Chunk 2>
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
36
   <Column 2 Chunk 2>
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
37
   ...
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
38
   <Column N Chunk 2>
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
39
   ...
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
40
   <Column 1 Chunk M>
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
41
   <Column 2 Chunk M>
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
42
   ...
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
43
   <Column N Chunk M>
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
44
   File Metadata
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
45
   4-byte length in bytes of file metadata (little endian)
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
46
   4-byte magic number "PAR1"
Richard Westhaver <ellis@rwest.io>
parents:
diff changeset
47
 #+end_example