You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@gobblin.apache.org by ab...@apache.org on 2018/03/21 08:30:40 UTC

[22/50] incubator-gobblin git commit: [GOBBLIN-351] Add ParquetHdfsDataWriter docs

[GOBBLIN-351] Add ParquetHdfsDataWriter docs

[GOBBLIN-351] Add ParquetHdfsDataWriter docs

[GOBBLIN-351] Add more info about builder and
dictionary encoding

Closes #2220 from tilakpatidar/parquet_hdfs_docs


Project: http://git-wip-us.apache.org/repos/asf/incubator-gobblin/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-gobblin/commit/3598d10e
Tree: http://git-wip-us.apache.org/repos/asf/incubator-gobblin/tree/3598d10e
Diff: http://git-wip-us.apache.org/repos/asf/incubator-gobblin/diff/3598d10e

Branch: refs/heads/0.12.0
Commit: 3598d10eb0ea0d01244a93ff1506a563afeca9ed
Parents: 3094fe5
Author: tilakpatidar <ti...@gmail.com>
Authored: Mon Feb 5 12:03:31 2018 -0800
Committer: Abhishek Tiwari <ab...@gmail.com>
Committed: Mon Feb 5 12:03:31 2018 -0800

----------------------------------------------------------------------
 gobblin-docs/sinks/ParquetHdfsDataWriter.md | 25 ++++++++++++++++++++++++
 mkdocs.yml                                  |  1 +
 2 files changed, 26 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-gobblin/blob/3598d10e/gobblin-docs/sinks/ParquetHdfsDataWriter.md
----------------------------------------------------------------------
diff --git a/gobblin-docs/sinks/ParquetHdfsDataWriter.md b/gobblin-docs/sinks/ParquetHdfsDataWriter.md
new file mode 100644
index 0000000..f3ad0da
--- /dev/null
+++ b/gobblin-docs/sinks/ParquetHdfsDataWriter.md
@@ -0,0 +1,25 @@
+# Description
+
+An extension to [`FsDataWriter`](https://github.com/apache/incubator-gobblin/blob/master/gobblin-core/src/main/java/org/apache/gobblin/writer/FsDataWriter.java) that writes in Parquet format in the form of [`Group.java`](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/example/data/Group.java). This implementation allows users to specify the CodecFactory to use through the configuration property [`writer.codec.type`](https://gobblin.readthedocs.io/en/latest/user-guide/Configuration-Properties-Glossary/#writercodectype). By default, the deflate codec is used.
+
+# Usage
+```
+writer.builder.class=org.apache.gobblin.writer.ParquetDataWriterBuilder
+writer.destination.type=HDFS
+writer.output.format=PARQUET
+```
+For more info, see 
+[`ParquetHdfsDataWriter`](https://github.com/apache/incubator-gobblin/blob/master/gobblin-modules/gobblin-parquet/src/main/java/org/apache/gobblin/writer/ParquetHdfsDataWriter.java)
+and
+[`ParquetDataWriterBuilder`](https://github.com/apache/incubator-gobblin/blob/master/gobblin-modules/gobblin-parquet/src/main/java/org/apache/gobblin/writer/ParquetDataWriterBuilder.java)
+
+
+# Configuration
+
+| Key                    | Description | Default Value | Required |
+|------------------------|-------------|---------------|----------|
+| writer.parquet.page.size | The page size threshold. | 1048576 | No |
+| writer.parquet.dictionary.page.size | The block size threshold for the dictionary pages. | 134217728 | No |
+| writer.parquet.dictionary | To turn dictionary encoding on. Parquet has a dictionary encoding for data with a small number of unique values ( < 10^5 ) that aids in significant compression and boosts processing speed. | true | No |
+| writer.parquet.validate | To turn on validation using the schema. This validation is done by [`ParquetWriter`](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java) not by Gobblin. | false | No |
+| writer.parquet.version | Version of parquet writer to use. Available versions are v1 and v2. | v1 | No |
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-gobblin/blob/3598d10e/mkdocs.yml
----------------------------------------------------------------------
diff --git a/mkdocs.yml b/mkdocs.yml
index f3486b4..7152bd0 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -64,6 +64,7 @@ pages:
         - Wikipedia: sources/WikipediaSource.md
     - Record Sinks:
         - Avro HDFS: sinks/AvroHdfsDataWriter.md
+        - Parquet HDFS: sinks/ParquetHdfsDataWriter.md
         - HDFS Byte array: sinks/SimpleBytesWriter.md
         - Console: sinks/ConsoleWriter.md
         - Couchbase: sinks/Couchbase-Writer.md