You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by GitBox <gi...@apache.org> on 2019/09/05 14:39:24 UTC

[GitHub] [arrow-site] hatemhelal commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15

hatemhelal commented on a change in pull request #19: ARROW-6419: [Website] Blog post about Parquet C++ read performance improvements in Arrow 0.15
URL: https://github.com/apache/arrow-site/pull/19#discussion_r321293620
 
 

 ##########
 File path: _posts/2019-09-03-faster-strings-cpp-parquet.md
 ##########
 @@ -0,0 +1,238 @@
+---
+layout: post
+title: "Faster C++ Apache Parquet performance on dictionary-encoded string data coming in Apache Arrow 0.15"
+date: "2019-09-05 00:00:00 -0600"
+author: Wes McKinney
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+We have been implementing a series of optimizations in the Apache Parquet C++
+internals to improve read and write efficiency (both performance and memory
+use) for Arrow columnar binary and string data, with new "native" support for
+Arrow's dictionary types. This should have a big impact on users of the C++,
+MATLAB, Python, R, and Ruby interfaces to Parquet files.
+
+This post reviews work that was done and shows benchmarks comparing Arrow
+0.12.1 with the current development version (to be released soon as Arrow
+0.15.0).
+
+# Summary of work
+
+One of the largest and most complex optimizations involves encoding and
+decoding Parquet files' internal dictionary-encoded data streams to and from
+Arrow's in-memory dictionary-encoded `DictionaryArray`
+representation. Dictionary encoding is a compression strategy in Parquet, and
+there is no formal "dictionary" or "categorical" type. I will go into more
+detail about this below.
+
+Some of the particular JIRA issues related to this work include:
+
+- Vectorize comparators for computing statistics ([PARQUET-1523][1])
+- Read binary directly data directly into DictionaryBuilder<T>
+  ([ARROW-3769][2])
+- Writing Parquet's dictionary indices directly into DictionaryBuilder<T>
+  ([ARROW-3772][3])
+- Write dense (non-dictionary) Arrow arrays directly into Parquet data encoders
+  ([ARROW-6152][4])
+- Direct writing of arrow::DictionaryArray to Parquet column writers ([ARROW-3246][5])
+- Supporting changing dictionaries ([ARROW-3144][6])
+- Internal IO optimizations and improved raw `BYTE_ARRAY` encoding performance
+  ([ARROW-4398][7])
+
+One of the challenges of developing the Parquet C++ library is that we
+maintain low-level read and write APIs that do not involve the Arrow columnar
+data structures. So we have had to take care to do Arrow-related optimizations
 
 Review comment:
   ```suggestion
   data structures. So we have had to take care implement Arrow-related optimizations
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services