You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/06 12:31:48 UTC

[GitHub] [arrow] marklit opened a new issue, #15220: Speed up Parquet Writing?

marklit opened a new issue, #15220:
URL: https://github.com/apache/arrow/issues/15220

   ### Describe the enhancement requested
   
   The following was run on Ubuntu 20 on a `e2-highcpu-32` GCP VM with 32 GB of RAM and 32 vCPUs.
   
   I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with pyarrow and I attempted to do the same with fastparquet.
   
   ```bash
   $ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
       | jq -c '.properties * {geom: .geometry|tostring}' \
       > California.jsonl
   $ head -n1 California.jsonl | jq .
   ```
   
   ```json
   {
     "release": 1,
     "capture_dates_range": "",
     "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
   }
   ```
   
   PyArrow is able to produce a 794 MB Parquet file in 49.86 seconds.
   
   ```bash
   /usr/bin/time -v \
       python3 -c "import pandas as pd; pd.read_json('California.jsonl', lines=True).to_parquet('pandas.pyarrow.snappy.pq', row_group_size=37738, engine='pyarrow')"
   ```
   
   With ClickHouse I'm able to complete the same task in 18.35 seconds.
   
   ```
   $ /usr/bin/time -v \
       clickhouse local \
             --input-format JSONEachRow \
             -q "SELECT *
                 FROM table
                 FORMAT Parquet" \
       < California.jsonl \
       > ch.snappy.pq
   ```
   
   The resulting PyArrow Parquet file matches ClickHouse in terms of row groups and using snappy compression.
   
   ```
   <pyarrow._parquet.FileMetaData object at 0x7f8edab544f0>
     created_by: parquet-cpp-arrow version 10.0.1
     num_columns: 3
     num_rows: 11542912
     num_row_groups: 306
     format_version: 2.6
     serialized_size: 228114
   ```
   
   ```
   <pyarrow._parquet.FileMetaData object at 0x7f0926d54860>
     created_by: parquet-cpp version 1.5.1-SNAPSHOT
     num_columns: 3
     num_rows: 11542912
     num_row_groups: 306
     format_version: 1.0
     serialized_size: 228389
   ```
   
   The ClickHouse-produced Parquet file is 19,979 bytes larger than the PyArrow-produced file.
   
   These are the versions of software involved:
   
   * pandas-1.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
   * pyarrow-10.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
   * ClickHouse 22.13.1.1361 (official build)
   
   Below is a flame graph from PyArrow's execution.
   
   ![parquet pyarrow snappy](https://user-images.githubusercontent.com/359316/211012988-6dc96d97-fea8-445a-b6f2-9ba648301ceb.svg)
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] marklit commented on issue #15220: Speed up Parquet Writing?

Posted by GitBox <gi...@apache.org>.

marklit commented on issue #15220:
URL: https://github.com/apache/arrow/issues/15220#issuecomment-1374390909

   I did run it at few times with little variation in the durations.
   
   Could you give me a Python code example that you'd like to see run, please? I can execute it multiple times and do the same with ClickHouse and return the durations list to you.
   
   If you're wanting to separate the reading from the writing aspect, I can run it in iPython and use the magic timer to get a more granular timing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #15220: Speed up Parquet Writing?

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #15220:
URL: https://github.com/apache/arrow/issues/15220#issuecomment-1379250418

   I would encourage you to experiment with row group sizes.  For example, from a quick experiment (writing ~50M rows) I see about a 1.5x hit using these prohibitively small row groups:
   
   ```
   >>> timeit.timeit(lambda: pq.write_table(tab, "/tmp/foo.parquet"), number=5)
   95.93511083399972
   >>> timeit.timeit(lambda: pq.write_table(tab, "/tmp/foo.parquet", row_group_size=37738), number=5)
   149.49872442699962
   ```
   
   Beyond writing, a too small row group size is likely to have an even larger negative effect on read performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] marklit commented on issue #15220: Speed up Parquet Writing?

Posted by GitBox <gi...@apache.org>.

marklit commented on issue #15220:
URL: https://github.com/apache/arrow/issues/15220#issuecomment-1378222568

   I talked about writing PQ out into separate parallel files in my [blog post](https://tech.marksblogg.com/postgresql-to-bigquery.html) yesterday, it does speed things up a lot if the CPU core count of your system is high enough. I'd need to dig into CH's command line switches to see what exactly is possible with tuning their PQ writer and then seeing if these settings have a material impact on their perf and of your project as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #15220: Speed up Parquet Writing?

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #15220:
URL: https://github.com/apache/arrow/issues/15220#issuecomment-1373981370

   Can you try a similar benchmark without involving pandas?  That test will, I believe, go JSONL -> Arrow -> Pandas Dataframe -> Arrow -> Parquet.  The conversion to/from pandas probably does not explain the entire time but may help if you remove it.
   
   Also, is it possible the separate the timing for the reading and writing?  In other words, is the bottleneck on the read half or the write half?
   
   Finally, just to confirm, did you run each benchmark multiple times in a row?  Any benchmark involving file I/O is extremely sensitive to filesystem caching effects.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] marklit commented on issue #15220: Speed up Parquet Writing?

Posted by "marklit (via GitHub)" <gi...@apache.org>.

marklit commented on issue #15220:
URL: https://github.com/apache/arrow/issues/15220#issuecomment-1435922148

   I don't have any way of adjusting ClickHouse's compression settings. 
   
   If I produce a 1-row PQ file with PyArrow and again with ClickHouse I can see the headers are different (PyArrow's are much longer) and the above PQ files produced were off by a few MBs. I'm not sure if it is possible to produce byte-identical PQ files with both tools.
   
   With that said, I'm not convinced it's down to Snappy being well-optimised in ClickHouse and unoptimised in PyArrow. Snappy shows up 4.8% of the time in the Flamegraph.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #15220: Speed up Parquet Writing?

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #15220:
URL: https://github.com/apache/arrow/issues/15220#issuecomment-1376594167

   Reading json and writing parquet, purely in pyarrow, should look something like:
   
   ```
   import pyarrow.parquet
   import pyarrow.json
   table = pyarrow.json.read_json('/tmp/foo.jsonl')
   pyarrow.parquet.write_table(table, '/tmp/foo.parquet')
   ```
   
   Separating disk write from encoding would be trickier (there is an in-memory filesystem in C++ but I don't think we expose it in python).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jiggunjer commented on issue #15220: Speed up Parquet Writing?

Posted by "jiggunjer (via GitHub)" <gi...@apache.org>.

jiggunjer commented on issue #15220:
URL: https://github.com/apache/arrow/issues/15220#issuecomment-1435533468

   To rule out differences in compression levels you could test with uncompressed writies? Perhaps another thing to control for is the schema, it is inferred in your tests?  I'm guessing both writers should be multithreaded already by default?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] marklit commented on issue #15220: Speed up Parquet Writing?

Posted by GitBox <gi...@apache.org>.

marklit commented on issue #15220:
URL: https://github.com/apache/arrow/issues/15220#issuecomment-1376871039

   I ran the following on my MBP this morning. PyArrow managed to get within 1.38x of ClickHouse which is a pretty good speed-up.
   
   ```bash
   $ time \
       ~/Downloads/ch/clickhouse local \
             --input-format JSONEachRow \
             -q "SELECT *
                 FROM table
                 FORMAT Parquet" \
       < California.jsonl \
       > ch.snappy.pq
   ```
   
   The above took 36.316 seconds.
   
   
   ```python
   import pyarrow.parquet
   import pyarrow.json
   
   In [3]: %time table = pyarrow.json.read_json('California.jsonl')
   CPU times: user 26.6 s, sys: 16.5 s, total: 43.1 s
   Wall time: 21.6 s
   
   In [4]: %time pyarrow.parquet.write_table(table, 'pyarrow.snappy.pq', row_group_size=37738)
   CPU times: user 8.98 s, sys: 8.37 s, total: 17.3 s
   Wall time: 28.6 s
   ```
   
   Any suggestions on how I can improve on that number would be greatly appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #15220: Speed up Parquet Writing?

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #15220:
URL: https://github.com/apache/arrow/issues/15220#issuecomment-1377929093

   That row group size is pretty small.  Note that we default to 64Mi rows per group.  I've found 1Mi rows per group to be reasonable.  Much less than that and performance starts to take a hit because the data:metadata ratio starts to get skewed.  Did you try with larger row groups?  Though I can see that is what you get with clickhouse.
   
   Otherwise I don't have any suggestions off the top of my head.  If clickhouse supports parallel column encoding that might explain it.  One way to test that would be to split the table into N different pieces and write the pieces into N different files in parallel.
   
   I have not done any profiling of the pyarrow JSON reader.  It's entirely possible all the missing seconds are in the JSON reader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org