You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/21 10:41:15 UTC

[GitHub] [arrow-rs] alamb opened a new issue, #1718: Support encoding a single parquet file using multiple threads

alamb opened a new issue, #1718:
URL: https://github.com/apache/arrow-rs/issues/1718

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   The encoding / compression is  most often the bottleneck for increasing the throughput of writing parquet files. Even though the actual writing of bytes must be done serially, the encoding could be done in parallel (into memory buffers) before the actual write
   
   **Describe the solution you'd like**
   I would like a way (either an explicit API or an example) that allows using multiple cores to write `ArrowRecord` batches to a file.  
   
   Note that trying to parallelize writes today results in corrupted parquet files, see https://github.com/apache/arrow-rs/issues/1717
   
   **Describe alternatives you've considered**
   There is a high level description of parallel decoding in @jorgecarleitao 's parquet2 https://github.com/jorgecarleitao/parquet2#higher-parallelism (focused on reading)
   
   **Additional context**
   Mailing list https://lists.apache.org/thread/rbhfwcpd6qfk52rtzm2t6mo3fhvdpc91
   
   
   Also, https://github.com/apache/arrow-rs/issues/1711 is possibly related


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1712481734

   > I had presumed it was a non-starter given https://github.com/apache/arrow-rs/issues/3871
   
   Indeed -- we would have to limit the row group size (in rows) to control the buffer needed, thus resulting in a tradeoff between compression and buffering. 
   
   I think we should parallelize writing sorted data at a higher level (aka write multiple files) at first, and treat parallelizing the write of a single file as a follow on project


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1732344650

   I definitely would prefer to keep arrow-rs runtime agnostic as much as possible, i.e. not making concurrency decisions in arrow-rs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] devinjdangelo commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "devinjdangelo (via GitHub)" <gi...@apache.org>.

devinjdangelo commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1734611359

   > Option 1 is likely the most tractable, ArrowWriter already encodes columns to separate memory regions and then stitches the encoded column chunks together. I could conceive doing something similar for a parallel writer.
   
   @tustvold your intial intuition was spot on! I reworked the datafusion parallel parquet writer to primarily use column wise parallelization. It is around 20% faster and 90% lower memory overhead vs. the previous attempt.
   
   PRs open with more details for this new approach:
   - https://github.com/apache/arrow-rs/pull/4859
   - https://github.com/apache/arrow-datafusion/pull/7655


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] devinjdangelo commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "devinjdangelo (via GitHub)" <gi...@apache.org>.

devinjdangelo commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1709219407

   The other thing that might be a challenge if we go the concatenation route is that DataFusion writers preserve input ordering (in the sense that if you run "COPY (select * from my_table order by my_col) to my_file.parquet" then my_file.parquet should be sorted  according to the input query). 
   
   If we construct multiple parquet files/concat, they would need to be deserialized and sorted again which would defeat any possibility of a speed up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1707250598

   Maybe we could add a new API for encoding that took ArrayRefs rather than RecordBatches
   
   The existing API is like: https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html#method.write
   
   Maybe a new one could look something like:
   
   
   ```rust
   let writer: ArrowWriter = ...;
   
   // Create some new structure  for writing each row group
   let row_group_writer = writer.new_row_group();
   // could also call row_group_writer.write_column() for multiple columns concurrently
   for col in record_batch.col() {
     row_group_writer.write(col)?
   }
   row_group_writer.finish()?
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] devinjdangelo commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "devinjdangelo (via GitHub)" <gi...@apache.org>.

devinjdangelo commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1734643791

   > A naive solution might be to just spawn tokio tasks for each column of each batch, but this will have very poor thread locality, high per-batch overheads, and in general feels a little off.
   
   Regarding this point, https://github.com/apache/arrow-datafusion/pull/7655 only spawns 1 task for each column for each row group (not each record batch). Each record batch is sent via a channel to the parallel tasks. Once max_row_group_size is reached, the parallel tasks are joined and new ones spawned in their place.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Support encoding a single parquet file using multiple threads [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold closed issue #1718: Support encoding a single parquet file using multiple threads
URL: https://github.com/apache/arrow-rs/issues/1718


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Support encoding a single parquet file using multiple threads [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1982979221

   I think that would be a question for the maintainers of whichever parquet writer those systems use


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1707251389

   Though encoding different RowGroups and using `append_column` is an interesting idea 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] devinjdangelo commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "devinjdangelo (via GitHub)" <gi...@apache.org>.

devinjdangelo commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1707141716

   I am interested in working on this. Does anyone know if there are existing parallelized parquet write implementations in other languages we could reference? I am particularly interested in what the best approach is between:
   
   1. Serialize multiple columns in a single row group in parallel
   2. Serialize multiple row groups in parallel
   3. A combination of 1 and 2
   
   Number 2 could be a challenge if we don't know up front how many total row groups we want in the file. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1709460116

   > preserve input ordering
   
   The DF partitioning logic is smart enough to not destroy sort orders, the flipside is writing an ordered parquet file will not be parallelized, but given such a query has a sort which will likely dominate execution time, perhaps that doesn't matter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1712463922

   > The buffering might not be the best idea but I think it would be possible
   
   I had presumed it was a non-starter given #3871


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1707250287

   One option might be for systems like DataFusion with a notion of partitioning to simply write each partition to separate memory regions, and then later stitch these together using the [append_column](https://docs.rs/parquet/latest/parquet/file/writer/struct.SerializedRowGroupWriter.html#method.append_column) APIs :thinking: 
   
   This would allow DataFusion to remain in control of the threading, and should be possible with the existing APIs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] devinjdangelo commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "devinjdangelo (via GitHub)" <gi...@apache.org>.

devinjdangelo commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1709213923

   > One option might be for systems like DataFusion with a notion of partitioning to simply write each partition to separate memory regions, and then later stitch these together using the [append_column](https://docs.rs/parquet/latest/parquet/file/writer/struct.SerializedRowGroupWriter.html#method.append_column) APIs 🤔
   
   I like the idea of DataFusion and other arrow-rs users having control over how this is parallelized in terms of threads vs. tokio. My original thought was creating a Send+Sync parquet::ArrowWriter , but I think serializing to independent memory regions and concatenating sounds slick. 
   
   I found https://github.com/apache/arrow-rs/blob/master/parquet/src/bin/parquet-concat.rs. The only thing that it doesn't do we would want in DataFusion is support for bloom filters. I could see how merging bloom filters could be a bit complicated though.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1707240666

   Option 1 is likely the most tractable, ArrowWriter already encodes columns to separate memory regions and then stitches the encoded column chunks together. I could conceive doing something similar for a parallel writer.
   
   I think the biggest question in my mind is the mechanics of parallelism. A naive solution might be to just spawn tokio tasks for each column of each batch, but this will have very poor thread locality, high per-batch overheads, and in general feels a little off... I don't have a good solution here, typically we have avoided adding notions of threading into this crate... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] devinjdangelo commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "devinjdangelo (via GitHub)" <gi...@apache.org>.

devinjdangelo commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1720375088

   > One option might be for systems like DataFusion with a notion of partitioning to simply write each partition to separate memory regions, and then later stitch these together using the [append_column](https://docs.rs/parquet/latest/parquet/file/writer/struct.SerializedRowGroupWriter.html#method.append_column) APIs 🤔
   
   @alamb @tustvold I took a stab at this approach in https://github.com/apache/arrow-datafusion/pull/7562. Any feedback is appreciated (especially ideas to reduce the memory requirements).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] devinjdangelo commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "devinjdangelo (via GitHub)" <gi...@apache.org>.

devinjdangelo commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1732308092

   Also, I think we should be able to get further improvement for parquet files with many columns (i.e. most of them) by providing a way to parallelize the loop in this method: https://github.com/apache/arrow-rs/blob/8465ed4729cf4de8a5aa31d811170c0968c1bc59/parquet/src/arrow/arrow_writer/mod.rs#L389-L397
   
   From looking through the code, it appears that serializing each column is already independent so it should not be too difficult to parallelize.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Support encoding a single parquet file using multiple threads [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1986349891

   For the curious, we plan to make a higher level API for encoding parquet files with multiple-threads available in DataFusion -- see https://github.com/apache/arrow-datafusion/issues/9493


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1721062895

   > @alamb @tustvold I took a stab at this approach in https://github.com/apache/arrow-datafusion/pull/7562. Any feedback is appreciated (especially ideas to reduce the memory requirements).
   
   This sounds amazing @devinjdangelo  -- I will take a look today


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] devinjdangelo commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "devinjdangelo (via GitHub)" <gi...@apache.org>.

devinjdangelo commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1732296454

   @alamb @tustvold  I took another pass at improving on the first implementation. This time I needed to make a few arrow-rs level changes. 
   
   https://github.com/apache/arrow-datafusion/pull/7632
   https://github.com/apache/arrow-rs/pull/4850


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Support encoding a single parquet file using multiple threads [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1776099418

   Closed by #4871


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #1718: Support encoding a single parquet file using multiple threads

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1710472723

   > The DF partitioning logic is smart enough to not destroy sort orders, the flipside is writing an ordered parquet file will not be parallelized, but given such a query has a sort which will likely dominate execution time, perhaps that doesn't matter
   
   I don't understand why we couldn't write row groups in parallel (we would have to buffer enough data to encode in each row group prior to starting the next one). The buffering might not be the best idea but I think it would be possible


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Support encoding a single parquet file using multiple threads [arrow-rs]

Posted by "hbpeng0115 (via GitHub)" <gi...@apache.org>.

hbpeng0115 commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1982757896

   Is it possible to implement in Java language? Due to we're using Java/Scala Flink to write parquet files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org