You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/29 09:50:55 UTC

[GitHub] [arrow-datafusion] andrei-ionescu commented on pull request #1500: Add support for PartitionBy functionality

andrei-ionescu commented on pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1002509571

@Dandandan Thanks for the message. I didn't have any other way of reading data and writing it partitioned to the storage. If you have such example please guide me to it.

My use case is this: read parquet data (multiple files) re-partition them and write them as OSS Delta. In my case I read the data in memory and collect it as partitioned then write each partition as separate file.

Can you give me an example using DataFrame API of writing a data as parquet? Or if there is a section where I can implement the Delta writer or any other writer?

I do not agree with the fact that this useful only for the write moment and neither going towards grouping and aggregating is a good way. I'm thinking of the following reasons:

- When using `group by` it requires an aggregation function and this is not what it is needed. We don't aggregate anything we just shuffle the data in a specific way.
- I've seen that the shuffling process is implemented nicely using channels (for the other repartitioning modes) and I think that the same approach has to be used in this case too.
- Parallel processing is still needed after partitioning the data in this way. Summarising metrics or doing some aggregations over such partitioned data is easy parallelizable which is a benefit.
- Using aggregation instead of shuffling will incur more performance penalty due to the fact that it needs to apply aggregation and keep some state.

There are other benefits or pros that this partitioning brings in but in terms of cons I cannot think of too many and no blocker (I have limited knowledge of DataFusion at this moment). Some that come to my mind are these:
- In some cases there is a more unbalanced partitioning trend which is ok. In the case of `Partitioning::Hash` the unbalanced partitioning happens too. So I don't know if this is really a downside.
- A bit slower than the `Partitioning::Hash` due to locking.

If there are any other cons that I'm not aware of, please explain them to me. Those will improve my knowledge on DataFusion.

If the name the I did give to this functionality is not good - let's say that `PartitionBy` term is used for writing and that is misleading - then we can say something else like `ShuffleByExpression` or something else.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org