You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/28 15:09:53 UTC

[GitHub] [arrow-datafusion] andrei-ionescu opened a new pull request #1500: Add support for PartitionBy functionality

andrei-ionescu opened a new pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500


   # Which issue does this PR close?
   
   Closes #1404.
   
    # Rationale for this change
   
   DataFusion lacks support for partition by operation. 
   
   The most used example is" given a dataset, we need to write it down on the storage as a partitioned set of files (ie: Parquet dataset partitioned by year/month/day, etc). We need to write it as paths like this:
   
   ```
   /dataset/day=2021-12-28/fuel=Gas/
   /dataset/day=2021-12-28/fuel=Diesel/
   /dataset/day=2021-12-27/fuel=Electric/
   ...
   ```
   
   # What changes are included in this PR?
   
   This PR adds a new partitioning method: `Partitioning::PartitionBy`. 
   
   It has two parameters:
   - the expression by which the partitioning will take place
   - an optional value to specify the number of partition to output:
     - the best would be to know the exact number of distinct values by which the data will be partitioned, fact that will be beneficial in terms of performance
     - if it Is bigger than the  exact number of distinct values for partitioning it will return the number of partitions found in the data
     - if is smaller it will return the first n number of partitions dropping the other
     - if not specified, it will start from `i16::MAX`
   
   # Are there any user-facing changes?
   
   There is a new partitioning option available as  part of the API.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1500: Add support for PartitionBy functionality

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1054615462


   Thanks @andrei-ionescu  -- I am sorry I missed that this PR was waiting on feedback from us; Thank you @houqp  for taking that on!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1500: Add support for PartitionBy functionality

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1040676594


   Closing stale PRs. Please reopen (or open a new one) if you plan to keep working on this feature. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #1500: Add support for PartitionBy functionality

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1050572054


   > I can help take a look into this next week and answer some of your questions if @Dandandan is busy with other things
   
   Thanks @houqp !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #1500: Add support for PartitionBy functionality

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1086635402


   I believe that PR is the way forward and addresses my concerns I raised here earlier.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on pull request #1500: Add support for PartitionBy functionality

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1002310491


   Thanks for showing interest in this functionality!
   
   I believe this functionality shouldn't be a part of the `RepartitionExec`. The partitions (confusingly) mentioned there are a unit of parallelism and distributing data, not a way to aggregate data like in the example for writing to different partitions. The partitions in `PartitionExec` are equivalent to Spark partitions and mirror that functionality / design.
   
   We already have some pieces that we could use to implement the write support:
   * hash aggregation, i.e. `group by`. This could be used to group the same rows and append them to the output directories / files. `HashAggregateExec` has different modes, local and global. For writing, only local mode is needed, as we can write the rows to multiple files within the same directory (just like Spark does). 
   * `RepartitionExec::Hash`, allowing the parallelize the operations on more CPUs and nodes in Ballista. This is already utilized by using the hash aggregate code.
   
   The missing piece here is the support/wiring to *write* the final partitions to directories / files using a partitioning scheme.
   
   I don't have to much time ATM to write up a proposal myself, but if you see which direction I am going to, I support to write a design for this to collect feedback on it.
   
   It would be wise to look after some similar engines, like Spark/Hive/Trino/etc. for some inspiration.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu commented on pull request #1500: Add support for PartitionBy functionality

Posted by GitBox <gi...@apache.org>.
andrei-ionescu commented on pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1002509571


   @Dandandan Thanks for the message. I didn't have any other way of reading data and writing it partitioned to the storage. If you have such example please guide me to it.
   
   My use case is this: read parquet data (multiple files) re-partition them and write them as OSS Delta. In my case I read the data in memory and collect it as partitioned then write each partition as separate file.
   
   Can you give me an example using DataFrame API of writing a data as parquet? Or if there is a section where I can implement the Delta writer or any other writer?
   
   I do not agree with the fact that this useful only for the write moment and neither going towards grouping and aggregating is a good way. I'm thinking of the following reasons:
   
   - When using `group by` it requires an aggregation function and this is not what it is needed. We don't aggregate anything we just shuffle the data in a specific way. 
   - I've seen that the shuffling process is implemented nicely using channels (for the other repartitioning modes) and I think that the same approach has to be used in this case too. 
   - Parallel processing is still needed after partitioning the data in this way. Summarising metrics or doing some aggregations  over such partitioned data is easy parallelizable which is a benefit. 
   - Using aggregation instead of shuffling will incur more performance penalty due to the fact that it needs to apply aggregation and keep some state.
   
   There are other benefits or pros that this partitioning brings in but in terms of cons I cannot think of too many and no blocker (I have limited knowledge of DataFusion at this moment). Some that come to my mind are these:
   - In some cases there is a more unbalanced partitioning trend which is ok. In the case of `Partitioning::Hash` the unbalanced partitioning happens too. So I don't know if this is really a downside.
   - A bit slower than the `Partitioning::Hash` due to locking.
   
   If there are any other cons that I'm not aware of, please explain them to me. Those will improve my knowledge on DataFusion.
   
   If the name the I did give to this functionality is not good - let's say that `PartitionBy` term is used for writing and that is misleading - then we can say something else like `ShuffleByExpression` or something else.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu commented on pull request #1500: Add support for PartitionBy functionality

Posted by GitBox <gi...@apache.org>.
andrei-ionescu commented on pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1046705564


   @alamb How can I reopen this PR? I don't see any re-open PR button. This is not a stale PR is just waiting feedback from your side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1500: Add support for PartitionBy functionality

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1026192366


   Marking PRs without activity in the last month as stale. I'll plan to close it in another month or so without activity, though feel free to reopen it when you have time to work on it)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed pull request #1500: Add support for PartitionBy functionality

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1500: Add support for PartitionBy functionality

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1050570899


   I can help take a look into this next week and answer some of your questions if @Dandandan is busy with other things


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andrei-ionescu commented on pull request #1500: Add support for PartitionBy functionality

Posted by GitBox <gi...@apache.org>.
andrei-ionescu commented on pull request #1500:
URL: https://github.com/apache/arrow-datafusion/pull/1500#issuecomment-1050238261


   @alamb, @houqp: I don't know why this PR was considered stalled and closed when the last comment was mine and this PR is waiting feedback from your side. I don't know how to reopen the PR as I see no button to do that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org