You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/08 05:03:29 UTC

[GitHub] [arrow-datafusion] jorgecarleitao opened a new issue #523: Number of output record batches for small datasets is large

jorgecarleitao opened a new issue #523:
URL: https://github.com/apache/arrow-datafusion/issues/523


   When running group bys on small datasets, we are emitting too many record batches. This is a regression over ref 2423ff0d .
   
   This is causing the tests for the Python bindings to fail when upgrading to 321fda40.
   
   For example,
   
   ```
   batch = pyarrow.RecordBatch.from_arrays(
               [pyarrow.array([1, 2, 3]), pyarrow.array([4, 4, 6])],
               names=["a", "b"],
           )
           return ctx.create_dataframe([[batch]])
   ```
   
   with 
   
   ```
   df = df.aggregate([f.col("b")], [udaf(f.col("a"))])
   
   result = df.collect()
   ```
   
   is returning 4 record batches. I can't see a valid reason for a record batch of 3 slots to be split in 4 record batches.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #523: Number of output record batches for small datasets is large

Posted by GitBox <gi...@apache.org>.

jorgecarleitao commented on issue #523:
URL: https://github.com/apache/arrow-datafusion/issues/523#issuecomment-856475864


   I agree that Python just should merge them on the tests. I was a bit surprised that even in such a low number of entries we are splitting them: seems odd to me.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #523: Number of output record batches for small datasets is large

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #523:
URL: https://github.com/apache/arrow-datafusion/issues/523#issuecomment-856452504


   > When running group bys on small datasets, we are emitting too many record batches. This is a regression over ref [2423ff0](https://github.com/apache/arrow-datafusion/commit/2423ff0dd1fe9c0932c1cb8d1776efa3acd69554) .
   > 
   > This is causing the tests for the Python bindings to fail when upgrading to [321fda4](https://github.com/apache/arrow-datafusion/commit/321fda40a47bcc494c5d2511b6e8b02c9ea975b4).
   > 
   > For example,
   > 
   > ```
   > batch = pyarrow.RecordBatch.from_arrays(
   >             [pyarrow.array([1, 2, 3]), pyarrow.array([4, 4, 6])],
   >             names=["a", "b"],
   >         )
   >         return ctx.create_dataframe([[batch]])
   > ```
   > 
   > with
   > 
   > ```
   > df = df.aggregate([f.col("b")], [udaf(f.col("a"))])
   > 
   > result = df.collect()
   > ```
   > 
   > is returning 4 record batches. I can't see a valid reason for a record batch of 3 slots to be split in 4 record batches.
   
   I think that is due to (hash) partitioning in group by / join.
   which will split the rows into multiple partitions/batches.
   Generally I think the unit tests should not rely on the number of batches or order of rows when not sorted as DataFusion should be free to reorder however it wants.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Dandandan commented on issue #523: Number of output record batches for small datasets is large

Posted by GitBox <gi...@apache.org>.

Dandandan commented on issue #523:
URL: https://github.com/apache/arrow-datafusion/issues/523#issuecomment-856483448


   > I agree that Python just should merge them on the tests. I was a bit surprised that even in such a low number of entries we are splitting them: seems odd to me.
   
   There could be some heuristics / optimizations to not apply partitioning for small datasets (when known upfront). For example, with hash join that can be beneficial when the left side is very small compared to the right side (hash partitioning the right side in that case could be slower than building the left side in a single thread / worker).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org