You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "giacomorebecchi (via GitHub)" <gi...@apache.org> on 2023/12/12 10:26:36 UTC

[I] ARRAY_AGG of column of type list ORDER BY column of type non-list [arrow-datafusion]

giacomorebecchi opened a new issue, #8512:
URL: https://github.com/apache/arrow-datafusion/issues/8512

   ### Describe the bug
   
   In version 33.0.0, I encountered the following bug (not present in version 32.0.0):
   Executing an aggregation with operator ARRAY_AGG() of a column of type list, ORDER BY a column of type non-list, returns the following error:
   `Execution error: Expects values arguments and/or ordering_values arguments to have same size`
   
   ### To Reproduce
   
   I have an MRE in python:
   `pip install "pyarrow==14.0.0" "datafusion==33.0.0"`
   
   ```python
   import datetime
   import random
   
   import datafusion
   import pyarrow as pa
   import pyarrow.dataset as pda
   
   N_ROWS = 10_000
   N_CARDS = 1_000
   N_PRODUCTS = 50
   
   ta = pa.Table.from_pydict(
       {
           "Card.Id": random.choices([str(i) for i in range(N_CARDS)], k=N_ROWS),
           "Date": (datetime.date(2023, (i % 12) + 1, (i % 28) + 1) for i in range(N_ROWS)),
           "Product.Ids": [random.choices([i for i in range(N_PRODUCTS)], k=2) for i in range(N_ROWS)]
       }
   )
   
   query = """
   SELECT
       "Card.Id"
       , FIRST_VALUE("Product.Ids" ORDER BY "Date")
       , LAST_VALUE("Product.Ids" ORDER BY "Date")
       , ARRAY_AGG("Product.Ids" ORDER BY "Date")
   FROM "table"
   GROUP BY "Card.Id"
   """
   
   ctx = datafusion.SessionContext()
   ctx.register_dataset(name="table",
                        dataset=pda.dataset(ta))
   df = ctx.sql(query)
   compute_ta = pa.Table.from_batches(df.collect())
   ```
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] ARRAY_AGG of column of type list ORDER BY column of type non-list [arrow-datafusion]

Posted by "giacomorebecchi (via GitHub)" <gi...@apache.org>.
giacomorebecchi commented on issue #8512:
URL: https://github.com/apache/arrow-datafusion/issues/8512#issuecomment-1853664588

   Great, thanks! Maybe it could be worth adding a test?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] ARRAY_AGG of column of type list ORDER BY column of type non-list [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8512:
URL: https://github.com/apache/arrow-datafusion/issues/8512#issuecomment-1953645749

   @giacomorebecchi I think it is resolved. Feel free to reopen the issue if it fails in release version 36 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] ARRAY_AGG of column of type list ORDER BY column of type non-list [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8512:
URL: https://github.com/apache/arrow-datafusion/issues/8512#issuecomment-1956617939

   > @giacomorebecchi I think it is resolved. Feel free to reopen the issue if it fails in release version 36
   
   It seems to be 37


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] ARRAY_AGG of column of type list ORDER BY column of type non-list [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 closed issue #8512: ARRAY_AGG of column of type list ORDER BY column of type non-list
URL: https://github.com/apache/arrow-datafusion/issues/8512


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] ARRAY_AGG of column of type list ORDER BY column of type non-list [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8512:
URL: https://github.com/apache/arrow-datafusion/issues/8512#issuecomment-1853872555

   I think it will still be an issue in datafusion 34.
   
   I found the array we got in `merge_batch` is `ListArray[ListArray[I64Array[1,2], I64Array[3,4], I64Array[5,6]]]`, so when we convert it to `scalar_vec` (`convert_array_to_scalar_vec`), we got the unexpected result. I'm not sure whether we need to fix scalar_vec conversion or the values we got before `merge_batch`.
   
   Currently, there is no test cover array cases where go through `merge_batch`. And, I think it is hard to support this test since we don't have a way to create array in CSV table which go through `merge_batch`. With normal sql table, it will be optimized that does not have the same workflow like CSV table (go through `update_batch`)
   
   I will take a look to fix this issue and find a way to add this kind of test if possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] ARRAY_AGG of column of type list ORDER BY column of type non-list [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8512:
URL: https://github.com/apache/arrow-datafusion/issues/8512#issuecomment-1852817577

   Thank you for the report @giacomorebecchi 🙏 
   
   @jayzhan211  and @Veeupup  and @Weijun-H  have done some significant work on various array functions in 34.0.0 (about to be released). This issue may be fixed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org