You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/09 09:46:04 UTC

[GitHub] [arrow] svjack opened a new issue #9146: How to use cast time-string to timestamp[ms] in expression in filters of _ParquetDatasetV2 in pyarrow

svjack opened a new issue #9146:
URL: https://github.com/apache/arrow/issues/9146


   I review the difference between ParquetDataset and _ParquetDatasetV2 in source code,
   they have different logic to perform partition filter. 
   The former simply use _filters, the latter combine the conclusion from _filters and _filters_to_expression
   i think the latter’ s design is more useful,  because many expression can be inject into field transformations
   (such as cast and other column transformations) before perform truly filter.
   My question is because i can’t cast from string into timestamp in ChunkedArray (field or column truly
   saved format in table),  i can not use this to simplify some logic in filters
   For example,
   use this kind of filters
   [[(“backup_time”, “>”, pd.to_datetime(“2020-01-01”)), ]]
   where “backup_time” is the partition use time_string  (not) well formatted
   i want to overwrite _filter_to_expression func and use field.cast to transform field type from
   string to timestamp into perform some filter
   Even, i can register more complex functions into pyarrow.compute to define many calculations
   to partitions as custom functions in expression, this is all i need.
   with the help of expression, i want to promote the  _filter_to_expression func from only
   field value compare into func(field) compare , even field -> field_object compare,
   just think improve field as spark udf(field).
   How can i do it gracefully ? because this will help for read performance with partition has complex
   format.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] svjack removed a comment on issue #9146: How to use cast time-string to timestamp[ms] in expression in filters of _ParquetDatasetV2 in pyarrow

Posted by GitBox <gi...@apache.org>.
svjack removed a comment on issue #9146:
URL: https://github.com/apache/arrow/issues/9146#issuecomment-757690864


   I find that the above function i required above seemed implemented by gandiva module  in the pyarrow.
   I have search some materials about it, it seems like gandiva mainly interface are java and C++, and
   i only find one python example, which register the gandiva function as a dataframe_accessor for pandas dataframe. (speed up the query execute)
   Can you introduce me some gandiva usage in python and pyarrow,
   Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] svjack commented on issue #9146: How to use cast time-string to timestamp[ms] in expression in filters of _ParquetDatasetV2 in pyarrow

Posted by GitBox <gi...@apache.org>.
svjack commented on issue #9146:
URL: https://github.com/apache/arrow/issues/9146#issuecomment-757690864


   I find that the above function i required above seemed implemented by gandiva module  in the pyarrow.
   I have search some materials about it, it seems like gandiva mainly interface are java and C++, and
   i only find one python example, which register the gandiva function as a dataframe_accessor for pandas dataframe. (speed up the query execute)
   Can you introduce me some gandiva usage in python and pyarrow,
   Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed issue #9146: How to use cast time-string to timestamp[ms] in expression in filters of _ParquetDatasetV2 in pyarrow

Posted by GitBox <gi...@apache.org>.
wesm closed issue #9146:
URL: https://github.com/apache/arrow/issues/9146


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] svjack commented on issue #9146: How to use cast time-string to timestamp[ms] in expression in filters of _ParquetDatasetV2 in pyarrow

Posted by GitBox <gi...@apache.org>.
svjack commented on issue #9146:
URL: https://github.com/apache/arrow/issues/9146#issuecomment-757691006


   I find that the above function i required seemed implemented by gandiva module in the pyarrow.
   I have search some materials about it, it seems like gandiva mainly interface are java and C++, and
   i only find one python example, which register the gandiva function as a dataframe_accessor for pandas dataframe. (speed up the query execute)
   Can you introduce me some gandiva usage in python and pyarrow,
   Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on issue #9146: How to use cast time-string to timestamp[ms] in expression in filters of _ParquetDatasetV2 in pyarrow

Posted by GitBox <gi...@apache.org>.
wesm commented on issue #9146:
URL: https://github.com/apache/arrow/issues/9146#issuecomment-769951602


   We don't handle user questions on GitHub issues -- please write to user@arrow.apache.org


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org