You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/12 04:15:14 UTC

[GitHub] [arrow] svjack opened a new issue #9172: Is Expression have decomposition methods ?

svjack opened a new issue #9172:
URL: https://github.com/apache/arrow/issues/9172


   pyarrow.dataset.Expression seems support assume __invert__ and or ,
   They all useful to construct with another expression.
   But what about decomposition to its components?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] svjack commented on issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
svjack commented on issue #9172:
URL: https://github.com/apache/arrow/issues/9172#issuecomment-759156383


   So if you don't care about this, can you make more cython api open to python domain (or make more cpp api visible to pure python user) this will not limit the python developers as only users status to the underlying logic.
   I understand this will sacrifice the performance.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] svjack commented on issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
svjack commented on issue #9172:
URL: https://github.com/apache/arrow/issues/9172#issuecomment-761539643


   > >  But because the methods i can apply to Expression is limited,
   > 
   > 
   > 
   > It's certainly planned to expand what you can express in the filters (basically all compute kernels should be possible). The functionality already exists in the C++ library, but needs to be exposed in Python. 
   > 
   > 
   > 
   > > 1、So i think some bool simplify should support, such as :
   > 
   > (ExpressionA or ExpressionB) and ExpressionA -> ExpressionA
   > 
   > 
   > 
   > Functionality to simplify expressions is not exposed (but it will be done under the hood though, when passing such an expression as a filter). Feel free to open JIRA issues with very specific feature requests. 
   > 
   > 
   > 
   > 
   
   I create a repository about use cast in filters expression
   https://github.com/svjack/PyArrowExpressionCastToolkit
   Does PyArrow support this feature ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
wesm closed issue #9172:
URL: https://github.com/apache/arrow/issues/9172


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] svjack commented on issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
svjack commented on issue #9172:
URL: https://github.com/apache/arrow/issues/9172#issuecomment-766864484


   > Can you give an example of a cast that does not work? The filter expressions already support casting, and should normally be casted to the type of the column in the dataset's schema
   did you mean 
   ("time_two_pos", ">", pd.to_datetime("1970-01-01 00:24:01.200000001")
   can pass as a filter argument?(or replace pd by pa.array().cast()) ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] svjack commented on issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
svjack commented on issue #9172:
URL: https://github.com/apache/arrow/issues/9172#issuecomment-766864484






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #9172:
URL: https://github.com/apache/arrow/issues/9172#issuecomment-766845978


   Can you give an example of a cast that does not work? The filter expressions already support casting, and should normally be casted to the type of the column in the dataset's schema


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #9172:
URL: https://github.com/apache/arrow/issues/9172#issuecomment-766938020


   When use the legacy dataset implementation it seems to work (but it will do string-based comparison, not actual datetimes), but indeed not with the new implementation (I actually get a segfault). Reported this as https://issues.apache.org/jira/browse/ARROW-11379


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] svjack commented on issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
svjack commented on issue #9172:
URL: https://github.com/apache/arrow/issues/9172#issuecomment-766881382


   > Can you provide a reproducible example?
   > 
   > The following small example works for me:
   > 
   > ```
   > In [43]: df = pd.DataFrame({"a": pd.date_range("2012-01-01", periods=10, freq="D")})
   > 
   > In [44]: df.to_parquet("test_filter_datetimes.parquet")
   > 
   > In [45]: import pyarrow.parquet as pq
   > 
   > In [46]: pq.read_table("test_filter_datetimes.parquet").to_pandas()
   > Out[46]: 
   >            a
   > 0 2012-01-01
   > 1 2012-01-02
   > 2 2012-01-03
   > 3 2012-01-04
   > 4 2012-01-05
   > 5 2012-01-06
   > 6 2012-01-07
   > 7 2012-01-08
   > 8 2012-01-09
   > 9 2012-01-10
   > 
   > In [47]: pq.read_table("test_filter_datetimes.parquet", filters=[("a", ">", pd.Timestamp("2012-01-05"))]).to_pandas()
   > Out[47]: 
   >            a
   > 0 2012-01-06
   > 1 2012-01-07
   > 2 2012-01-08
   > 3 2012-01-09
   > 4 2012-01-10
   > ```
   
   if you save df to parquet partition by a, is it also works ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] svjack commented on issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
svjack commented on issue #9172:
URL: https://github.com/apache/arrow/issues/9172#issuecomment-759153642


   > @svjack because the Expression APIs are somewhat provisional, we didn't expose much functionality to inspect / interact with it, for now. 
   > 
   > 
   > 
   > Do you have a specific use case for this?
   
   I search the usage of Expression in the pyarrow project,
   It seems that elements of dataset.pieces i.e. piece (Fragment)
   have partition_expression as its attribute,
   If i want perform some kind of filters by partition_expressson 
   on some pieces. The official support method is to use filters (Expression) argument in DatasetV2 constructor.
   But because the methods i can apply to Expression is limited,
   I replaced the official support method by custom a filter on pieces and want to use the partition_expression as the formal representation of the piece (partition Fragment),
   1、So i think some bool simplify should support, such as :
   (ExpressionA or ExpressionB) and ExpressionA -> ExpressionA
   2、I did not dive into the underlying logic of Expression filters execute. But think of below case:
   	total_expression = reduce(lambda pe_a, pe_b: pe_a.__or__(pe_b), map(lambda piece: piece.partition_expression, dataset.pieces)) 
   	total_expression may seemed as a trial expression of all pieces union, but if the underlying logic of execute total_expression is to simplify it first and execute the simplified total_expression , i think this may save the execute
   speed than perform a lot of __or__ (union) on many of fragments.
   
   So i want to make complex Expression have logic simplify method and some sense of pre-simplify in execute time.
   I think this depends on function can retrieve minimal logic units from the total_expression (this is about the element), when comes to 
   the "op" ("=" "in" and so on in _filters_to_expression), should have a formal or formula reverse method to transform Expression back to filters (construct by nested python collections) 
   With the help of these functions, Expression will have completeness in both algebraic (math) and programming.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #9172:
URL: https://github.com/apache/arrow/issues/9172#issuecomment-758646081


   @svjack because the Expression APIs are somewhat provisional, we didn't expose much functionality to inspect / interact with it, for now. 
   
   Do you have a specific use case for this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #9172:
URL: https://github.com/apache/arrow/issues/9172#issuecomment-760877747


   >  But because the methods i can apply to Expression is limited,
   
   It's certainly planned to expand what you can express in the filters (basically all compute kernels should be possible). The functionality already exists in the C++ library, but needs to be exposed in Python. 
   
   > 1、So i think some bool simplify should support, such as :
   (ExpressionA or ExpressionB) and ExpressionA -> ExpressionA
   
   Functionality to simplify expressions is not exposed (but it will be done under the hood though, when passing such an expression as a filter). Feel free to open JIRA issues with very specific feature requests. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] svjack commented on issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
svjack commented on issue #9172:
URL: https://github.com/apache/arrow/issues/9172#issuecomment-759220949


   Without logic simplified method "equals" can not return true in (ExpressionA or ExpressionB) and ExpressionA == ExpressionA


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #9172:
URL: https://github.com/apache/arrow/issues/9172#issuecomment-766845978






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on issue #9172: Is Expression have decomposition methods ?

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on issue #9172:
URL: https://github.com/apache/arrow/issues/9172#issuecomment-766877382


   Can you provide a reproducible example?
   
   The following small example works for me:
   
   ```
   In [43]: df = pd.DataFrame({"a": pd.date_range("2012-01-01", periods=10, freq="D")})
   
   In [44]: df.to_parquet("test_filter_datetimes.parquet")
   
   In [45]: import pyarrow.parquet as pq
   
   In [46]: pq.read_table("test_filter_datetimes.parquet").to_pandas()
   Out[46]: 
              a
   0 2012-01-01
   1 2012-01-02
   2 2012-01-03
   3 2012-01-04
   4 2012-01-05
   5 2012-01-06
   6 2012-01-07
   7 2012-01-08
   8 2012-01-09
   9 2012-01-10
   
   In [47]: pq.read_table("test_filter_datetimes.parquet", filters=[("a", ">", pd.Timestamp("2012-01-05"))]).to_pandas()
   Out[47]: 
              a
   0 2012-01-06
   1 2012-01-07
   2 2012-01-08
   3 2012-01-09
   4 2012-01-10
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org