You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "daddywantssugar (via GitHub)" <gi...@apache.org> on 2023/06/25 07:15:42 UTC

[GitHub] [arrow] daddywantssugar opened a new issue, #36283: parquet pushdown predicate dataset.field.isin() much slower than or '|'

daddywantssugar opened a new issue, #36283:
URL: https://github.com/apache/arrow/issues/36283

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   apologies if this isn't a bug but I assert still surprising behaviour: When pushing predicates down to parquet reads, using the "or" | syntax returns in pretty fast as expected in 5s. but the read with equivalent isin() predicate takes over an order of magnitude longer- 140s.
   
   ```
   
   import pandas as pd
   import pyarrow.dataset as ds
   import s3fs
   from contexttimer import Timer #pip install or use your own timer
   
   fs = s3fs.S3FileSystem(anon=True) # doesn't actually require s3, network share will exhibit this as well
   rawpath = f'nyc-taxi-test/weather_sorted.parquet' # sorted by longitude to accentuate the issue
   filters = [
       (ds.field("longitude") == -10.8) | (ds.field("longitude") == -11.4), # 5s
       ds.field("longitude").isin([-11.4, -10.8]), # 143s
       
       (ds.field("longitude") == 10.2) | (ds.field("longitude") == 10.5), # 9s
       ds.field("longitude").isin([10.2, 10.5]), # 135s
       
       None, # no filter baseline #150s
   ]
   for filter in filters:
       with Timer() as t:
           with fs.open(rawpath, 'rb') as f:
               df = pd.read_parquet(f, filters=filter)
       print('time: ', t, 'size: ', len(df))
   
   """
   time:  5.372 size:  26353
   time:  143.137 size:  26353
   time:  9.685 size:  59565
   time:  135.809 size:  59565 
   time:  153.935 size:  12736802  
   """
   
   ```
   
   tested on Windows 10
   python 3.10
   pandas 2.0.2
   pyarrow 12.0.1
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #36283: parquet pushdown predicate dataset.field.isin() much slower than or '|'

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #36283:
URL: https://github.com/apache/arrow/issues/36283#issuecomment-1631536946

   Off the top of my head I think it would be easier to add support for `is_in` to the simplification logic.  However, rewriting to an or chain should work too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] daddywantssugar commented on issue #36283: parquet pushdown predicate dataset.field.isin() much slower than or '|'

Posted by "daddywantssugar (via GitHub)" <gi...@apache.org>.
daddywantssugar commented on issue #36283:
URL: https://github.com/apache/arrow/issues/36283#issuecomment-1605910313

   Here is a similar example where the problem is not as pronounced but on a more popular nyc-taxi dataset sorted by passenger count:
   
   ```
   import pandas as pd
   import pyarrow.dataset as ds
   import s3fs
   from contexttimer import Timer #pip install or use your own timer
   
   fs = s3fs.S3FileSystem(anon=True) # doesn't actually require s3, network share will exhibit this as well
   rawpath = f'nyc-taxi-test/taxi_sorted.parquet' # sorted by passenger_count to accentuate the issue
   filters = [
       (ds.field("passenger_count") == 10) | (ds.field("passenger_count") == 6), #1.5s
       ds.field("passenger_count").isin([10, 6]), #6s
       None, # 58s  no filter baseline
   ]
   for filter in filters:
       with Timer() as t:
           with fs.open(rawpath, 'rb') as f:
               df = pd.read_parquet(f, filters=filter)
       print('time: ', t, 'size: ', len(df))
   
   """
   time:  1.079 size:  63817
   time:  6.445 size:  63817
   time:  58.441 size:  14092413
   """
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #36283: parquet pushdown predicate dataset.field.isin() much slower than or '|'

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #36283:
URL: https://github.com/apache/arrow/issues/36283#issuecomment-1613227675

   This is expected but could be improved.  Parquet predicate pushdown works like so:
   
    * Extract a row group guarantee from parquet statistics (e.g. `30 < x < 70 && 0 < y < 100`)
    * Call `SimplifyWithGuarantee` on the filter, given the above guarantee
      * For example, a filter `x == 100 && z < 20` would simplify to `false`.
   
   The `SimplifyWithGuarantee` method does not understand `isin`.  It could be improved to do so if someone were interested.  The place to make the change would be here I think: https://github.com/apache/arrow/blob/apache-arrow-12.0.1/cpp/src/arrow/compute/expression.cc#L1230
   
   First we "extract known values" (places in the guarantee where we have something like x == 7).  This usually wouldn't apply because equality guarantees come from partitioning and not from parquet statistics.
   
   Second, we consider inequalities in the guarantee.  This is the part that is critical for parquet predicate pushdown.  We then call Inequality::Simplify which looks for places in the filter that are:
   
    * calls to is_valid or is_null (these might be simplified by an inequality)
    * comparisons (these might also be simplified by an inequality)
   
   I think the point you are making is that `isin` is another function that may be simplified by an inequality.  If we know that x > 100 and the filter is `isin(0, 7, 12)` then we can simplify this to `literal(false)`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #36283: parquet pushdown predicate dataset.field.isin() much slower than or '|'

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #36283:
URL: https://github.com/apache/arrow/issues/36283#issuecomment-1615828090

   @westonpace I'd like to fix this. Would I change `is_in` to `or(eq(e, expr1), eq(e, expr2), ...)` or using other methods?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org