You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/15 06:48:24 UTC

[GitHub] [arrow] vibhatha opened a new pull request, #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

vibhatha opened a new pull request, #14641:
URL: https://github.com/apache/arrow/pull/14641

   Updated `parse` function in `Partitioning` to accept a list of paths and produce a single expression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #14641:
URL: https://github.com/apache/arrow/pull/14641#issuecomment-1318085476

   @ldacey if there are new requirements adjacent to this work or new work, let's discuss over an email thread or a JIRA. 
   
   I am closing this. 
   
   cc @westonpace @jorisvandenbossche 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #14641:
URL: https://github.com/apache/arrow/pull/14641#issuecomment-1315200552

   > See also my last comment on the JIRA. Since this is essentially doing `functools.reduce(operator.or_, [partitioning.parse(file) for file in paths])` (as there is no further simplification), I am not sure it is worth adding this capability to `parse()` instead of having the user do this reduce manually.
   
   I agree with you. If you take a look at my examples in the thread itself, I also shows the same intent for the user to do it themselves. I only added this piece as a helper. I also have doubt if it is actually worth it. 
   
   cc @westonpace WDYT?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

Posted by GitBox <gi...@apache.org>.

westonpace commented on PR #14641:
URL: https://github.com/apache/arrow/pull/14641#issuecomment-1316471121

   > I am actually not fully sure if the code to evaluate pushdown filters would actually understand an isin kernel. I think this is handled in SimplifyWithGuarantee:
   
   I'm pretty sure this is correct.  The pushdown filtering is more primitive and only understands <, ==, > (and maybe != and is_valid / is_null?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #14641:
URL: https://github.com/apache/arrow/pull/14641#issuecomment-1314850659

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha closed pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

Posted by GitBox <gi...@apache.org>.

vibhatha closed pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters
URL: https://github.com/apache/arrow/pull/14641


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ldacey commented on pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

Posted by GitBox <gi...@apache.org>.

ldacey commented on PR #14641:
URL: https://github.com/apache/arrow/pull/14641#issuecomment-1317139268

   > > I am actually not fully sure if the code to evaluate pushdown filters would actually understand an isin kernel. I think this is handled in SimplifyWithGuarantee:
   > 
   
   I have historically been consolidated a single .isin() expression if my dataset has 1 partition since I assumed it should be faster/better. 
   
   ```
   Consolidated dataset filter: {'month_id': [202105, 202106, 202107]}
   
   <pyarrow.compute.Expression is_in(month_id, {value_set=int64:[
     202105,
     202106,
     202107
   ], skip_nulls=false})>
   ```
   
   But if it is fine to use OR (and potentially have duplicate filters in cases where multiple fragments were saved to 1 partition), then I am fine with that. It sounds like a chain of OR filters might be used anyways.
   
   I think we can close this then. The reduce method above will cut out a lot of my custom, messy code. I currently have several functions preparing the expression for me (the consolidate_dictionary() removes the duplicate partitions from ds._get_partition_keys).
   
   ```
       def consolidate_expressions(
           self, expressions: list[ds.Expression], partition_count: int
       ) -> dict | list[dict]:
           """Consolidates the values of a multiple filters into a single list
   
           Args:
               expressions: Partitioning expressions
               partition_count: Number of partition columns
   
           Returns:
               Consolidate dictionary filter such as {'date_id': [20220507, 20220514]}
           """
           filters = [ds._get_partition_keys(exp) for exp in expressions]
           if partition_count == 1:
               filters = consolidate_dictionary(filters)
           return filters
   
       @staticmethod
       def multiple_partition_filter(
           partitions: list[dict] | dict,
       ) -> list[list[tuple[str, str, Any]]]:
           """Consolidates filters into a single list which has lists of tuples such as
               [[("date_id", "==", 20201231)], [("date_id", "==", 20210217)]]
   
           Args:
               partitions: Dataset partition keys
           """
           filters = []
           for part in partitions:
               element = [(k, "==", v) for k, v in part.items()]
               if element not in filters:
                   filters.append(element)
           return filters
   
       @staticmethod
       def single_partition_filter(
           partitions: list[dict] | dict,
       ) -> list[tuple[str, str, Any]]:
           """Consolidates filters into a single list of a tuple such as
               [("date_id", "in", [20201231, 20200101, 20200102, 20200103])]
   
           Args:
               partitions: Dataset partition keys
           """
           return [(k, "in", v) for k, v in partitions.items()]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on PR #14641:
URL: https://github.com/apache/arrow/pull/14641#issuecomment-1315047402

   See also my last comment on the JIRA. Since this is essentially doing `functools.reduce(operator.or_, [partitioning.parse(file) for file in paths])` (as there is no further simplification), I am not sure it is worth adding this capability to `parse()` instead of having the user do this reduce manually.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on PR #14641:
URL: https://github.com/apache/arrow/pull/14641#issuecomment-1315460858

> Do I need to worry about there being multiple filters for 202107 in that expression? In my example, there are 3 fragments in that partition.

I don't think you really have to worry about that (it won't change behaviour), although it might make a bit less efficient to apply the filter (not sure by heart).

> 2\. There is no benefit in using `.isin()` for a single partition compared to `or`, right? In terms of performance/efficiency they are the same?

In general, I think an `isin` filter will certainly be more efficient than the equivalent with multiple boolean comparisons (that's one of the goals for having `isin`).
But that's when talking about applying such a filter to actual, materialized data. In the specific case where this filter only applies to partitioning fields, I suppose the situation is different. I am actually not fully sure if the code to evaluate pushdown filters would actually understand an `isin` kernel. I _think_ this is handled in `SimplifyWithGuarantee`:

https://github.com/apache/arrow/blob/058d4f697a06477539e7f9ccf3e7c035f8cfbc5e/cpp/src/arrow/compute/exec/expression.cc#L1144-L1188

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ldacey commented on pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

Posted by GitBox <gi...@apache.org>.

ldacey commented on PR #14641:
URL: https://github.com/apache/arrow/pull/14641#issuecomment-1315331226

   This works for me, I can run the functools reduce (that simplifies what I have).
   
   Some things I want to make sure:
   
   ```
   paths = [
        'path/to/data/month_id=202105/part-0.parquet',
        'path/to/data/month_id=202106/part-0.parquet',
        'path/to/data/month_id=202107/part-0.parquet',
        'path/to/data/month_id=202107/part-1.parquet',
        'path/to/data/month_id=202107/part-2.parquet',
   ]
   reduce(operator.or_, [partitioning.parse(file) for file in paths])
   
   <pyarrow.compute.Expression (((((month_id == 202105) or (month_id == 202106)) or 
   (month_id == 202107)) or (month_id == 202107)) or (month_id == 202107))>
   ```
   
   
   1. Do I need to worry about there being multiple filters for 202107 in that expression? In my example, there are 3 fragments in that partition.
   2. There is no benefit in using `.isin()` for a single partition compared to `or`, right? In terms of performance/efficiency they are the same?
   
   ```
   <pyarrow.compute.Expression is_in(month_id, {value_set=int64:[
     202105,
     202106,
     202107
   ], skip_nulls=false})>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #14641:
URL: https://github.com/apache/arrow/pull/14641#issuecomment-1314850629

   https://issues.apache.org/jira/browse/ARROW-15716


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vibhatha commented on pull request #14641: ARROW-15716: [Dataset][Python] Parse a list of fragment paths to gather filters

Posted by GitBox <gi...@apache.org>.

vibhatha commented on PR #14641:
URL: https://github.com/apache/arrow/pull/14641#issuecomment-1316485067

   Should we close this PR? I also think handling things externally using the existing APIs is fine for the job.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org