You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/23 11:23:18 UTC

[GitHub] [arrow] jorisvandenbossche opened a new pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

jorisvandenbossche opened a new pull request #7523:
URL: https://github.com/apache/arrow/pull/7523


   Not a polished PR, just a quick try (in cython, since that's faster for me) to expose the RowGroupInfo statistics in Python + convert the expression into min/max information. More as food for discussion for now.
   
   What this enables:
   
   ```
   In [1]: import pyarrow.dataset as ds 
      ...: dataset = ds.parquet_dataset("test_parquet_dask/_metadata", partitioning="hive") 
      ...: fragment = list(dataset.get_fragments())[0]  
      ...: rg = fragment.row_groups[0]  
   
   In [2]: ds.get_min_max_statistics(rg.statistics) 
   Out[2]: 
   [{'col': {'min': -1.43563008497128}},
    {'col': {'max': 1.2929964609736964}},
    {'index': {'min': 335}},
    {'index': {'max': 359}}]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7523:
URL: https://github.com/apache/arrow/pull/7523#issuecomment-649644745


   @rjzamora indeed something like that. I am not sure that you need to keep track of the path as well, unless maybe to have it working with existing functions to determine sorted columns out of this (but that's more something to discuss on the dask issue/PR)
   
   Now, in the meantime, @bkietz did the more proper implementation -> https://github.com/apache/arrow/pull/7546


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rjzamora edited a comment on pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

Posted by GitBox <gi...@apache.org>.

rjzamora edited a comment on pull request #7523:
URL: https://github.com/apache/arrow/pull/7523#issuecomment-648269136


   Thanks for working on this @jorisvandenbossche !
   
   This does seem like the functionality needed by Dask.  To test my understanding (and for the sake of discussion), I am imagining something (roughly) like the following in Dask to collect row-group statistics (note that I am using pyarrow-0.17.1 from conda, so the `get_row_group_fragments` call would be replaced):
   
   ```python
   from collections import defaultdict
   import json
   import pandas as pd
   import pyarrow.dataset as pds
   from string import ascii_lowercase as letters
   
   path = "simple.pdf"
   df0 = pd.DataFrame(
       {"x": range(26), "myindex": list(letters)}
   ).set_index("myindex")
   df0.to_parquet(path, engine="pyarrow", row_group_size=10)
   ds = pds.dataset(path)
   
   # Need index_cols to be specified by user or encoded in
   # the "pandas" metadata. Otherwise, we will not bother
   # to infer an index column (and wont need statistics).
   index_cols = json.loads(
       ds.schema.metadata[b"pandas"].decode("utf8")
   )["index_columns"]
   filter = None # Some user-defined filter
   
   # Collect path and statistics for each row-group
   metadata = defaultdict(list)
   for file_frag in ds.get_fragments(filter=filter):
       for rg_frag in file_frag.get_row_group_fragments():
           for rg in rg_frag.row_groups:
               stats = ds.get_min_max_statistics(rg.statistics)
               metadata[rg_frag.path].append((<rg-index>, stats))
   ```
   
   In this case, the resulting `metadata` object would be something like:
   ```
   defaultdict(list,
               {'simple.pdf': [(0, <stats-for-rg-0>),
                 (1, <stats-for-rg-1>),
                 (2, <stats-for-rg-2>)]})
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche closed pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche closed pull request #7523:
URL: https://github.com/apache/arrow/pull/7523


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rjzamora commented on pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

Posted by GitBox <gi...@apache.org>.

rjzamora commented on pull request #7523:
URL: https://github.com/apache/arrow/pull/7523#issuecomment-648269136


   Thanks for working on this @jorisvandenbossche !
   
   This does seem like the functionality needed by Dask.  To test my understanding (and for the sake of discussion), I am imagining something (roughly) like the following in Dask to collect row-group statistics (note that I am using pyarrow-0.17.1 from conda, so the `get_row_group_fragments` call would be replaced):
   
   ```python
   from collections import defaultdict
   import json
   import pandas as pd
   import pyarrow.dataset as ds
   from string import ascii_lowercase as letters
   
   path = "simple.pdf"
   df0 = pd.DataFrame(
       {"x": range(26), "myindex": list(letters)}
   ).set_index("myindex")
   df0.to_parquet(path, engine="pyarrow", row_group_size=10)
   ds = pds.dataset(path)
   
   # Need index_cols to be specified by user or encoded in
   # the "pandas" metadata. Otherwise, we will not bother
   # to infer an index column (and wont need statistics).
   index_cols = json.loads(
       ds.schema.metadata[b"pandas"].decode("utf8")
   )["index_columns"]
   filter = None # Some user-defined filter
   
   # Collect path and statistics for each row-group
   metadata = defaultdict(list)
   for file_frag in ds.get_fragments(filter=filter):
       for rg_frag in file_frag.get_row_group_fragments():
           for rg in rg_frag.row_groups:
               stats = ds.get_min_max_statistics(rg.statistics)
               metadata[rg_frag.path].append((<rg-index>, stats))
   ```
   
   In this case, the resulting `metadata` object would be something like:
   ```
   defaultdict(list,
               {'simple.pdf': [(0, <stats-for-rg-0>),
                 (1, <stats-for-rg-1>),
                 (2, <stats-for-rg-2>)]})
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #7523: ARROW-8733: [Python][Dataset] Expose statistics of ParquetFileFragment::RowGroupInfo

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #7523:
URL: https://github.com/apache/arrow/pull/7523#issuecomment-648087751


   https://issues.apache.org/jira/browse/ARROW-8733


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org