You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/25 15:35:36 UTC

[GitHub] [arrow] jorisvandenbossche opened a new pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

jorisvandenbossche opened a new pull request #7545:
URL: https://github.com/apache/arrow/pull/7545


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658728008


   The existing tests are already failing (the above reproducible snippets were based on those), *if* the dictionary encoding gets enabled. 
   But I can write a `pyarrow.dataset`-specific test that captures the failure as well. Opened https://issues.apache.org/jira/browse/ARROW-9476 for the bug.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on a change in pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

wesm commented on a change in pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#discussion_r454456081



##########
File path: python/pyarrow/parquet.py
##########
@@ -1116,6 +1111,11 @@ class ParquetDataset:
     dataset metadata. Increasing this is helpful to read partitioned
     datasets.
 {0}
+use_legacy_dataset : bool, default True

Review comment:
       I see thanks




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#discussion_r450744941



##########
File path: python/pyarrow/fs.py
##########
@@ -63,6 +63,31 @@ def __getattr__(name):
     )
 
 
+def _ensure_filesystem(filesystem, use_mmap=False):
+    if isinstance(filesystem, FileSystem):
+        return filesystem
+
+    # handle fsspec-compatible filesystems
+    try:
+        import fsspec
+        if isinstance(filesystem, fsspec.AbstractFileSystem):

Review comment:
       @martindurant is this the best way to check if a given object is a fsspec filesytem? 
   
   (I suppose that in theory "fsspec-compatible" doesn't mean it needs to sublcass from fsspec, since it's the spec (the methods) that needs to be implemented. But I don't think we can detect "duck-typed" filesystems)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] martindurant commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

martindurant commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658917048


   (please please do ensure that dict encoding does happen, at least for str)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-649631989


   Still some work:
   
   Need to add tests for the different filesystems that can be passed.
   
   There are still some skipped tests:
   
   * `ARROW:schema` is not yet removed from the metadata -> ARROW-9009
   * Partition fields as dictionary keys
   * Specifying `metadata` object (not very important IMO)
   
   One of the `large_memory` tests is also failing (`test_binary_array_overflow_to_chunked`):
   
   ```
   $ pytest python/pyarrow/tests/test_parquet.py -v -r s -m large_memory --enable-large_memory
   =============================================================================================== test session starts ===============================================================================================
   platform linux -- Python 3.7.3, pytest-5.2.1, py-1.8.0, pluggy-0.12.0 -- /home/joris/miniconda3/envs/arrow-dev/bin/python
   cachedir: .pytest_cache
   hypothesis profile 'dev' -> max_examples=10, database=DirectoryBasedExampleDatabase('/home/joris/scipy/repos/arrow/.hypothesis/examples')
   rootdir: /home/joris/scipy/repos/arrow/python, inifile: setup.cfg
   plugins: hypothesis-4.47.5, lazy-fixture-0.6.1
   collected 277 items / 273 deselected / 4 selected                                                                                                                                                                 
   
   python/pyarrow/tests/test_parquet.py::test_large_table_int32_overflow PASSED                                                                                                                                [ 25%]
   python/pyarrow/tests/test_parquet.py::test_byte_array_exactly_2gb PASSED                                                                                                                                    [ 50%]
   python/pyarrow/tests/test_parquet.py::test_binary_array_overflow_to_chunked FAILED                                                                                                                          [ 75%]
   python/pyarrow/tests/test_parquet.py::test_list_of_binary_large_cell PASSED                                                                                                                                 [100%]
   
   ==================================================================================================== FAILURES =====================================================================================================
   ______________________________________________________________________________________ test_binary_array_overflow_to_chunked ______________________________________________________________________________________
   
               assert t.equals(result)
       
       
       @pytest.mark.pandas
       @pytest.mark.large_memory
       def test_binary_array_overflow_to_chunked():
           # ARROW-3762
       
           # 2^31 + 1 bytes
           values = [b'x'] + [
               b'x' * (1 << 20)
           ] * 2 * (1 << 10)
   >       df = pd.DataFrame({'byte_col': values})
   
   python/pyarrow/tests/test_parquet.py:3043: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   python/pyarrow/tests/test_parquet.py:3010: in _simple_table_roundtrip
       stream = pa.BufferOutputStream()
   python/pyarrow/tests/test_parquet.py:82: in _read_table
       return pq.read_table(*args, **kwargs)
   python/pyarrow/parquet.py:1555: in read_table
       raise ValueError(
   python/pyarrow/parquet.py:1468: in read
       use_threads=use_threads
   pyarrow/_dataset.pyx:403: in pyarrow._dataset.Dataset.to_table
       ???
   pyarrow/_dataset.pyx:1893: in pyarrow._dataset.Scanner.to_table
       ???
   pyarrow/error.pxi:122: in pyarrow.lib.pyarrow_internal_check_status
       ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   ???
   E   pyarrow.lib.ArrowNotImplementedError: This class cannot yet iterate chunked arrays
   
   pyarrow/error.pxi:105: ArrowNotImplementedError
   ============================================================================= 1 failed, 3 passed, 273 deselected in 512.87s (0:08:32) =============================================================================
   ```
   
   
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658396423


   Note that the new option with the datasets API only dictionary encodes string partition fields, and not integer partition field. So it would still not keep exactly the same behaviour as we had before ..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658004350


   > I'm not sure what to say about the handling of the partition fields. Is it ok to accept things as they are for now?
   
   I think that might be OK, yes (it means there is a change in behaviour to no longer return partition fields as dictionary type). But, maybe we should at least provide the option to get back the old behaviour? With `pyarrow.dataset` this is possible, but with the current PR this is not possible in `read_table`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-652875909


   Additional question: do we need to care about the case where the dataset module is not built? (which actually is the case on the Ursabot builds) 
   
   Right now, you can use `pq.read_table` without the `pyarrow.dataset` module being built. WIth this PR, that will stop working. However, for single files without advanced features (eg filter specified), we could simply fall back to `ParquetFile(..).read()` to keep `pq.read_table` working for this simple case without `pyarrow.dataset` being available.
   
   cc @wesm 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658507782


   > I think the rationale is that the memory and performance savings related to materializing the partition columns are mos significant with string data. So it's definitely beneficial to return them as dictionary types.
   
   Right, my understanding from Joris's last comment was that this was already converting strings to dictionaries, which seems like a reasonable (though not mandatory) choice, and that the hangup was whether it was essential to also do that for ints.
   
   I guess the other workaround if people aren't happy with the choice here is to set `use_legacy_dataset = True`, so I agree that it's not the end of the world if the choice we make about dictionaries today turns out not to be optimal. But we should merge this so that the default is to use the datasets API so that we can learn where exactly we were mistaken.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz closed pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

bkietz closed pull request #7545:
URL: https://github.com/apache/arrow/pull/7545


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658256892


   I'm fine with whatever you decide but it would be good to merge this today


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-649636228


   https://issues.apache.org/jira/browse/ARROW-9139


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658440946

Short version: I agree with Wes: do what you think best but let's merge something.

Longer thoughts:

If I understand correctly, the difference is whether or not partition fields that are integers are encoded as dictionaries in the old function and they come out as ints in the new? Semantically, they should behave the same though right?

It is possible to cast an integer column to dictionary, or at least I recall seeing a dictionary encode function. So if someone felt strongly about it, they could make their partition fields be dictionaries after the fact.

I caveat this by confessing that I've never personally used this function, but on face value I don't see why partition fields have to be dictionary type, particularly if we're only talking about integers. So if the main reason for adding something to force them to be dictionaries is to conform exactly with an old API, I'm not sure it's worth it. Maybe that's the wrong call, but since there is a workaround, why not try it without coercing everything to dictionary and see what reaction we get?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-652971061


   I opened https://issues.apache.org/jira/browse/ARROW-9297 for the large_memory issue (BinaryArray overflow to chunked array works for parquet.read_table, but not in dataset scanner)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] martindurant commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

martindurant commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658919780


   Thank you!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] martindurant commented on a change in pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

martindurant commented on a change in pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#discussion_r450835383



##########
File path: python/pyarrow/fs.py
##########
@@ -63,6 +63,31 @@ def __getattr__(name):
     )
 
 
+def _ensure_filesystem(filesystem, use_mmap=False):
+    if isinstance(filesystem, FileSystem):
+        return filesystem
+
+    # handle fsspec-compatible filesystems
+    try:
+        import fsspec
+        if isinstance(filesystem, fsspec.AbstractFileSystem):

Review comment:
       Agree completely. So far, all implementations II know about are subclasses, but it's not required. I have not yet got around to implementing an entrypoints way to declare implementations. 
   So, you could try to look for some specific methods, but I'm not sure what we could guarantee was specific to fsspec.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658465856


   I think the rationale is that the memory and performance savings related to materializing the partition columns are mostly related to string data. So it's definitely beneficial to return them as dictionary types.
   
   IMHO if there is a change from dictionary/dense required post-1.0.0 it is not the end of the world, so I'm OK either with merging this as is or changing all partition types to be dictionary. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658721203


   @jorisvandenbossche Can you at least add minimal failing tests in the PR?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-657140613


   Where does this PR stand? It needs a rebase


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-657186524


   Rebased. 
   
   This depends on https://github.com/apache/arrow/pull/7704 (ARROW-9297) for fixing the large_memory failure noted above (https://github.com/apache/arrow/pull/7545#issuecomment-649631989).
   
   In addition, we should probably also decide on whether we want to use dictionary type for the (string) partition fields or not. Right now we do (actually not only for strings, but also for integers). But the default with the datasets API is to use the plain (string or int) type. But we can specify an option to keep the existing behaviour for `parquet.read_table` (although that also creates an inconsistency between `pyarrow.datasets` and `pyarrow.parquet` using datasets).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658720185


   A bit simplified example:
   
   ```python
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds 
   
   foo_keys = np.array([0, 1, 3])
   bar_keys = np.array(['a', 'b', 'c'], dtype=object)
   N = 30
   
   table = pa.table({
       'foo': foo_keys.repeat(10),
       'bar': np.tile(np.tile(bar_keys, 5), 2),
       'values': np.random.randn(N)
   })
   
   base_path = "test_partition_directories3"
   pq.write_to_dataset(table, base_path, partition_cols=["bar", "foo"])
   
   # works
   ds.dataset(base_path, partitioning="hive")
   # fails
   part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
   ds.dataset(base_path, partitioning=part)
   ```
   
   this also fails, with "ArrowInvalid: No dictionary provided for dictionary field bar: dictionary<values=string, indices=int32, ordered=0>" (so slightly different error message)
   
   From playing with different keys for foo/bar, it seems that it might be trying to use the dictionary of the first field to parse the values of the second field (this might be a bug in my fix for HivePartitioning). 
   
   Because replacing the keys with:
   
   ```python
   foo_keys = np.array(['a', 'b', 'c'], dtype=object)
   bar_keys = np.array(['a', 'b', 'c'], dtype=object)
   ```
   
   works, while this
   
   ```python
   foo_keys = np.array(['a', 'b', 'c'], dtype=object) 
   bar_keys = np.array(['e', 'f', 'g'], dtype=object) 
   ```
   
   fails with "Dictionary supplied for field bar: dictionary<values=string, indices=int32, ordered=0> does not contain 'e'"


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm edited a comment on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

wesm edited a comment on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658465856


   I think the rationale is that the memory and performance savings related to materializing the partition columns are mos significant with string data. So it's definitely beneficial to return them as dictionary types.
   
   IMHO if there is a change from dictionary/dense required post-1.0.0 it is not the end of the world, so I'm OK either with merging this as is or changing all partition types to be dictionary. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658714201


   When enabling dictionary encoding for string partition fields, there are actually a bunch of failing tests ..
   
   Eg this one (based on `test_read_partitioned_directory`):
   
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.dataset as ds
   
   foo_keys = [0, 1]
   bar_keys = ['a', 'b', 'c']
   partition_spec = [
       ['foo', foo_keys],
       ['bar', bar_keys]
   ]
   N = 30
   
   df = pd.DataFrame({
       'index': np.arange(N),
       'foo': np.array(foo_keys, dtype='i4').repeat(15),
       'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
       'values': np.random.randn(N)
   }, columns=['index', 'foo', 'bar', 'values'])
   
   from pyarrow.tests.test_parquet import _generate_partition_directories
   fs = pa.filesystem.LocalFileSystem()
   _generate_partition_directories(fs, "test_partition_directories", partition_spec, df)
   
   # works
   ds.dataset("test_partition_directories/", partitioning="hive")
   # fails
   part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
   ds.dataset("test_partition_directories/", partitioning=part)
   ```
   
   fails with 
   
   ```
   ArrowInvalid: Dictionary supplied for field bar: dictionary<values=string, indices=int32, ordered=0> does not contain 'c'
   In ../src/arrow/dataset/partition.cc, line 55, code: (_error_or_value13).status()
   In ../src/arrow/dataset/discovery.cc, line 243, code: (_error_or_value16).status()
   ```
   
   Another reproducible example (based on `test_write_to_dataset_with_partitions`) giving a similar error:
   
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds 
   
   output_df = pd.DataFrame({'group1': list('aaabbbbccc'),
                               'group2': list('eefeffgeee'),
                               'num': list(range(10)),
                               'nan': [np.nan] * 10,
                               'date': np.arange('2017-01-01', '2017-01-11',
                                               dtype='datetime64[D]')})
   cols = output_df.columns.tolist()
   partition_by = ['group1', 'group2']
   output_table = pa.Table.from_pandas(output_df, safe=False,
                                       preserve_index=False)
   filesystem = pa.filesystem.LocalFileSystem() 
   base_path = "test_partition_directories2/"
   pq.write_to_dataset(output_table, base_path, partition_by,
                       filesystem=filesystem)
   
   # works
   ds.dataset("test_partition_directories2/", partitioning="hive")
   # fails
   part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
   ds.dataset("test_partition_directories2/", partitioning=part)
   ```
   
   I couldn't yet figure out what is the reason it is failing in those cases, though. 
   
   
   I should have tested the dictionary encoding feature more thoroughly, earlier, sorry about that. 
   But with the current state (unless someone can fix it today, but I don't have much time), it seems the choice is quite simple: merge as is without dictionary encoding, or delay until after 1.0
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658921634


   -> https://github.com/apache/arrow/pull/7777
   
   (note that this will only impact direct users of `read_table` with partitioned datasets, which eg does not include dask, but I suppose might impact pandas users)
   
   @bkietz do you merge this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#discussion_r454140244



##########
File path: python/pyarrow/parquet.py
##########
@@ -1116,6 +1111,11 @@ class ParquetDataset:
     dataset metadata. Increasing this is helpful to read partitioned
     datasets.
 {0}
+use_legacy_dataset : bool, default True

Review comment:
       This is the docstring of ParquetDataset, for which this PR still keeps the old default (it's only for `read_table` that it is switching, as here there is only minimal impact on the API, as it simply returns a table. While for ParquetDataset this would not be the case)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] martindurant commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

martindurant commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658254501


   I would advicate for yes, keeping dict encoding for partitioning column: I think it's the obvious mapping, and of course saves a lot of memory/processing. Part of the reason for partitioning is because the dict-like view matches well to the data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on a change in pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

wesm commented on a change in pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#discussion_r454001248



##########
File path: python/pyarrow/parquet.py
##########
@@ -1116,6 +1111,11 @@ class ParquetDataset:
     dataset metadata. Increasing this is helpful to read partitioned
     datasets.
 {0}
+use_legacy_dataset : bool, default True

Review comment:
       default False

##########
File path: python/pyarrow/parquet.py
##########
@@ -1116,6 +1111,11 @@ class ParquetDataset:
     dataset metadata. Increasing this is helpful to read partitioned
     datasets.
 {0}
+use_legacy_dataset : bool, default True
+    Set to False to enable the new code path (experimental, using the
+    new Arrow Dataset API). Among other things, this allows to pass
+    `filters` for all columns and not only the partition keys, enables
+    different partitioning schemes, etc.

Review comment:
       Seems somewhat out of date




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658807841


   Revision: 5d25c02ae678657c149fa307010339c43656eff6
   
   Submitted crossbow builds: [ursa-labs/crossbow @ actions-422](https://github.com/ursa-labs/crossbow/branches/all?query=actions-422)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.6-pandas-0.23|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-422-github-test-conda-python-3.6-pandas-0.23)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-422-github-test-conda-python-3.6-pandas-0.23)|
   |test-conda-python-3.7-dask-latest|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-422-github-test-conda-python-3.7-dask-latest)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-422-github-test-conda-python-3.7-dask-latest)|
   |test-conda-python-3.7-kartothek-latest|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-422-github-test-conda-python-3.7-kartothek-latest)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-422-github-test-conda-python-3.7-kartothek-latest)|
   |test-conda-python-3.7-kartothek-master|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-422-github-test-conda-python-3.7-kartothek-master)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-422-github-test-conda-python-3.7-kartothek-master)|
   |test-conda-python-3.7-pandas-master|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-422-github-test-conda-python-3.7-pandas-master)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-422-github-test-conda-python-3.7-pandas-master)|
   |test-conda-python-3.8-pandas-latest|[![Github Actions](https://github.com/ursa-labs/crossbow/workflows/Crossbow/badge.svg?branch=actions-422-github-test-conda-python-3.8-pandas-latest)](https://github.com/ursa-labs/crossbow/actions?query=branch:actions-422-github-test-conda-python-3.8-pandas-latest)|


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658806394


   @github-actions crossbow submit test-conda-python-3.7-pandas-master test-conda-python-3.7-kartothek-master test-conda-python-3.7-kartothek-latest test-conda-python-3.7-dask-latest test-conda-python-3.6-pandas-0.23 test-conda-python-3.8-pandas-latest


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658919336


   @martindurant I am doing the follow-up PR as we speak


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658662358


   To clarify:
   
   - The current PR right now doesn't use dictionary encoding for any type of partition fields, so also not for strings
   - For strings I could rather easily add it (it's an option in the datasets API that can be set)
   - For ints it's not actually possible, as long as the datasets API doesn't support it (dictionary encoding the ints after reading is possible, but won't necessarily give you all unique values in the dictionary if you applied a filter)
   
   I will at least quickly experiment with enabling the dictionary encoding, or providing an option for it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org