You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/15 19:24:54 UTC

[GitHub] [arrow] wyzhao opened a new pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

wyzhao opened a new pull request #8672:
URL: https://github.com/apache/arrow/pull/8672


   I would like to enhance partition filters in methods such as:
   
   pyarrow.parquet.ParquetDataset(path, filters)
   
   I am proposing the below enhancements:
   
   1. for operator "in", "not in", the value should be any typing.Iteratable (also a container). But currently only set is supported while other iteratable, such as list, tuple cannot function correctly. I would like to change it to accept any iteratable.
   
   2. Enhance the documents about the partition filters.
   
   3. Check when no partition can satisfy the filters, raise an exception with meaningful error message.
   
   I see there is a new version implemented with _ParquetDatasetV2 which passed my tests with an iterable for "in" and "not in". So the documentation update is fine for the new version as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#issuecomment-742543077


   @wyzhao thanks for the update! I am waiting until https://github.com/apache/arrow/pull/8816 gets merged, and then will get back to this PR (this will give a merge conflict, but I can handle that)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#issuecomment-727623661


   https://issues.apache.org/jira/browse/ARROW-10574


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wyzhao commented on a change in pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
wyzhao commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r529065886



##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -2008,6 +2008,19 @@ def test_filters_inclusive_set(tempdir, use_legacy_dataset):
     assert 'c' not in result_df['string'].values
     assert False not in result_df['boolean'].values
 
+    dataset = pq.ParquetDataset(
+        base_path, filesystem=fs,
+        filters=[('integer', 'in', [1]), ('string', 'in', ('a', 'b')),
+                 ('boolean', 'not in', {False})],
+        use_legacy_dataset=use_legacy_dataset
+    )
+    table = dataset.read()
+    result_df = (table.to_pandas().reset_index(drop=True))
+
+    assert 0 not in result_df['integer'].values
+    assert 'c' not in result_df['string'].values
+    assert False not in result_df['boolean'].values
+

Review comment:
       There is a test case: [('integers', 'in', 3), ]. It will raise a TypeError as in the new dataset API.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wyzhao commented on a change in pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
wyzhao commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r529065377



##########
File path: python/pyarrow/parquet.py
##########
@@ -871,21 +889,34 @@ def filter_accepts_partition(self, part_key, filter, level):
         if p_column != f_column:
             return True
 
-        f_type = type(f_value)
+        p_value = self.levels[level].dictionary[p_value_index].as_py()
 
-        if isinstance(f_value, set):
+        if op in {'in', 'not in'}:
+            if not (isinstance(f_value, Container) and
+                    isinstance(f_value, Iterable)):

Review comment:
       Now ('part', 'in', 'abc') will be True for 'a', 'b', 'ab' or 'abc'. There is a new test case.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r527031173



##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -2008,6 +2008,19 @@ def test_filters_inclusive_set(tempdir, use_legacy_dataset):
     assert 'c' not in result_df['string'].values
     assert False not in result_df['boolean'].values
 
+    dataset = pq.ParquetDataset(
+        base_path, filesystem=fs,
+        filters=[('integer', 'in', [1]), ('string', 'in', ('a', 'b')),
+                 ('boolean', 'not in', {False})],
+        use_legacy_dataset=use_legacy_dataset
+    )
+    table = dataset.read()
+    result_df = (table.to_pandas().reset_index(drop=True))
+
+    assert 0 not in result_df['integer'].values
+    assert 'c' not in result_df['string'].values
+    assert False not in result_df['boolean'].values
+

Review comment:
       Can you also add a test to assert the error you get when passing an invalid "in" filter (like `("col", "in", 4)`)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wyzhao commented on a change in pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
wyzhao commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r529066887



##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -1997,8 +1997,20 @@ def test_filters_inclusive_set(tempdir, use_legacy_dataset):
 
     dataset = pq.ParquetDataset(
         base_path, filesystem=fs,
-        filters=[('integer', 'in', {1}), ('string', 'in', {'a', 'b'}),
-                 ('boolean', 'in', {True})],
+        filters=[('string', 'in', 'ab')],

Review comment:
       ('string', 'in', 'ab') will check if the partition in 'ab', so 'ab' will be accepted.
   ('string', 'in', {'a', 'b'})  will check if the partition in  {'a', 'b'}. And  'ab' will not be accepted.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wyzhao commented on a change in pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
wyzhao commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r527096473



##########
File path: python/pyarrow/parquet.py
##########
@@ -1237,6 +1264,8 @@ def validate_schemas(self):
             if self.common_metadata is not None:
                 self.schema = self.common_metadata.schema
             else:
+                if not self.pieces:
+                    raise ValueError('No partition is found.')

Review comment:
       If no partition satisfies the filters, the next line will throw IndexOutOfRange error:
                   self.schema = self.pieces[0].get_metadata().schema
   I add the two lines to make the error more meaningful.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche closed pull request #8672: ARROW-10574: [Python][Parquet] Allow collections for 'in' / 'not in' filter (in addition to sets)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche closed pull request #8672:
URL: https://github.com/apache/arrow/pull/8672


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wyzhao commented on a change in pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
wyzhao commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r532378480



##########
File path: python/pyarrow/parquet.py
##########
@@ -877,17 +894,22 @@ def filter_accepts_partition(self, part_key, filter, level):
 
         f_type = type(f_value)
 
-        if isinstance(f_value, set):
+        if op in {'in', 'not in'}:
+            if not (isinstance(f_value, Container) and
+                    isinstance(f_value, Iterable)):

Review comment:
       Collection is not supported in python 3.5, so roll back to Container and Iterable. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wyzhao commented on a change in pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
wyzhao commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r529077937



##########
File path: python/pyarrow/parquet.py
##########
@@ -877,17 +894,22 @@ def filter_accepts_partition(self, part_key, filter, level):
 
         f_type = type(f_value)
 
-        if isinstance(f_value, set):
+        if op in {'in', 'not in'}:
+            if not (isinstance(f_value, Container) and
+                    isinstance(f_value, Iterable)):

Review comment:
       Hi Joris. I changed to use Collection. But the build failed for python 3.5 which does not support Collection.
   What is your take on this?
   Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wyzhao commented on a change in pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
wyzhao commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r529069224



##########
File path: python/pyarrow/parquet.py
##########
@@ -877,17 +894,22 @@ def filter_accepts_partition(self, part_key, filter, level):
 
         f_type = type(f_value)
 
-        if isinstance(f_value, set):
+        if op in {'in', 'not in'}:
+            if not (isinstance(f_value, Container) and
+                    isinstance(f_value, Iterable)):

Review comment:
       Good idea. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wyzhao commented on a change in pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
wyzhao commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r529067072



##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -2009,6 +2021,37 @@ def test_filters_inclusive_set(tempdir, use_legacy_dataset):
     assert False not in result_df['boolean'].values
 
 
+@pytest.mark.pandas
+@parametrize_legacy_dataset
+def test_filters_no_partition_found(tempdir, use_legacy_dataset):
+    fs = LocalFileSystem._get_instance()
+    base_path = tempdir
+
+    integer_keys = [0, 1]
+    string_keys = ['a', 'b', 'c']
+    boolean_keys = [True, False]
+    partition_spec = [
+        ['integer', integer_keys],
+        ['string', string_keys],
+        ['boolean', boolean_keys]
+    ]
+
+    df = pd.DataFrame({
+        'integer': np.array(integer_keys, dtype='i4').repeat(15),
+        'string': np.tile(np.tile(np.array(string_keys, dtype=object), 5), 2),
+        'boolean': np.tile(np.tile(np.array(boolean_keys, dtype='bool'), 5),
+                           3),
+    }, columns=['integer', 'string', 'boolean'])
+
+    _generate_partition_directories(fs, base_path, partition_spec, df)
+
+    with pytest.raises(ValueError, match=r"No partition is found."):

Review comment:
       I removed this change. Will have a separate JIRA.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #8672: ARROW-10574: [Python][Parquet] Allow collections for 'in' / 'not in' filter (in addition to sets)

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#issuecomment-749071136


   Thanks @wyzhao !


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wyzhao commented on a change in pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
wyzhao commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r529060879



##########
File path: python/pyarrow/parquet.py
##########
@@ -1237,6 +1264,8 @@ def validate_schemas(self):
             if self.common_metadata is not None:
                 self.schema = self.common_metadata.schema
             else:
+                if not self.pieces:
+                    raise ValueError('No partition is found.')

Review comment:
       I have remove this change from this JIRA/pull request. I will create another pull request for this particular issue.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r528586857



##########
File path: python/pyarrow/parquet.py
##########
@@ -877,17 +894,22 @@ def filter_accepts_partition(self, part_key, filter, level):
 
         f_type = type(f_value)
 
-        if isinstance(f_value, set):
+        if op in {'in', 'not in'}:
+            if not (isinstance(f_value, Container) and
+                    isinstance(f_value, Iterable)):

Review comment:
       Could maybe also check for `Collection` instead of the combination of Container and Iterable? 

##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -2009,6 +2021,37 @@ def test_filters_inclusive_set(tempdir, use_legacy_dataset):
     assert False not in result_df['boolean'].values
 
 
+@pytest.mark.pandas
+@parametrize_legacy_dataset
+def test_filters_no_partition_found(tempdir, use_legacy_dataset):
+    fs = LocalFileSystem._get_instance()
+    base_path = tempdir
+
+    integer_keys = [0, 1]
+    string_keys = ['a', 'b', 'c']
+    boolean_keys = [True, False]
+    partition_spec = [
+        ['integer', integer_keys],
+        ['string', string_keys],
+        ['boolean', boolean_keys]
+    ]
+
+    df = pd.DataFrame({
+        'integer': np.array(integer_keys, dtype='i4').repeat(15),
+        'string': np.tile(np.tile(np.array(string_keys, dtype=object), 5), 2),
+        'boolean': np.tile(np.tile(np.array(boolean_keys, dtype='bool'), 5),
+                           3),
+    }, columns=['integer', 'string', 'boolean'])
+
+    _generate_partition_directories(fs, base_path, partition_spec, df)
+
+    with pytest.raises(ValueError, match=r"No partition is found."):

Review comment:
       This only raises an error for the legacy dataset, for the new implementation, we return an empty table. So you will need to add a `if use_legacy_dataset: ...` here (see a similar case in the `test_filters_invalid_pred_op` test just below)

##########
File path: python/pyarrow/tests/test_parquet.py
##########
@@ -1997,8 +1997,20 @@ def test_filters_inclusive_set(tempdir, use_legacy_dataset):
 
     dataset = pq.ParquetDataset(
         base_path, filesystem=fs,
-        filters=[('integer', 'in', {1}), ('string', 'in', {'a', 'b'}),
-                 ('boolean', 'in', {True})],
+        filters=[('string', 'in', 'ab')],

Review comment:
       I am wondering, should we raise an error here instead? Because now this is interpreted as `("string", "in", {"a", "b"})`, which I am not sure we should "silently" do (it's not very explicit for the user)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r527139269



##########
File path: python/pyarrow/parquet.py
##########
@@ -1237,6 +1264,8 @@ def validate_schemas(self):
             if self.common_metadata is not None:
                 self.schema = self.common_metadata.schema
             else:
+                if not self.pieces:
+                    raise ValueError('No partition is found.')

Review comment:
       Is there a test that covers this?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wyzhao commented on pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
wyzhao commented on pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#issuecomment-739440882


   Hi Joris,
   
   Thank you for reviewing my work. I saw that python 3.5 is not supported any more, so I changed back to use "Collection" as you suggested. I believe everything is taken care of. Please review.
   Thanks and best regards,
   
   Weiyang 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8672: ARROW-10574: [Python][Parquet] Enhance hive partition filtering.

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #8672:
URL: https://github.com/apache/arrow/pull/8672#discussion_r527019353



##########
File path: python/pyarrow/parquet.py
##########
@@ -114,7 +115,24 @@ def _check_filters(filters, check_null_strings=True):
 
     Predicates may also be passed as List[Tuple]. This form is interpreted
     as a single conjunction. To express OR in predicates, one must
-    use the (preferred) List[List[Tuple]] notation."""
+    use the (preferred) List[List[Tuple]] notation.
+
+    Each tuple has format: (``key``, ``op``, ``value``) and compares the
+    ``key`` with the ``value``.
+    The supported ``op`` are:  ``=`` or ``==``, ``!=``, ``<``, ``>``, ``<=``,
+    ``>=``, ``in`` and ``not in``. If the ``op`` is ``in`` or ``not in``, the
+    ``value`` must be an iterable such as a ``list``, a ``set`` or a ``tuple``.
+
+    Examples:
+
+    .. code-block:: python
+
+        ('x', '=', 0)
+        ('y', 'in', ['a', 'b', 'c'])
+        ('z', 'not in', {'a','b'})
+
+

Review comment:
       A small nit: you can remove a blank line here

##########
File path: python/pyarrow/parquet.py
##########
@@ -871,21 +889,34 @@ def filter_accepts_partition(self, part_key, filter, level):
         if p_column != f_column:
             return True
 
-        f_type = type(f_value)
+        p_value = self.levels[level].dictionary[p_value_index].as_py()
 
-        if isinstance(f_value, set):
+        if op in {'in', 'not in'}:
+            if not (isinstance(f_value, Container) and
+                    isinstance(f_value, Iterable)):

Review comment:
       Can you use the abstract types from `collections.abc` ?  
   Also, we need to ensure we don't accidentally allow a single string value this way (`("part", "in", "abc")` would then be interpreted like `("part", "in", ["a", "b", "c"])`, which I don't think we should do)

##########
File path: python/pyarrow/parquet.py
##########
@@ -871,21 +889,34 @@ def filter_accepts_partition(self, part_key, filter, level):
         if p_column != f_column:
             return True
 
-        f_type = type(f_value)
+        p_value = self.levels[level].dictionary[p_value_index].as_py()
 
-        if isinstance(f_value, set):
+        if op in {'in', 'not in'}:
+            if not (isinstance(f_value, Container) and
+                    isinstance(f_value, Iterable)):
+                raise ValueError("Op '%s' requires an iterable such as list,"
+                                 " set or tuple", op)
             if not f_value:
-                raise ValueError("Cannot use empty set as filter value")
-            if op not in {'in', 'not in'}:
-                raise ValueError("Op '%s' not supported with set value",
-                                 op)
+                raise ValueError("Cannot use empty iterable as filter value")
+
             if len({type(item) for item in f_value}) != 1:
-                raise ValueError("All elements of set '%s' must be of"
+                raise ValueError("All elements of the iterable '%s' must be of"
                                  " same type", f_value)
             f_type = type(next(iter(f_value)))
 
-        p_value = f_type(self.levels[level]
-                         .dictionary[p_value_index].as_py())
+            p_value = f_type(p_value)
+
+            if op == 'in':
+                return p_value in f_value
+            else:
+                return p_value not in f_value

Review comment:
       Could you only do the `f_value` checking and `p_value` preparation in this `if` branch, but keep the actual check below?

##########
File path: python/pyarrow/parquet.py
##########
@@ -1237,6 +1264,8 @@ def validate_schemas(self):
             if self.common_metadata is not None:
                 self.schema = self.common_metadata.schema
             else:
+                if not self.pieces:
+                    raise ValueError('No partition is found.')

Review comment:
       Is this a related change? Why was it needed?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org