You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/06 11:44:29 UTC

[GitHub] [arrow] jorisvandenbossche opened a new pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

jorisvandenbossche opened a new pull request #10661:
URL: https://github.com/apache/arrow/pull/10661


   (this is a draft PR to see if this is a reasonable approach, still needs tests etc)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #10661:
URL: https://github.com/apache/arrow/pull/10661#issuecomment-876422802


   Updated the PR according to the comments + added tests and docstrings.
   
   >  Another approach could be to expose the dictionaries as part of the partitioning factory output. So the partitioning factory could have a `dictionaries()` property that is set after `Finish` has been called. This puts a bit more complexity on python since it hides the partitioning factory from the user but it could track the dictionaries and expose them where needed.
   
   In the end, I am fine either way *how* we achieve it, and I suppose this could also work (in the python factory functions, we pass the PartitioningFactory before calling `Finish` on the dataset factory, so we have access to this object, could extract dictionaries, and attach this to the Dataset class only at the cython level). 
   If this would be preferred, I can take a look at it. But, I think the changes on the C++ side in the current implementation are rather limited and not invasive, and this makes this easier at the Python side. I also don't know to what extent there would be interest in this on the R side (it could make a read CSV -> write parquet flow easier for the case where you want to preserve the exact partitioning).
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #10661:
URL: https://github.com/apache/arrow/pull/10661#issuecomment-874672918


   See the JIRA for more details and motivating use cases, but a small demo of what this exposes:
   
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   
   df = pd.DataFrame({"year": [2020, 2020, 2021, 2021], "month":[1, 2, 1, 2], "values": [1, 2, 3, 4]})
   df.to_parquet("test_partitioned", partition_cols=["year", "month"], engine="pyarrow")
   ```
   
   A discovered dataset now stores the Partitioning object that was created/discovered in the FileSystemDatasetFactory, so you can still access this information on the dataset:
   
   ```python
   >>> dataset = ds.dataset("test_partitioned/", partitioning="hive")
   >>> dataset
   <pyarrow._dataset.FileSystemDataset at 0x7fe481b6e270>
   >>> dataset.partitioning
   <pyarrow._dataset.HivePartitioning at 0x7fe48162c7f0>
   
   # the schema (field names and types) of the partitioning
   >>> dataset.partitioning.schema 
   year: int32
   month: int32
   >>> dataset.partitioning.schema.names
   ['year', 'month']
   # and all partition field values discovered during the factory
   >>> dataset.partitioning.dictionaries
   [<pyarrow.lib.Int32Array object at 0x7fe480fd9fa0>
    [
      2020,
      2021
    ],
    <pyarrow.lib.Int32Array object at 0x7fe480fd9b80>
    [
      1,
      2
    ]]
   ```
   
   With inferring a dictionary type for the partitioning, the schema of the partitioning object indeed has dictionary type:
   
   ```python
   >>> dataset2 = ds.dataset("test_partitioned/", partitioning=ds.HivePartitioning.discover(infer_dictionary=True))
   >>> datatset2.partitioning.schema
   year: dictionary<values=int32, indices=int32, ordered=0>
   month: dictionary<values=int32, indices=int32, ordered=0>
   -- schema metadata --
   pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 352
   ```
   
   When the FileSystemDataset is not discovered with a partitioning or not created through a discovery, the `partitioning` attribute is None:
   
   ```python
   >>> dataset3 = ds.dataset("test_partitioned/")
   # TODO this is still something to fix, right now this is a "default" partitioning (which isn't exposed in Python)
   >>> dataset3.partitioning
   ...
   ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Partitioning.wrap()
   TypeError: default
   
   >>> dataset4 = ds.FileSystemDataset(list(dataset.get_fragments()), dataset.schema, dataset.format, dataset.filesystem)
   >>> dataset4.partitioning is None
   True
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #10661:
URL: https://github.com/apache/arrow/pull/10661#discussion_r664849553



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -172,6 +172,8 @@ class ARROW_DS_EXPORT KeyValuePartitioning : public Partitioning {
 
   Result<std::string> Format(const compute::Expression& expr) const override;
 
+  const ArrayVector dictionaries() const { return dictionaries_; }

Review comment:
       ```suggestion
     const ArrayVector& dictionaries() const { return dictionaries_; }
   ```

##########
File path: cpp/src/arrow/dataset/discovery.cc
##########
@@ -274,7 +274,8 @@ Result<std::shared_ptr<Dataset>> FileSystemDatasetFactory::Finish(FinishOptions
     fragments.push_back(fragment);
   }
 
-  return FileSystemDataset::Make(schema, root_partition_, format_, fs_, fragments);
+  return FileSystemDataset::Make(schema, root_partition_, format_, fs_, fragments,
+                                 partitioning);

Review comment:
       ```suggestion
     return FileSystemDataset::Make(std::move(schema), root_partition_, format_, fs_,
                                    std::move(fragments), std::move(partitioning));
   ```

##########
File path: cpp/src/arrow/dataset/file_base.h
##########
@@ -238,6 +238,12 @@ class ARROW_DS_EXPORT FileSystemDataset : public Dataset {
       std::shared_ptr<FileFormat> format, std::shared_ptr<fs::FileSystem> filesystem,
       std::vector<std::shared_ptr<FileFragment>> fragments);
 
+  static Result<std::shared_ptr<FileSystemDataset>> Make(
+      std::shared_ptr<Schema> schema, compute::Expression root_partition,
+      std::shared_ptr<FileFormat> format, std::shared_ptr<fs::FileSystem> filesystem,
+      std::vector<std::shared_ptr<FileFragment>> fragments,
+      std::shared_ptr<Partitioning> partitioning);

Review comment:
       In the interest of limiting available constructors, please combine these overloads:
   ```suggestion
         std::vector<std::shared_ptr<FileFragment>> fragments,
         std::shared_ptr<Partitioning> partitioning = NULLPTR);
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #10661:
URL: https://github.com/apache/arrow/pull/10661#discussion_r665260101



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -2214,6 +2230,15 @@ cdef class HivePartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CHivePartitioning.MakeFactory(c_options))
 
+    @property
+    def dictionaries(self):

Review comment:
       Same comment here.

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -2083,6 +2090,15 @@ cdef class DirectoryPartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CDirectoryPartitioning.MakeFactory(c_field_names, c_options))
 
+    @property
+    def dictionaries(self):

Review comment:
       Perhaps add a docstring / comment here explaining that this property might be empty or null if the partitioning was not created through some kind of discovery process?  For example:
   
   ```
   # partitioning.dictionaries will be empty here
   partitioning = HivePartitioning(pa.schema([("year", pa.int16()), ("month", pa.int8())]))
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #10661:
URL: https://github.com/apache/arrow/pull/10661#discussion_r666340416



##########
File path: cpp/src/arrow/dataset/file_base.h
##########
@@ -236,7 +236,8 @@ class ARROW_DS_EXPORT FileSystemDataset : public Dataset {
   static Result<std::shared_ptr<FileSystemDataset>> Make(
       std::shared_ptr<Schema> schema, compute::Expression root_partition,
       std::shared_ptr<FileFormat> format, std::shared_ptr<fs::FileSystem> filesystem,
-      std::vector<std::shared_ptr<FileFragment>> fragments);
+      std::vector<std::shared_ptr<FileFragment>> fragments,
+      std::shared_ptr<Partitioning> partitioning = NULLPTR);

Review comment:
       This parameter needs a doccomment https://github.com/apache/arrow/pull/10661/checks?check_run_id=3019338484#step:8:2650




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10661:
URL: https://github.com/apache/arrow/pull/10661#discussion_r666150986



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -2083,6 +2090,15 @@ cdef class DirectoryPartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CDirectoryPartitioning.MakeFactory(c_field_names, c_options))
 
+    @property
+    def dictionaries(self):

Review comment:
       Yes, added some more docstrings now




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10661:
URL: https://github.com/apache/arrow/pull/10661#discussion_r670338982



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -2083,6 +2102,27 @@ cdef class DirectoryPartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CDirectoryPartitioning.MakeFactory(c_field_names, c_options))
 
+    @property
+    def dictionaries(self):
+        """
+        The unique values for each partition field, if available.
+
+        Those values are only available if the Partitioning object was
+        created through dataset discovery from a PartitioningFactory, or
+        if the dictionaries were manually specified in the constructor.
+        If not available, this returns None.
+        """
+        cdef vector[shared_ptr[CArray]] c_arrays
+        c_arrays = self.directory_partitioning.dictionaries()
+        res = []
+        for arr in c_arrays:
+            if arr.get() == nullptr:
+                # Partitioning object has not been created through
+                # inspected Factory
+                return None
+            res.append(pyarrow_wrap_array(arr))

Review comment:
       `pyarrow_wrap_array` will in that case raise an error (it checks for `arr` being NULL), so at least it won't segfault I think




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #10661:
URL: https://github.com/apache/arrow/pull/10661#issuecomment-874665379


   https://issues.apache.org/jira/browse/ARROW-8655


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #10661:
URL: https://github.com/apache/arrow/pull/10661#discussion_r667002584



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -2083,6 +2102,27 @@ cdef class DirectoryPartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CDirectoryPartitioning.MakeFactory(c_field_names, c_options))
 
+    @property
+    def dictionaries(self):
+        """
+        The unique values for each partition field, if available.
+
+        Those values are only available if the Partitioning object was
+        created through dataset discovery from a PartitioningFactory, or
+        if the dictionaries were manually specified in the constructor.
+        If not available, this returns None.
+        """
+        cdef vector[shared_ptr[CArray]] c_arrays
+        c_arrays = self.directory_partitioning.dictionaries()
+        res = []
+        for arr in c_arrays:
+            if arr.get() == nullptr:
+                # Partitioning object has not been created through
+                # inspected Factory
+                return None
+            res.append(pyarrow_wrap_array(arr))

Review comment:
       Probably not worth worrying about here (will not occur in any KeyValuePartitioning constructed during discovery) but it *is* possible for a KeyValuePartitioning to have dictionaries for some fields and not others




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz closed pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

Posted by GitBox <gi...@apache.org>.

bkietz closed pull request #10661:
URL: https://github.com/apache/arrow/pull/10661


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10661:
URL: https://github.com/apache/arrow/pull/10661#discussion_r666166080



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1999,39 +2003,46 @@ def test_scan_iterator(use_threads, use_async):
             scanner.to_table()
 
 
-@pytest.mark.parquet
-def test_open_dataset_partitioned_directory(tempdir, dataset_reader):
+def _create_partitioned_dataset(basedir):
     import pyarrow.parquet as pq
     table = pa.table({'a': range(9), 'b': [0.] * 4 + [1.] * 5})
 
-    path = tempdir / "dataset"
+    path = basedir / "dataset-partitioned"
     path.mkdir()
 
-    for part in range(3):
-        part = path / "part={}".format(part)
+    for i in range(3):
+        part = path / "part={}".format(i)
         part.mkdir()
-        pq.write_table(table, part / "test.parquet")
+        pq.write_table(table.slice(3*i, 3), part / "test.parquet")
+
+    full_table = table.append_column(
+        "part", pa.array(np.repeat([0, 1, 2], 3), type=pa.int32()))
+
+    return full_table, path
+
+
+@pytest.mark.parquet
+def test_open_dataset_partitioned_directory(tempdir, dataset_reader):
+    full_table, path = _create_partitioned_dataset(tempdir)

Review comment:
       Those changes are unrelated, but edited the test a bit to factor out the test data creation in a `_create_partitioned_dataset` helper function to reuse.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10661:
URL: https://github.com/apache/arrow/pull/10661#discussion_r666102851



##########
File path: cpp/src/arrow/dataset/file_base.h
##########
@@ -238,6 +238,12 @@ class ARROW_DS_EXPORT FileSystemDataset : public Dataset {
       std::shared_ptr<FileFormat> format, std::shared_ptr<fs::FileSystem> filesystem,
       std::vector<std::shared_ptr<FileFragment>> fragments);
 
+  static Result<std::shared_ptr<FileSystemDataset>> Make(
+      std::shared_ptr<Schema> schema, compute::Expression root_partition,
+      std::shared_ptr<FileFormat> format, std::shared_ptr<fs::FileSystem> filesystem,
+      std::vector<std::shared_ptr<FileFragment>> fragments,
+      std::shared_ptr<Partitioning> partitioning);

Review comment:
       Yes, I still wanted to ask about that if I could use a default null argument ;) Will edit this




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

Posted by GitBox <gi...@apache.org>.

westonpace commented on pull request #10661:
URL: https://github.com/apache/arrow/pull/10661#issuecomment-875505735


   I'll add my comment from the JIRA here as you suggested.  Another approach could be to expose the dictionaries as part of the partitioning factory output.  So the partitioning factory could have a `dictionaries()` property that is set after `Finish` has been called.  This puts a bit more complexity on python since it hides the partitioning factory from the user but it could track the dictionaries and expose them where needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org