You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/03 11:49:02 UTC

[GitHub] [arrow] kszucs opened a new pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

kszucs opened a new pull request #7631:
URL: https://github.com/apache/arrow/pull/7631


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #7631:
URL: https://github.com/apache/arrow/pull/7631#issuecomment-654334653


   I opened https://issues.apache.org/jira/browse/ARROW-9332 for the parquet statistics


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #7631:
URL: https://github.com/apache/arrow/pull/7631#discussion_r450390113



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -612,6 +613,83 @@ def test_make_fragment(multisourcefs):
         assert row_group_fragment.row_groups == [ds.RowGroupInfo(0)]
 
 
+def test_make_csv_fragment_from_buffer():
+    content = textwrap.dedent("""
+        alpha,num,animal
+        a,12,dog
+        b,11,cat
+        c,10,rabbit
+    """)
+    buffer = pa.py_buffer(content.encode('utf-8'))
+
+    csv_format = ds.CsvFileFormat()
+    fragment = csv_format.make_fragment(buffer)
+
+    expected = pa.table([['a', 'b', 'c'],
+                         [12, 11, 10],
+                         ['dog', 'cat', 'rabbit']],
+                        names=['alpha', 'num', 'animal'])
+    assert fragment.to_table().equals(expected)
+
+    pickled = pickle.loads(pickle.dumps(fragment))
+    assert pickled.to_table().equals(fragment.to_table())
+
+
+@pytest.mark.parquet
+def test_make_parquet_fragment_from_buffer():
+    import pyarrow.parquet as pq
+
+    cases = [
+        (
+            pa.table(
+                [
+                    ['a', 'b', 'c'],
+                    [12, 11, 10],
+                    ['dog', 'cat', 'rabbit']

Review comment:
       you could reduce the used vertical space here a bit by only defining the list of arrays here, and do `table = pa.table(arrays, names=['alpha', 'num', 'animal'])` in the loop




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

Posted by GitBox <gi...@apache.org>.

kszucs commented on a change in pull request #7631:
URL: https://github.com/apache/arrow/pull/7631#discussion_r449562966



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -773,6 +789,14 @@ cdef class FileFragment(Fragment):
         Fragment.init(self, sp)
         self.file_fragment = <CFileFragment*> sp.get()
 
+    def __reduce__(self):
+        buffer = self.buffer
+        return self.format.make_fragment, (

Review comment:
       Adding a case.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

Posted by GitBox <gi...@apache.org>.

kszucs commented on a change in pull request #7631:
URL: https://github.com/apache/arrow/pull/7631#discussion_r450361126



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -635,6 +635,37 @@ def test_make_fragment_from_buffer():
     assert pickled.to_table().equals(fragment.to_table())
 
 
+@pytest.mark.parquet
+def test_make_parquet_fragment_from_buffer():
+    import pyarrow.parquet as pq
+
+    table = pa.table([['a', 'b', 'c'],
+                      [12, 11, 10],
+                      ['dog', 'cat', 'rabbit']],
+                     names=['alpha', 'num', 'animal'])
+
+    out = pa.BufferOutputStream()
+    pq.write_table(table, out)
+
+    buffer = out.getvalue()
+
+    formats = [
+        ds.ParquetFileFormat(),
+        ds.ParquetFileFormat(
+            read_options=ds.ParquetReadOptions(
+                use_buffered_stream=True,
+                buffer_size=4096,

Review comment:
       @jorisvandenbossche updated.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #7631:
URL: https://github.com/apache/arrow/pull/7631#issuecomment-653516244


   https://issues.apache.org/jira/browse/ARROW-8651


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche closed pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche closed pull request #7631:
URL: https://github.com/apache/arrow/pull/7631


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

Posted by GitBox <gi...@apache.org>.

kszucs commented on a change in pull request #7631:
URL: https://github.com/apache/arrow/pull/7631#discussion_r449564759



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -773,6 +789,14 @@ cdef class FileFragment(Fragment):
         Fragment.init(self, sp)
         self.file_fragment = <CFileFragment*> sp.get()
 
+    def __reduce__(self):
+        buffer = self.buffer
+        return self.format.make_fragment, (

Review comment:
       > By specifying here the method on a format object (`format.make_fragment`), it also automatically pickles the `format` _instance_ ?
   
   Apparently.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #7631:
URL: https://github.com/apache/arrow/pull/7631#discussion_r449556844



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -773,6 +789,14 @@ cdef class FileFragment(Fragment):
         Fragment.init(self, sp)
         self.file_fragment = <CFileFragment*> sp.get()
 
+    def __reduce__(self):
+        buffer = self.buffer
+        return self.format.make_fragment, (

Review comment:
       By specifying here the method on a format object (`format.make_fragment`), it also automatically pickles the `format` *instance* ?

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -887,6 +911,14 @@ cdef class ParquetFileFragment(FileFragment):
         FileFragment.init(self, sp)
         self.parquet_file_fragment = <CParquetFileFragment*> sp.get()
 
+    def __reduce__(self):
+        return self.format.make_fragment, (
+            self.path,

Review comment:
       I suppose you might need to do the same `self.path if buffer is None else buffer,` here as you did for `FileFragment` ?

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -773,6 +789,14 @@ cdef class FileFragment(Fragment):
         Fragment.init(self, sp)
         self.file_fragment = <CFileFragment*> sp.get()
 
+    def __reduce__(self):
+        buffer = self.buffer
+        return self.format.make_fragment, (

Review comment:
       We should maybe ensure this with testing a picking roundtrip for a case that specified read params in the ParquetFileFormat object




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

Posted by GitBox <gi...@apache.org>.

kszucs commented on a change in pull request #7631:
URL: https://github.com/apache/arrow/pull/7631#discussion_r449562766



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -887,6 +911,14 @@ cdef class ParquetFileFragment(FileFragment):
         FileFragment.init(self, sp)
         self.parquet_file_fragment = <CParquetFileFragment*> sp.get()
 
+    def __reduce__(self):
+        return self.format.make_fragment, (
+            self.path,

Review comment:
       Fixed it, also added a test.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #7631:
URL: https://github.com/apache/arrow/pull/7631#discussion_r449567025



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -635,6 +635,37 @@ def test_make_fragment_from_buffer():
     assert pickled.to_table().equals(fragment.to_table())
 
 
+@pytest.mark.parquet
+def test_make_parquet_fragment_from_buffer():
+    import pyarrow.parquet as pq
+
+    table = pa.table([['a', 'b', 'c'],
+                      [12, 11, 10],
+                      ['dog', 'cat', 'rabbit']],
+                     names=['alpha', 'num', 'animal'])
+
+    out = pa.BufferOutputStream()
+    pq.write_table(table, out)
+
+    buffer = out.getvalue()
+
+    formats = [
+        ds.ParquetFileFormat(),
+        ds.ParquetFileFormat(
+            read_options=ds.ParquetReadOptions(
+                use_buffered_stream=True,
+                buffer_size=4096,

Review comment:
       we probably need to use an option that actually alters the output to be able to catch a failure, eg `dictionary_columns`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] kszucs commented on a change in pull request #7631: ARROW-8651: [Python][Dataset] Support pickling of Dataset objects

Posted by GitBox <gi...@apache.org>.

kszucs commented on a change in pull request #7631:
URL: https://github.com/apache/arrow/pull/7631#discussion_r450396810



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -612,6 +613,83 @@ def test_make_fragment(multisourcefs):
         assert row_group_fragment.row_groups == [ds.RowGroupInfo(0)]
 
 
+def test_make_csv_fragment_from_buffer():
+    content = textwrap.dedent("""
+        alpha,num,animal
+        a,12,dog
+        b,11,cat
+        c,10,rabbit
+    """)
+    buffer = pa.py_buffer(content.encode('utf-8'))
+
+    csv_format = ds.CsvFileFormat()
+    fragment = csv_format.make_fragment(buffer)
+
+    expected = pa.table([['a', 'b', 'c'],
+                         [12, 11, 10],
+                         ['dog', 'cat', 'rabbit']],
+                        names=['alpha', 'num', 'animal'])
+    assert fragment.to_table().equals(expected)
+
+    pickled = pickle.loads(pickle.dumps(fragment))
+    assert pickled.to_table().equals(fragment.to_table())
+
+
+@pytest.mark.parquet
+def test_make_parquet_fragment_from_buffer():
+    import pyarrow.parquet as pq
+
+    cases = [
+        (
+            pa.table(
+                [
+                    ['a', 'b', 'c'],
+                    [12, 11, 10],
+                    ['dog', 'cat', 'rabbit']

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org