You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/09/14 21:06:59 UTC

[GitHub] [arrow] bkietz opened a new pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

bkietz opened a new pull request #8188:
URL: https://github.com/apache/arrow/pull/8188


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-692852532


   In terms of benchmarking, it also strikes me that one issue is that it may be faster (especially on machines with a lot of cores -- e.g. 16/20 core servers) to read a 2-file (or even n-file where n is some number less than the number of cores on the machine) dataset by reading the files one at a time rather than using the datasets API. How many files do you have to have before the performance issue goes away? This is something that would be good to quantify in a collection of benchmarks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r495353791



##########
File path: ci/scripts/python_test.sh
##########
@@ -29,4 +29,4 @@ export LD_LIBRARY_PATH=${ARROW_HOME}/lib:${LD_LIBRARY_PATH}
 # Enable some checks inside Python itself
 export PYTHONDEVMODE=1
 
-pytest -r s --pyargs pyarrow
+pytest -r s -vvv --pyargs pyarrow

Review comment:
       I think the verbosity is more helpful when hunting down a failed test, but we can discuss that in a follow up




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm closed pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

wesm closed pull request #8188:
URL: https://github.com/apache/arrow/pull/8188


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r492271433



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1013,27 +1013,38 @@ cdef class ParquetReadOptions(_Weakrefable):
     dictionary_columns : list of string, default None
         Names of columns which should be dictionary encoded as
         they are read.
+    enable_parallel_column_conversion : bool, default False
+        EXPERIMENTAL: Parallelize conversion across columns. This option is
+        ignored if a scan is already parallelized across input files to avoid
+        thread contention. This option will be removed after support is added
+        for simultaneous parallelization across files and columns.

Review comment:
       I was going to comment: is it needed to add this option in the *public* dataset interface? Since we can enable/disable it under the hood depending on whether we are reading a single file or not, and since we do not plan to keep this keyword long term. 
   
   But, currently we implement this logic in parquet.py, which of course needs to be able to pass it through to the dataset API, so it needs to be a keyword this way. 

##########
File path: python/pyarrow/parquet.py
##########
@@ -1400,38 +1400,55 @@ def __init__(self, path_or_paths, filesystem=None, filters=None,
                                 buffer_size=buffer_size)
         if read_dictionary is not None:
             read_options.update(dictionary_columns=read_dictionary)
-        parquet_format = ds.ParquetFileFormat(read_options=read_options)
 
         # map filters to Expressions
         self._filters = filters
         self._filter_expression = filters and _filters_to_expression(filters)
 
-        # check for single NativeFile dataset
-        if not isinstance(path_or_paths, list):
-            if not _is_path_like(path_or_paths):
-                fragment = parquet_format.make_fragment(path_or_paths)
-                self._dataset = ds.FileSystemDataset(
-                    [fragment], schema=fragment.physical_schema,
-                    format=parquet_format,
-                    filesystem=fragment.filesystem
-                )
-                return
+        # map old filesystems to new one
+        if filesystem is not None:
+            filesystem = _ensure_filesystem(
+                filesystem, use_mmap=memory_map)
+        else:
+            # assume local file system
+            filesystem = LocalFileSystem(use_mmap=memory_map)

Review comment:
       I am not sure it is correct to *always* assume local filesystem if no `filesystem` is specified. Eg for a URI this will then not work (but this clearly not tested ..)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

wesm commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r495998246



##########
File path: cpp/src/arrow/dataset/scanner.h
##########
@@ -36,6 +36,8 @@
 namespace arrow {
 namespace dataset {
 
+constexpr int64_t kDefualtBatchSize = 1 << 20;

Review comment:
       typo: Default




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche edited a comment on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-696688370


   It seems the crashing test is:
   
   https://github.com/apache/arrow/blob/40d64756dc3b2c51489b48362d0f04ee3e2a7388/python/pyarrow/tests/test_parquet.py#L3389-L3414
   
   where it is passing for `use_legacy_dataset=True`, but then crashing in the next test which normally is the same with `use_legacy_dataset=False`
   
   Not directly an idea how this would be related to the changes here, but maybe something about the decimal conversion code is not thread-safe? Although there are multiple columns, so also with legacy code it should run in parallel ..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r495353620



##########
File path: cpp/src/arrow/dataset/scanner.h
##########
@@ -73,7 +73,7 @@ class ARROW_DS_EXPORT ScanOptions {
   RecordBatchProjector projector;
 
   // Maximum row count for scanned batches.
-  int64_t batch_size = 1 << 15;
+  int64_t batch_size = 1 << 20;

Review comment:
       Okay




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r488672303



##########
File path: python/pyarrow/parquet.py
##########
@@ -1479,6 +1479,11 @@ def read(self, columns=None, use_threads=True, use_pandas_metadata=False):
                 ]
                 columns = columns + list(set(index_columns) - set(columns))
 
+        if len(list(self._dataset.get_fragments())) <= 1:
+            # Allow per-column parallelism; would otherwise cause contention
+            # in the presence of per-file parallelism.
+            use_threads = False

Review comment:
       Alternative would also to handle this within the `to_table` method? 

##########
File path: python/pyarrow/parquet.py
##########
@@ -1479,6 +1479,11 @@ def read(self, columns=None, use_threads=True, use_pandas_metadata=False):
                 ]
                 columns = columns + list(set(index_columns) - set(columns))
 
+        if len(list(self._dataset.get_fragments())) <= 1:
+            # Allow per-column parallelism; would otherwise cause contention
+            # in the presence of per-file parallelism.
+            use_threads = False

Review comment:
       We should probably check that `use_threads`is actually True when enabling this per-column parallelism

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1070,6 +1070,7 @@ cdef class ParquetFileFormat(FileFormat):
         options = &(wrapped.get().reader_options)
         options.use_buffered_stream = read_options.use_buffered_stream
         options.buffer_size = read_options.buffer_size
+        options.enable_parallel_column_conversion = True

Review comment:
       Should the default be False here? (and then be set to True if we only have a single file)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

bkietz commented on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-693622612


   For a range of file and column counts, the time to read is as follows:
   ```
   nfiles  ncolumns  legacy_time  default_time  regression
        1         1     0.490398      0.401345   -0.181592
        1         2     0.642569      0.523074   -0.185964
        1         4     0.988469      0.945871   -0.043095
        1         8     1.541519      1.602061    0.039274
        2         1     1.078602      0.622690   -0.422688
        2         2     1.275463      0.922737   -0.276548
        2         4     1.601820      2.001778    0.249689
        2         8     2.847058      4.283226    0.504439
        4         1     2.116808      0.760073   -0.640935
        4         2     2.458016      1.472731   -0.400846
        4         4     3.975070      2.648561   -0.333707
        4         8     6.531598      6.030903   -0.076657
   ```
   (times in seconds, regression computed as (default_time - legacy_time)/legacy_time)
   
   `$ python -m pyperf system show`
   <details>
   <pre>
   System state
   ============
   
   CPU: use 8 logical CPUs: 0-7
   Perf event: Maximum sample rate: 1 per second
   ASLR: Full randomization
   Linux scheduler: No CPU is isolated
   CPU Frequency: 0-7=min=max=1800 MHz
   CPU scaling governor (intel_pstate): performance
   Turbo Boost (intel_pstate): Turbo Boost disabled
   IRQ affinity: irqbalance service: inactive
   IRQ affinity: Default IRQ affinity: CPU 0-7
   IRQ affinity: IRQ affinity: IRQ 0-17,51,120-127,129-130,138-139,146,155-158=CPU 0-7; IRQ 128=CPU 0; IRQ 131=CPU 1; IRQ 132=CPU 2; IRQ 133=CPU 3; IRQ 134=CPU 4; IRQ 135=CPU 5; IRQ 136=CPU 6; IRQ 137=CPU 7
   Power supply: the power cable is plugged
   </pre>
   </details>
   
   We mostly see a performance improvement with defaults, including moderate improvement in single file reading time. Note the significant regressions when reading two files with 4 or 8 columns, which is to be expected since legacy is able to divide that work across 4 or 8 threads instead of only 2.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

bkietz commented on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-698590377


   Passing Travis: https://travis-ci.org/github/bkietz/arrow/builds/730077738


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

wesm commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r490454944



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1013,27 +1013,36 @@ cdef class ParquetReadOptions(_Weakrefable):
     dictionary_columns : list of string, default None
         Names of columns which should be dictionary encoded as
         they are read.
+    enable_parallel_column_conversion : bool, default False
+        Whether single files may be read in parallel (ignored when reading
+        multiple files to avoid thread contention).

Review comment:
       Add note that this is experimental. Once we fix our nested parallelism problem this shouldn't be needed

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1899,7 +1912,7 @@ cdef shared_ptr[CScanContext] _build_scan_context(bint use_threads=True,
 
 cdef void _populate_builder(const shared_ptr[CScannerBuilder]& ptr,
                             list columns=None, Expression filter=None,
-                            int batch_size=32*2**10) except *:
+                            int batch_size=2**20) except *:

Review comment:
       Can this be put in a global variable (`DEFAULT_BATCH_SIZE`) rather than hard-coded?

##########
File path: cpp/src/arrow/dataset/file_parquet.h
##########
@@ -91,6 +91,10 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
     /// @{
     std::unordered_set<std::string> dict_columns;
     /// @}
+
+    /// Parallelize conversion across columns. This option is ignored if a scan is already
+    /// parallelized across input files.

Review comment:
       Add note that this is experimental. Once we fix our nested parallelism problem this shouldn't be needed

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1992,7 +2005,7 @@ cdef class Scanner(_Weakrefable):
     def from_fragment(Fragment fragment not None, Schema schema=None,
                       bint use_threads=True, MemoryPool memory_pool=None,
                       list columns=None, Expression filter=None,
-                      int batch_size=32*2**10):
+                      int batch_size=2**20):

Review comment:
       use common variable




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

bkietz commented on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-698590377






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

wesm commented on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-700063001


   +1


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r492271433



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1013,27 +1013,38 @@ cdef class ParquetReadOptions(_Weakrefable):
     dictionary_columns : list of string, default None
         Names of columns which should be dictionary encoded as
         they are read.
+    enable_parallel_column_conversion : bool, default False
+        EXPERIMENTAL: Parallelize conversion across columns. This option is
+        ignored if a scan is already parallelized across input files to avoid
+        thread contention. This option will be removed after support is added
+        for simultaneous parallelization across files and columns.

Review comment:
       I was going to comment: is it needed to add this option in the *public* dataset interface? Since we can enable/disable it under the hood depending on whether we are reading a single file or not, and since we do not plan to keep this keyword long term. 
   
   But, currently we implement this logic in parquet.py, which of course needs to be able to pass it through to the dataset API, so it needs to be a keyword this way. 

##########
File path: python/pyarrow/parquet.py
##########
@@ -1400,38 +1400,55 @@ def __init__(self, path_or_paths, filesystem=None, filters=None,
                                 buffer_size=buffer_size)
         if read_dictionary is not None:
             read_options.update(dictionary_columns=read_dictionary)
-        parquet_format = ds.ParquetFileFormat(read_options=read_options)
 
         # map filters to Expressions
         self._filters = filters
         self._filter_expression = filters and _filters_to_expression(filters)
 
-        # check for single NativeFile dataset
-        if not isinstance(path_or_paths, list):
-            if not _is_path_like(path_or_paths):
-                fragment = parquet_format.make_fragment(path_or_paths)
-                self._dataset = ds.FileSystemDataset(
-                    [fragment], schema=fragment.physical_schema,
-                    format=parquet_format,
-                    filesystem=fragment.filesystem
-                )
-                return
+        # map old filesystems to new one
+        if filesystem is not None:
+            filesystem = _ensure_filesystem(
+                filesystem, use_mmap=memory_map)
+        else:
+            # assume local file system
+            filesystem = LocalFileSystem(use_mmap=memory_map)

Review comment:
       I am not sure it is correct to *always* assume local filesystem if no `filesystem` is specified. Eg for a URI this will then not work (but this clearly not tested ..)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-692326434


   https://issues.apache.org/jira/browse/ARROW-9924


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-696688370


   It seems the crashing test is:
   
   https://github.com/apache/arrow/blob/40d64756dc3b2c51489b48362d0f04ee3e2a7388/python/pyarrow/tests/test_parquet.py#L3389-L3414
   
   where it is passing for `use_legacy_dataset=True`, but then crashing in the next test which normally is the same with `use_legacy_dataset=False`
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r490997910



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1899,7 +1912,7 @@ cdef shared_ptr[CScanContext] _build_scan_context(bint use_threads=True,
 
 cdef void _populate_builder(const shared_ptr[CScannerBuilder]& ptr,
                             list columns=None, Expression filter=None,
-                            int batch_size=32*2**10) except *:
+                            int batch_size=2**20) except *:

Review comment:
       will do

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1013,27 +1013,36 @@ cdef class ParquetReadOptions(_Weakrefable):
     dictionary_columns : list of string, default None
         Names of columns which should be dictionary encoded as
         they are read.
+    enable_parallel_column_conversion : bool, default False
+        Whether single files may be read in parallel (ignored when reading
+        multiple files to avoid thread contention).

Review comment:
       alright




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

bkietz commented on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-698887227


   @jorisvandenbossche PTAL
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

bkietz commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r495353478



##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -856,18 +856,32 @@ Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_groups,
     return Status::OK();
   }
 
+  int64_t num_rows = 0;
+  for (int row_group : row_groups) {
+    num_rows += parquet_reader()->metadata()->RowGroup(row_group)->num_rows();
+  }

Review comment:
       I don't think `metadata->RowGroupRowCount(row_group)` is an improvement on `metadata->RowGroup(row_group)->num_rows()`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

nealrichardson commented on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-692848891


   Can you run some kind of benchmark or ad-hoc performance check, including with the file/shape in question here, to see the effects? I see two changes (batch size and single-file behavior change) so we should be clear about the effects.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-696688370


   It seems the crashing test is:
   
   https://github.com/apache/arrow/blob/40d64756dc3b2c51489b48362d0f04ee3e2a7388/python/pyarrow/tests/test_parquet.py#L3389-L3414
   
   where it is passing for `use_legacy_dataset=True`, but then crashing in the next test which normally is the same with `use_legacy_dataset=False`
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wesm commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

wesm commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r495174760



##########
File path: c_glib/test/dataset/test-scan-options.rb
##########
@@ -28,7 +28,7 @@ def test_schema
   end
 
   def test_batch_size
-    assert_equal(1<<15,
+    assert_equal(1<<20,

Review comment:
       Should probably be a constant, but doesn't need to be fixed here

##########
File path: cpp/src/arrow/dataset/scanner.h
##########
@@ -73,7 +73,7 @@ class ARROW_DS_EXPORT ScanOptions {
   RecordBatchProjector projector;
 
   // Maximum row count for scanned batches.
-  int64_t batch_size = 1 << 15;
+  int64_t batch_size = 1 << 20;

Review comment:
       Should this be a constant `kDefaultBatchSize`?

##########
File path: ci/scripts/python_test.sh
##########
@@ -29,4 +29,4 @@ export LD_LIBRARY_PATH=${ARROW_HOME}/lib:${LD_LIBRARY_PATH}
 # Enable some checks inside Python itself
 export PYTHONDEVMODE=1
 
-pytest -r s --pyargs pyarrow
+pytest -r s -vvv --pyargs pyarrow

Review comment:
       Revert this?

##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -856,18 +856,32 @@ Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_groups,
     return Status::OK();
   }
 
+  int64_t num_rows = 0;
+  for (int row_group : row_groups) {
+    num_rows += parquet_reader()->metadata()->RowGroup(row_group)->num_rows();
+  }

Review comment:
       Should this be a helper method on the FileMetaData?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche edited a comment on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-696688370


   It seems the crashing test is:
   
   https://github.com/apache/arrow/blob/40d64756dc3b2c51489b48362d0f04ee3e2a7388/python/pyarrow/tests/test_parquet.py#L3389-L3414
   
   where it is passing for `use_legacy_dataset=True`, but then crashing in the next test which normally is the same with `use_legacy_dataset=False`
   
   Not directly an idea how this would be related to the changes here, but maybe something about the decimal conversion code is not thread-safe? Although there are multiple columns, so also with legacy code it should run in parallel ..


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] bkietz removed a comment on pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Posted by GitBox <gi...@apache.org>.

bkietz removed a comment on pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#issuecomment-698590377


   Passing Travis: https://travis-ci.org/github/bkietz/arrow/builds/730077738


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org