You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/12 21:43:20 UTC
[GitHub] [arrow] wesm commented on a change in pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

wesm commented on a change in pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#discussion_r453366215



##########
File path: rust/datafusion/README.md
##########
@@ -69,5 +69,15 @@ DataFusion includes a simple command-line interactive SQL utility. See the [CLI
 - [x] Parquet primitive types
 - [ ] Parquet nested types
 
-# Examples
+# Supported SQL
 
+This library currently supports the following SQL constructs:
+
+* `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
+* `SELECT ... FROM ...` together with any expression
+* `ALIAS` to name an expression
+* `CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
+* most mathematical unary and binary expressions such as `+`, `/`, `sqrt`, `tan`, `>=`.
+* `WHERE` to filter
+* `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`
+* `ORDER BY` together with an expression and optional `DESC`

Review comment:
       This shouldn't be here

##########
File path: cpp/src/parquet/arrow/reader.h
##########
@@ -151,8 +151,10 @@ class PARQUET_EXPORT FileReader {
       int i, std::shared_ptr<::arrow::ChunkedArray>* out) = 0;
 
   /// \brief Return a RecordBatchReader of row groups selected from row_group_indices, the
-  ///    ordering in row_group_indices matters.
-  /// \returns error Status if row_group_indices contains invalid index
+  ///     ordering in row_group_indices matters. Each row group will held in memory until
+  ///     each slice has been yielded as a RecordBatch, at which point the next row group
+  ///     will be loaded. FileReaders must outlive their RecordBatchReaders.

Review comment:
       Probably want to add a TODO here that we want to fix this (this is a problem for memory use)

##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -794,8 +773,35 @@ Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_group_in
     END_PARQUET_CATCH_EXCEPTIONS
   }
 
-  return RowGroupRecordBatchReader::Make(row_group_indices, column_indices, this,
-                                         reader_properties_.batch_size(), out);
+  using ::arrow::RecordBatchIterator;
+
+  // NB: This lambda will be invoked lazily whenever a new row group must be
+  // scanned, so it must capture `column_indices` by value (it will not
+  // otherwise outlive the scope of `GetRecordBatchReader()`). `this` is a non-owning
+  // pointer so we are relying on the parent FileReader outliving this RecordBatchReader.
+  auto row_group_index_to_batch_iterator =
+      [column_indices, this](const int* i) -> ::arrow::Result<RecordBatchIterator> {
+    std::shared_ptr<::arrow::Table> table;
+    RETURN_NOT_OK(RowGroup(*i)->ReadTable(column_indices, &table));

Review comment:
       It isn't great to be reading the whole row group into memory before returning it in chunks, is there an issue about materializing only a chunk at a time (or a background actor-type reader that will prepare the materialize the next chunk asynchronously in between calls to `Next`)?

##########
File path: cpp/src/parquet/arrow/reader.h
##########
@@ -163,8 +165,11 @@ class PARQUET_EXPORT FileReader {
   /// \brief Return a RecordBatchReader of row groups selected from
   ///     row_group_indices, whose columns are selected by column_indices. The
   ///     ordering in row_group_indices and column_indices matter.
+  ///     Each row group will held in memory until each slice has been yielded as a
+  ///     RecordBatch, at which point the next row group will be loaded. FileReaders must
+  ///     outlive their RecordBatchReaders.

Review comment:
       Add a TODO that this is considered to be a problem




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org