You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/03/23 20:34:29 UTC

[GitHub] [arrow] wjones127 opened a new issue, #34712: [C++] Utilities to estimate average (de)serialized row size

wjones127 opened a new issue, #34712:
URL: https://github.com/apache/arrow/issues/34712

   ### Describe the enhancement requested
   
   We often parameterize things by number of rows, but what we would rather set the batch size in bytes. This is often the case when reading/writing files or IPC streams. One solution would be to provide utilities to estimate the average row size. For example, the Velox project file readers provide an `estimatedRowSize()` (although I'm not sure how often that is used):
   
   https://github.com/facebookincubator/velox/blob/33c40fda3a7654891c506bf23d078c0da0cd4f0d/velox/dwio/common/Reader.h#L71
   
   This interface for the Parquet reader might be something like:
   
   ```cpp
   /// \brief Provides average bytes per row as in-memory Arrow data.
   ///
   /// \param sample_rows: if true, will read a sample of rows to estimate the average size of
   /// variable length columns
   Result<int64_t> EstimateDeserializedRowSize(
     parquet::arrow::FileReader file,
     std::vector<int64_t> column_indices,
     sample_rows = false);
   ```
   
   Then would be used something like:
   
   ```cpp
   parquet::arrow::FileReader file = parquet::arrow::OpenFile("path/to/file");
   std::vector<int64_t> columns = {1, 3, 5};
   int64_t row_size = ARROW_RETURN_NOT_OK(EstimateDeserializedRowSize(file, columns));
   
   // Configure Arrow-specific Parquet reader settings
   int64_t batch_size_bytes = 64 * 1024 * 1024;
   auto arrow_reader_props = parquet::ArrowReaderProperties();
   arrow_reader_props.set_batch_size(batch_size_bytes / row_size);
   
   /// Then use the properties to get a RBR...
   ```
   
   
   Similarly, when writing IPC we might want something like:
   
   ```cpp
   /// \brief Provides average bytes per for in serialized IPC message
   Result<int64_t> EstimateSerializedRowSize(
     const RecordBatch& batch,
     const IpcWriteOptions& write_option,
   );
   ```
   
   So we can use this when writing to a Flight stream:
   
   ```cpp
   std::shared_ptr<Table> table = ...
   int64_t row_size = EstimateSerializedRowSize(table->batch(0), write_options);
   
   int64_t batch_size_bytes = 10 * 1024 * 1024;
   auto reader = TableBatchReader(table.get());
   reader.set_chunksize(batch_size_bytes / row_size);
   
   /// Pass batches to DoPut
   ```
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #34712: [C++] Utilities to estimate average (de)serialized row size

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #34712:
URL: https://github.com/apache/arrow/issues/34712#issuecomment-1481936921

   Alternatively, instead of returning a single value, it may make more sense to return the value per column. For columns where the size is fixed, it can always be there, and for variable with ones it can be optional.
   
   Note that this kind of metadata is really important for cost-based plan optimizers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34712: [C++] Utilities to estimate average (de)serialized row size

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34712:
URL: https://github.com/apache/arrow/issues/34712#issuecomment-1483045067

   I think the approach might be different for writing and for reading.  For example, for writing, if you wanted your output batches to be a certain size (in bytes) then you need to either:
   
    * Use the decoded & uncompressed size (no guessing required)
    * Write to a buffer and then to the file (no guessing required)
    * Guess how much the encodings & compression will reduce the data
   
   However, when reading, your options are more limited.  Typically you want to read a batch that has X bytes.  You can't use the decoded & uncompressed size (unless that is written in the statistics / metadata somewhere).  You can't read-twice in the same way you can write-twice.  You are then left with guessing.
   
   However, there is one other approach you can take when reading.  Instead of asking your column decoded for X pages or X row groups worth of data you can ask your column decoder for X bytes worth of data.  The decoder can then advance through as many pages as it needs to deliver X bytes of data.  This is a bit tricky because, if you are reading a batch, you might get a different number of rows from each decoder.  However, that can be addressed as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on issue #34712: [C++] Utilities to estimate average (de)serialized row size

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on issue #34712:
URL: https://github.com/apache/arrow/issues/34712#issuecomment-1483864875

   The util is great, however, it's a bit-tricky here. I've  implement a similar size-hint in our system, here are some problems I met:
   
   1. Null variables. In Arrow Array, null-value should occupy some place, but
       field-raw size cannot represent that value.
   2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
       variable-size-summary + sizeof(ByteArray) * value-count
   3. Some time Arrow data is not equal to Parquet data, like Decimal stored
       as int32 or int64.
   
   Hope that helps.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #34712: [C++] Utilities to estimate average (de)serialized row size

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #34712:
URL: https://github.com/apache/arrow/issues/34712#issuecomment-1483105168

   Or if this gets tractions, we might not have to guess at all (for Parquet): https://lists.apache.org/thread/3sm9n6tgjxsb0k6j1b6dr2nv3zx68bjy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on issue #34712: [C++] Utilities to estimate average (de)serialized row size

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on issue #34712:
URL: https://github.com/apache/arrow/issues/34712#issuecomment-1483841925

   > Alternatively, instead of returning a single value, it may make more sense to return the value per column. For columns where the size is fixed, it can always be there, and for variable with ones it can be optional.
   > 
   > Note that this kind of metadata is really important for cost-based plan optimizers.
   
   Just want to add that things are different if we are talking about in-memory size or on-disk raw size (decoded & decompressed), especially when there are substantial null values.
   
   BTW, it is tricky to support this by file formats. We always have to deal with legacy files that does not have these metadata fields.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org