You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "msalib (via GitHub)" <gi...@apache.org> on 2023/04/05 20:01:24 UTC

[GitHub] [arrow-rs] msalib opened a new issue, #4023: `ParquetRecordBatchStream` is inconsistent about schemas

msalib opened a new issue, #4023:
URL: https://github.com/apache/arrow-rs/issues/4023

   **Describe the bug**
   
   Let's say you're trying to async read a Parquet file on S3, and that file has metadata (like "created by"). There's an inconsistency:
   
   `ParquetRecordBatchStream::schema` will produce a `Schema` object that includes that metadata.
   But `ParquetRecordBatchStream` will yield `RecordBatch`es that have schema objects that don't have the metadata.
   
   The problem is that if you create an  `ArrowWriter` using the first schema and then try to write batches from the stream to it, the schemas won't match (the writer is expecting metadata but each batch has a schema without metadata).
   
   **Expected behavior**
   
   I'd expect that either:
   * `ParquetRecordBatchStream::schema` produces a `Schema` without metadata, or
   * the `RecordBatch`es produced by `ParquetRecordBatchStream` have the exact same schema as what `::schema` returns, or
   * `ArrowWriter` should tolerate its supplied schema differing from the batch schemas provided to `write()` in metadata
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] `ParquetRecordBatchStream` Should Return the Projected Schema [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #4023: `ParquetRecordBatchStream` Should Return the Projected Schema
URL: https://github.com/apache/arrow-rs/issues/4023


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] `ParquetRecordBatchStream` Should Return the Projected Schema [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4023:
URL: https://github.com/apache/arrow-rs/issues/4023#issuecomment-1830664665

   Updated based on https://github.com/apache/arrow-rs/pull/5135#issuecomment-1830663668
   
   I agree this is a bug


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4023: `ParquetRecordBatchStream` is inconsistent about schemas

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4023:
URL: https://github.com/apache/arrow-rs/issues/4023#issuecomment-1499487074

   > ArrowWriter should tolerate its supplied schema differing from the batch schemas provided to write() in metadata
   
   https://github.com/apache/arrow-rs/pull/4027 relaxes the check on ArrowWriter
   
   > ParquetRecordBatchStream::schema produces a Schema without metadata, or
   
   I think this would be consistent with `ParquetRecordBatchReader`, and would ensure it matches the `RecordBatch` that are returned. `ParquetRecordBatchStreamBuilder::schema` should return the schema with metadata. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org