You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/03 21:49:38 UTC

[GitHub] [arrow-rs] NGA-TRAN opened a new issue #252: schema: missing field `metadata` when writing to parquet file

NGA-TRAN opened a new issue #252:
URL: https://github.com/apache/arrow-rs/issues/252


   **Describe the bug**
   Influxdb_iox](https://github.com/influxdata/influxdb_iox) invokes this [ arrow_writer::write](https://github.com/apache/arrow-rs/blob/8f030db53d9eda901c82db9daf94339fc447d0db/parquet/src/arrow/arrow_writer.rs#L83) to save a RecordBatch in parquet. When reading the data back, we see empty `metadata` 
   
   **To Reproduce**
   Exploring the  [ arrow_writer::write](https://github.com/apache/arrow-rs/blob/8f030db53d9eda901c82db9daf94339fc447d0db/parquet/src/arrow/arrow_writer.rs#L83) function, we see  `metadata` was not included in the function. Looking further into the function's unit tests, the `metadata` was created as empty default hence the expected and actual always match. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb edited a comment on issue #252: schema: missing field `metadata` when writing to parquet file

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on issue #252:
URL: https://github.com/apache/arrow-rs/issues/252#issuecomment-834652617


   Is this the behavior we want from the parquet writer?
   
   1. The RecordBatch metadata `RecordBatch::metadata` is written by the parquet writer as  `FileMetaData::key_value_metadata` https://github.com/sunchao/parquet-format-rs/blob/master/parquet.thrift#L1009 
   2. The Arrow field level metadata,`Field::metadata` is written by the parquet writer as `ColumnMetaData::key_value_metadata` https://github.com/sunchao/parquet-format-rs/blob/master/parquet.thrift#L730 
   
   What is the desired behavior when?
   1. Different `RecordBatch`es are written to the parquet writer with different metadata?
   2. Rows from a particular `RecordBatch` span multiple parquet row groups? (and thus there could be different field level metadata)
    
   Perhaps we could make the simplifying assumption and say "the arrow schema is supposed to be the same for all record, and thus we assume the metadata that applies to all the rows should be the same as well"?
   
   This would also mean the reader's semantics are straightforward: the `RecordBatch` and `Field` metadata for all data returned from the arrow parquet reader would be the same -- the file level metadata attached to the RecordBatch and the field level metadata attached to the Field


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on issue #252: schema: missing field `metadata` when writing to parquet file

Posted by GitBox <gi...@apache.org>.
nevi-me commented on issue #252:
URL: https://github.com/apache/arrow-rs/issues/252#issuecomment-831682725


   A bit embarassing to say this, but I forgot about the metadata completely.
   
   From https://github.com/sunchao/parquet-format-rs/blob/master/parquet.thrift, there's 2 levels where we can store `key_value_metadata`, at the file level and per column in a column chunk.
   
   This probably maps well to Arrow's metadata, which can be at a schema or field.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #252: schema: missing field `metadata` when writing to parquet file

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #252:
URL: https://github.com/apache/arrow-rs/issues/252#issuecomment-831560293


   Assigning to @NGA-TRAN  at her request


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #252: schema: missing field `metadata` when writing to parquet file

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #252:
URL: https://github.com/apache/arrow-rs/issues/252#issuecomment-831559967


   FYI @nevi-me 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb edited a comment on issue #252: schema: missing field `metadata` when writing to parquet file

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on issue #252:
URL: https://github.com/apache/arrow-rs/issues/252#issuecomment-834652617


   Is this the behavior we want from the parquet writer?
   
   1. The RecordBatch metadata `RecordBatch::metadata` is written by the parquet writer as  `FileMetaData::key_value_metadata` https://github.com/sunchao/parquet-format-rs/blob/master/parquet.thrift#L1009 
   2. The Arrow field level metadata,`Field::metadata` is written by the parquet writer as `ColumnMetaData::key_value_metadata` https://github.com/sunchao/parquet-format-rs/blob/master/parquet.thrift#L730 
   
   What is the desired behavior when?
   1. Different `RecordBatch`es are written to the parquet writer with different metadata?
   2. Rows from a particular `RecordBatch` span multiple parquet row groups? (and thus there could be different field level metadata)
    
   Perhaps we could make the simplifying assumption and say "the arrow schema is supposed to be the same for all record, and thus we assume the metadata that applies to all the rows should be the same as well"?
   
   This would also mean the reader's semantics are straightforward: the `RecordBatch` and `Field` metadata for all data returned from the arrow parquet reader would be the same -- the `FileMetaData` attached to the RecordBatch and the field level metadata attached to the ColumnMetaData


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #252: schema: missing field `metadata` when writing to parquet file

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #252:
URL: https://github.com/apache/arrow-rs/issues/252#issuecomment-837443209


   I can imagine a case when the individual row groups are maybe sorted differently or maybe adding some sort of custom per row group statistics to the metadata.
   
   However, maybe if someone is doing that level of fanciness they would be better off to use the parquet library directly 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #252: schema: missing field `metadata` when writing to parquet file

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #252:
URL: https://github.com/apache/arrow-rs/issues/252#issuecomment-834652617


   Is this the behavior we want from the parquet writer?
   
   1. The RecordBatch metadata `RecordBatch::metadata` is written by the parquet writer as  `FileMetaData::key_value_metadata` https://github.com/sunchao/parquet-format-rs/blob/master/parquet.thrift#L1009 
   2. The Arrow field level metadata,`Field::metadata` is written by the parquet writer as `ColumnMetaData::key_value_metadata` https://github.com/sunchao/parquet-format-rs/blob/master/parquet.thrift#L730 
   
   What is the desired behavior when?
   1. Different `RecordBatch`es are written to the parquet writer with different metadata?
   2. Rows from a particular `RecordBatch` span multiple parquet row groups? (and thus there could be different field level metadata)
    
   Perhaps we could make the simplifying assumption and say "the arrow schema is supposed to be the same for all record, and thus we assume the metadata that applies to all the rows should be the same as well"?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me commented on issue #252: schema: missing field `metadata` when writing to parquet file

Posted by GitBox <gi...@apache.org>.
nevi-me commented on issue #252:
URL: https://github.com/apache/arrow-rs/issues/252#issuecomment-836925868


   > Perhaps we could make the simplifying assumption and say "the arrow schema is supposed to be the same for all record, and thus we assume the metadata that applies to all the rows should be the same as well"?
   
   If I think about how the IPC format works, we send the schema first, and then send batches after. The batches don't have a copy of the schema, but would have just the buffers making up the data.
   
   So, my thinking is that we:
   
   * Write the schema of the Arrow data to `FileMetaData`
   * Write the schema of each field to `ColumnMetaData`
   * Use the schema that's provided in the write function, and not the ones from each `ArrowWriter::write(batch: &RecordBatch)`.
   
   I can't think of a valid use-case where we expect a stream of Arrow data's metadata (at a schema or field) to change mid-stream. I don't think we'd even be able to communicate such a scenario with `arrow-flight`.
   
   I wonder though, if Parquet ordinarily handles a scenario where the metadata per file is different 🤔


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org