You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/13 12:42:48 UTC
[GitHub] [arrow-rs] tustvold opened a new pull request, #1558: Add option to skip decoding arrow metadata from parquet (#1459)
tustvold opened a new pull request, #1558:
URL: https://github.com/apache/arrow-rs/pull/1558
# Which issue does this PR close?
Closes #1459
Closes #1557
Part of #1556
# Rationale for this change
See tickets
# What changes are included in this PR?
See tickets
# Are there any user-facing changes?
Yes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1558: Add option to skip decoding arrow metadata from parquet (#1459)
Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1558:
URL: https://github.com/apache/arrow-rs/pull/1558#discussion_r849437999
##########
parquet/src/arrow/schema.rs:
##########
@@ -690,6 +690,8 @@ impl ParquetTypeConverter<'_> {
t.unit
))),
},
+ // https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unknown-always-null
+ (Some(LogicalType::UNKNOWN(_)), _) => Ok(DataType::Null),
Review Comment:
This is the fix for #1557
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter commented on pull request #1558: Add option to skip decoding arrow metadata from parquet (#1459)
Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #1558:
URL: https://github.com/apache/arrow-rs/pull/1558#issuecomment-1099100182
# [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1558?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#1558](https://codecov.io/gh/apache/arrow-rs/pull/1558?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (54755f5) into [master](https://codecov.io/gh/apache/arrow-rs/commit/c9549bba3b03e2c79f4cbcf98060f8ad82566c40?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c9549bb) will **increase** coverage by `0.02%`.
> The diff coverage is `98.46%`.
```diff
@@ Coverage Diff @@
## master #1558 +/- ##
==========================================
+ Coverage 82.83% 82.85% +0.02%
==========================================
Files 190 190
Lines 54957 55042 +85
==========================================
+ Hits 45521 45606 +85
Misses 9436 9436
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1558?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [parquet/src/file/serialized\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9zZXJpYWxpemVkX3JlYWRlci5ycw==) | `95.65% <ø> (ø)` | |
| [parquet/src/arrow/arrow\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfcmVhZGVyLnJz) | `94.48% <98.30%> (+0.69%)` | :arrow_up: |
| [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `87.07% <100.00%> (ø)` | |
| [parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvc2NoZW1hLnJz) | `85.71% <100.00%> (+0.03%)` | :arrow_up: |
| [parquet/src/file/metadata.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9tZXRhZGF0YS5ycw==) | `94.42% <100.00%> (ø)` | |
| [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.37% <0.00%> (-0.19%)` | :arrow_down: |
| [arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2pzb24vcmVhZGVyLnJz) | `84.48% <0.00%> (+0.02%)` | :arrow_up: |
| [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `54.10% <0.00%> (+0.30%)` | :arrow_up: |
| [arrow/src/ipc/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2lwYy9yZWFkZXIucnM=) | `88.28% <0.00%> (+0.48%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1558?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1558?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [c9549bb...54755f5](https://codecov.io/gh/apache/arrow-rs/pull/1558?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1558: Add option to skip decoding arrow metadata from parquet (#1459)
Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1558:
URL: https://github.com/apache/arrow-rs/pull/1558#discussion_r849437690
##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -154,14 +175,41 @@ impl ArrowReader for ParquetFileArrowReader {
}
impl ParquetFileArrowReader {
+ /// Create a new [`ParquetFileArrowReader`]
pub fn new(file_reader: Arc<dyn FileReader>) -> Self {
- Self { file_reader }
+ Self {
+ file_reader,
+ options: Default::default(),
+ }
+ }
+
+ /// Create a new [`ParquetFileArrowReader`] with the provided [`ArrowReaderOptions`]
+ pub fn new_with_options(
+ file_reader: Arc<dyn FileReader>,
+ options: ArrowReaderOptions,
+ ) -> Self {
+ Self {
+ file_reader,
+ options,
+ }
}
- // Expose the reader metadata
+ /// Expose the reader metadata
pub fn get_metadata(&mut self) -> ParquetMetaData {
self.file_reader.metadata().clone()
}
+
+ /// Returns the key value metadata, returns `None` if [`ArrowReaderOptions::skip_arrow_metadata`]
+ fn get_kv_metadata(&self) -> Option<&Vec<KeyValue>> {
+ if self.options.skip_arrow_metadata {
+ return None;
Review Comment:
This can't be written without the change in #1556 to move away from `&Option<T>`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold merged pull request #1558: Add option to skip decoding arrow metadata from parquet (#1459)
Posted by GitBox <gi...@apache.org>.
tustvold merged PR #1558:
URL: https://github.com/apache/arrow-rs/pull/1558
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on a diff in pull request #1558: Add option to skip decoding arrow metadata from parquet (#1459)
Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1558:
URL: https://github.com/apache/arrow-rs/pull/1558#discussion_r850341604
##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -78,19 +78,40 @@ pub trait ArrowReader {
T: IntoIterator<Item = usize>;
}
+#[derive(Debug, Clone, Default)]
+pub struct ArrowReaderOptions {
+ skip_arrow_metadata: bool,
+}
+
+impl ArrowReaderOptions {
+ /// Create a new [`ArrowReaderOptions`] with the default settings
+ fn new() -> Self {
+ Self::default()
+ }
+
+ /// Parquet files generated by some writers may contain embedded arrow
+ /// schema and metadata. This may not be correct or compatible with your system
+ ///
Review Comment:
```suggestion
/// schema and metadata. This may not be correct or compatible with your system.
///
/// For example:[ARROW-16184](https://issues.apache.org/jira/browse/ARROW-16184)
///
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1558: Add option to skip decoding arrow metadata from parquet (#1459)
Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1558:
URL: https://github.com/apache/arrow-rs/pull/1558#discussion_r849439609
##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -1238,4 +1290,97 @@ mod tests {
let val = list.value(0);
assert_eq!(val.len(), 0);
}
+
+ #[test]
+ fn test_null_schema_inference() {
+ let testdata = arrow::util::test_util::parquet_test_data();
+ let path = format!("{}/null_list.parquet", testdata);
+ let reader =
+ Arc::new(SerializedFileReader::try_from(File::open(&path).unwrap()).unwrap());
+
+ let arrow_field = Field::new(
+ "emptylist",
+ ArrowDataType::List(Box::new(Field::new("item", ArrowDataType::Null, true))),
+ true,
+ );
+
+ let options = ArrowReaderOptions::default().with_skip_arrow_metadata(true);
+ let mut arrow_reader = ParquetFileArrowReader::new_with_options(reader, options);
+ let schema = arrow_reader.get_schema().unwrap();
+ assert_eq!(schema.fields().len(), 1);
+ assert_eq!(schema.field(0), &arrow_field);
+ }
+
+ #[test]
Review Comment:
This is the closest I could get to a test of #1459 as we always write and decode the LogicalType, even when technically PARQUET_1_0 doesn't support it. The nature of thrift means this isn't actually a bug I don't think
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org