You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/13 12:42:48 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #1558: Add option to skip decoding arrow metadata from parquet (#1459)

tustvold opened a new pull request, #1558:
URL: https://github.com/apache/arrow-rs/pull/1558

   # Which issue does this PR close?
   
   Closes #1459 
   Closes #1557
   Part of #1556 
   
   # Rationale for this change
    
   See tickets
   
   # What changes are included in this PR?
   
   See tickets
   
   # Are there any user-facing changes?
   
   Yes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1558: Add option to skip decoding arrow metadata from parquet (#1459)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1558:
URL: https://github.com/apache/arrow-rs/pull/1558#discussion_r849437999


##########
parquet/src/arrow/schema.rs:
##########
@@ -690,6 +690,8 @@ impl ParquetTypeConverter<'_> {
                     t.unit
                 ))),
             },
+            // https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#unknown-always-null
+            (Some(LogicalType::UNKNOWN(_)), _) => Ok(DataType::Null),

Review Comment:
   This is the fix for #1557 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #1558: Add option to skip decoding arrow metadata from parquet (#1459)

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #1558:
URL: https://github.com/apache/arrow-rs/pull/1558#issuecomment-1099100182

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1558?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1558](https://codecov.io/gh/apache/arrow-rs/pull/1558?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (54755f5) into [master](https://codecov.io/gh/apache/arrow-rs/commit/c9549bba3b03e2c79f4cbcf98060f8ad82566c40?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c9549bb) will **increase** coverage by `0.02%`.
   > The diff coverage is `98.46%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1558      +/-   ##
   ==========================================
   + Coverage   82.83%   82.85%   +0.02%     
   ==========================================
     Files         190      190              
     Lines       54957    55042      +85     
   ==========================================
   + Hits        45521    45606      +85     
     Misses       9436     9436              
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1558?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/file/serialized\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9zZXJpYWxpemVkX3JlYWRlci5ycw==) | `95.65% <ø> (ø)` | |
   | [parquet/src/arrow/arrow\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfcmVhZGVyLnJz) | `94.48% <98.30%> (+0.69%)` | :arrow_up: |
   | [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `87.07% <100.00%> (ø)` | |
   | [parquet/src/arrow/schema.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvc2NoZW1hLnJz) | `85.71% <100.00%> (+0.03%)` | :arrow_up: |
   | [parquet/src/file/metadata.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9tZXRhZGF0YS5ycw==) | `94.42% <100.00%> (ø)` | |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.37% <0.00%> (-0.19%)` | :arrow_down: |
   | [arrow/src/json/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2pzb24vcmVhZGVyLnJz) | `84.48% <0.00%> (+0.02%)` | :arrow_up: |
   | [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `54.10% <0.00%> (+0.30%)` | :arrow_up: |
   | [arrow/src/ipc/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1558/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2lwYy9yZWFkZXIucnM=) | `88.28% <0.00%> (+0.48%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1558?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1558?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [c9549bb...54755f5](https://codecov.io/gh/apache/arrow-rs/pull/1558?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1558: Add option to skip decoding arrow metadata from parquet (#1459)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1558:
URL: https://github.com/apache/arrow-rs/pull/1558#discussion_r849437690


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -154,14 +175,41 @@ impl ArrowReader for ParquetFileArrowReader {
 }
 
 impl ParquetFileArrowReader {
+    /// Create a new [`ParquetFileArrowReader`]
     pub fn new(file_reader: Arc<dyn FileReader>) -> Self {
-        Self { file_reader }
+        Self {
+            file_reader,
+            options: Default::default(),
+        }
+    }
+
+    /// Create a new [`ParquetFileArrowReader`] with the provided [`ArrowReaderOptions`]
+    pub fn new_with_options(
+        file_reader: Arc<dyn FileReader>,
+        options: ArrowReaderOptions,
+    ) -> Self {
+        Self {
+            file_reader,
+            options,
+        }
     }
 
-    // Expose the reader metadata
+    /// Expose the reader metadata
     pub fn get_metadata(&mut self) -> ParquetMetaData {
         self.file_reader.metadata().clone()
     }
+
+    /// Returns the key value metadata, returns `None` if [`ArrowReaderOptions::skip_arrow_metadata`]
+    fn get_kv_metadata(&self) -> Option<&Vec<KeyValue>> {
+        if self.options.skip_arrow_metadata {
+            return None;

Review Comment:
   This can't be written without the change in #1556 to move away from `&Option<T>`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #1558: Add option to skip decoding arrow metadata from parquet (#1459)

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #1558:
URL: https://github.com/apache/arrow-rs/pull/1558


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1558: Add option to skip decoding arrow metadata from parquet (#1459)

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1558:
URL: https://github.com/apache/arrow-rs/pull/1558#discussion_r850341604


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -78,19 +78,40 @@ pub trait ArrowReader {
         T: IntoIterator<Item = usize>;
 }
 
+#[derive(Debug, Clone, Default)]
+pub struct ArrowReaderOptions {
+    skip_arrow_metadata: bool,
+}
+
+impl ArrowReaderOptions {
+    /// Create a new [`ArrowReaderOptions`] with the default settings
+    fn new() -> Self {
+        Self::default()
+    }
+
+    /// Parquet files generated by some writers may contain embedded arrow
+    /// schema and metadata. This may not be correct or compatible with your system
+    ///

Review Comment:
   ```suggestion
       /// schema and metadata. This may not be correct or compatible with your system.
       /// 
       /// For example:[ARROW-16184](https://issues.apache.org/jira/browse/ARROW-16184)
       ///
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1558: Add option to skip decoding arrow metadata from parquet (#1459)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1558:
URL: https://github.com/apache/arrow-rs/pull/1558#discussion_r849439609


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -1238,4 +1290,97 @@ mod tests {
         let val = list.value(0);
         assert_eq!(val.len(), 0);
     }
+
+    #[test]
+    fn test_null_schema_inference() {
+        let testdata = arrow::util::test_util::parquet_test_data();
+        let path = format!("{}/null_list.parquet", testdata);
+        let reader =
+            Arc::new(SerializedFileReader::try_from(File::open(&path).unwrap()).unwrap());
+
+        let arrow_field = Field::new(
+            "emptylist",
+            ArrowDataType::List(Box::new(Field::new("item", ArrowDataType::Null, true))),
+            true,
+        );
+
+        let options = ArrowReaderOptions::default().with_skip_arrow_metadata(true);
+        let mut arrow_reader = ParquetFileArrowReader::new_with_options(reader, options);
+        let schema = arrow_reader.get_schema().unwrap();
+        assert_eq!(schema.fields().len(), 1);
+        assert_eq!(schema.field(0), &arrow_field);
+    }
+
+    #[test]

Review Comment:
   This is the closest I could get to a test of #1459 as we always write and decode the LogicalType, even when technically PARQUET_1_0 doesn't support it. The nature of thrift means this isn't actually a bug I don't think



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org