You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/07 08:02:21 UTC

[GitHub] [arrow-rs] Ted-Jiang opened a new pull request, #2012: Add page index reader test for all types and support empty index.

Ted-Jiang opened a new pull request, #2012:
URL: https://github.com/apache/arrow-rs/pull/2012

   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-rs/issues/2010.
   
   # Rationale for this change
    
   After https://github.com/apache/parquet-testing/pull/25 merged, there will be standard page index test file in parquet-testing.All types support page index in parquet-format has add test in this pr.
   
   Another change:
   I found there is a situation one col has pageLocation but without min_max index(😂 i missed it). So add EMPTY_ARRAY in enum Index.
   
   # What changes are included in this PR?
   
   <!---
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!---
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#issuecomment-1177148363

   @alamb @tustvold PTAL 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r914564868


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -1098,11 +1105,292 @@ mod tests {
         let offset_indexes = metadata.offset_indexes().unwrap();
         // only one row group
         assert_eq!(offset_indexes.len(), 1);
-        let offset_index = offset_indexes.get(0).unwrap();
-        let page_offset = offset_index.get(0).unwrap();
+        let offset_index = &offset_indexes[0];
+        let page_offset = &offset_index[0][0];
 
         assert_eq!(4, page_offset.offset);
         assert_eq!(152, page_offset.compressed_page_size);
         assert_eq!(0, page_offset.first_row_index);
     }
+
+    #[test]
+    fn test_page_index_reader_all_type() {
+        let test_file = get_test_file("alltypes_tiny_pages_plain.parquet");
+        let builder = ReadOptionsBuilder::new();
+        //enable read page index
+        let options = builder.with_page_index().build();
+        let reader_result = SerializedFileReader::new_with_options(test_file, options);
+        let reader = reader_result.unwrap();
+
+        // Test contents in Parquet metadata
+        let metadata = reader.metadata();
+        assert_eq!(metadata.num_row_groups(), 1);
+
+        let page_indexes = metadata.page_indexes().unwrap();
+        let row_group_offset_indexes = &metadata.offset_indexes().unwrap()[0];
+
+        // only one row group
+        assert_eq!(page_indexes.len(), 1);
+        let row_group_metadata = metadata.row_group(0);
+
+        //col0->id: INT32 UNCOMPRESSED DO:0 FPO:4 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 7299, num_nulls: 0]
+        if let Index::INT32(index) = &page_indexes[0][0] {
+            check_native_page_index(
+                index,
+                325,
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .min_bytes(),
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .max_bytes(),
+                BoundaryOrder::Unordered,
+            );
+            assert_eq!(row_group_offset_indexes[0].len(), 325);
+        } else {
+            unreachable!()
+        };
+        //col1->bool_col:BOOLEAN UNCOMPRESSED DO:0 FPO:37329 SZ:3022/3022/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: false, max: true, num_nulls: 0]
+        if let Index::BOOLEAN(index) = &page_indexes[0][1] {
+            assert_eq!(index.indexes.len(), 82);
+            assert_eq!(row_group_offset_indexes[1].len(), 82);
+        } else {
+            unreachable!()
+        };
+        //col2->tinyint_col: INT32 UNCOMPRESSED DO:0 FPO:40351 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+        if let Index::INT32(index) = &page_indexes[0][2] {
+            check_native_page_index(
+                index,
+                325,
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .min_bytes(),
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .max_bytes(),
+                BoundaryOrder::Ascending,
+            );
+            assert_eq!(row_group_offset_indexes[2].len(), 325);
+        } else {
+            unreachable!()
+        };
+        //col4->smallint_col: INT32 UNCOMPRESSED DO:0 FPO:77676 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+        if let Index::INT32(index) = &page_indexes[0][3] {
+            check_native_page_index(
+                index,
+                325,
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .min_bytes(),
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .max_bytes(),
+                BoundaryOrder::Ascending,
+            );
+            assert_eq!(row_group_offset_indexes[3].len(), 325);
+        } else {
+            unreachable!()
+        };
+        //col5->smallint_col: INT32 UNCOMPRESSED DO:0 FPO:77676 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+        if let Index::INT32(index) = &page_indexes[0][4] {
+            check_native_page_index(
+                index,
+                325,
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .min_bytes(),
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .max_bytes(),
+                BoundaryOrder::Ascending,
+            );
+            assert_eq!(row_group_offset_indexes[4].len(), 325);
+        } else {
+            unreachable!()
+        };
+        //col6->bigint_col: INT64 UNCOMPRESSED DO:0 FPO:152326 SZ:71598/71598/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 90, num_nulls: 0]
+        if let Index::INT64(index) = &page_indexes[0][5] {
+            //Todo row_group_metadata.column(0).statistics().unwrap().min_bytes() only return 4 bytes
+            check_native_page_index(

Review Comment:
   i try to use
   ```
   row_group_metadata
                       .column(0)
                       .statistics()
                       .unwrap()
                       .min_bytes(),
   ```
   get min values from one column chunk metadata in type In64, but it return only 4 bytes...
   I think this is a bug.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r916720635


##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,7 @@ pub enum Index {
     DOUBLE(NativeIndex<f64>),
     BYTE_ARRAY(ByteArrayIndex),
     FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+    EMPTY_ARRAY(),

Review Comment:
   I'm suggesting renaming the variant not using option



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r915853677


##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,7 @@ pub enum Index {
     DOUBLE(NativeIndex<f64>),
     BYTE_ARRAY(ByteArrayIndex),
     FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+    EMPTY_ARRAY(),

Review Comment:
   Could we also get a comment as to why this is necessary. I also wonder if `None` might be more self-explanatory than `EMPTY_ARRAY`??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2012: Add page index reader test and support empty index.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r914564868


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -1098,11 +1105,292 @@ mod tests {
         let offset_indexes = metadata.offset_indexes().unwrap();
         // only one row group
         assert_eq!(offset_indexes.len(), 1);
-        let offset_index = offset_indexes.get(0).unwrap();
-        let page_offset = offset_index.get(0).unwrap();
+        let offset_index = &offset_indexes[0];
+        let page_offset = &offset_index[0][0];
 
         assert_eq!(4, page_offset.offset);
         assert_eq!(152, page_offset.compressed_page_size);
         assert_eq!(0, page_offset.first_row_index);
     }
+
+    #[test]
+    fn test_page_index_reader_all_type() {
+        let test_file = get_test_file("alltypes_tiny_pages_plain.parquet");
+        let builder = ReadOptionsBuilder::new();
+        //enable read page index
+        let options = builder.with_page_index().build();
+        let reader_result = SerializedFileReader::new_with_options(test_file, options);
+        let reader = reader_result.unwrap();
+
+        // Test contents in Parquet metadata
+        let metadata = reader.metadata();
+        assert_eq!(metadata.num_row_groups(), 1);
+
+        let page_indexes = metadata.page_indexes().unwrap();
+        let row_group_offset_indexes = &metadata.offset_indexes().unwrap()[0];
+
+        // only one row group
+        assert_eq!(page_indexes.len(), 1);
+        let row_group_metadata = metadata.row_group(0);
+
+        //col0->id: INT32 UNCOMPRESSED DO:0 FPO:4 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 7299, num_nulls: 0]
+        if let Index::INT32(index) = &page_indexes[0][0] {
+            check_native_page_index(
+                index,
+                325,
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .min_bytes(),
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .max_bytes(),
+                BoundaryOrder::Unordered,
+            );
+            assert_eq!(row_group_offset_indexes[0].len(), 325);
+        } else {
+            unreachable!()
+        };
+        //col1->bool_col:BOOLEAN UNCOMPRESSED DO:0 FPO:37329 SZ:3022/3022/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: false, max: true, num_nulls: 0]
+        if let Index::BOOLEAN(index) = &page_indexes[0][1] {
+            assert_eq!(index.indexes.len(), 82);
+            assert_eq!(row_group_offset_indexes[1].len(), 82);
+        } else {
+            unreachable!()
+        };
+        //col2->tinyint_col: INT32 UNCOMPRESSED DO:0 FPO:40351 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+        if let Index::INT32(index) = &page_indexes[0][2] {
+            check_native_page_index(
+                index,
+                325,
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .min_bytes(),
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .max_bytes(),
+                BoundaryOrder::Ascending,
+            );
+            assert_eq!(row_group_offset_indexes[2].len(), 325);
+        } else {
+            unreachable!()
+        };
+        //col4->smallint_col: INT32 UNCOMPRESSED DO:0 FPO:77676 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+        if let Index::INT32(index) = &page_indexes[0][3] {
+            check_native_page_index(
+                index,
+                325,
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .min_bytes(),
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .max_bytes(),
+                BoundaryOrder::Ascending,
+            );
+            assert_eq!(row_group_offset_indexes[3].len(), 325);
+        } else {
+            unreachable!()
+        };
+        //col5->smallint_col: INT32 UNCOMPRESSED DO:0 FPO:77676 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+        if let Index::INT32(index) = &page_indexes[0][4] {
+            check_native_page_index(
+                index,
+                325,
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .min_bytes(),
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .max_bytes(),
+                BoundaryOrder::Ascending,
+            );
+            assert_eq!(row_group_offset_indexes[4].len(), 325);
+        } else {
+            unreachable!()
+        };
+        //col6->bigint_col: INT64 UNCOMPRESSED DO:0 FPO:152326 SZ:71598/71598/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 90, num_nulls: 0]
+        if let Index::INT64(index) = &page_indexes[0][5] {
+            //Todo row_group_metadata.column(0).statistics().unwrap().min_bytes() only return 4 bytes
+            check_native_page_index(

Review Comment:
   i try to use
   ```
   row_group_metadata
                       .column(0)
                       .statistics()
                       .unwrap()
                       .min_bytes(),
   ```
   get min values from one column chunk metadata in type In64, but it return only 4 bytes...
   I think this is a bud.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#issuecomment-1178551744

   @tustvold do you know this error in IT
   ```
   ==========================================================
   Testing file duration
   ==========================================================
   Traceback (most recent call last):
     File "/arrow/dev/archery/archery/integration/runner.py", line 246, in _run_ipc_test_case
       run_binaries(producer, consumer, test_case)
     File "/arrow/dev/archery/archery/integration/runner.py", line 286, in _produce_consume
       consumer.stream_to_file(producer_stream_path, consumer_file_path)
     File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in stream_to_file
       self.run_shell_command(cmd)
     File "/arrow/dev/archery/archery/integration/tester.py", line 49, in run_shell_command
       subprocess.check_call(cmd, shell=True)
     File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in check_call
       raise CalledProcessError(retcode, cmd)
   subprocess.CalledProcessError: Command '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest --mode stream-to-file -a /tmp/tmp3okdpdw5/124c7ca4_generated_datetime.consumer_stream_as_file < /tmp/tmp3okdpdw5/124c7ca4_generated_datetime.producer_file_as_stream' returned non-zero exit status 1.
   ################# FAILURES #################
   FAILED TEST: datetime Java producing,  C# consuming
   1 failures
   1
   Error: `docker-compose --file /home/runner/work/arrow-rs/arrow-rs/docker-compose.yml run --rm -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration` exited with a non-zero exit code 1, see the process log above.
   The docker-compose command was invoked with the following parameters:
   Defaults defined in .env:
     ALMALINUX: 8
     ARCH: amd64
     ARCH_ALIAS: x86_64
     ARCH_SHORT: amd64
     ARROW_R_DEV: TRUE
     BUILDKIT_INLINE_CACHE: 1
     CLANG_TOOLS: 12
     COMPOSE_DOCKER_CLI_BUILD: 1
     CONAN: gcc10
     CUDA: 9.1
     DASK: latest
     DEBIAN: 11
     DEVTOOLSET_VERSION: -1
     DOCKER_BUILDKIT: 1
     DOCKER_VOLUME_PREFIX: 
     DOTNET: 6.0
     FEDORA: 35
     GCC_VERSION: 
     GO: 1.16
     HDFS: 3.2.1
     JDK: 8
     KARTOTHEK: latest
     LLVM: 13
     MAVEN: 3.5.4
     NODE: 16
     NUMPY: latest
     PANDAS: latest
     PYTHON: 3.8
     PYTHON_WHEEL_WINDOWS_IMAGE_REVISION: 2022-06-12
     R: 4.2
     REPO: apache/arrow-dev
     R_CUSTOM_CCACHE: false
     R_IMAGE: ubuntu-gcc-release
     R_ORG: rhub
     R_PRUNE_DEPS: FALSE
     R_TAG: latest
     SPARK: master
     TURBODBC: latest
     TZ: UTC
     UBUNTU: 20.04
     ULIMIT_CORE: -1
     VCPKG: 38bb87c
   Archery was called with:
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r916416883


##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,7 @@ pub enum Index {
     DOUBLE(NativeIndex<f64>),
     BYTE_ARRAY(ByteArrayIndex),
     FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+    EMPTY_ARRAY(),

Review Comment:
   I think change `Index` to `Option<Index>` need write a lot useless code. Use `None` will change a lot of func to `.....()->Option<Index>`. 
   I think add `EMPTY_ARRAY ` is more convenient😊



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r916931053


##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,10 @@ pub enum Index {
     DOUBLE(NativeIndex<f64>),
     BYTE_ARRAY(ByteArrayIndex),
     FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+    /// Sometimes reading page index from parquet file
+    /// will only return pageLocations without min_max index,
+    /// Use `EMPTY_ARRAY` representing None will be more convenient.
+    EMPTY_ARRAY,

Review Comment:
   make sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#issuecomment-1177286737

   @alamb @tustvold PTAL 😄 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r915853432


##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,7 @@ pub enum Index {
     DOUBLE(NativeIndex<f64>),
     BYTE_ARRAY(ByteArrayIndex),
     FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+    EMPTY_ARRAY(),

Review Comment:
   ```suggestion
       EMPTY_ARRAY,
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r916717329


##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,10 @@ pub enum Index {
     DOUBLE(NativeIndex<f64>),
     BYTE_ARRAY(ByteArrayIndex),
     FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+    /// Sometimes reading page index from parquet file
+    /// will only return pageLocations without min_max index,
+    /// Use `EMPTY_ARRAY` representing None will be more convenient.
+    EMPTY_ARRAY,

Review Comment:
   I think I would prefer None to indicate absence of index information



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r914564868


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -1098,11 +1105,292 @@ mod tests {
         let offset_indexes = metadata.offset_indexes().unwrap();
         // only one row group
         assert_eq!(offset_indexes.len(), 1);
-        let offset_index = offset_indexes.get(0).unwrap();
-        let page_offset = offset_index.get(0).unwrap();
+        let offset_index = &offset_indexes[0];
+        let page_offset = &offset_index[0][0];
 
         assert_eq!(4, page_offset.offset);
         assert_eq!(152, page_offset.compressed_page_size);
         assert_eq!(0, page_offset.first_row_index);
     }
+
+    #[test]
+    fn test_page_index_reader_all_type() {
+        let test_file = get_test_file("alltypes_tiny_pages_plain.parquet");
+        let builder = ReadOptionsBuilder::new();
+        //enable read page index
+        let options = builder.with_page_index().build();
+        let reader_result = SerializedFileReader::new_with_options(test_file, options);
+        let reader = reader_result.unwrap();
+
+        // Test contents in Parquet metadata
+        let metadata = reader.metadata();
+        assert_eq!(metadata.num_row_groups(), 1);
+
+        let page_indexes = metadata.page_indexes().unwrap();
+        let row_group_offset_indexes = &metadata.offset_indexes().unwrap()[0];
+
+        // only one row group
+        assert_eq!(page_indexes.len(), 1);
+        let row_group_metadata = metadata.row_group(0);
+
+        //col0->id: INT32 UNCOMPRESSED DO:0 FPO:4 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 7299, num_nulls: 0]
+        if let Index::INT32(index) = &page_indexes[0][0] {
+            check_native_page_index(
+                index,
+                325,
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .min_bytes(),
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .max_bytes(),
+                BoundaryOrder::Unordered,
+            );
+            assert_eq!(row_group_offset_indexes[0].len(), 325);
+        } else {
+            unreachable!()
+        };
+        //col1->bool_col:BOOLEAN UNCOMPRESSED DO:0 FPO:37329 SZ:3022/3022/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: false, max: true, num_nulls: 0]
+        if let Index::BOOLEAN(index) = &page_indexes[0][1] {
+            assert_eq!(index.indexes.len(), 82);
+            assert_eq!(row_group_offset_indexes[1].len(), 82);
+        } else {
+            unreachable!()
+        };
+        //col2->tinyint_col: INT32 UNCOMPRESSED DO:0 FPO:40351 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+        if let Index::INT32(index) = &page_indexes[0][2] {
+            check_native_page_index(
+                index,
+                325,
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .min_bytes(),
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .max_bytes(),
+                BoundaryOrder::Ascending,
+            );
+            assert_eq!(row_group_offset_indexes[2].len(), 325);
+        } else {
+            unreachable!()
+        };
+        //col4->smallint_col: INT32 UNCOMPRESSED DO:0 FPO:77676 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+        if let Index::INT32(index) = &page_indexes[0][3] {
+            check_native_page_index(
+                index,
+                325,
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .min_bytes(),
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .max_bytes(),
+                BoundaryOrder::Ascending,
+            );
+            assert_eq!(row_group_offset_indexes[3].len(), 325);
+        } else {
+            unreachable!()
+        };
+        //col5->smallint_col: INT32 UNCOMPRESSED DO:0 FPO:77676 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+        if let Index::INT32(index) = &page_indexes[0][4] {
+            check_native_page_index(
+                index,
+                325,
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .min_bytes(),
+                row_group_metadata
+                    .column(0)
+                    .statistics()
+                    .unwrap()
+                    .max_bytes(),
+                BoundaryOrder::Ascending,
+            );
+            assert_eq!(row_group_offset_indexes[4].len(), 325);
+        } else {
+            unreachable!()
+        };
+        //col6->bigint_col: INT64 UNCOMPRESSED DO:0 FPO:152326 SZ:71598/71598/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 90, num_nulls: 0]
+        if let Index::INT64(index) = &page_indexes[0][5] {
+            //Todo row_group_metadata.column(0).statistics().unwrap().min_bytes() only return 4 bytes
+            check_native_page_index(

Review Comment:
   i try to use
   ```
   row_group_metadata
                       .column(0)
                       .statistics()
                       .unwrap()
                       .min_bytes(),
   ```
   get min values from one column chunk metadata in type In64, but it return only 4 bytes...
   I think this is a bug.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r916717329


##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,10 @@ pub enum Index {
     DOUBLE(NativeIndex<f64>),
     BYTE_ARRAY(ByteArrayIndex),
     FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+    /// Sometimes reading page index from parquet file
+    /// will only return pageLocations without min_max index,
+    /// Use `EMPTY_ARRAY` representing None will be more convenient.
+    EMPTY_ARRAY,

Review Comment:
   I think I would prefer None to indicate absence of index information. EMPTY_ARRAY makes me think of an empty byte array or something



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r916718724


##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,10 @@ pub enum Index {
     DOUBLE(NativeIndex<f64>),
     BYTE_ARRAY(ByteArrayIndex),
     FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+    /// Sometimes reading page index from parquet file
+    /// will only return pageLocations without min_max index,
+    /// Use `EMPTY_ARRAY` representing None will be more convenient.

Review Comment:
   ```suggestion
       /// `None` represents this lack of index information
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#issuecomment-1177207934

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/2012?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2012](https://codecov.io/gh/apache/arrow-rs/pull/2012?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9270235) into [master](https://codecov.io/gh/apache/arrow-rs/commit/230915703471e9bf1403af657a447fcab971d530?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (2309157) will **increase** coverage by `0.12%`.
   > The diff coverage is `91.66%`.
   
   > :exclamation: Current head 9270235 differs from pull request most recent head 1270878. Consider uploading reports for the commit 1270878 to get more accurate results
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #2012      +/-   ##
   ==========================================
   + Coverage   83.58%   83.71%   +0.12%     
   ==========================================
     Files         222      222              
     Lines       57529    57749     +220     
   ==========================================
   + Hits        48088    48342     +254     
   + Misses       9441     9407      -34     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/2012?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/file/page\_index/index.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9wYWdlX2luZGV4L2luZGV4LnJz) | `69.23% <ø> (+46.15%)` | :arrow_up: |
   | [parquet/src/record/api.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvcmVjb3JkL2FwaS5ycw==) | `96.82% <ø> (ø)` | |
   | [arrow/src/compute/kernels/arity.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9hcml0eS5ycw==) | `81.05% <78.08%> (-9.86%)` | :arrow_down: |
   | [...row/src/array/builder/fixed\_size\_binary\_builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIvZml4ZWRfc2l6ZV9iaW5hcnlfYnVpbGRlci5ycw==) | `85.52% <94.44%> (+21.42%)` | :arrow_up: |
   | [parquet/src/file/serialized\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9zZXJpYWxpemVkX3JlYWRlci5ycw==) | `95.65% <97.77%> (+0.70%)` | :arrow_up: |
   | [arrow/src/compute/kernels/arithmetic.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9hcml0aG1ldGljLnJz) | `92.35% <100.00%> (+0.40%)` | :arrow_up: |
   | [parquet/src/file/metadata.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9tZXRhZGF0YS5ycw==) | `95.39% <100.00%> (ø)` | |
   | [parquet/src/file/page\_index/index\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9wYWdlX2luZGV4L2luZGV4X3JlYWRlci5ycw==) | `90.90% <100.00%> (+10.60%)` | :arrow_up: |
   | [arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2JpbmFyeS5ycw==) | `93.18% <0.00%> (-1.11%)` | :arrow_down: |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.43% <0.00%> (-0.20%)` | :arrow_down: |
   | ... and [7 more](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/2012?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/2012?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [2309157...1270878](https://codecov.io/gh/apache/arrow-rs/pull/2012?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang closed pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang closed pull request #2012: Add page index reader test for all types and support empty index.
URL: https://github.com/apache/arrow-rs/pull/2012


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#issuecomment-1178861924

   See https://github.com/apache/arrow-rs/issues/1931


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang closed pull request #2012: Add page index reader test for all types and support empty index.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang closed pull request #2012: Add page index reader test for all types and support empty index.
URL: https://github.com/apache/arrow-rs/pull/2012


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org