You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/06 08:28:30 UTC
[GitHub] [arrow-rs] Ted-Jiang opened a new pull request, #2012: Add page index reader test and support empty index.
Ted-Jiang opened a new pull request, #2012:
URL: https://github.com/apache/arrow-rs/pull/2012
# Which issue does this PR close?
Closes https://github.com/apache/arrow-rs/issues/2010.
# Rationale for this change
After https://github.com/apache/parquet-testing/pull/25 merged, there will be standard page index test file in parquet-testing.All types support page index in parquet-format has add test in this pr.
Another change:
I found there is a situation one col has pageLocation but without min_max index(😂 i missed it). So add EMPTY_ARRAY in enum Index.
# What changes are included in this PR?
<!---
There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
-->
# Are there any user-facing changes?
<!---
If there are user-facing changes then we may require documentation to be updated before approving the PR.
-->
<!---
If there are any breaking changes to public APIs, please add the `breaking change` label.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Ted-Jiang commented on pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#issuecomment-1177148363
@alamb @tustvold PTAL 😄
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r914564868
##########
parquet/src/file/serialized_reader.rs:
##########
@@ -1098,11 +1105,292 @@ mod tests {
let offset_indexes = metadata.offset_indexes().unwrap();
// only one row group
assert_eq!(offset_indexes.len(), 1);
- let offset_index = offset_indexes.get(0).unwrap();
- let page_offset = offset_index.get(0).unwrap();
+ let offset_index = &offset_indexes[0];
+ let page_offset = &offset_index[0][0];
assert_eq!(4, page_offset.offset);
assert_eq!(152, page_offset.compressed_page_size);
assert_eq!(0, page_offset.first_row_index);
}
+
+ #[test]
+ fn test_page_index_reader_all_type() {
+ let test_file = get_test_file("alltypes_tiny_pages_plain.parquet");
+ let builder = ReadOptionsBuilder::new();
+ //enable read page index
+ let options = builder.with_page_index().build();
+ let reader_result = SerializedFileReader::new_with_options(test_file, options);
+ let reader = reader_result.unwrap();
+
+ // Test contents in Parquet metadata
+ let metadata = reader.metadata();
+ assert_eq!(metadata.num_row_groups(), 1);
+
+ let page_indexes = metadata.page_indexes().unwrap();
+ let row_group_offset_indexes = &metadata.offset_indexes().unwrap()[0];
+
+ // only one row group
+ assert_eq!(page_indexes.len(), 1);
+ let row_group_metadata = metadata.row_group(0);
+
+ //col0->id: INT32 UNCOMPRESSED DO:0 FPO:4 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 7299, num_nulls: 0]
+ if let Index::INT32(index) = &page_indexes[0][0] {
+ check_native_page_index(
+ index,
+ 325,
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .min_bytes(),
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .max_bytes(),
+ BoundaryOrder::Unordered,
+ );
+ assert_eq!(row_group_offset_indexes[0].len(), 325);
+ } else {
+ unreachable!()
+ };
+ //col1->bool_col:BOOLEAN UNCOMPRESSED DO:0 FPO:37329 SZ:3022/3022/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: false, max: true, num_nulls: 0]
+ if let Index::BOOLEAN(index) = &page_indexes[0][1] {
+ assert_eq!(index.indexes.len(), 82);
+ assert_eq!(row_group_offset_indexes[1].len(), 82);
+ } else {
+ unreachable!()
+ };
+ //col2->tinyint_col: INT32 UNCOMPRESSED DO:0 FPO:40351 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+ if let Index::INT32(index) = &page_indexes[0][2] {
+ check_native_page_index(
+ index,
+ 325,
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .min_bytes(),
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .max_bytes(),
+ BoundaryOrder::Ascending,
+ );
+ assert_eq!(row_group_offset_indexes[2].len(), 325);
+ } else {
+ unreachable!()
+ };
+ //col4->smallint_col: INT32 UNCOMPRESSED DO:0 FPO:77676 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+ if let Index::INT32(index) = &page_indexes[0][3] {
+ check_native_page_index(
+ index,
+ 325,
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .min_bytes(),
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .max_bytes(),
+ BoundaryOrder::Ascending,
+ );
+ assert_eq!(row_group_offset_indexes[3].len(), 325);
+ } else {
+ unreachable!()
+ };
+ //col5->smallint_col: INT32 UNCOMPRESSED DO:0 FPO:77676 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+ if let Index::INT32(index) = &page_indexes[0][4] {
+ check_native_page_index(
+ index,
+ 325,
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .min_bytes(),
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .max_bytes(),
+ BoundaryOrder::Ascending,
+ );
+ assert_eq!(row_group_offset_indexes[4].len(), 325);
+ } else {
+ unreachable!()
+ };
+ //col6->bigint_col: INT64 UNCOMPRESSED DO:0 FPO:152326 SZ:71598/71598/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 90, num_nulls: 0]
+ if let Index::INT64(index) = &page_indexes[0][5] {
+ //Todo row_group_metadata.column(0).statistics().unwrap().min_bytes() only return 4 bytes
+ check_native_page_index(
Review Comment:
i try to use
```
row_group_metadata
.column(0)
.statistics()
.unwrap()
.min_bytes(),
```
get min values from one column chunk metadata in type In64, but it return only 4 bytes...
I think this is a bug.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r916720635
##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,7 @@ pub enum Index {
DOUBLE(NativeIndex<f64>),
BYTE_ARRAY(ByteArrayIndex),
FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+ EMPTY_ARRAY(),
Review Comment:
I'm suggesting renaming the variant not using option
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r915853677
##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,7 @@ pub enum Index {
DOUBLE(NativeIndex<f64>),
BYTE_ARRAY(ByteArrayIndex),
FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+ EMPTY_ARRAY(),
Review Comment:
Could we also get a comment as to why this is necessary. I also wonder if `None` might be more self-explanatory than `EMPTY_ARRAY`??
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2012: Add page index reader test and support empty index.
Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r914564868
##########
parquet/src/file/serialized_reader.rs:
##########
@@ -1098,11 +1105,292 @@ mod tests {
let offset_indexes = metadata.offset_indexes().unwrap();
// only one row group
assert_eq!(offset_indexes.len(), 1);
- let offset_index = offset_indexes.get(0).unwrap();
- let page_offset = offset_index.get(0).unwrap();
+ let offset_index = &offset_indexes[0];
+ let page_offset = &offset_index[0][0];
assert_eq!(4, page_offset.offset);
assert_eq!(152, page_offset.compressed_page_size);
assert_eq!(0, page_offset.first_row_index);
}
+
+ #[test]
+ fn test_page_index_reader_all_type() {
+ let test_file = get_test_file("alltypes_tiny_pages_plain.parquet");
+ let builder = ReadOptionsBuilder::new();
+ //enable read page index
+ let options = builder.with_page_index().build();
+ let reader_result = SerializedFileReader::new_with_options(test_file, options);
+ let reader = reader_result.unwrap();
+
+ // Test contents in Parquet metadata
+ let metadata = reader.metadata();
+ assert_eq!(metadata.num_row_groups(), 1);
+
+ let page_indexes = metadata.page_indexes().unwrap();
+ let row_group_offset_indexes = &metadata.offset_indexes().unwrap()[0];
+
+ // only one row group
+ assert_eq!(page_indexes.len(), 1);
+ let row_group_metadata = metadata.row_group(0);
+
+ //col0->id: INT32 UNCOMPRESSED DO:0 FPO:4 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 7299, num_nulls: 0]
+ if let Index::INT32(index) = &page_indexes[0][0] {
+ check_native_page_index(
+ index,
+ 325,
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .min_bytes(),
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .max_bytes(),
+ BoundaryOrder::Unordered,
+ );
+ assert_eq!(row_group_offset_indexes[0].len(), 325);
+ } else {
+ unreachable!()
+ };
+ //col1->bool_col:BOOLEAN UNCOMPRESSED DO:0 FPO:37329 SZ:3022/3022/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: false, max: true, num_nulls: 0]
+ if let Index::BOOLEAN(index) = &page_indexes[0][1] {
+ assert_eq!(index.indexes.len(), 82);
+ assert_eq!(row_group_offset_indexes[1].len(), 82);
+ } else {
+ unreachable!()
+ };
+ //col2->tinyint_col: INT32 UNCOMPRESSED DO:0 FPO:40351 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+ if let Index::INT32(index) = &page_indexes[0][2] {
+ check_native_page_index(
+ index,
+ 325,
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .min_bytes(),
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .max_bytes(),
+ BoundaryOrder::Ascending,
+ );
+ assert_eq!(row_group_offset_indexes[2].len(), 325);
+ } else {
+ unreachable!()
+ };
+ //col4->smallint_col: INT32 UNCOMPRESSED DO:0 FPO:77676 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+ if let Index::INT32(index) = &page_indexes[0][3] {
+ check_native_page_index(
+ index,
+ 325,
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .min_bytes(),
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .max_bytes(),
+ BoundaryOrder::Ascending,
+ );
+ assert_eq!(row_group_offset_indexes[3].len(), 325);
+ } else {
+ unreachable!()
+ };
+ //col5->smallint_col: INT32 UNCOMPRESSED DO:0 FPO:77676 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+ if let Index::INT32(index) = &page_indexes[0][4] {
+ check_native_page_index(
+ index,
+ 325,
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .min_bytes(),
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .max_bytes(),
+ BoundaryOrder::Ascending,
+ );
+ assert_eq!(row_group_offset_indexes[4].len(), 325);
+ } else {
+ unreachable!()
+ };
+ //col6->bigint_col: INT64 UNCOMPRESSED DO:0 FPO:152326 SZ:71598/71598/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 90, num_nulls: 0]
+ if let Index::INT64(index) = &page_indexes[0][5] {
+ //Todo row_group_metadata.column(0).statistics().unwrap().min_bytes() only return 4 bytes
+ check_native_page_index(
Review Comment:
i try to use
```
row_group_metadata
.column(0)
.statistics()
.unwrap()
.min_bytes(),
```
get min values from one column chunk metadata in type In64, but it return only 4 bytes...
I think this is a bud.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Ted-Jiang commented on pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#issuecomment-1178551744
@tustvold do you know this error in IT
```
==========================================================
Testing file duration
==========================================================
Traceback (most recent call last):
File "/arrow/dev/archery/archery/integration/runner.py", line 246, in _run_ipc_test_case
run_binaries(producer, consumer, test_case)
File "/arrow/dev/archery/archery/integration/runner.py", line 286, in _produce_consume
consumer.stream_to_file(producer_stream_path, consumer_file_path)
File "/arrow/dev/archery/archery/integration/tester_csharp.py", line 63, in stream_to_file
self.run_shell_command(cmd)
File "/arrow/dev/archery/archery/integration/tester.py", line 49, in run_shell_command
subprocess.check_call(cmd, shell=True)
File "/opt/conda/envs/arrow/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/arrow/csharp/artifacts/Apache.Arrow.IntegrationTest/Debug/net6.0/Apache.Arrow.IntegrationTest --mode stream-to-file -a /tmp/tmp3okdpdw5/124c7ca4_generated_datetime.consumer_stream_as_file < /tmp/tmp3okdpdw5/124c7ca4_generated_datetime.producer_file_as_stream' returned non-zero exit status 1.
################# FAILURES #################
FAILED TEST: datetime Java producing, C# consuming
1 failures
1
Error: `docker-compose --file /home/runner/work/arrow-rs/arrow-rs/docker-compose.yml run --rm -e ARCHERY_INTEGRATION_WITH_RUST=1 conda-integration` exited with a non-zero exit code 1, see the process log above.
The docker-compose command was invoked with the following parameters:
Defaults defined in .env:
ALMALINUX: 8
ARCH: amd64
ARCH_ALIAS: x86_64
ARCH_SHORT: amd64
ARROW_R_DEV: TRUE
BUILDKIT_INLINE_CACHE: 1
CLANG_TOOLS: 12
COMPOSE_DOCKER_CLI_BUILD: 1
CONAN: gcc10
CUDA: 9.1
DASK: latest
DEBIAN: 11
DEVTOOLSET_VERSION: -1
DOCKER_BUILDKIT: 1
DOCKER_VOLUME_PREFIX:
DOTNET: 6.0
FEDORA: 35
GCC_VERSION:
GO: 1.16
HDFS: 3.2.1
JDK: 8
KARTOTHEK: latest
LLVM: 13
MAVEN: 3.5.4
NODE: 16
NUMPY: latest
PANDAS: latest
PYTHON: 3.8
PYTHON_WHEEL_WINDOWS_IMAGE_REVISION: 2022-06-12
R: 4.2
REPO: apache/arrow-dev
R_CUSTOM_CCACHE: false
R_IMAGE: ubuntu-gcc-release
R_ORG: rhub
R_PRUNE_DEPS: FALSE
R_TAG: latest
SPARK: master
TURBODBC: latest
TZ: UTC
UBUNTU: 20.04
ULIMIT_CORE: -1
VCPKG: 38bb87c
Archery was called with:
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Ted-Jiang commented on pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#issuecomment-1177286737
@alamb @tustvold PTAL 😄
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r915853432
##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,7 @@ pub enum Index {
DOUBLE(NativeIndex<f64>),
BYTE_ARRAY(ByteArrayIndex),
FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+ EMPTY_ARRAY(),
Review Comment:
```suggestion
EMPTY_ARRAY,
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r916416883
##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,7 @@ pub enum Index {
DOUBLE(NativeIndex<f64>),
BYTE_ARRAY(ByteArrayIndex),
FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+ EMPTY_ARRAY(),
Review Comment:
I think change `Index` to `Option<Index>` need write a lot useless code. Use `None` will change a lot of func to `.....()->Option<Index>`.
I think add `EMPTY_ARRAY ` is more convenient😊
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r916931053
##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,10 @@ pub enum Index {
DOUBLE(NativeIndex<f64>),
BYTE_ARRAY(ByteArrayIndex),
FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+ /// Sometimes reading page index from parquet file
+ /// will only return pageLocations without min_max index,
+ /// Use `EMPTY_ARRAY` representing None will be more convenient.
+ EMPTY_ARRAY,
Review Comment:
make sense.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r916717329
##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,10 @@ pub enum Index {
DOUBLE(NativeIndex<f64>),
BYTE_ARRAY(ByteArrayIndex),
FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+ /// Sometimes reading page index from parquet file
+ /// will only return pageLocations without min_max index,
+ /// Use `EMPTY_ARRAY` representing None will be more convenient.
+ EMPTY_ARRAY,
Review Comment:
I think I would prefer None to indicate absence of index information
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold merged pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
tustvold merged PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r914564868
##########
parquet/src/file/serialized_reader.rs:
##########
@@ -1098,11 +1105,292 @@ mod tests {
let offset_indexes = metadata.offset_indexes().unwrap();
// only one row group
assert_eq!(offset_indexes.len(), 1);
- let offset_index = offset_indexes.get(0).unwrap();
- let page_offset = offset_index.get(0).unwrap();
+ let offset_index = &offset_indexes[0];
+ let page_offset = &offset_index[0][0];
assert_eq!(4, page_offset.offset);
assert_eq!(152, page_offset.compressed_page_size);
assert_eq!(0, page_offset.first_row_index);
}
+
+ #[test]
+ fn test_page_index_reader_all_type() {
+ let test_file = get_test_file("alltypes_tiny_pages_plain.parquet");
+ let builder = ReadOptionsBuilder::new();
+ //enable read page index
+ let options = builder.with_page_index().build();
+ let reader_result = SerializedFileReader::new_with_options(test_file, options);
+ let reader = reader_result.unwrap();
+
+ // Test contents in Parquet metadata
+ let metadata = reader.metadata();
+ assert_eq!(metadata.num_row_groups(), 1);
+
+ let page_indexes = metadata.page_indexes().unwrap();
+ let row_group_offset_indexes = &metadata.offset_indexes().unwrap()[0];
+
+ // only one row group
+ assert_eq!(page_indexes.len(), 1);
+ let row_group_metadata = metadata.row_group(0);
+
+ //col0->id: INT32 UNCOMPRESSED DO:0 FPO:4 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 7299, num_nulls: 0]
+ if let Index::INT32(index) = &page_indexes[0][0] {
+ check_native_page_index(
+ index,
+ 325,
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .min_bytes(),
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .max_bytes(),
+ BoundaryOrder::Unordered,
+ );
+ assert_eq!(row_group_offset_indexes[0].len(), 325);
+ } else {
+ unreachable!()
+ };
+ //col1->bool_col:BOOLEAN UNCOMPRESSED DO:0 FPO:37329 SZ:3022/3022/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: false, max: true, num_nulls: 0]
+ if let Index::BOOLEAN(index) = &page_indexes[0][1] {
+ assert_eq!(index.indexes.len(), 82);
+ assert_eq!(row_group_offset_indexes[1].len(), 82);
+ } else {
+ unreachable!()
+ };
+ //col2->tinyint_col: INT32 UNCOMPRESSED DO:0 FPO:40351 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+ if let Index::INT32(index) = &page_indexes[0][2] {
+ check_native_page_index(
+ index,
+ 325,
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .min_bytes(),
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .max_bytes(),
+ BoundaryOrder::Ascending,
+ );
+ assert_eq!(row_group_offset_indexes[2].len(), 325);
+ } else {
+ unreachable!()
+ };
+ //col4->smallint_col: INT32 UNCOMPRESSED DO:0 FPO:77676 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+ if let Index::INT32(index) = &page_indexes[0][3] {
+ check_native_page_index(
+ index,
+ 325,
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .min_bytes(),
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .max_bytes(),
+ BoundaryOrder::Ascending,
+ );
+ assert_eq!(row_group_offset_indexes[3].len(), 325);
+ } else {
+ unreachable!()
+ };
+ //col5->smallint_col: INT32 UNCOMPRESSED DO:0 FPO:77676 SZ:37325/37325/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 9, num_nulls: 0]
+ if let Index::INT32(index) = &page_indexes[0][4] {
+ check_native_page_index(
+ index,
+ 325,
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .min_bytes(),
+ row_group_metadata
+ .column(0)
+ .statistics()
+ .unwrap()
+ .max_bytes(),
+ BoundaryOrder::Ascending,
+ );
+ assert_eq!(row_group_offset_indexes[4].len(), 325);
+ } else {
+ unreachable!()
+ };
+ //col6->bigint_col: INT64 UNCOMPRESSED DO:0 FPO:152326 SZ:71598/71598/1.00 VC:7300 ENC:BIT_PACKED,RLE,PLAIN ST:[min: 0, max: 90, num_nulls: 0]
+ if let Index::INT64(index) = &page_indexes[0][5] {
+ //Todo row_group_metadata.column(0).statistics().unwrap().min_bytes() only return 4 bytes
+ check_native_page_index(
Review Comment:
i try to use
```
row_group_metadata
.column(0)
.statistics()
.unwrap()
.min_bytes(),
```
get min values from one column chunk metadata in type In64, but it return only 4 bytes...
I think this is a bug.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r916717329
##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,10 @@ pub enum Index {
DOUBLE(NativeIndex<f64>),
BYTE_ARRAY(ByteArrayIndex),
FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+ /// Sometimes reading page index from parquet file
+ /// will only return pageLocations without min_max index,
+ /// Use `EMPTY_ARRAY` representing None will be more convenient.
+ EMPTY_ARRAY,
Review Comment:
I think I would prefer None to indicate absence of index information. EMPTY_ARRAY makes me think of an empty byte array or something
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#discussion_r916718724
##########
parquet/src/file/page_index/index.rs:
##########
@@ -56,6 +56,10 @@ pub enum Index {
DOUBLE(NativeIndex<f64>),
BYTE_ARRAY(ByteArrayIndex),
FIXED_LEN_BYTE_ARRAY(ByteArrayIndex),
+ /// Sometimes reading page index from parquet file
+ /// will only return pageLocations without min_max index,
+ /// Use `EMPTY_ARRAY` representing None will be more convenient.
Review Comment:
```suggestion
/// `None` represents this lack of index information
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter commented on pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#issuecomment-1177207934
# [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/2012?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#2012](https://codecov.io/gh/apache/arrow-rs/pull/2012?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9270235) into [master](https://codecov.io/gh/apache/arrow-rs/commit/230915703471e9bf1403af657a447fcab971d530?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (2309157) will **increase** coverage by `0.12%`.
> The diff coverage is `91.66%`.
> :exclamation: Current head 9270235 differs from pull request most recent head 1270878. Consider uploading reports for the commit 1270878 to get more accurate results
```diff
@@ Coverage Diff @@
## master #2012 +/- ##
==========================================
+ Coverage 83.58% 83.71% +0.12%
==========================================
Files 222 222
Lines 57529 57749 +220
==========================================
+ Hits 48088 48342 +254
+ Misses 9441 9407 -34
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/2012?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [parquet/src/file/page\_index/index.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9wYWdlX2luZGV4L2luZGV4LnJz) | `69.23% <ø> (+46.15%)` | :arrow_up: |
| [parquet/src/record/api.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvcmVjb3JkL2FwaS5ycw==) | `96.82% <ø> (ø)` | |
| [arrow/src/compute/kernels/arity.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9hcml0eS5ycw==) | `81.05% <78.08%> (-9.86%)` | :arrow_down: |
| [...row/src/array/builder/fixed\_size\_binary\_builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIvZml4ZWRfc2l6ZV9iaW5hcnlfYnVpbGRlci5ycw==) | `85.52% <94.44%> (+21.42%)` | :arrow_up: |
| [parquet/src/file/serialized\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9zZXJpYWxpemVkX3JlYWRlci5ycw==) | `95.65% <97.77%> (+0.70%)` | :arrow_up: |
| [arrow/src/compute/kernels/arithmetic.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9hcml0aG1ldGljLnJz) | `92.35% <100.00%> (+0.40%)` | :arrow_up: |
| [parquet/src/file/metadata.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9tZXRhZGF0YS5ycw==) | `95.39% <100.00%> (ø)` | |
| [parquet/src/file/page\_index/index\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9wYWdlX2luZGV4L2luZGV4X3JlYWRlci5ycw==) | `90.90% <100.00%> (+10.60%)` | :arrow_up: |
| [arrow/src/array/array\_binary.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2JpbmFyeS5ycw==) | `93.18% <0.00%> (-1.11%)` | :arrow_down: |
| [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.43% <0.00%> (-0.20%)` | :arrow_down: |
| ... and [7 more](https://codecov.io/gh/apache/arrow-rs/pull/2012/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/2012?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/2012?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [2309157...1270878](https://codecov.io/gh/apache/arrow-rs/pull/2012?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Ted-Jiang closed pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
Ted-Jiang closed pull request #2012: Add page index reader test for all types and support empty index.
URL: https://github.com/apache/arrow-rs/pull/2012
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #2012:
URL: https://github.com/apache/arrow-rs/pull/2012#issuecomment-1178861924
See https://github.com/apache/arrow-rs/issues/1931
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] Ted-Jiang closed pull request #2012: Add page index reader test for all types and support empty index.
Posted by GitBox <gi...@apache.org>.
Ted-Jiang closed pull request #2012: Add page index reader test for all types and support empty index.
URL: https://github.com/apache/arrow-rs/pull/2012
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org