You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/03/13 15:17:04 UTC

[GitHub] [parquet-format] wgtmac opened a new pull request, #194: PARQUET-2257: Add bloom_filter_length to ColumnMetaData

wgtmac opened a new pull request, #194:
URL: https://github.com/apache/parquet-format/pull/194

   The specs has only added `bloom_filter_offset` to locate the bloom filter. The reader cannot load the bloom filter in a single shot until it parses the bloom filter header to get the total size.
   ```thrift
   struct ColumnMetaData {
     /** Byte offset from beginning of file to Bloom filter data. **/
     14: optional i64 bloom_filter_offset;
   }
   ```
   This patch proposes to add an optional bloom_filter_length field to track the size of bloom filter to facilitate I/O scheduling. The specs already do the similar things for column index and offset index.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-format] emkornfield commented on a diff in pull request #194: PARQUET-2257: Add bloom_filter_length to ColumnMetaData

Posted by "emkornfield (via GitHub)" <gi...@apache.org>.
emkornfield commented on code in PR #194:
URL: https://github.com/apache/parquet-format/pull/194#discussion_r1157961261


##########
src/main/thrift/parquet.thrift:
##########
@@ -753,6 +753,9 @@ struct ColumnMetaData {
 
   /** Byte offset from beginning of file to Bloom filter data. **/
   14: optional i64 bloom_filter_offset;
+
+  /** Size of Bloom filter data, in bytes. **/
+  15: optional i32 bloom_filter_length;

Review Comment:
   for future implementers we should add a comment noting that existing of one element does not indicate existence of the other (even though it is already implied by mapping).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-format] wgtmac merged pull request #194: PARQUET-2257: Add bloom_filter_length to ColumnMetaData

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac merged PR #194:
URL: https://github.com/apache/parquet-format/pull/194


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-format] wgtmac commented on a diff in pull request #194: PARQUET-2257: Add bloom_filter_length to ColumnMetaData

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #194:
URL: https://github.com/apache/parquet-format/pull/194#discussion_r1134163134


##########
src/main/thrift/parquet.thrift:
##########
@@ -753,6 +753,9 @@ struct ColumnMetaData {
 
   /** Byte offset from beginning of file to Bloom filter data. **/
   14: optional i64 bloom_filter_offset;
+
+  /** Size of Bloom filter data, in bytes. **/
+  15: optional i32 bloom_filter_length;

Review Comment:
   On the writer side:
   - Old writer only writes offset.
   - New writer should write length as well.
   
   On the reader side:
   - Old reader only checks offset.
   - New reader checks offset and then checks if length exists.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-format] wgtmac commented on pull request #194: PARQUET-2257: Add bloom_filter_length to ColumnMetaData

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #194:
URL: https://github.com/apache/parquet-format/pull/194#issuecomment-1499283453

   > One minor comment otherwise looks OK to me, apologies for the late review.
   
   I have updated the comment. Please take a look, thanks! @emkornfield 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-format] wgtmac commented on a diff in pull request #194: PARQUET-2257: Add bloom_filter_length to ColumnMetaData

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #194:
URL: https://github.com/apache/parquet-format/pull/194#discussion_r1158000250


##########
src/main/thrift/parquet.thrift:
##########
@@ -753,6 +753,9 @@ struct ColumnMetaData {
 
   /** Byte offset from beginning of file to Bloom filter data. **/
   14: optional i64 bloom_filter_offset;
+
+  /** Size of Bloom filter data, in bytes. **/
+  15: optional i32 bloom_filter_length;

Review Comment:
   Agreed.
   
   I have thought to add a comment to say that `bloom_filter_length` is introduced by v2.10.0 (the next unreleased version) and readers may not expect its existence from legacy files or some implementations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-format] wgtmac commented on pull request #194: PARQUET-2257: Add bloom_filter_length to ColumnMetaData

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #194:
URL: https://github.com/apache/parquet-format/pull/194#issuecomment-1466350305

   @emkornfield @pitrou @gszadovszky @shangxinli @mapleFU


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-format] wgtmac commented on a diff in pull request #194: PARQUET-2257: Add bloom_filter_length to ColumnMetaData

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #194:
URL: https://github.com/apache/parquet-format/pull/194#discussion_r1134163134


##########
src/main/thrift/parquet.thrift:
##########
@@ -753,6 +753,9 @@ struct ColumnMetaData {
 
   /** Byte offset from beginning of file to Bloom filter data. **/
   14: optional i64 bloom_filter_offset;
+
+  /** Size of Bloom filter data, in bytes. **/
+  15: optional i32 bloom_filter_length;

Review Comment:
   On the writer side:
   - Old writer only writes offset.
   - New writer should write length as well.
   
   On the reader size:
   - Old reader only checks offset.
   - New reader checks offset then try to use length if exists.



##########
src/main/thrift/parquet.thrift:
##########
@@ -753,6 +753,9 @@ struct ColumnMetaData {
 
   /** Byte offset from beginning of file to Bloom filter data. **/
   14: optional i64 bloom_filter_offset;
+
+  /** Size of Bloom filter data, in bytes. **/
+  15: optional i32 bloom_filter_length;

Review Comment:
   On the writer side:
   - Old writer only writes offset.
   - New writer should write length as well.
   
   On the reader side:
   - Old reader only checks offset.
   - New reader checks offset then try to use length if exists.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-format] mapleFU commented on a diff in pull request #194: PARQUET-2257: Add bloom_filter_length to ColumnMetaData

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #194:
URL: https://github.com/apache/parquet-format/pull/194#discussion_r1134158075


##########
src/main/thrift/parquet.thrift:
##########
@@ -753,6 +753,9 @@ struct ColumnMetaData {
 
   /** Byte offset from beginning of file to Bloom filter data. **/
   14: optional i64 bloom_filter_offset;
+
+  /** Size of Bloom filter data, in bytes. **/
+  15: optional i32 bloom_filter_length;

Review Comment:
   Seems that if length exists, offset must exists. However, if offset not exists, length should not exist?
   
   And reader should check that:
   1. Does it have offset
   2. If has, can it go fast path that using bloom_filter_length to read?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [parquet-format] wgtmac commented on pull request #194: PARQUET-2257: Add bloom_filter_length to ColumnMetaData

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #194:
URL: https://github.com/apache/parquet-format/pull/194#issuecomment-1476435933

   @emkornfield @pitrou Do you have any comment?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org