You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "singhpk234 (via GitHub)" <gi...@apache.org> on 2023/04/07 23:56:00 UTC

[GitHub] [iceberg] singhpk234 opened a new pull request, #7301: Parquet: Update parquet to 1.13.0

singhpk234 opened a new pull request, #7301:
URL: https://github.com/apache/iceberg/pull/7301

   ### About the change
   
   parquet 1.13.0 release notes : https://github.com/apache/parquet-mr/blob/apache-parquet-1.13.0/CHANGES.md?plain=1#L22-L78
   
   This has fix for : https://issues.apache.org/jira/browse/PARQUET-2160
   
   earlier iceberg pr to fix issue observed above : 
   - https://github.com/apache/iceberg/pull/5681 
   
   cc @rdblue @bryanck 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1199411445


##########
docs/configuration.md:
##########
@@ -47,51 +47,52 @@ Iceberg tables support table properties to configure table behavior, like the de
 
 ### Write properties
 
-| Property                                            | Default                    | Description                                                                                                                                                                                       |
-|-----------------------------------------------------|----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| write.format.default                                | parquet                    | Default file format for the table; parquet, avro, or orc                                                                                                                                          |
-| write.delete.format.default                         | data file format           | Default delete file format for the table; parquet, avro, or orc                                                                                                                                   |
-| write.parquet.row-group-size-bytes                  | 134217728 (128 MB)         | Parquet row group size                                                                                                                                                                            |
-| write.parquet.page-size-bytes                       | 1048576 (1 MB)             | Parquet page size                                                                                                                                                                                 |
-| write.parquet.page-row-limit                        | 20000                      | Parquet page row limit                                                                                                                                                                            |
-| write.parquet.dict-size-bytes                       | 2097152 (2 MB)             | Parquet dictionary page size                                                                                                                                                                      |
-| write.parquet.compression-codec                     | gzip                       | Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed                                                                                                                          |
-| write.parquet.compression-level                     | null                       | Parquet compression level                                                                                                                                                                         |
-| write.parquet.bloom-filter-enabled.column.col1      | (not set)                  | Enables writing a bloom filter for the column: col1                                                                                                                                               |
-| write.parquet.bloom-filter-max-bytes                | 1048576 (1 MB)             | The maximum number of bytes for a bloom filter bitset                                                                                                                                             |
-| write.avro.compression-codec                        | gzip                       | Avro compression codec: gzip(deflate with 9 level), zstd, snappy, uncompressed                                                                                                                    |
-| write.avro.compression-level                        | null                       | Avro compression level                                                                                                                                                                            |
-| write.orc.stripe-size-bytes                         | 67108864 (64 MB)           | Define the default ORC stripe size, in bytes                                                                                                                                                      |
-| write.orc.block-size-bytes                          | 268435456 (256 MB)         | Define the default file system block size for ORC files                                                                                                                                           |
-| write.orc.compression-codec                         | zlib                       | ORC compression codec: zstd, lz4, lzo, zlib, snappy, none                                                                                                                                         |
-| write.orc.compression-strategy                      | speed                      | ORC compression strategy: speed, compression                                                                                                                                                      |
-| write.orc.bloom.filter.columns                      | (not set)                  | Comma separated list of column names for which a Bloom filter must be created                                                                                                                     |
-| write.orc.bloom.filter.fpp                          | 0.05                       | False positive probability for Bloom filter (must > 0.0 and < 1.0)                                                                                                                                |
-| write.location-provider.impl                        | null                       | Optional custom implementation for LocationProvider                                                                                                                                               |
-| write.metadata.compression-codec                    | none                       | Metadata compression codec; none or gzip                                                                                                                                                          |
-| write.metadata.metrics.max-inferred-column-defaults | 100                        | Defines the maximum number of columns for which metrics are collected                                                                                                                             |
-| write.metadata.metrics.default                      | truncate(16)               | Default metrics mode for all columns in the table; none, counts, truncate(length), or full                                                                                                        |
-| write.metadata.metrics.column.col1                  | (not set)                  | Metrics mode for column 'col1' to allow per-column tuning; none, counts, truncate(length), or full                                                                                                |
-| write.target-file-size-bytes                        | 536870912 (512 MB)         | Controls the size of files generated to target about this many bytes                                                                                                                              |
-| write.delete.target-file-size-bytes                 | 67108864 (64 MB)           | Controls the size of delete files generated to target about this many bytes                                                                                                                       |
-| write.distribution-mode                             | none                       | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder |
-| write.delete.distribution-mode                      | hash                       | Defines distribution of write delete data                                                                                                                                                         |
-| write.update.distribution-mode                      | hash                       | Defines distribution of write update data                                                                                                                                                         |
-| write.merge.distribution-mode                       | none                       | Defines distribution of write merge data                                                                                                                                                          |
-| write.wap.enabled                                   | false                      | Enables write-audit-publish writes                                                                                                                                                                |
-| write.summary.partition-limit                       | 0                          | Includes partition-level summary stats in snapshot summaries if the changed partition count is less than this limit                                                                               |
-| write.metadata.delete-after-commit.enabled          | false                      | Controls whether to delete the oldest **tracked** version metadata files after commit                                                                                                             |
-| write.metadata.previous-versions-max                | 100                        | The max number of previous version metadata files to keep before deleting after commit                                                                                                            |
-| write.spark.fanout.enabled                          | false                      | Enables the fanout writer in Spark that does not require data to be clustered; uses more memory                                                                                                   |
-| write.object-storage.enabled                        | false                      | Enables the object storage location provider that adds a hash component to file paths                                                                                                             |
-| write.data.path                                     | table location + /data     | Base location for data files                                                                                                                                                                      |
-| write.metadata.path                                 | table location + /metadata | Base location for metadata files                                                                                                                                                                  |
-| write.delete.mode                                   | copy-on-write              | Mode used for delete commands: copy-on-write or merge-on-read (v2 only)                                                                                                                           |
-| write.delete.isolation-level                        | serializable               | Isolation level for delete commands: serializable or snapshot                                                                                                                                     |
-| write.update.mode                                   | copy-on-write              | Mode used for update commands: copy-on-write or merge-on-read (v2 only)                                                                                                                           |
-| write.update.isolation-level                        | serializable               | Isolation level for update commands: serializable or snapshot                                                                                                                                     |
-| write.merge.mode                                    | copy-on-write              | Mode used for merge commands: copy-on-write or merge-on-read (v2 only)                                                                                                                            |
-| write.merge.isolation-level                         | serializable               | Isolation level for merge commands: serializable or snapshot                                                                                                                                      |
+| Property                                             | Default                     | Description                                                                                                                                                                                       |
+|------------------------------------------------------|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| write.format.default                                 | parquet                     | Default file format for the table; parquet, avro, or orc                                                                                                                                          |
+| write.delete.format.default                          | data file format            | Default delete file format for the table; parquet, avro, or orc                                                                                                                                   |
+| write.parquet.row-group-size-bytes                   | 134217728 (128 MB)          | Parquet row group size                                                                                                                                                                            |
+| write.parquet.page-size-bytes                        | 1048576 (1 MB)              | Parquet page size                                                                                                                                                                                 |
+| write.parquet.page-row-limit                         | 20000                       | Parquet page row limit                                                                                                                                                                            |
+ | write.parquet.dictionary.enabled                     | true                        | Enable dictionary encoding                                                                                                                                                                        |
+ | write.parquet.dict-size-bytes                        | 2097152 (2 MB)              | Parquet dictionary page size                                                                                                                                                                      |
+| write.parquet.compression-codec                      | gzip                        | Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed                                                                                                                          |
+| write.parquet.compression-level                      | null                        | Parquet compression level                                                                                                                                                                         |
+| write.parquet.bloom-filter-enabled.column.col1       | (not set)                   | Hint to parquet to write a bloom filter for the column: col1                                                                                                                                      |
+| write.parquet.bloom-filter-max-bytes                 | 1048576 (1 MB)              | The maximum number of bytes for a bloom filter bitset                                                                                                                                             |
+| write.avro.compression-codec                         | gzip                        | Avro compression codec: gzip(deflate with 9 level), zstd, snappy, uncompressed                                                                                                                    |
+| write.avro.compression-level                         | null                        | Avro compression level                                                                                                                                                                            |
+| write.orc.stripe-size-bytes                          | 67108864 (64 MB)            | Define the default ORC stripe size, in bytes                                                                                                                                                      |
+| write.orc.block-size-bytes                           | 268435456 (256 MB)          | Define the default file system block size for ORC files                                                                                                                                           |
+| write.orc.compression-codec                          | zlib                        | ORC compression codec: zstd, lz4, lzo, zlib, snappy, none                                                                                                                                         |
+| write.orc.compression-strategy                       | speed                       | ORC compression strategy: speed, compression                                                                                                                                                      |
+| write.orc.bloom.filter.columns                       | (not set)                   | Comma separated list of column names for which a Bloom filter must be created                                                                                                                     |
+| write.orc.bloom.filter.fpp                           | 0.05                        | False positive probability for Bloom filter (must > 0.0 and < 1.0)                                                                                                                                |
+| write.location-provider.impl                         | null                        | Optional custom implementation for LocationProvider                                                                                                                                               |
+| write.metadata.compression-codec                     | none                        | Metadata compression codec; none or gzip                                                                                                                                                          |
+| write.metadata.metrics.max-inferred-column-defaults  | 100                         | Defines the maximum number of columns for which metrics are collected                                                                                                                             |
+| write.metadata.metrics.default                       | truncate(16)                | Default metrics mode for all columns in the table; none, counts, truncate(length), or full                                                                                                        |
+| write.metadata.metrics.column.col1                   | (not set)                   | Metrics mode for column 'col1' to allow per-column tuning; none, counts, truncate(length), or full                                                                                                |
+| write.target-file-size-bytes                         | 536870912 (512 MB)          | Controls the size of files generated to target about this many bytes                                                                                                                              |
+| write.delete.target-file-size-bytes                  | 67108864 (64 MB)            | Controls the size of delete files generated to target about this many bytes                                                                                                                       |
+| write.distribution-mode                              | none                        | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder |
+| write.delete.distribution-mode                       | hash                        | Defines distribution of write delete data                                                                                                                                                         |
+| write.update.distribution-mode                       | hash                        | Defines distribution of write update data                                                                                                                                                         |
+| write.merge.distribution-mode                        | none                        | Defines distribution of write merge data                                                                                                                                                          |
+| write.wap.enabled                                    | false                       | Enables write-audit-publish writes                                                                                                                                                                |
+| write.summary.partition-limit                        | 0                           | Includes partition-level summary stats in snapshot summaries if the changed partition count is less than this limit                                                                               |
+| write.metadata.delete-after-commit.enabled           | false                       | Controls whether to delete the oldest **tracked** version metadata files after commit                                                                                                             |
+| write.metadata.previous-versions-max                 | 100                         | The max number of previous version metadata files to keep before deleting after commit                                                                                                            |
+| write.spark.fanout.enabled                           | false                       | Enables the fanout writer in Spark that does not require data to be clustered; uses more memory                                                                                                   |
+| write.object-storage.enabled                         | false                       | Enables the object storage location provider that adds a hash component to file paths                                                                                                             |
+| write.data.path                                      | table location + /data      | Base location for data files                                                                                                                                                                      |
+| write.metadata.path                                  | table location + /metadata  | Base location for metadata files                                                                                                                                                                  |
+| write.delete.mode                                    | copy-on-write               | Mode used for delete commands: copy-on-write or merge-on-read (v2 only)                                                                                                                           |
+| write.delete.isolation-level                         | serializable                | Isolation level for delete commands: serializable or snapshot                                                                                                                                     |
+| write.update.mode                                    | copy-on-write               | Mode used for update commands: copy-on-write or merge-on-read (v2 only)                                                                                                                           |
+| write.update.isolation-level                         | serializable                | Isolation level for update commands: serializable or snapshot                                                                                                                                     |
+| write.merge.mode                                     | copy-on-write               | Mode used for merge commands: copy-on-write or merge-on-read (v2 only)                                                                                                                            |
+| write.merge.isolation-level                          | serializable                | Isolation level for merge commands: serializable or snapshot                                                                                                                                      |

Review Comment:
   Why was this entire table reformatted?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "bryanck (via GitHub)" <gi...@apache.org>.

bryanck commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1555343671

   > Thanks @singhpk234 for picking this up, and @amogh-jahagirdar for the review!
   > 
   > @bryanck do you want the honor of reverting #5681?
   
   Sure, PR incoming…


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1196603961


##########
build.gradle:
##########
@@ -137,6 +137,11 @@ allprojects {
   repositories {
     mavenCentral()
     mavenLocal()
+    maven {
+      url "https://repository.apache.org/content/groups/staging/"
+      mavenContent {
+      }
+    }

Review Comment:
   Great to see that everything is green :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1199249686


##########
parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java:
##########
@@ -339,6 +344,8 @@ private static class Context {
       private final int rowGroupSize;
       private final int pageSize;
       private final int pageRowLimit;
+

Review Comment:
   ACK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on pull request #7301: Parquet: Update parquet to 1.13.0

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1530490373

   @Fokko, added a fix for the failing UTs dues to BF changes and also updated the docs with the current behaviour of parquet, I think once your pr is in https://github.com/apache/parquet-mr/pull/1084 and parquet 1.13.1 is out we should be good to go, please let me know your thoughts !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on pull request #7301: Parquet: Update parquet to 1.13.0

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1546528872

   Thank you @Fokko for the awesome work ! Added my +1 :) !
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1195781392


##########
build.gradle:
##########
@@ -137,6 +137,11 @@ allprojects {
   repositories {
     mavenCentral()
     mavenLocal()
+    maven {
+      url "https://repository.apache.org/content/groups/staging/"
+      mavenContent {
+      }
+    }

Review Comment:
   will remove when parquet 1.13.1 is officially released



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko merged pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko merged PR #7301:
URL: https://github.com/apache/iceberg/pull/7301


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1199081161


##########
aliyun/src/test/java/org/apache/iceberg/aliyun/oss/mock/AliyunOSSMockLocalStore.java:
##########
@@ -36,8 +36,8 @@
 import java.util.List;
 import java.util.Locale;
 import java.util.Map;
+import org.apache.commons.codec.binary.Hex;

Review Comment:
   ACK reverting the changes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1198316221


##########
build.gradle:
##########
@@ -137,6 +137,11 @@ allprojects {
   repositories {
     mavenCentral()
     mavenLocal()
+    maven {
+      url "https://repository.apache.org/content/groups/staging/"
+      mavenContent {
+      }
+    }

Review Comment:
   1.13.1 is out: https://parquet.apache.org/blog/2023/05/18/1.13.1/
   
   🥳 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1199411330


##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -135,6 +135,9 @@ private TableProperties() {}
   public static final String DELETE_PARQUET_PAGE_ROW_LIMIT = "write.delete.parquet.page-row-limit";
   public static final int PARQUET_PAGE_ROW_LIMIT_DEFAULT = 20_000;
 
+  public static final String PARQUET_DICT_ENABLED = "write.parquet.enable.dictionary";

Review Comment:
   This doesn't match the format of Iceberg options. It should be `write.parquet.dict-enabled` to match `dict-size-bytes` or should use `enabled` as the last word if you're using `dictionary` as a part of the hierarchy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on pull request #7301: Parquet: Update parquet to 1.13.0

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1528869332

   I've created a PR to support Hadoop 2.7.3 as well: https://github.com/apache/parquet-mr/pull/1084


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "bryanck (via GitHub)" <gi...@apache.org>.

bryanck commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1555369191

   I opened https://github.com/apache/iceberg/pull/7664 to revert the original workaround that is no longer needed with the upgrade to Parquet 1.13.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on pull request #7301: Parquet: Update parquet to 1.13.0

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1544008745

   Ran a benchmark on the 1.2.1 branch:
   ```
   Apache Iceberg 1.2.1 with Apache Iceberg 1.12.3
   
   Benchmark                                                                            Mode  Cnt  Score   Error  Units
   IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized                  ss    5  5,068 ± 0,151   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized                     ss    5  1,961 ± 0,081   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readIceberg                                  ss    5  2,438 ± 0,028   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  1,061 ± 0,120   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  0,490 ± 0,186   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIceberg                    ss    5  0,377 ± 0,052   s/op
   
   Apache Iceberg 1.2.1 with Apache Parquet 1.13.1-SNAPSHOT
   
   Benchmark                                                                            Mode  Cnt  Score   Error  Units
   IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized                  ss    5  4,509 ± 0,047   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized                     ss    5  2,137 ± 0,176   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readIceberg                                  ss    5  2,446 ± 0,056   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  1,033 ± 0,085   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  0,471 ± 0,043   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIceberg                    ss    5  0,372 ± 0,039   s/op
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1199032788


##########
aliyun/src/test/java/org/apache/iceberg/aliyun/oss/mock/AliyunOSSMockLocalStore.java:
##########
@@ -36,8 +36,8 @@
 import java.util.List;
 import java.util.Locale;
 import java.util.Map;
+import org.apache.commons.codec.binary.Hex;

Review Comment:
   Looking at https://github.com/apache/iceberg/pull/5024/files#r896336657 We'd rather avoid using `apache.commons`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.

amogh-jahagirdar commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1199247009


##########
parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java:
##########
@@ -339,6 +344,8 @@ private static class Context {
       private final int rowGroupSize;
       private final int pageSize;
       private final int pageRowLimit;
+

Review Comment:
   Nit: don't think we need the newline here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1199412050


##########
parquet/src/test/java/org/apache/iceberg/parquet/TestBloomRowGroupFilter.java:
##########
@@ -197,6 +198,7 @@ public void createInputFile() throws IOException {
     try (FileAppender<Record> appender =
         Parquet.write(outFile)
             .schema(FILE_SCHEMA)
+            .set(PARQUET_DICT_ENABLED, "false")

Review Comment:
   @singhpk234 @Fokko, I don't think it makes sense to create an option that is public and must be supported by Iceberg moving forward just for this test case. Is it possible to set this with a Hadoop option or to do this some other way?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1555250580

   Thanks @singhpk234 for picking this up, and @amogh-jahagirdar for the review!
   
   @bryanck do you want the honor of reverting https://github.com/apache/iceberg/pull/5681?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1555364990

   I agree with @amogh-jahagirdar. It is probably not worth adding an extra table property for the sake of the test. I doubt anyone would ever configure it. Can we fix the test?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1199037966


##########
aliyun/src/test/java/org/apache/iceberg/aliyun/oss/mock/AliyunOSSMockLocalStore.java:
##########
@@ -36,8 +36,8 @@
 import java.util.List;
 import java.util.Locale;
 import java.util.Map;
+import org.apache.commons.codec.binary.Hex;

Review Comment:
   I noticed that it all over the codebase, I've created a separate ticket: https://github.com/apache/iceberg/issues/7658



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] amogh-jahagirdar commented on pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.

amogh-jahagirdar commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1555346282

   I guess there were a few oversights here that we want to address before release, I was thinking that it would make sense for certain use cases to disable dictionary encoding if they wanted to use bloom filters (controlling the space tradeoff). 
   
   So we should conclude
   
   1.) Do we really want a table property? Curious to know other's opinions. cc: @Fokko @rdblue @singhpk234 @aokolnychyi 
   
   2.) If the answer is yes, let's go back and correct the naming in the properties.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1199543934


##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -135,6 +135,9 @@ private TableProperties() {}
   public static final String DELETE_PARQUET_PAGE_ROW_LIMIT = "write.delete.parquet.page-row-limit";
   public static final int PARQUET_PAGE_ROW_LIMIT_DEFAULT = 20_000;
 
+  public static final String PARQUET_DICT_ENABLED = "write.parquet.enable.dictionary";

Review Comment:
   Apologies for my oversight here, i intended to have `.enabled` in suffix which i mentioned in the table prop but messed up here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on pull request #7301: Parquet: Update parquet to 1.13.0

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1528874719

   Looks like this commit made the bloom tests on the Iceberg side fail:
   ```
   ➜  parquet-mr git:(4e9e79c89) ✗  git bisect bad
   4e9e79c895e61775066e11240729b574e2264b8b is the first bad commit
   commit 4e9e79c895e61775066e11240729b574e2264b8b
   Author: ChenLiang.Lu <31...@users.noreply.github.com>
   Date:   Mon Feb 27 15:45:46 2023 +0800
   
       PARQUET-2251 Avoid generating Bloomfilter when all pages of a column are encoded by dictionary in parquet v1 (#1033)
   
    .../apache/parquet/hadoop/ParquetFileWriter.java   |   3 +-
    .../apache/parquet/hadoop/TestBloomFiltering.java  |  18 ++-
    .../parquet/hadoop/TestStoreBloomFilter.java       | 132 +++++++++++++++++++++
    3 files changed, 151 insertions(+), 2 deletions(-)
    create mode 100644 parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestStoreBloomFilter.java
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on pull request #7301: Parquet: Update parquet to 1.13.0

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1513897388

   There is a dependency on hadoop 3.2, I see there is a pr https://github.com/apache/iceberg/pull/5024 for the same, will wait for it to get in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "rdblue (via GitHub)" <gi...@apache.org>.

rdblue commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1199410277


##########
docs/configuration.md:
##########
@@ -47,51 +47,52 @@ Iceberg tables support table properties to configure table behavior, like the de
 
 ### Write properties
 
-| Property                                            | Default                    | Description                                                                                                                                                                                       |
-|-----------------------------------------------------|----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| write.format.default                                | parquet                    | Default file format for the table; parquet, avro, or orc                                                                                                                                          |
-| write.delete.format.default                         | data file format           | Default delete file format for the table; parquet, avro, or orc                                                                                                                                   |
-| write.parquet.row-group-size-bytes                  | 134217728 (128 MB)         | Parquet row group size                                                                                                                                                                            |
-| write.parquet.page-size-bytes                       | 1048576 (1 MB)             | Parquet page size                                                                                                                                                                                 |
-| write.parquet.page-row-limit                        | 20000                      | Parquet page row limit                                                                                                                                                                            |
-| write.parquet.dict-size-bytes                       | 2097152 (2 MB)             | Parquet dictionary page size                                                                                                                                                                      |
-| write.parquet.compression-codec                     | gzip                       | Parquet compression codec: zstd, brotli, lz4, gzip, snappy, uncompressed                                                                                                                          |
-| write.parquet.compression-level                     | null                       | Parquet compression level                                                                                                                                                                         |
-| write.parquet.bloom-filter-enabled.column.col1      | (not set)                  | Enables writing a bloom filter for the column: col1                                                                                                                                               |
-| write.parquet.bloom-filter-max-bytes                | 1048576 (1 MB)             | The maximum number of bytes for a bloom filter bitset                                                                                                                                             |
-| write.avro.compression-codec                        | gzip                       | Avro compression codec: gzip(deflate with 9 level), zstd, snappy, uncompressed                                                                                                                    |
-| write.avro.compression-level                        | null                       | Avro compression level                                                                                                                                                                            |
-| write.orc.stripe-size-bytes                         | 67108864 (64 MB)           | Define the default ORC stripe size, in bytes                                                                                                                                                      |
-| write.orc.block-size-bytes                          | 268435456 (256 MB)         | Define the default file system block size for ORC files                                                                                                                                           |
-| write.orc.compression-codec                         | zlib                       | ORC compression codec: zstd, lz4, lzo, zlib, snappy, none                                                                                                                                         |
-| write.orc.compression-strategy                      | speed                      | ORC compression strategy: speed, compression                                                                                                                                                      |
-| write.orc.bloom.filter.columns                      | (not set)                  | Comma separated list of column names for which a Bloom filter must be created                                                                                                                     |
-| write.orc.bloom.filter.fpp                          | 0.05                       | False positive probability for Bloom filter (must > 0.0 and < 1.0)                                                                                                                                |
-| write.location-provider.impl                        | null                       | Optional custom implementation for LocationProvider                                                                                                                                               |
-| write.metadata.compression-codec                    | none                       | Metadata compression codec; none or gzip                                                                                                                                                          |
-| write.metadata.metrics.max-inferred-column-defaults | 100                        | Defines the maximum number of columns for which metrics are collected                                                                                                                             |
-| write.metadata.metrics.default                      | truncate(16)               | Default metrics mode for all columns in the table; none, counts, truncate(length), or full                                                                                                        |
-| write.metadata.metrics.column.col1                  | (not set)                  | Metrics mode for column 'col1' to allow per-column tuning; none, counts, truncate(length), or full                                                                                                |
-| write.target-file-size-bytes                        | 536870912 (512 MB)         | Controls the size of files generated to target about this many bytes                                                                                                                              |
-| write.delete.target-file-size-bytes                 | 67108864 (64 MB)           | Controls the size of delete files generated to target about this many bytes                                                                                                                       |
-| write.distribution-mode                             | none                       | Defines distribution of write data: __none__: don't shuffle rows; __hash__: hash distribute by partition key ; __range__: range distribute by partition key or sort key if table has an SortOrder |
-| write.delete.distribution-mode                      | hash                       | Defines distribution of write delete data                                                                                                                                                         |
-| write.update.distribution-mode                      | hash                       | Defines distribution of write update data                                                                                                                                                         |
-| write.merge.distribution-mode                       | none                       | Defines distribution of write merge data                                                                                                                                                          |
-| write.wap.enabled                                   | false                      | Enables write-audit-publish writes                                                                                                                                                                |
-| write.summary.partition-limit                       | 0                          | Includes partition-level summary stats in snapshot summaries if the changed partition count is less than this limit                                                                               |
-| write.metadata.delete-after-commit.enabled          | false                      | Controls whether to delete the oldest **tracked** version metadata files after commit                                                                                                             |
-| write.metadata.previous-versions-max                | 100                        | The max number of previous version metadata files to keep before deleting after commit                                                                                                            |
-| write.spark.fanout.enabled                          | false                      | Enables the fanout writer in Spark that does not require data to be clustered; uses more memory                                                                                                   |
-| write.object-storage.enabled                        | false                      | Enables the object storage location provider that adds a hash component to file paths                                                                                                             |
-| write.data.path                                     | table location + /data     | Base location for data files                                                                                                                                                                      |
-| write.metadata.path                                 | table location + /metadata | Base location for metadata files                                                                                                                                                                  |
-| write.delete.mode                                   | copy-on-write              | Mode used for delete commands: copy-on-write or merge-on-read (v2 only)                                                                                                                           |
-| write.delete.isolation-level                        | serializable               | Isolation level for delete commands: serializable or snapshot                                                                                                                                     |
-| write.update.mode                                   | copy-on-write              | Mode used for update commands: copy-on-write or merge-on-read (v2 only)                                                                                                                           |
-| write.update.isolation-level                        | serializable               | Isolation level for update commands: serializable or snapshot                                                                                                                                     |
-| write.merge.mode                                    | copy-on-write              | Mode used for merge commands: copy-on-write or merge-on-read (v2 only)                                                                                                                            |
-| write.merge.isolation-level                         | serializable               | Isolation level for merge commands: serializable or snapshot                                                                                                                                      |
+| Property                                             | Default                     | Description                                                                                                                                                                                       |
+|------------------------------------------------------|-----------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| write.format.default                                 | parquet                     | Default file format for the table; parquet, avro, or orc                                                                                                                                          |
+| write.delete.format.default                          | data file format            | Default delete file format for the table; parquet, avro, or orc                                                                                                                                   |
+| write.parquet.row-group-size-bytes                   | 134217728 (128 MB)          | Parquet row group size                                                                                                                                                                            |
+| write.parquet.page-size-bytes                        | 1048576 (1 MB)              | Parquet page size                                                                                                                                                                                 |
+| write.parquet.page-row-limit                         | 20000                       | Parquet page row limit                                                                                                                                                                            |
+ | write.parquet.dictionary.enabled                     | true                        | Enable dictionary encoding                                                                                                                                                                        |

Review Comment:
   This doesn't match the setting above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.

aokolnychyi commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1199479304


##########
core/src/main/java/org/apache/iceberg/TableProperties.java:
##########
@@ -135,6 +135,9 @@ private TableProperties() {}
   public static final String DELETE_PARQUET_PAGE_ROW_LIMIT = "write.delete.parquet.page-row-limit";
   public static final int PARQUET_PAGE_ROW_LIMIT_DEFAULT = 20_000;
 
+  public static final String PARQUET_DICT_ENABLED = "write.parquet.enable.dictionary";

Review Comment:
   +1, the name does match our pattern



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on pull request #7301: Parquet: Update parquet to 1.13.0

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1545769682

   @singhpk234 can I invite you to give the 1.13.1 RC a try: https://lists.apache.org/thread/0yokbmfcbhz76ftjbxktwxfo5vrt57od Would be great if you can vote on the RC.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1195780604


##########
build.gradle:
##########
@@ -137,6 +137,12 @@ allprojects {
   repositories {
     mavenCentral()
     mavenLocal()
+    // testing purpose only, to be removed when spark-3.3 rc passes
+    maven {
+      url "https://repository.apache.org/content/groups/staging/"
+      mavenContent {
+      }
+    }

Review Comment:
   will remove when 1.13.1 is officially released :) !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] amogh-jahagirdar commented on a diff in pull request #7301: Parquet: Update parquet to 1.13.1

Posted by "amogh-jahagirdar (via GitHub)" <gi...@apache.org>.

amogh-jahagirdar commented on code in PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#discussion_r1199246741


##########
parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java:
##########
@@ -339,6 +344,8 @@ private static class Context {
       private final int rowGroupSize;
       private final int pageSize;
       private final int pageRowLimit;
+

Review Comment:
   Nit: do we need the newline here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on pull request #7301: Parquet: Update parquet to 1.13.0

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1528952880

   > Looks like this commit made the bloom tests on the Iceberg side fail:
   
   This is an interesting find @Fokko, this implies, even when we are enabling BF via iceberg table conf's (https://iceberg.apache.org/docs/latest/configuration/#write-properties) parquet may decide not to write a BF for the column if all it's pages are dictionary encoded (I am assuming this is because there will be no benefit of writing BF in this particular case), Do we need to mention this in our doc as well, as the given table prop is a hint for parquet to write BF, but it may not choose to ?   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on pull request #7301: Parquet: Update parquet to 1.13.0

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1529410433

   @singhpk234 Yes, that's indeed the case. The dictionary is a better version of the bloom filter because it cannot contain false positives, this makes it redundant to also generate the bloom filter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on pull request #7301: Parquet: Update parquet to 1.13.0

Posted by "Fokko (via GitHub)" <gi...@apache.org>.

Fokko commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1532986138

   @singhpk234 I'm okay with the change, looking at the benchmark, I don't see much difference:
   
   
   With the change:
   
   ```
   Benchmark                                                                            Mode  Cnt  Score   Error  Units
   IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized                  ss    5  4,917 ± 0,179   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized                     ss    5  1,806 ± 0,036   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readIceberg                                  ss    5  2,037 ± 0,022   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  0,978 ± 0,077   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  0,435 ± 0,028   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIceberg                    ss    5  0,364 ± 0,021   s/op
   ```
   
   Master:
   
   ```
   Benchmark                                                                            Mode  Cnt  Score   Error  Units
   IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized                  ss    5  4,658 ± 0,222   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized                     ss    5  1,774 ± 0,056   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readIceberg                                  ss    5  1,952 ± 0,060   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceNonVectorized    ss    5  1,019 ± 0,211   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceVectorized       ss    5  0,422 ± 0,208   s/op
   IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIceberg                    ss    5  0,363 ± 0,042   s/op
   ```
   
   It even looks a bit faster on some of the benchmarks, but when taking the error into account, the difference is minimal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] bryanck commented on pull request #7301: Parquet: Update parquet to 1.13.0

Posted by "bryanck (via GitHub)" <gi...@apache.org>.

bryanck commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1500905311

   Should we also revert https://github.com/apache/iceberg/pull/5681 ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on pull request #7301: Parquet: Update parquet to 1.13.0

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on PR #7301:
URL: https://github.com/apache/iceberg/pull/7301#issuecomment-1502058653

   > Should we also revert https://github.com/apache/iceberg/pull/5681 ?
   
   IMHO we should, was thinking of the same but in a separate commit, post validating the issues. 
   Presently this upgrade takes a dependency on hadoop 3.2, and interestingly some of the BloomFilter ut's fails as well. Taking a look into the same as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org