You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/09 13:46:07 UTC

[GitHub] [arrow-rs] thinkharderdev opened a new pull request, #2854: Fix page size on dictionary fallback

thinkharderdev opened a new pull request, #2854:
URL: https://github.com/apache/arrow-rs/pull/2854

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #2853 
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   On fallback `ByteArrayEncoder` wasn't tracking the number of values written so when the dictionary page hits the limit and we fallback, all remaining data was written in a single data page. 
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Make sure `ByteArrayEncoder` tracks the number of encoded values after it falls back to the fallback encoder
   
   # Are there any user-facing changes?
   
   This will change the way data pages are laid out in some cases. 
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   No
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] thinkharderdev commented on a diff in pull request #2854: Fix page size on dictionary fallback

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854#discussion_r990816906


##########
parquet/src/arrow/arrow_writer/byte_array.rs:
##########
@@ -551,7 +551,10 @@ where
 
     match &mut encoder.dict_encoder {
         Some(dict_encoder) => dict_encoder.encode(values, indices),
-        None => encoder.fallback.encode(values, indices),
+        None => {
+            encoder.num_values += indices.len();

Review Comment:
   I think it already does that right? https://github.com/apache/arrow-rs/blob/2ae23093c3f7edc278fe6daf57daf167c430143b/parquet/src/arrow/arrow_writer/byte_array.rs#L313



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #2854: Fix page size on dictionary fallback

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2854: Fix page size on dictionary fallback

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854#discussion_r990799861


##########
parquet/src/arrow/arrow_writer/byte_array.rs:
##########
@@ -551,7 +551,10 @@ where
 
     match &mut encoder.dict_encoder {
         Some(dict_encoder) => dict_encoder.encode(values, indices),
-        None => encoder.fallback.encode(values, indices),
+        None => {
+            encoder.num_values += indices.len();

Review Comment:
   Should we be doing this regardless of if we've fallen back? I think currently this will fail to flush a dictionary encoded data page even if it has reached sufficient size?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2854: Fix page size on dictionary fallback

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854#discussion_r990798877


##########
parquet/src/arrow/arrow_writer/byte_array.rs:
##########
@@ -551,7 +551,10 @@ where
 
     match &mut encoder.dict_encoder {
         Some(dict_encoder) => dict_encoder.encode(values, indices),
-        None => encoder.fallback.encode(values, indices),
+        None => {
+            encoder.num_values += indices.len();

Review Comment:
   I'm guessing the problem was that whilst the estimated_data_page_size would increase, the lack of any values would cause it to erroneously not try to flush the page
   
   In particular https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L567



##########
parquet/src/arrow/arrow_writer/byte_array.rs:
##########
@@ -551,7 +551,10 @@ where
 
     match &mut encoder.dict_encoder {
         Some(dict_encoder) => dict_encoder.encode(values, indices),
-        None => encoder.fallback.encode(values, indices),
+        None => {
+            encoder.num_values += indices.len();

Review Comment:
   I'm guessing the problem was that whilst the estimated_data_page_size would increase, the lack of any values would cause it to erroneously not try to flush the page?
   
   In particular https://github.com/apache/arrow-rs/blob/master/parquet/src/column/writer/mod.rs#L567



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] thinkharderdev commented on a diff in pull request #2854: Fix page size on dictionary fallback

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854#discussion_r990803379


##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -1108,6 +1110,55 @@ mod tests {
         roundtrip(batch, Some(SMALL_SIZE / 2));
     }
 
+    #[test]
+    fn arrow_writer_page_size() {
+        let mut rng = thread_rng();
+        let schema =
+            Arc::new(Schema::new(vec![Field::new("col", DataType::Utf8, false)]));
+
+        let mut builder = StringBuilder::with_capacity(1_000, 2 * 1_000);
+
+        for _ in 0..10_000 {
+            let value = (0..200)
+                .map(|_| rng.gen_range(b'a'..=b'z') as char)
+                .collect::<String>();
+
+            builder.append_value(value);
+        }
+
+        let array = Arc::new(builder.finish());
+
+        let batch = RecordBatch::try_new(schema, vec![array]).unwrap();
+
+        let file = tempfile::tempfile().unwrap();
+
+        let props = WriterProperties::builder()
+            .set_max_row_group_size(usize::MAX)
+            .set_data_pagesize_limit(256)

Review Comment:
   So I think there are still some issues here. It is still ignoring the size limit. It is at least respecting the `write_batch_size` though. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2854: Fix page size on dictionary fallback

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854#discussion_r990805688


##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -1108,6 +1110,55 @@ mod tests {
         roundtrip(batch, Some(SMALL_SIZE / 2));
     }
 
+    #[test]
+    fn arrow_writer_page_size() {
+        let mut rng = thread_rng();
+        let schema =
+            Arc::new(Schema::new(vec![Field::new("col", DataType::Utf8, false)]));
+
+        let mut builder = StringBuilder::with_capacity(1_000, 2 * 1_000);
+
+        for _ in 0..10_000 {
+            let value = (0..200)
+                .map(|_| rng.gen_range(b'a'..=b'z') as char)
+                .collect::<String>();
+
+            builder.append_value(value);
+        }
+
+        let array = Arc::new(builder.finish());
+
+        let batch = RecordBatch::try_new(schema, vec![array]).unwrap();
+
+        let file = tempfile::tempfile().unwrap();
+
+        let props = WriterProperties::builder()
+            .set_max_row_group_size(usize::MAX)
+            .set_data_pagesize_limit(256)

Review Comment:
   That is expected and I believe consistent with other parquet writers. The limit is best effort



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2854: Fix page size on dictionary fallback

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854#discussion_r990798748


##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -1108,6 +1110,55 @@ mod tests {
         roundtrip(batch, Some(SMALL_SIZE / 2));
     }
 
+    #[test]
+    fn arrow_writer_page_size() {
+        let mut rng = thread_rng();
+        let schema =
+            Arc::new(Schema::new(vec![Field::new("col", DataType::Utf8, false)]));
+
+        let mut builder = StringBuilder::with_capacity(1_000, 2 * 1_000);
+
+        for _ in 0..10_000 {
+            let value = (0..200)
+                .map(|_| rng.gen_range(b'a'..=b'z') as char)
+                .collect::<String>();
+
+            builder.append_value(value);
+        }
+
+        let array = Arc::new(builder.finish());
+
+        let batch = RecordBatch::try_new(schema, vec![array]).unwrap();
+
+        let file = tempfile::tempfile().unwrap();
+
+        let props = WriterProperties::builder()
+            .set_max_row_group_size(usize::MAX)
+            .set_data_pagesize_limit(256)

Review Comment:
   You could potentially set the dictionary page size smaller to verify that as well, but up to you



##########
parquet/src/arrow/arrow_writer/mod.rs:
##########
@@ -1108,6 +1110,55 @@ mod tests {
         roundtrip(batch, Some(SMALL_SIZE / 2));
     }
 
+    #[test]
+    fn arrow_writer_page_size() {
+        let mut rng = thread_rng();

Review Comment:
   I think we should either seed this, or loosen the assert below. Otherwise I worry that depending on what values are generated, we may end up with more or less pages (as the dictionary page will only spill once it has seen sufficient different values, which technically could occur at any point)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] thinkharderdev commented on a diff in pull request #2854: Fix page size on dictionary fallback

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854#discussion_r990799846


##########
parquet/src/arrow/arrow_writer/byte_array.rs:
##########
@@ -551,7 +551,10 @@ where
 
     match &mut encoder.dict_encoder {
         Some(dict_encoder) => dict_encoder.encode(values, indices),
-        None => encoder.fallback.encode(values, indices),
+        None => {
+            encoder.num_values += indices.len();

Review Comment:
   yep, exactly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2854: Fix page size on dictionary fallback

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854#discussion_r990798877


##########
parquet/src/arrow/arrow_writer/byte_array.rs:
##########
@@ -551,7 +551,10 @@ where
 
     match &mut encoder.dict_encoder {
         Some(dict_encoder) => dict_encoder.encode(values, indices),
-        None => encoder.fallback.encode(values, indices),
+        None => {
+            encoder.num_values += indices.len();

Review Comment:
   I'm guessing the problem was that whilst the estimated_data_page_size would increase, the lack of any values would cause it to erroneously not try to flush the page



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] thinkharderdev commented on a diff in pull request #2854: Fix page size on dictionary fallback

Posted by GitBox <gi...@apache.org>.
thinkharderdev commented on code in PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854#discussion_r990801173


##########
parquet/src/arrow/arrow_writer/byte_array.rs:
##########
@@ -551,7 +551,10 @@ where
 
     match &mut encoder.dict_encoder {
         Some(dict_encoder) => dict_encoder.encode(values, indices),
-        None => encoder.fallback.encode(values, indices),
+        None => {
+            encoder.num_values += indices.len();

Review Comment:
   Maybe, when we do it that way it causes a panic which may also be a bug. 
   
   `General("Must flush data pages before flushing dictionary")'`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ursabot commented on pull request #2854: Fix page size on dictionary fallback

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854#issuecomment-1272962194

   Benchmark runs are scheduled for baseline = c3aac93454c67b7b1b2ee38cd33aa93c1a8e568e and contender = 0268bba4c01c2b83986c023258ad4405c29cabff. 0268bba4c01c2b83986c023258ad4405c29cabff is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/1543d8346d3a40c9a3d54498ba97a31f...c7bf39dd8aaf4a4ba1a977bfcc061922/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/6fc71ae9256c4670871443f397739a7d...953058e4f21d435198b158dd12381374/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/e42ff667be22491f9f1b72cdf9023783...8926ee8578d14b9f86ac56e654a0efa5/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/8acaf9d6fd0b4cdd8f673cdda3059f6c...3a283b9fcdc34ad9835fdfd10921d5ef/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2854: Fix page size on dictionary fallback

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854#discussion_r990799861


##########
parquet/src/arrow/arrow_writer/byte_array.rs:
##########
@@ -551,7 +551,10 @@ where
 
     match &mut encoder.dict_encoder {
         Some(dict_encoder) => dict_encoder.encode(values, indices),
-        None => encoder.fallback.encode(values, indices),
+        None => {
+            encoder.num_values += indices.len();

Review Comment:
   Should we be doing this regardless of if we've fallen back?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2854: Fix page size on dictionary fallback

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2854:
URL: https://github.com/apache/arrow-rs/pull/2854#discussion_r990806367


##########
parquet/src/arrow/arrow_writer/byte_array.rs:
##########
@@ -551,7 +551,10 @@ where
 
     match &mut encoder.dict_encoder {
         Some(dict_encoder) => dict_encoder.encode(values, indices),
-        None => encoder.fallback.encode(values, indices),
+        None => {
+            encoder.num_values += indices.len();

Review Comment:
   I think we need to reset num_values to 0 when we flush a data page



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org