You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/09 12:31:22 UTC

[GitHub] [arrow-rs] thinkharderdev opened a new issue, #2853: `parquet::arrow::arrow_writer::ArrowWriter` ignores page size properties

thinkharderdev opened a new issue, #2853:
URL: https://github.com/apache/arrow-rs/issues/2853

   **Describe the bug**
   <!--
   A clear and concise description of what the bug is.
   -->
   
   `ArrowWrites` ignores page size properties when writing to parquet. It also seems to always write just two pages, the first one a normal sized page and all the remaining data in the second page. 
   
   **To Reproduce**
   <!--
   Steps to reproduce the behavior:
   -->
   
   ```
       #[test]
       fn arrow_writer_page_size() {
           let mut rng = thread_rng();
           let schema = Arc::new(Schema::new(vec![Field::new("col", DataType::Utf8, false)]));
   
           let mut builder = StringBuilder::with_capacity(10_000, 2 * 10_0000);
   
           for _ in 0..100_000 {
               let value = (0..200)
                   .map(|_| rng.gen_range(b'a'..=b'z') as char)
                   .collect::<String>();
   
               builder.append_value(value);
           }
   
           let array = Arc::new(builder.finish());
   
           let batch = RecordBatch::try_new(schema, vec![array]).unwrap();
   
           let file = tempfile::tempfile().unwrap();
   
           let props = WriterProperties::builder()
               .set_max_row_group_size(usize::MAX)
               .set_data_pagesize_limit(512)
               .set_write_batch_size(512)
               .build();
   
           let mut writer = ArrowWriter::try_new(
               file.try_clone().unwrap(),
               batch.schema(),
               Some(props),
           )
               .expect("Unable to write file");
           writer.write(&batch).unwrap();
           writer.close().unwrap();
   
           let reader = SerializedFileReader::new(file.try_clone().unwrap()).unwrap();
   
           let column = reader.metadata().row_group(0).columns();
   
           let page_locations = read_pages_locations(&file, column).unwrap();
   
           let offset_index = page_locations[0].clone();
   
           assert!(offset_index.len() > 2, "Expected more than two pages but got {:#?}", offset_index);
       }
   ```
   
   This outputs 
   ```
   thread 'arrow::arrow_writer::tests::arrow_writer_page_size' panicked at 'Expected more than two pages but got [
       PageLocation {
           offset: 1148953,
           compressed_page_size: 9595,
           first_row_index: 0,
       },
       PageLocation {
           offset: 1158548,
           compressed_page_size: 19251505,
           first_row_index: 5632,
       },
   ]'
   ```
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   
   The writer should respect the page size properties and write similarly sized pages. 
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #2853: `parquet::arrow::arrow_writer::ArrowWriter` ignores page size properties

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #2853: `parquet::arrow::arrow_writer::ArrowWriter` ignores page size properties
URL: https://github.com/apache/arrow-rs/issues/2853


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #2853: `parquet::arrow::arrow_writer::ArrowWriter` ignores page size properties

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #2853: `parquet::arrow::arrow_writer::ArrowWriter` ignores page size properties
URL: https://github.com/apache/arrow-rs/issues/2853


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #2853: `parquet::arrow::arrow_writer::ArrowWriter` ignores page size properties

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2853:
URL: https://github.com/apache/arrow-rs/issues/2853#issuecomment-1279419284

   `label_issue.py` automatically added labels {'parquet'} from #2854


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #2853: `parquet::arrow::arrow_writer::ArrowWriter` ignores page size properties

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #2853:
URL: https://github.com/apache/arrow-rs/issues/2853#issuecomment-1282991054

   Reopening as @tustvold says it is not yet fixed https://github.com/apache/arrow-rs/pull/2890#issuecomment-1282800329


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org