You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/06/28 17:30:22 UTC

[GitHub] [arrow-rs] tustvold opened a new issue, #4459: Empty Offset Index for All Null Columns

tustvold opened a new issue, #4459:
URL: https://github.com/apache/arrow-rs/issues/4459

   **Describe the bug**
   <!--
   A clear and concise description of what the bug is.
   -->
   
   Writing a column consisting solely of all nulls results in an empty offset index for that column
   
   ```
   #[test]
   fn test_writer_all_null() {
       let a = Int32Array::from(vec![1, 2, 3, 4, 5]);
       let b = Int32Array::new(vec![0; 5].into(), Some(NullBuffer::new_null(5)));
       let batch = RecordBatch::try_from_iter(vec![
           ("a", Arc::new(a) as ArrayRef),
           ("b", Arc::new(b) as ArrayRef),
       ])
       .unwrap();
   
       let mut buf = Vec::with_capacity(1024);
       let mut writer = ArrowWriter::try_new(&mut buf, batch.schema(), None).unwrap();
       writer.write(&batch).unwrap();
       writer.close().unwrap();
   
       let bytes = Bytes::from(buf);
       let options = ReadOptionsBuilder::new().with_page_index().build();
       let reader = SerializedFileReader::new_with_options(bytes, options).unwrap();
       let index = reader.metadata().offset_index().unwrap();
   
       assert_eq!(index.len(), 1);
       assert_eq!(index[0].len(), 2); // 2 columns
       assert_eq!(index[0][0].len(), 1); // 1 page
       assert_eq!(index[0][1].len(), 1); // 1 page
   }
   ```
   
   This appears to have been a bug introduced by https://github.com/apache/arrow-rs/pull/4389
   
   In particular - https://github.com/apache/arrow-rs/pull/4389/files#diff-b1859e4da1d85e57a4185dc407458ac83a369dac132285689c27e878e3695ad6R695
   
   **To Reproduce**
   <!--
   Steps to reproduce the behavior:
   -->
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   
   **Additional context**
   <!--
   Add any other context about the problem here.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #4459: Regression in in parquet `42.0.0` : Bad parquet column indexes for All Null Columns, resulting in `Parquet error: StructArrayReader out of sync` on read

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #4459: Regression in  in parquet `42.0.0` : Bad parquet column indexes for All Null Columns, resulting in `Parquet error: StructArrayReader out of sync` on read
URL: https://github.com/apache/arrow-rs/issues/4459


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #4459: Regression in in parquet `42.0.0` : Bad parquet column indexes for All Null Columns, resulting in `Parquet error: StructArrayReader out of sync` on read

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4459:
URL: https://github.com/apache/arrow-rs/issues/4459#issuecomment-1611827803

   FYI We found this in our internal testing. I will post symptoms here to help anyone else who comes across this:
   
   We found a query like this in IOx that resulted in `Parquet error: StructArrayReader out of sync` on read errors 
   
   
   ```
   $ datafusion-cli -c "SELECT col, time FROM 'data.parquet' WHERE 1684850057953220316 <= time::bigint"
   DataFusion CLI v27.0.0
   Arrow error: External error: Arrow: Parquet argument error: Parquet error: StructArrayReader out of sync in read_records, expected 0 skipped, got 11
   ```
   
   The workaround for datafusion is to disable using the page index:
   
   ```
   ❯ set datafusion.execution.parquet.enable_page_index = false;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #4459: Regression: Bad parquet column indexes written in parquet `42.0.0` Empty Offset Index for All Null Columns / Parquet error: StructArrayReader out of sync on read

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4459:
URL: https://github.com/apache/arrow-rs/issues/4459#issuecomment-1611824302

   I added some more stuff to the title of this ticket to make it easier to find / search for


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] cmackenzie1 commented on issue #4459: Regression in in parquet `42.0.0` : Bad parquet column indexes for All Null Columns, resulting in `Parquet error: StructArrayReader out of sync` on read

Posted by "cmackenzie1 (via GitHub)" <gi...@apache.org>.
cmackenzie1 commented on issue #4459:
URL: https://github.com/apache/arrow-rs/issues/4459#issuecomment-1660725138

   I am still experiencing this error with Datafusion 28.0.0, which is based on Arrow 43.0.0. Is it possible the bug still exists?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4459: Regression in in parquet `42.0.0` : Bad parquet column indexes for All Null Columns, resulting in `Parquet error: StructArrayReader out of sync` on read

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4459:
URL: https://github.com/apache/arrow-rs/issues/4459#issuecomment-1660740319

   As this was a writer bug, you will need to rewrite the affected files. Their isn't a read-side patch as the page indices are simply invalid


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] cmackenzie1 commented on issue #4459: Regression in in parquet `42.0.0` : Bad parquet column indexes for All Null Columns, resulting in `Parquet error: StructArrayReader out of sync` on read

Posted by "cmackenzie1 (via GitHub)" <gi...@apache.org>.
cmackenzie1 commented on issue #4459:
URL: https://github.com/apache/arrow-rs/issues/4459#issuecomment-1660746087

   Ah, yeah I see that now looking at the PR diff. Thanks for the explanation!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org