You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/10 12:31:19 UTC

[GitHub] [arrow-datafusion] tustvold opened a new issue #1976: Parquet SQL Benchmarks Broken

tustvold opened a new issue #1976:
URL: https://github.com/apache/arrow-datafusion/issues/1976


   **Describe the bug**
   
   The parquet SQL benchmarks no longer run cleanly, in particular the following query returns an error
   
   ```
   select string_optional from t where dict_10_required = 'prefix#1' and dict_1000_required = 'prefix#1';
   ```
   
   ```
    Parquet argument error: Parquet error: 'block_size' must be a multiple of 128, got 90") for files: [PartitionedFile { file_meta: FileMeta { sized_file: SizedFile { path: "/tmp/parquet_query_sql20TObt.parquet", size: 201093448 }, last_modified: Some(2022-03-10T12:17:51.953953953Z) }, partition_values: [] }]
   ```
   
   I suspected this related to https://github.com/apache/arrow-rs/pull/1284 which was included in the 9.1 release of arrow, but rolling back to before this upgrade just alters the error message
   
   ```
   Parquet argument error: EOF: eof decoding byte array") for files: [PartitionedFile { file_meta: FileMeta { sized_file: SizedFile { path: "/tmp/parquet_query_sqlg368pa.parquet", size: 200927005 }, last_modified: Some(2022-03-10T12:29:15.693863589Z) }, partition_values: [] }]
   ```
   
   It is unclear at this stage if the problem is that the encoder is writing gibberish, or if the code has introduced a bug in the decoder. Either way, we should have caught this upstream in arrow-rs, if it is an upstream bug.
   
   Unfortunately my go to tool of, use alternative tools has not thus far yielded fruit. I guess I need to go work out how to get spark running...
   
   ```
   >>> pq.read_table('/home/raphael/Downloads/borked.parquet', columns=['string_optional'])
   OSError: Not yet implemented: Unsupported encoding.
   
   >>> duckdb.query(f"select string_optional from '/home/raphael/Downloads/borked.parquet'").fetchall()
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   RuntimeError: Unsupported page encoding
   ```
   
   **To Reproduce**
   
   Run the SQL benchmarks
   
   **Expected behavior**
   
   They run without errors
   
   **Additional context**
   
   There is a broader question that perhaps we should be running this benchmark suite as part of some nightly CI job or something, potentially relates to #1377
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold commented on issue #1976: Parquet SQL Benchmarks Broken

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1976:
URL: https://github.com/apache/arrow-datafusion/issues/1976#issuecomment-1064011980


   Foiled by a lock file, downgrading to parquet 9.0.2 does resolve this issue, so https://github.com/apache/arrow-rs/pull/1284 is likely related.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb closed issue #1976: Parquet SQL Benchmarks Broken

Posted by GitBox <gi...@apache.org>.
alamb closed issue #1976:
URL: https://github.com/apache/arrow-datafusion/issues/1976


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org