You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/01/18 13:35:00 UTC
[jira] [Commented] (DRILL-8023) Empty dict page breaks the "old" Parquet reader

    [ https://issues.apache.org/jira/browse/DRILL-8023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477863#comment-17477863 ] 

ASF GitHub Bot commented on DRILL-8023:
---------------------------------------

jnturton opened a new pull request #2430:
URL: https://github.com/apache/drill/pull/2430


   # [DRILL-8023](https://issues.apache.org/jira/browse/DRILL-8023): Empty dict page breaks the "old" Parquet reader
   
   ## Description
   
   Dictionary pages of zero bytes caused exceptions to be thrown from the "old" Parquet reader which did not include logic to discard them.
   
   ## Documentation
   None.
   
   ## Testing
   TestEmptyParquet#testEmptyDictPage
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@drill.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Empty dict page breaks the "old" Parquet reader
> -----------------------------------------------
>
>                 Key: DRILL-8023
>                 URL: https://issues.apache.org/jira/browse/DRILL-8023
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>            Reporter: Alex Delgado
>            Assignee: James Turton
>            Priority: Major
>         Attachments: fastparquet_test.parquet.tar.gz, pyarrow_test.parquet.tar.gz
>
>
> If the python libraries dask and pyarrow are used to export a dataframe to parquet, and the parquet file has a column that is all null, this will cause Apache Drill to raise an "INTERNAL_ERROR ERROR: null" error.  Dask and Spark are able to read the dask+pyarrow parquet files.
>  
> Example:
> Create the parquet files with and without pyarrow in python.
> {code:java}
> import pandas as pd
> import dask.dataframe as dd
> df = pd.DataFrame(
>     {
>         'A': [1, 2, 3],
>         'B': ['a', 'b', 'c'],
>         'C': [None, None, None]
>     }
> )
> ddf = dd.from_pandas(df, npartitions=1)
> ddf.to_parquet('data/pyarrow_test.parquet', engine='pyarrow')
> ddf.to_parquet('data/fastparquet_test.parquet', engine='fastparquet')
> {code}
> Read these parquet files with drill:
> {code:java}
> Apache Drill 1.19.0
> "Everything is easier with Drill."
> apache drill> SELECT * FROM dfs.`data/fastparquet_test.parquet`;
> +---------------------+---+---+------+
> | __null_dask_index__ | A | B |  C   |
> +---------------------+---+---+------+
> | 0                   | 1 | a | null |
> | 1                   | 2 | b | null |
> | 2                   | 3 | c | null |
> +---------------------+---+---+------+
> 3 rows selected (0.179 seconds)
> apache drill> SELECT * FROM dfs.`data/pyarrow_test.parquet`;
> Error: INTERNAL_ERROR ERROR: null
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 25034075-69b0-415e-8bb2-d7aa3d834653 on 75a796902ffe:31010](state=,code=0)
> {code}
> Narrow down to column that is causing the issue:
> {code:java}
> apache drill> SELECT A, B FROM dfs.`data/pyarrow_test.parquet`;
> +---+---+
> | A | B |
> +---+---+
> | 1 | a |
> | 2 | b |
> | 3 | c |
> +---+---+
> 3 rows selected (0.145 seconds)
> apache drill> SELECT C FROM dfs.`data/pyarrow_test.parquet`;
> Error: INTERNAL_ERROR ERROR: null
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 932ef1d1-7c56-4833-b906-0da0c7c155f9 on 75a796902ffe:31010] (state=,code=0)
> {code}
> Dependency versions:
> {code:java}
> Apache Drill 1.19.0
> Python 3.9.7
> dask==2021.10.0
> pyarrow==6.0.0
> fastparquet==0.7.1
> {code}
> Attached are the parquet files I tested with.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)