You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Alex Delgado (Jira)" <ji...@apache.org> on 2021/10/30 04:28:00 UTC
[jira] [Updated] (DRILL-8023) NULL Columns in Parquet from DASK+PyArrow Raising "INTERNAL_ERROR ERROR: null" in Drill

     [ https://issues.apache.org/jira/browse/DRILL-8023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Delgado updated DRILL-8023:
--------------------------------
    Description: 
If the python libraries dask and pyarrow are used to export a dataframe to parquet, and the parquet file has a column that is all null, this will cause Apache Drill to raise an "INTERNAL_ERROR ERROR: null" error.  Dask and Spark are able to read the dask+pyarrow parquet files.

 

Example:

Create the parquet files with and without pyarrow in python.
{code:java}
import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame(
    {
        'A': [1, 2, 3],
        'B': ['a', 'b', 'c'],
        'C': [None, None, None]
    }
)

ddf = dd.from_pandas(df, npartitions=1)

ddf.to_parquet('data/pyarrow_test.parquet', engine='pyarrow')
ddf.to_parquet('data/fastparquet_test.parquet', engine='fastparquet')
{code}
Read these parquet files with drill:
{code:java}
Apache Drill 1.19.0
"Everything is easier with Drill."
apache drill> SELECT * FROM dfs.`data/fastparquet_test.parquet`;
+---------------------+---+---+------+
| __null_dask_index__ | A | B |  C   |
+---------------------+---+---+------+
| 0                   | 1 | a | null |
| 1                   | 2 | b | null |
| 2                   | 3 | c | null |
+---------------------+---+---+------+
3 rows selected (0.179 seconds)

apache drill> SELECT * FROM dfs.`data/pyarrow_test.parquet`;
Error: INTERNAL_ERROR ERROR: null

Fragment: 0:0

Please, refer to logs for more information.

[Error Id: 25034075-69b0-415e-8bb2-d7aa3d834653 on 75a796902ffe:31010](state=,code=0)
{code}
Narrow down to column that is causing the issue:
{code:java}
apache drill> SELECT A, B FROM dfs.`data/pyarrow_test.parquet`;
+---+---+
| A | B |
+---+---+
| 1 | a |
| 2 | b |
| 3 | c |
+---+---+
3 rows selected (0.145 seconds)

apache drill> SELECT C FROM dfs.`data/pyarrow_test.parquet`;
Error: INTERNAL_ERROR ERROR: null

Fragment: 0:0
Please, refer to logs for more information.
[Error Id: 932ef1d1-7c56-4833-b906-0da0c7c155f9 on 75a796902ffe:31010] (state=,code=0)
{code}
Dependency versions:
{code:java}
Apache Drill 1.19.0
Python 3.9.7
dask==2021.10.0
pyarrow==6.0.0
fastparquet==0.7.1
{code}
Attached are the parquet files I tested with.

  was:
If the python libraries dask and pyarrow are used to export a dataframe to parquet, and the parquet file has a column that is all null, this will cause Apache Drill to raise an "INTERNAL_ERROR ERROR: null" error.  Dask and Spark are able to read the dask+pyarrow parquet files.

 

Example:

Create the parquet files with and without pyarrow in python.
{code:java}
import pandas as pd
import dask.dataframe as dddf = pd.DataFrame(
    {
        'A': [1, 2, 3],
        'B': ['a', 'b', 'c'],
        'C': [None, None, None]
    }
)

ddf = dd.from_pandas(df, npartitions=1)

ddf.to_parquet('data/pyarrow_test.parquet', engine='pyarrow')
ddf.to_parquet('data/fastparquet_test.parquet', engine='fastparquet')
{code}
Read these parquet files with drill:
{code:java}
Apache Drill 1.19.0
"Everything is easier with Drill."
apache drill> SELECT * FROM dfs.`data/fastparquet_test.parquet`;
+---------------------+---+---+------+
| __null_dask_index__ | A | B |  C   |
+---------------------+---+---+------+
| 0                   | 1 | a | null |
| 1                   | 2 | b | null |
| 2                   | 3 | c | null |
+---------------------+---+---+------+
3 rows selected (0.179 seconds)

apache drill> SELECT * FROM dfs.`data/pyarrow_test.parquet`;
Error: INTERNAL_ERROR ERROR: null

Fragment: 0:0

Please, refer to logs for more information.

[Error Id: 25034075-69b0-415e-8bb2-d7aa3d834653 on 75a796902ffe:31010](state=,code=0)
{code}
Narrow down to column that is causing the issue:
{code:java}
apache drill> SELECT A, B FROM dfs.`data/pyarrow_test.parquet`;
+---+---+
| A | B |
+---+---+
| 1 | a |
| 2 | b |
| 3 | c |
+---+---+
3 rows selected (0.145 seconds)

apache drill> SELECT C FROM dfs.`data/pyarrow_test.parquet`;
Error: INTERNAL_ERROR ERROR: null

Fragment: 0:0
Please, refer to logs for more information.
[Error Id: 932ef1d1-7c56-4833-b906-0da0c7c155f9 on 75a796902ffe:31010] (state=,code=0)
{code}
Dependency versions:
{code:java}
Apache Drill 1.19.0
Python 3.9.7
dask==2021.10.0
pyarrow==6.0.0
fastparquet==0.7.1
{code}
Attached are the parquet files I tested with.


> NULL Columns in Parquet from DASK+PyArrow Raising "INTERNAL_ERROR ERROR: null" in Drill
> ---------------------------------------------------------------------------------------
>
>                 Key: DRILL-8023
>                 URL: https://issues.apache.org/jira/browse/DRILL-8023
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>            Reporter: Alex Delgado
>            Priority: Major
>         Attachments: fastparquet_test.parquet.tar.gz, pyarrow_test.parquet.tar.gz
>
>
> If the python libraries dask and pyarrow are used to export a dataframe to parquet, and the parquet file has a column that is all null, this will cause Apache Drill to raise an "INTERNAL_ERROR ERROR: null" error.  Dask and Spark are able to read the dask+pyarrow parquet files.
>  
> Example:
> Create the parquet files with and without pyarrow in python.
> {code:java}
> import pandas as pd
> import dask.dataframe as dd
> df = pd.DataFrame(
>     {
>         'A': [1, 2, 3],
>         'B': ['a', 'b', 'c'],
>         'C': [None, None, None]
>     }
> )
> ddf = dd.from_pandas(df, npartitions=1)
> ddf.to_parquet('data/pyarrow_test.parquet', engine='pyarrow')
> ddf.to_parquet('data/fastparquet_test.parquet', engine='fastparquet')
> {code}
> Read these parquet files with drill:
> {code:java}
> Apache Drill 1.19.0
> "Everything is easier with Drill."
> apache drill> SELECT * FROM dfs.`data/fastparquet_test.parquet`;
> +---------------------+---+---+------+
> | __null_dask_index__ | A | B |  C   |
> +---------------------+---+---+------+
> | 0                   | 1 | a | null |
> | 1                   | 2 | b | null |
> | 2                   | 3 | c | null |
> +---------------------+---+---+------+
> 3 rows selected (0.179 seconds)
> apache drill> SELECT * FROM dfs.`data/pyarrow_test.parquet`;
> Error: INTERNAL_ERROR ERROR: null
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 25034075-69b0-415e-8bb2-d7aa3d834653 on 75a796902ffe:31010](state=,code=0)
> {code}
> Narrow down to column that is causing the issue:
> {code:java}
> apache drill> SELECT A, B FROM dfs.`data/pyarrow_test.parquet`;
> +---+---+
> | A | B |
> +---+---+
> | 1 | a |
> | 2 | b |
> | 3 | c |
> +---+---+
> 3 rows selected (0.145 seconds)
> apache drill> SELECT C FROM dfs.`data/pyarrow_test.parquet`;
> Error: INTERNAL_ERROR ERROR: null
> Fragment: 0:0
> Please, refer to logs for more information.
> [Error Id: 932ef1d1-7c56-4833-b906-0da0c7c155f9 on 75a796902ffe:31010] (state=,code=0)
> {code}
> Dependency versions:
> {code:java}
> Apache Drill 1.19.0
> Python 3.9.7
> dask==2021.10.0
> pyarrow==6.0.0
> fastparquet==0.7.1
> {code}
> Attached are the parquet files I tested with.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)