You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by sreeparna bhabani <bh...@gmail.com> on 2020/05/09 08:07:07 UTC

Error while reading from Parquet during CTAS : DATA_READ ERROR

Hi Team,

Reach out to you for one issue regarding Apache Drill while creating Parquet
file from another Parquet file (generated from another tool). Please find
the details below. I have created following Jira ticket with more details-
https://issues.apache.org/jira/browse/DRILL-7736

*Summary-*

I am re-writing one Parquet file from another Parquet file using CTAS
PARTITION BY (). The source Parquet file is generated from Python. But when
I am trying to rewrite the parquet in Drill I am getting error. The details
of the error is given below.

*Version of Apache Drill* -

1.17

*Memory config-*

DRILL_HEAP=16 G
DRILL_MAX_DIRECT_MEMORY=32G

*Config information which I tried-*

exec.sort.disable_managed=true

store.parquet.reader.pagereader.async=true;

store.parquet.reader.pagereader.bufferedread=false;

planner.memory.max_query_memory_per_node=31147483648

drill.exec.memory.operator.output_batch_size=4194304

*Details of volume of data-*

The number of rows for which I am trying to CTAS is - 25245241. No of
columns 145.

FYI - I am able to create Parquet using CTAS for less number of rows.

*CTAS script-*

CREATE TABLE dfs.root.<Table_name>
PARTITION BY (<Column1>,<Column2>,<Column3>)
AS SELECT *
FROM dfs.root.<source_parquet>;

Please suggest me how we can fix this.

Thanks n Regards,
*Sreeparna Bhabani*