You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Matt <bs...@gmail.com> on 2016/01/22 01:10:34 UTC

File size limit for CTAS?

Converting CSV files to Parquet with CTAS, and getting errors on some 
larger files:

With a source file of 16.34GB (as reported in the HDFS explorer):

~~~
create table `/parquet/customer_20151017` partition by (date_tm) AS 
select * from `/csv/customer/customer_20151017.csv`;
Error: SYSTEM ERROR: IllegalArgumentException: length: -484 (expected: 
 >= 0)

Fragment 1:1

[Error Id: da53d687-a8d5-4927-88ec-e56d5da17112 on es07:31010] 
(state=,code=0)
~~~

But an optation on a 70 MB file of the same format succeeds.

Given some HDFS advice is to avoid large numbers of small files [1], is 
there a general guideline for the max file size to ingest into Parquet 
files with CTAS?

---

[1] HDFS put performance is very poor with a large number of small 
files, thus trying to find the right amount of source rollup to perform. 
Pointers to HDFS configuration guides for beginners would be appreciated 
too. I have only used HDFS for Drill - no other Hadoop experience.

Re: File size limit for CTAS?

Posted by rahul challapalli <ch...@gmail.com>.

Ignoring the CTAS part can you try running the select query and see if it
completes. My suspicion is that some record/field in your large file is
causing drill to break. Also it would be helpful if you can give more
information from the drillbit.log when this error happens (Search for
da53d687-a8d5-4927-88ec-e56d5da17112)

- Rahul

On Thu, Jan 21, 2016 at 4:10 PM, Matt <bs...@gmail.com> wrote:

> Converting CSV files to Parquet with CTAS, and getting errors on some
> larger files:
>
> With a source file of 16.34GB (as reported in the HDFS explorer):
>
> ~~~
> create table `/parquet/customer_20151017` partition by (date_tm) AS select
> * from `/csv/customer/customer_20151017.csv`;
> Error: SYSTEM ERROR: IllegalArgumentException: length: -484 (expected: >=
> 0)
>
> Fragment 1:1
>
> [Error Id: da53d687-a8d5-4927-88ec-e56d5da17112 on es07:31010]
> (state=,code=0)
> ~~~
>
> But an optation on a 70 MB file of the same format succeeds.
>
> Given some HDFS advice is to avoid large numbers of small files [1], is
> there a general guideline for the max file size to ingest into Parquet
> files with CTAS?
>
> ---
>
> [1] HDFS put performance is very poor with a large number of small files,
> thus trying to find the right amount of source rollup to perform. Pointers
> to HDFS configuration guides for beginners would be appreciated too. I have
> only used HDFS for Drill - no other Hadoop experience.
>