You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Alexey Strokach <os...@gmail.com> on 2017/07/08 02:32:38 UTC

Error when converting csv to parquet in chunks, with the first chunk being all nulls

I am running into a problem converting a csv file into a parquet file in
chunks, where one of the string columns is null for the first several
million rows.

Self-contained dummy example:

csv_file = '/tmp/df.csv'
parquet_file = '/tmp/df.parquet'

df = pd.DataFrame([np.nan] * 3 + ['hello'], columns=['a'])
df.to_csv(csv_file, index=False, na_rep='.')
display(df)

for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=2,
na_values=['.'], dtype={'a': str})):
    print(i)
    display(chunk)
    if i == 0:
        parquet_schema = pa.Table.from_pandas(chunk).schema
        parquet_writer = pq.ParquetWriter(parquet_file,
parquet_schema, compression='snappy')
    table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    parquet_writer.write_table(table)

Any suggestions would be much appreciated.

Running pyarrow=0.4.1=np112py36_1 installed using conda on Linux Mint 18.1

And thanks a lot for developing pyarrow.parquet!
Alexey
​

Re: Error when converting csv to parquet in chunks, with the first chunk being all nulls

Posted by Alexey Strokach <os...@gmail.com>.
OK, awesome!

Thanks for the reply.

On Mon, Jul 10, 2017 at 1:42 PM, Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Alexey,
>
> you discovered a known bug in 0.4.1. If a column is only made up of None
> objects, then writing to Parquet fails. This is fixed upstream and will
> be included in the upcoming 0.5.0 release.
>
> Uwe
>
>
> On Sat, Jul 8, 2017, at 04:32 AM, Alexey Strokach wrote:
> > I am running into a problem converting a csv file into a parquet file in
> > chunks, where one of the string columns is null for the first several
> > million rows.
> >
> > Self-contained dummy example:
> >
> > csv_file = '/tmp/df.csv'
> > parquet_file = '/tmp/df.parquet'
> >
> > df = pd.DataFrame([np.nan] * 3 + ['hello'], columns=['a'])
> > df.to_csv(csv_file, index=False, na_rep='.')
> > display(df)
> >
> > for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=2,
> > na_values=['.'], dtype={'a': str})):
> >     print(i)
> >     display(chunk)
> >     if i == 0:
> >         parquet_schema = pa.Table.from_pandas(chunk).schema
> >         parquet_writer = pq.ParquetWriter(parquet_file,
> > parquet_schema, compression='snappy')
> >     table = pa.Table.from_pandas(chunk, schema=parquet_schema)
> >     parquet_writer.write_table(table)
> >
> > Any suggestions would be much appreciated.
> >
> > Running pyarrow=0.4.1=np112py36_1 installed using conda on Linux Mint
> > 18.1
> >
> > And thanks a lot for developing pyarrow.parquet!
> > Alexey
> > ​
>

Re: Error when converting csv to parquet in chunks, with the first chunk being all nulls

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Alexey, 

you discovered a known bug in 0.4.1. If a column is only made up of None
objects, then writing to Parquet fails. This is fixed upstream and will
be included in the upcoming 0.5.0 release.

Uwe


On Sat, Jul 8, 2017, at 04:32 AM, Alexey Strokach wrote:
> I am running into a problem converting a csv file into a parquet file in
> chunks, where one of the string columns is null for the first several
> million rows.
> 
> Self-contained dummy example:
> 
> csv_file = '/tmp/df.csv'
> parquet_file = '/tmp/df.parquet'
> 
> df = pd.DataFrame([np.nan] * 3 + ['hello'], columns=['a'])
> df.to_csv(csv_file, index=False, na_rep='.')
> display(df)
> 
> for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=2,
> na_values=['.'], dtype={'a': str})):
>     print(i)
>     display(chunk)
>     if i == 0:
>         parquet_schema = pa.Table.from_pandas(chunk).schema
>         parquet_writer = pq.ParquetWriter(parquet_file,
> parquet_schema, compression='snappy')
>     table = pa.Table.from_pandas(chunk, schema=parquet_schema)
>     parquet_writer.write_table(table)
> 
> Any suggestions would be much appreciated.
> 
> Running pyarrow=0.4.1=np112py36_1 installed using conda on Linux Mint
> 18.1
> 
> And thanks a lot for developing pyarrow.parquet!
> Alexey
> ​