You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Alexey Strokach <os...@gmail.com> on 2017/07/08 02:32:38 UTC
Error when converting csv to parquet in chunks, with the first chunk
being all nulls
I am running into a problem converting a csv file into a parquet file in
chunks, where one of the string columns is null for the first several
million rows.
Self-contained dummy example:
csv_file = '/tmp/df.csv'
parquet_file = '/tmp/df.parquet'
df = pd.DataFrame([np.nan] * 3 + ['hello'], columns=['a'])
df.to_csv(csv_file, index=False, na_rep='.')
display(df)
for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=2,
na_values=['.'], dtype={'a': str})):
print(i)
display(chunk)
if i == 0:
parquet_schema = pa.Table.from_pandas(chunk).schema
parquet_writer = pq.ParquetWriter(parquet_file,
parquet_schema, compression='snappy')
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(table)
Any suggestions would be much appreciated.
Running pyarrow=0.4.1=np112py36_1 installed using conda on Linux Mint 18.1
And thanks a lot for developing pyarrow.parquet!
Alexey
Re: Error when converting csv to parquet in chunks, with the first
chunk being all nulls
Posted by Alexey Strokach <os...@gmail.com>.
OK, awesome!
Thanks for the reply.
On Mon, Jul 10, 2017 at 1:42 PM, Uwe L. Korn <uw...@xhochy.com> wrote:
> Hello Alexey,
>
> you discovered a known bug in 0.4.1. If a column is only made up of None
> objects, then writing to Parquet fails. This is fixed upstream and will
> be included in the upcoming 0.5.0 release.
>
> Uwe
>
>
> On Sat, Jul 8, 2017, at 04:32 AM, Alexey Strokach wrote:
> > I am running into a problem converting a csv file into a parquet file in
> > chunks, where one of the string columns is null for the first several
> > million rows.
> >
> > Self-contained dummy example:
> >
> > csv_file = '/tmp/df.csv'
> > parquet_file = '/tmp/df.parquet'
> >
> > df = pd.DataFrame([np.nan] * 3 + ['hello'], columns=['a'])
> > df.to_csv(csv_file, index=False, na_rep='.')
> > display(df)
> >
> > for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=2,
> > na_values=['.'], dtype={'a': str})):
> > print(i)
> > display(chunk)
> > if i == 0:
> > parquet_schema = pa.Table.from_pandas(chunk).schema
> > parquet_writer = pq.ParquetWriter(parquet_file,
> > parquet_schema, compression='snappy')
> > table = pa.Table.from_pandas(chunk, schema=parquet_schema)
> > parquet_writer.write_table(table)
> >
> > Any suggestions would be much appreciated.
> >
> > Running pyarrow=0.4.1=np112py36_1 installed using conda on Linux Mint
> > 18.1
> >
> > And thanks a lot for developing pyarrow.parquet!
> > Alexey
> >
>
Re: Error when converting csv to parquet in chunks,
with the first chunk being all nulls
Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Alexey,
you discovered a known bug in 0.4.1. If a column is only made up of None
objects, then writing to Parquet fails. This is fixed upstream and will
be included in the upcoming 0.5.0 release.
Uwe
On Sat, Jul 8, 2017, at 04:32 AM, Alexey Strokach wrote:
> I am running into a problem converting a csv file into a parquet file in
> chunks, where one of the string columns is null for the first several
> million rows.
>
> Self-contained dummy example:
>
> csv_file = '/tmp/df.csv'
> parquet_file = '/tmp/df.parquet'
>
> df = pd.DataFrame([np.nan] * 3 + ['hello'], columns=['a'])
> df.to_csv(csv_file, index=False, na_rep='.')
> display(df)
>
> for i, chunk in enumerate(pd.read_csv(csv_file, chunksize=2,
> na_values=['.'], dtype={'a': str})):
> print(i)
> display(chunk)
> if i == 0:
> parquet_schema = pa.Table.from_pandas(chunk).schema
> parquet_writer = pq.ParquetWriter(parquet_file,
> parquet_schema, compression='snappy')
> table = pa.Table.from_pandas(chunk, schema=parquet_schema)
> parquet_writer.write_table(table)
>
> Any suggestions would be much appreciated.
>
> Running pyarrow=0.4.1=np112py36_1 installed using conda on Linux Mint
> 18.1
>
> And thanks a lot for developing pyarrow.parquet!
> Alexey
>