You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Antoine Pitrou (JIRA)" <ji...@apache.org> on 2019/04/17 15:14:00 UTC

[jira] [Updated] (ARROW-5089) [C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size

     [ https://issues.apache.org/jira/browse/ARROW-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoine Pitrou updated ARROW-5089:
----------------------------------
    Component/s: C++

> [C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size
> --------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-5089
>                 URL: https://issues.apache.org/jira/browse/ARROW-5089
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Florian Jetter
>            Priority: Minor
>
> Currently, there is a workaround for dict encoded columns in place to handle writing dict encoded columns to parquet.
> The workaround converts the dict encoded array to its plain version before writing to parquet. This is painfully slow since for every row group the entire array is converted over and over again.
> The following example is orders of magnitude slower than the non-dict encoded version:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> df = pd.DataFrame({"col": ["A", "B"] * 100000}).astype("category")
> table = pa.Table.from_pandas(df)
> buf = pa.BufferOutputStream()
> pq.write_table(
>     table,
>     buf,
>     chunk_size=100,
> )
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)