You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Larry Parker (Jira)" <ji...@apache.org> on 2020/08/03 21:14:00 UTC
[jira] [Created] (ARROW-9637) Speed degradation with categoricals
Larry Parker created ARROW-9637:
-----------------------------------
Summary: Speed degradation with categoricals
Key: ARROW-9637
URL: https://issues.apache.org/jira/browse/ARROW-9637
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Larry Parker
I have noticed some major speed degradation when using categorical data types. For example, a Parquet file with 1 million rows that sums 10 float columns and groups by two columns (one a date column and one a category column). The cardinality of the category seems to have a major effect. When grouping on category column of cardinality 10, performance is decent (query runs in 150 ms). But with cardinality of 100, the query runs in 10 seconds. If I switch over to my Parquet file that does *not* have categorical columns, the same query that took 10 seconds with categoricals now runs in 350 ms.
I would be happy to post the Pandas code that I'm using (including how I'm creating the Parquet file), but I first wanted to report this and see if it's a known issue.
Thanks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)