You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/09/16 16:59:00 UTC
[jira] [Assigned] (ARROW-13965) [C++] dynamic_casts in parquet
TypedColumnWriterImpl impacting performance
[ https://issues.apache.org/jira/browse/ARROW-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou reassigned ARROW-13965:
--------------------------------------
Assignee: Edward Seidl
> [C++] dynamic_casts in parquet TypedColumnWriterImpl impacting performance
> --------------------------------------------------------------------------
>
> Key: ARROW-13965
> URL: https://issues.apache.org/jira/browse/ARROW-13965
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Environment: arrow 6.0.0-SNAPSHOT on both RHEL8 (gcc 8.4.1) and MacOS 11.5.2 (clang 11.0.0)
> Reporter: Edward Seidl
> Assignee: Edward Seidl
> Priority: Minor
> Labels: pull-request-available
> Fix For: 6.0.0
>
> Attachments: arrow_downcast.patch
>
> Time Spent: 1h 50m
> Remaining Estimate: 0h
>
> The methods WriteDictionaryPage(), CheckDictionarySizeLimit(), WriteValues(), and WriteValuesSpaced() in TypedColumnWriterImpl (cpp/src/parquet/column_writer.cc) perform dynamic_casts of the current_dict_ object to either DictEncoder or ValueEncoderType pointers. When calling WriteBatch() with a large number of values this is ok, but when writing batches of 1 (as when using the stream api), these dynamic casts can consume a great deal of cpu. Using gperftools against code I wrote to do a log structured merge of several parquet files, I measured the dynamic_casts taking as much as 25% of execution time.
> By modifying TypedColumnWriterImpl to save downcasted observer pointers of the appropriate types, I was able to cut my execution time from 32 to 24 seconds, validating the gpertools results. I've attached a patch to show what I did.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)