You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Krisztian Szucs (Jira)" <ji...@apache.org> on 2021/01/12 14:45:00 UTC

[jira] [Commented] (ARROW-9441) [C++] Optimize RecordBatchReader::ReadAll

    [ https://issues.apache.org/jira/browse/ARROW-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17263381#comment-17263381 ] 

Krisztian Szucs commented on ARROW-9441:
----------------------------------------

Postponing it to 4.0

> [C++] Optimize RecordBatchReader::ReadAll
> -----------------------------------------
>
>                 Key: ARROW-9441
>                 URL: https://issues.apache.org/jira/browse/ARROW-9441
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Ji Liu
>            Priority: Major
>             Fix For: 3.0.0
>
>
> Based on perf reports, more time is spent manipulating C++ data structures than reconstructing record batches from IPC messages, which strikes me as not what we want
> here is from a perf report based on the Python code
> {code}
> for i in range(100):
>     pa.ipc.open_stream('nyctaxi.arrow').read_all()
> {code}
> {code}
> -   50.40%     0.06%  python           libarrow.so.100.0.0                  [.] arrow::RecordBatchReader::ReadAll
>    - 50.34% arrow::RecordBatchReader::ReadAll     
>       - 25.86% arrow::Table::FromRecordBatches    
>          - 18.41% arrow::SimpleRecordBatch::column
>             - 16.00% arrow::MakeArray
>                - 10.49% arrow::VisitTypeInline<arrow::internal::ArrayDataWrapper>  
>                     7.71% arrow::PrimitiveArray::SetData           
>                     1.87% arrow::StringArray::StringArray          
>            1.54% __pthread_mutex_lock                              
>            0.88% __pthread_mutex_unlock                            
>            0.67% std::_Hash_bytes                                  
>            0.60% arrow::ChunkedArray::ChunkedArray                 
>       - 22.30% arrow::RecordBatchReader::ReadAll                   
>          - 22.12% arrow::ipc::RecordBatchStreamReaderImpl::ReadNext
>             - 15.91% arrow::ipc::ReadRecordBatchInternal
>                - 15.15% arrow::ipc::LoadRecordBatch
>                   - 14.45% arrow::ipc::ArrayLoader::Load
>                      + 13.15% arrow::VisitTypeInline<arrow::ipc::ArrayLoader>
>             + 5.53% arrow::ipc::InputStreamMessageReader::ReadNextMessage 
>         1.84% arrow::SimpleRecordBatch::~SimpleRecordBatch
> {code}
> Perhaps {{ChunkedArray}} internally should be changed to contain a vector of {{ArrayData}} instead of boxed Arrays. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)