You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Liya Fan (JIRA)" <ji...@apache.org> on 2019/04/24 08:16:00 UTC

[jira] [Commented] (ARROW-5200) [Java] Provide light-weight arrow APIs

    [ https://issues.apache.org/jira/browse/ARROW-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824900#comment-16824900 ] 

Liya Fan commented on ARROW-5200:
---------------------------------

Sounds reasonable. Thanks a lot for your comments. 

We have opened a new Jira (ARROW-5209) to setup some performance benchmarks from our SQL engine, which is going to be made open source. The benchmarks are extracted by running an open SQL benchmark TPC-H. 

> [Java] Provide light-weight arrow APIs
> --------------------------------------
>
>                 Key: ARROW-5200
>                 URL: https://issues.apache.org/jira/browse/ARROW-5200
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Java
>            Reporter: Liya Fan
>            Assignee: Liya Fan
>            Priority: Major
>         Attachments: image-2019-04-23-15-19-34-187.png
>
>
> We are trying to incorporate Apache Arrow to Apache Flink runtime. We find Arrow an amazing library, which greatly simplifies the support of columnar data format.
> However, for many scenarios, we find the performance unacceptable. Our investigation shows the reason is that, there are too many redundant checks and computations in Arrow API.
> For example, the following figures shows that in a single call to Float8Vector.get(int) method (this is one of the most frequently used APIs in Flink computation),  there are 20+ method invocations.
> !image-2019-04-23-15-19-34-187.png!
>  
> There are many other APIs with similar problems. We believe that these checks will make sure of the integrity of the program. However, it also impacts performance severely. For our evaluation, the performance may degrade by two or three orders of magnitude slower, compared to access data on heap memory. 
> We think at least for some scenarios, we can give the responsibility of integrity check to application owners. If they can be sure all the checks have been passed, we can provide some light-weight APIs and the inherent high performance, to them.
> In the light-weight APIs, we only provide minimum checks, or avoid checks at all. The application owner can still develop and debug their code using the original heavy-weight APIs. Once all bugs have been fixed, they can switch to light-weight APIs in their products and enjoy the consequent high performance.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)