You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2020/06/15 17:58:38 UTC

[GitHub] [incubator-doris] xy720 opened a new issue #3877: [optimize] Optimize spark load/broker load reading parquet format file

xy720 opened a new issue #3877:
URL: https://github.com/apache/incubator-doris/issues/3877


   Currently, broker load support reading parquet file from remote, and soon we will use parquet format as intermediate output in spark load.
   
   But due to the seperated metadata (file meta/column meta/page header...) structure of parquet file, broker reader need frequently seek to get data, which leads to a lot of RPCs.  Large amount of RPCs will lead to huge network costs in cross-data-center scene.
   
   You can see a big gap of time cost in the table below.
   
   |cross-center|rpc times|load time|data size|
   |----|----|----|----|
   |No|15014|60s|560m|
   |Yes|16817|2h|560m|
   |No|169766|8min|5.8G|
   |Yes|150476|14h|5.8G|
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org