You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@phoenix.apache.org by "Istvan Toth (Jira)" <ji...@apache.org> on 2022/07/15 06:28:00 UTC

[jira] [Commented] (PHOENIX-6698) hive-connector will take long time to generate splits for large phoenix tables.

    [ https://issues.apache.org/jira/browse/PHOENIX-6698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567110#comment-17567110 ] 

Istvan Toth commented on PHOENIX-6698:
--------------------------------------

Let's circle back to a higher level.

Looking at org.apache.phoenix.mapreduce.PhoenixInputFormat.generateSplits(QueryPlan, Configuration) , I cannot see anything there that should take a significant amount of time.
The actual splitting by goalposts was already done when preparing the Query plan, and the PhoenixInputSplit is mostly just a POJO constructor.

I suspect that the parallalization that you introduce here is only masking some other inefficiency in the split generation, and we should fix that instead / as well.

Can you provide some finer grained profiling data on where excatly the (unmodified) generateSplits() is spending  ~2 seconds per region ?
Idally, something like a flame graph provided by asyncProfile would be the best.

> hive-connector will take long time to generate splits for large phoenix tables.
> -------------------------------------------------------------------------------
>
>                 Key: PHOENIX-6698
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-6698
>             Project: Phoenix
>          Issue Type: Improvement
>          Components: hive-connector
>    Affects Versions: 5.1.0
>            Reporter: jichen
>            Assignee: jichen
>            Priority: Minor
>             Fix For: connectors-6.0.0
>
>         Attachments: PHOENIX-6698.master.v1.patch
>
>
> {{{color:#1d1c1d}In our production environment, hive-phoenix connector  will take nearly 30-40 minutes to generate splits for large phoenix table, which has more than 2048 regions.it is because in class PhoenixInputFormat, function  'generateSplits' only uses one thread to generate splits for each scan. My proposal is to use multi-thread to generate splits in parallel. the proposal has been validated in our production environment.by  changing code {color}}}{color:#1d1c1d}to generate splits  in parallel with 24 threads, the time cost is reduced to 2 minutes.  {color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)