You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "Karan Mehta (JIRA)" <ji...@apache.org> on 2018/01/08 06:47:00 UTC
[jira] [Comment Edited] (PHOENIX-4489) HBase Connection leak in Phoenix MR Jobs

    [ https://issues.apache.org/jira/browse/PHOENIX-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315655#comment-16315655 ] 

Karan Mehta edited comment on PHOENIX-4489 at 1/8/18 6:46 AM:
--------------------------------------------------------------

[~vincentpoon] 
Technically, as we discussed it shouldn't be a problem since we go out of scope real quick after the generateSplits() method is executed and the connection object should be garbage collected. However, if you checkout PHOENIX-4503, the client is trying to read multiple spark dataframes inside a loop (almost 50 times). Such a code will get executed fast and will result in lots of HConnections and ZKConnections getting created in a short span of time and I suspect that even though GC gets triggered to clear them, it might actually take some time before this to happen (until JVM feels the need). This can cause issues with the application. I see many issues filed in this regard. Checkout https://stackoverflow.com/questions/4138200/garbage-collection-on-a-local-variable

Also, since the connections are not instantiated via factory, it is difficult to catch their quantity and limit the resources by having a custom implementation. What do you think?

FYI, [~aertoria]


was (Author: karanmehta93):
[~vincentpoon] 
Technically, as we discussed it shouldn't be a problem since we go out of scope real quick after the generateSplits() method is executed and the connection object should be garbage collected. However, if you checkout PHOENIX-4503, the client is trying to read multiple spark dataframes inside a loop (almost 50 times). Such a code will get executed fast and will result in lots of HConnections and ZKConnections getting created in a short span of time and I suspect that even though GC gets triggered to clear them, it might actually take some time before this to happen (until JVM feels the need). This can cause issues with the application. I see many issues filed in this regard. 

Also, since the connections are not instantiated via factory, it is difficult to catch their quantity and limit the resources by having a custom implementation. What do you think?

FYI, [~aertoria]

> HBase Connection leak in Phoenix MR Jobs
> ----------------------------------------
>
>                 Key: PHOENIX-4489
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4489
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Karan Mehta
>            Assignee: Karan Mehta
>         Attachments: PHOENIX-4489.001.patch
>
>
> Phoenix MR jobs uses a custom class {{PhoenixInputFormat}} to determine the splits and the parallelism of the work. The class directly opens up a HBase connection, which is not closed after the usage. Independently running MR jobs should not have any concern, however jobs that run through Phoenix-Spark can cause leak issues if this is left unclosed (since those jobs run as a part of same JVM). 
> Apart from this, the connection should be instantiated with {{HBaseFactoryProvider.getHConnectionFactory()}} instead of the default one. It can be useful if a separate client is trying to run jobs and wants to provide a custom implementation of {{HConnection}}. 
> [~jmahonin] Any ideas?
> [~jamestaylor] [~vincentpoon] Any concerns around this?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)