You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Michael Stack (Jira)" <ji...@apache.org> on 2021/03/29 19:22:00 UTC
[jira] [Resolved] (HBASE-25692) Failure to instantiate WALCellCodec leaks socket in replication

     [ https://issues.apache.org/jira/browse/HBASE-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Stack resolved HBASE-25692.
-----------------------------------
    Hadoop Flags: Reviewed
      Resolution: Fixed

Merged to 2.3+. Shout if you want it to go elsewhere [~elserj].

> Failure to instantiate WALCellCodec leaks socket in replication
> ---------------------------------------------------------------
>
>                 Key: HBASE-25692
>                 URL: https://issues.apache.org/jira/browse/HBASE-25692
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.1.0, 2.2.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 2.3.1, 2.1.4, 2.0.6, 2.1.5, 2.2.1, 2.1.6, 2.1.7, 2.2.2, 2.1.8, 2.2.3, 2.3.3, 2.1.9, 2.2.4, 2.4.0, 2.2.5, 2.2.6, 2.3.2, 2.3.4, 2.4.1, 2.4.2
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.5.0, 2.4.3, 2.3.6
>
>
> I was looking at an HBase user's cluster with [~danilocop] where they saw two otherwise identical clusters where one of them was regularly had sockets in CLOSE_WAIT going from RegionServers to a distributed storage appliance.
> After a lot of analysis, we eventually figured out that these sockets in CLOSE_WAIT were directly related to an FSDataInputStream which we forgot to close inside of the RegionServer. The subtlety was that only one of these HBase clusters was set up to do replication (to the other cluster). The HBase cluster experiencing this problem was shipping edits to a peer, and had previously been using Phoenix. At some point, the cluster had Phoenix removed from it.
> What we found was that replication still had WALs to ship which were for Phoenix tables. Phoenix, in this version, still used the custom WALCellCodec; however, this codec class was missing from the RS classpath after the owner of the cluster removed Phoenix.
> When we try to instantiate the Codec implementation via ReflectionUtils, we end up throwing an UnsupportedOperationException which wraps a NoClassDefFoundException. However, in WALFactory, we _only_ close the FSDataInputStream when we catch an IOException. 
> Thus, replication sits in a "fast" loop, trying to ship these edits, each time leaking a new socket because of the InputStream not being closed. There is an obvious workaround for this specific issue, but we should not leak this inside HBase.
> Approximate, 2.1.x stack trace which lead us to this is below.
> {noformat}
> 2021-03-11 18:19:20,364 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader: Failed to read stream of replication entries
> java.io.IOException: Cannot get log reader
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:366)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:303)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:291)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:427)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:354)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:302)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:293)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:174)
> 	at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:100)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:192)
> 	at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:138)
> Caused by: java.lang.UnsupportedOperationException: Unable to find org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec
> 	at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:47)
> 	at org.apache.hadoop.hbase.regionserver.wal.WALCellCodec.create(WALCellCodec.java:106)
> 	at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.getCodec(ProtobufLogReader.java:301)
> 	at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initAfterCompression(ProtobufLogReader.java:311)
> 	at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:81)
> 	at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:168)
> 	at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:321)
> 	... 10 more
> Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec
> 	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> 	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> 	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> 	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> 	at java.lang.Class.forName0(Native Method)
> 	at java.lang.Class.forName(Class.java:264)
> 	at org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:43)
> 	... 16 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)