You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/11/02 23:06:00 UTC

[jira] [Work logged] (HADOOP-17975) Fallback to simple auth does not work for a secondary DistributedFileSystem instance

     [ https://issues.apache.org/jira/browse/HADOOP-17975?focusedWorklogId=674254&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-674254 ]

ASF GitHub Bot logged work on HADOOP-17975:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 02/Nov/21 23:05
            Start Date: 02/Nov/21 23:05
    Worklog Time Spent: 10m 
      Work Description: fapifta commented on pull request #3579:
URL: https://github.com/apache/hadoop/pull/3579#issuecomment-958358986


   @symious actually the connection is not two different connection, but the same connection to the same DataNode. The problem is with what we send via the connection. In a secure environment, we are doing a SASL handshake, and for that we are sending a SASL auth header, but a DataNode expecting a single auth does not expect the SASL auth header, it expects an int with a version, and for that reads the first 4 bytes, which in the SASL auth case is 0xDEAD which translates to the first cryptic -8531 in the DN log.
   
   The setupIOStreams method works based on configuration, and server defaults coming in via the DfsClient (if I remember correctly), and the shared atomicBoolean is set to control the behaviour of the SASLDataTransferClient inside the DfsClient, so that the SASLDataTransferClient does not try to do a SASL auth, does not set a SASL header, but sends the plain request.
   
   The first client behaves correctly, but when the second client tries to connect to the DN, its shared atomicBoolean (instantiated in the DfsClient constructor) is not set before the patch, so it sends a SASL header and initiates a SASL handshake first, and fails because of the EOF it gets, as the DN closes the connection when the exception happens on its side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 674254)
    Time Spent: 1h 20m  (was: 1h 10m)

> Fallback to simple auth does not work for a secondary DistributedFileSystem instance
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-17975
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17975
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: István Fajth
>            Assignee: István Fajth
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The following code snippet demonstrates what is necessary to cause a failure in connection to a non secure cluster with fallback to SIMPLE auth allowed from a secure cluster.
> {code:java}
>     Configuration conf = new Configuration();
>     conf.setBoolean("ipc.client.fallback-to-simple-auth-allowed", true);
>     URI fsUri = new URI("hdfs://<nn_uri>");
>     conf.setBoolean("fs.hdfs.impl.disable.cache", true);
>     FileSystem fs = FileSystem.get(fsUri, conf);
>     FSDataInputStream src = fs.open(new Path("/path/to/a/file"));
>     FileOutputStream dst = new FileOutputStream(File.createTempFile("foo", "bar"));
>     IOUtils.copyBytes(src, dst, 1024);
>     // The issue happens even if we re-enable cache at this point
>     //conf.setBoolean("fs.hdfs.impl.disable.cache", false);
>     // The issue does not happen when we close the first FileSystem object
>     // before creating the second.
>     //fs.close();
>     FileSystem fs2 = FileSystem.get(fsUri, conf);
>     FSDataInputStream src2 = fs2.open(new Path("/path/to/a/file"));
>     FileOutputStream dst2 = new FileOutputStream(File.createTempFile("foo", "bar"));
>     IOUtils.copyBytes(src2, dst2, 1024);
> {code}
> The problem is that when the DfsClient is created it creates an instance of AtomicBoolean, which is propagated down into the IPC layer, where the Client.Connection instance in setupIOStreams sets its value. This connection object is cached and re-used to multiplex requests against the same DataNode.
> In case of creating a second DfsClient, the AtomicBoolean reference in the client is a new AtomicBoolean, but the Client.Connection instance is the same, and as it has a socket already open to the DataNode, it returns immediatelly from setupIOStreams, leaving the fallbackToSimpleAuth AtomicBoolean false as it is created in the DfsClient.
> This AtomicBoolean on the other hand controls how the SaslDataTransferClient handles the connection in the above level, and with this value left on the default false, the SaslDataTransferClient of the second DfsClient will not fall back to SIMPLE authentication but will try to send a SASL handshake when connecting to the DataNode.
>  
> The access to the FileSystem via the second DfsClient fails with exceptions like the following one, then fails the read with a BlockMissingException like below:
> {code}
> WARN hdfs.DFSClient: Failed to connect to /<dn_ip>:<dn_port> for file <file> for block BP-531773307-<nn_ip>-1634685133591:blk_1073741826_1002, add to deadNodes and continue. 
> java.io.EOFException: Unexpected EOF while trying to read response from server
> 	at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:552)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessage(DataTransferSaslUtil.java:215)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:455)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:393)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:267)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:215)
> 	at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
> 	at org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:648)
> 	at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2980)
> 	at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:822)
> 	at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:747)
> 	at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:380)
> 	at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:658)
> 	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:589)
> 	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:771)
> 	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
> 	at java.io.DataInputStream.read(DataInputStream.java:100)
> 	at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:94)
> 	at DfsClientTest3.main(DfsClientTest3.java:30)
> {code}
> {code}
> org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-813026743-<nn_ip>-1495248833293:blk_1139767762_66027405 file=/path/to/file
> {code}
>  
> The DataNode in the meantime logs the following:
> {code}
> ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: <dn_host>:<dn_port>:DataXceiver error processing unknown operation  src: /<client_ip>:<client_port> dst: /<dn_ip>:<dn_port>
> java.io.IOException: Version Mismatch (Expected: 28, Received: -8531 )
>         at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:70)
>         at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:222)
>         at java.lang.Thread.run(Thread.java:748)
> {code}
> This happens only if the second client is connecting to the same DataNode as the first one did, so might seem intermittent in case the clients are reading different files, but happens always if the two client reads the same file with replication factor 1.
> We ran into this issue during running HBase ExportSnapshot tool to move a snapshot from a non-secure to a secure cluster, the issue is loosely related to HBASE-12819 and HBASE-20433 and similar problems, I am linking these so that HBase team will see how this is relevant for them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org