You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Tenghuan He <te...@gmail.com> on 2015/12/30 18:29:37 UTC

Directly reading from datanode using JAVA API got socketTimeoutException

​Hello,

I want to directly read from datanode blocks using JAVA API as the
following code, but I got socketTimeoutException

I use reflection to call the DFSClient private method connectToDN(...), and
get IOStreamPair of in and out, where in is used to read bytes from
datanode.
The workhorse code is

try {
    Method connectToDN;
    Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
    connectToDN =
dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
    connectToDN.setAccessible(true);
    IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient,
datanode, timeout, lb);
    in = new DataInputStream(pair.in);
    System.out.println(in.getClass());
    byte[] b = new byte[10000];
    in.readFully(b);
} catch (Exception e) {
    e.printStackTrace();

}

and the exception is

java.net.SocketTimeoutException: 11000 millis timeout while waiting for
channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765
remote=/192.168.179.135:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​

Could anyone tell me where the problem is?

Thanks & Begards

Tenghuan He

Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Chris Nauroth <cn...@hortonworks.com>.
Yes, good point about the combination of one local short-circuit read + one remote read.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Monday, January 4, 2016 at 9:42 AM
To: Chris Nauroth <cn...@hortonworks.com>>
Cc: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Directly reading from datanode using JAVA API got socketTimeoutException

Thanks Chris
Your answer helps me a lot!
And I got another idea.
If launching another thread using short-circuit local reads to read data on datanode of local machine which does not take up network bandwidth, the combination reading may have a better performance if the amount of local data is comparable to remote data.
Does this make sense?

Tenghuan He

On Sun, Jan 3, 2016 at 3:00 PM, Chris Nauroth <cn...@hortonworks.com>> wrote:
I think you can achieve something close to this with just public APIs by launching multiple threads, calling FileSystem#open to get a separate input stream in each one, and then calling seek to position each stream at a different block boundary.  Seek is a cheap operation, basically just updating internal offsets.  Seeking forward does not require reading through the earlier data byte-by-byte, so you won't pay the cost of transferring that part of the data.

Whether or not this strategy would really improve performance is subject to a lot of other factors.  If the application's single-threaded reading already saturates the network bandwidth of the NIC, then starting multiple threads is unlikely to improve performance.  Those threads will just run into contention with each other on the scarce network bandwidth resources.  If instead the application reads data gradually and performs some CPU-intensive processing as it reads, then perhaps the NIC is not saturated, and multi-threading could help.

As usual with performance work, the actual outcomes are going to be highly situational.

I hope this helps.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Thursday, December 31, 2015 at 5:17 PM
To: Chris Nauroth <cn...@hortonworks.com>>
Cc: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Directly reading from datanode using JAVA API got socketTimeoutException

The following is what I want to do.
When reading a big file across multi blocks, I want to read different blocks from different node in parallel thus make reading big file faster.
Is that possible?

Thanks

On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>> wrote:
Your code has connected to a DataNode's TCP port, and the DataNode server side is likely blocked expecting the client to send some kind of request defined in the Data Transfer Protocol.  The client code here does not write a request, so the DataNode server doesn't know what to do.  Instead, the client immediately goes into a blocking read.  Since the DataNode server side doesn't know what to do, it's never going to write any bytes back to the socket connection, and therefore the client eventually times out on the read.

Stepping back, please be aware that what you are trying to do is unsupported.  Relying on private implementation details like this is likely to be brittle and buggy.  As the HDFS code evolves in the future, there is no guarantee that what you do here will work the same way in future versions.  There might not even be a connectToDN method in future versions if we decide to do some internal refactoring.

If you can give a high-level description of what you want to achieve, then perhaps we can suggest a way to do it through the public API.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Wednesday, December 30, 2015 at 9:29 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Directly reading from datanode using JAVA API got socketTimeoutException

​Hello,

I want to directly read from datanode blocks using JAVA API as the following code, but I got socketTimeoutException

I use reflection to call the DFSClient private method connectToDN(...), and get IOStreamPair of in and out, where in is used to read bytes from datanode.
The workhorse code is

try {
    Method connectToDN;
    Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
    connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
    connectToDN.setAccessible(true);
    IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
    in = new DataInputStream(pair.in);
    System.out.println(in.getClass());
    byte[] b = new byte[10000];
    in.readFully(b);
} catch (Exception e) {
    e.printStackTrace();

}

and the exception is

java.net.SocketTimeoutException: 11000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765<http://192.168.179.1:53765> remote=/192.168.179.135:50010<http://192.168.179.135:50010>]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​

Could anyone tell me where the problem is?

Thanks & Begards

Tenghuan He



Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Chris Nauroth <cn...@hortonworks.com>.
Yes, good point about the combination of one local short-circuit read + one remote read.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Monday, January 4, 2016 at 9:42 AM
To: Chris Nauroth <cn...@hortonworks.com>>
Cc: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Directly reading from datanode using JAVA API got socketTimeoutException

Thanks Chris
Your answer helps me a lot!
And I got another idea.
If launching another thread using short-circuit local reads to read data on datanode of local machine which does not take up network bandwidth, the combination reading may have a better performance if the amount of local data is comparable to remote data.
Does this make sense?

Tenghuan He

On Sun, Jan 3, 2016 at 3:00 PM, Chris Nauroth <cn...@hortonworks.com>> wrote:
I think you can achieve something close to this with just public APIs by launching multiple threads, calling FileSystem#open to get a separate input stream in each one, and then calling seek to position each stream at a different block boundary.  Seek is a cheap operation, basically just updating internal offsets.  Seeking forward does not require reading through the earlier data byte-by-byte, so you won't pay the cost of transferring that part of the data.

Whether or not this strategy would really improve performance is subject to a lot of other factors.  If the application's single-threaded reading already saturates the network bandwidth of the NIC, then starting multiple threads is unlikely to improve performance.  Those threads will just run into contention with each other on the scarce network bandwidth resources.  If instead the application reads data gradually and performs some CPU-intensive processing as it reads, then perhaps the NIC is not saturated, and multi-threading could help.

As usual with performance work, the actual outcomes are going to be highly situational.

I hope this helps.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Thursday, December 31, 2015 at 5:17 PM
To: Chris Nauroth <cn...@hortonworks.com>>
Cc: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Directly reading from datanode using JAVA API got socketTimeoutException

The following is what I want to do.
When reading a big file across multi blocks, I want to read different blocks from different node in parallel thus make reading big file faster.
Is that possible?

Thanks

On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>> wrote:
Your code has connected to a DataNode's TCP port, and the DataNode server side is likely blocked expecting the client to send some kind of request defined in the Data Transfer Protocol.  The client code here does not write a request, so the DataNode server doesn't know what to do.  Instead, the client immediately goes into a blocking read.  Since the DataNode server side doesn't know what to do, it's never going to write any bytes back to the socket connection, and therefore the client eventually times out on the read.

Stepping back, please be aware that what you are trying to do is unsupported.  Relying on private implementation details like this is likely to be brittle and buggy.  As the HDFS code evolves in the future, there is no guarantee that what you do here will work the same way in future versions.  There might not even be a connectToDN method in future versions if we decide to do some internal refactoring.

If you can give a high-level description of what you want to achieve, then perhaps we can suggest a way to do it through the public API.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Wednesday, December 30, 2015 at 9:29 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Directly reading from datanode using JAVA API got socketTimeoutException

​Hello,

I want to directly read from datanode blocks using JAVA API as the following code, but I got socketTimeoutException

I use reflection to call the DFSClient private method connectToDN(...), and get IOStreamPair of in and out, where in is used to read bytes from datanode.
The workhorse code is

try {
    Method connectToDN;
    Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
    connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
    connectToDN.setAccessible(true);
    IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
    in = new DataInputStream(pair.in);
    System.out.println(in.getClass());
    byte[] b = new byte[10000];
    in.readFully(b);
} catch (Exception e) {
    e.printStackTrace();

}

and the exception is

java.net.SocketTimeoutException: 11000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765<http://192.168.179.1:53765> remote=/192.168.179.135:50010<http://192.168.179.135:50010>]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​

Could anyone tell me where the problem is?

Thanks & Begards

Tenghuan He



Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Chris Nauroth <cn...@hortonworks.com>.
Yes, good point about the combination of one local short-circuit read + one remote read.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Monday, January 4, 2016 at 9:42 AM
To: Chris Nauroth <cn...@hortonworks.com>>
Cc: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Directly reading from datanode using JAVA API got socketTimeoutException

Thanks Chris
Your answer helps me a lot!
And I got another idea.
If launching another thread using short-circuit local reads to read data on datanode of local machine which does not take up network bandwidth, the combination reading may have a better performance if the amount of local data is comparable to remote data.
Does this make sense?

Tenghuan He

On Sun, Jan 3, 2016 at 3:00 PM, Chris Nauroth <cn...@hortonworks.com>> wrote:
I think you can achieve something close to this with just public APIs by launching multiple threads, calling FileSystem#open to get a separate input stream in each one, and then calling seek to position each stream at a different block boundary.  Seek is a cheap operation, basically just updating internal offsets.  Seeking forward does not require reading through the earlier data byte-by-byte, so you won't pay the cost of transferring that part of the data.

Whether or not this strategy would really improve performance is subject to a lot of other factors.  If the application's single-threaded reading already saturates the network bandwidth of the NIC, then starting multiple threads is unlikely to improve performance.  Those threads will just run into contention with each other on the scarce network bandwidth resources.  If instead the application reads data gradually and performs some CPU-intensive processing as it reads, then perhaps the NIC is not saturated, and multi-threading could help.

As usual with performance work, the actual outcomes are going to be highly situational.

I hope this helps.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Thursday, December 31, 2015 at 5:17 PM
To: Chris Nauroth <cn...@hortonworks.com>>
Cc: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Directly reading from datanode using JAVA API got socketTimeoutException

The following is what I want to do.
When reading a big file across multi blocks, I want to read different blocks from different node in parallel thus make reading big file faster.
Is that possible?

Thanks

On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>> wrote:
Your code has connected to a DataNode's TCP port, and the DataNode server side is likely blocked expecting the client to send some kind of request defined in the Data Transfer Protocol.  The client code here does not write a request, so the DataNode server doesn't know what to do.  Instead, the client immediately goes into a blocking read.  Since the DataNode server side doesn't know what to do, it's never going to write any bytes back to the socket connection, and therefore the client eventually times out on the read.

Stepping back, please be aware that what you are trying to do is unsupported.  Relying on private implementation details like this is likely to be brittle and buggy.  As the HDFS code evolves in the future, there is no guarantee that what you do here will work the same way in future versions.  There might not even be a connectToDN method in future versions if we decide to do some internal refactoring.

If you can give a high-level description of what you want to achieve, then perhaps we can suggest a way to do it through the public API.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Wednesday, December 30, 2015 at 9:29 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Directly reading from datanode using JAVA API got socketTimeoutException

​Hello,

I want to directly read from datanode blocks using JAVA API as the following code, but I got socketTimeoutException

I use reflection to call the DFSClient private method connectToDN(...), and get IOStreamPair of in and out, where in is used to read bytes from datanode.
The workhorse code is

try {
    Method connectToDN;
    Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
    connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
    connectToDN.setAccessible(true);
    IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
    in = new DataInputStream(pair.in);
    System.out.println(in.getClass());
    byte[] b = new byte[10000];
    in.readFully(b);
} catch (Exception e) {
    e.printStackTrace();

}

and the exception is

java.net.SocketTimeoutException: 11000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765<http://192.168.179.1:53765> remote=/192.168.179.135:50010<http://192.168.179.135:50010>]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​

Could anyone tell me where the problem is?

Thanks & Begards

Tenghuan He



Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Chris Nauroth <cn...@hortonworks.com>.
Yes, good point about the combination of one local short-circuit read + one remote read.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Monday, January 4, 2016 at 9:42 AM
To: Chris Nauroth <cn...@hortonworks.com>>
Cc: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Directly reading from datanode using JAVA API got socketTimeoutException

Thanks Chris
Your answer helps me a lot!
And I got another idea.
If launching another thread using short-circuit local reads to read data on datanode of local machine which does not take up network bandwidth, the combination reading may have a better performance if the amount of local data is comparable to remote data.
Does this make sense?

Tenghuan He

On Sun, Jan 3, 2016 at 3:00 PM, Chris Nauroth <cn...@hortonworks.com>> wrote:
I think you can achieve something close to this with just public APIs by launching multiple threads, calling FileSystem#open to get a separate input stream in each one, and then calling seek to position each stream at a different block boundary.  Seek is a cheap operation, basically just updating internal offsets.  Seeking forward does not require reading through the earlier data byte-by-byte, so you won't pay the cost of transferring that part of the data.

Whether or not this strategy would really improve performance is subject to a lot of other factors.  If the application's single-threaded reading already saturates the network bandwidth of the NIC, then starting multiple threads is unlikely to improve performance.  Those threads will just run into contention with each other on the scarce network bandwidth resources.  If instead the application reads data gradually and performs some CPU-intensive processing as it reads, then perhaps the NIC is not saturated, and multi-threading could help.

As usual with performance work, the actual outcomes are going to be highly situational.

I hope this helps.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Thursday, December 31, 2015 at 5:17 PM
To: Chris Nauroth <cn...@hortonworks.com>>
Cc: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Directly reading from datanode using JAVA API got socketTimeoutException

The following is what I want to do.
When reading a big file across multi blocks, I want to read different blocks from different node in parallel thus make reading big file faster.
Is that possible?

Thanks

On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>> wrote:
Your code has connected to a DataNode's TCP port, and the DataNode server side is likely blocked expecting the client to send some kind of request defined in the Data Transfer Protocol.  The client code here does not write a request, so the DataNode server doesn't know what to do.  Instead, the client immediately goes into a blocking read.  Since the DataNode server side doesn't know what to do, it's never going to write any bytes back to the socket connection, and therefore the client eventually times out on the read.

Stepping back, please be aware that what you are trying to do is unsupported.  Relying on private implementation details like this is likely to be brittle and buggy.  As the HDFS code evolves in the future, there is no guarantee that what you do here will work the same way in future versions.  There might not even be a connectToDN method in future versions if we decide to do some internal refactoring.

If you can give a high-level description of what you want to achieve, then perhaps we can suggest a way to do it through the public API.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Wednesday, December 30, 2015 at 9:29 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Directly reading from datanode using JAVA API got socketTimeoutException

​Hello,

I want to directly read from datanode blocks using JAVA API as the following code, but I got socketTimeoutException

I use reflection to call the DFSClient private method connectToDN(...), and get IOStreamPair of in and out, where in is used to read bytes from datanode.
The workhorse code is

try {
    Method connectToDN;
    Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
    connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
    connectToDN.setAccessible(true);
    IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
    in = new DataInputStream(pair.in);
    System.out.println(in.getClass());
    byte[] b = new byte[10000];
    in.readFully(b);
} catch (Exception e) {
    e.printStackTrace();

}

and the exception is

java.net.SocketTimeoutException: 11000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765<http://192.168.179.1:53765> remote=/192.168.179.135:50010<http://192.168.179.135:50010>]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​

Could anyone tell me where the problem is?

Thanks & Begards

Tenghuan He



Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Tenghuan He <te...@gmail.com>.
Thanks Chris
Your answer helps me a lot!
And I got another idea.
If launching another thread using short-circuit local reads to read data on
datanode of local machine which does not take up network bandwidth, the
combination reading may have a better performance if the amount of local
data is comparable to remote data.
Does this make sense?

Tenghuan He

On Sun, Jan 3, 2016 at 3:00 PM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> I think you can achieve something close to this with just public APIs by
> launching multiple threads, calling FileSystem#open to get a separate input
> stream in each one, and then calling seek to position each stream at a
> different block boundary.  Seek is a cheap operation, basically just
> updating internal offsets.  Seeking forward does not require reading
> through the earlier data byte-by-byte, so you won't pay the cost of
> transferring that part of the data.
>
> Whether or not this strategy would really improve performance is subject
> to a lot of other factors.  If the application's single-threaded reading
> already saturates the network bandwidth of the NIC, then starting multiple
> threads is unlikely to improve performance.  Those threads will just run
> into contention with each other on the scarce network bandwidth resources.
> If instead the application reads data gradually and performs some
> CPU-intensive processing as it reads, then perhaps the NIC is not
> saturated, and multi-threading could help.
>
> As usual with performance work, the actual outcomes are going to be highly
> situational.
>
> I hope this helps.
>
> --Chris Nauroth
>
> From: Tenghuan He <te...@gmail.com>
> Date: Thursday, December 31, 2015 at 5:17 PM
> To: Chris Nauroth <cn...@hortonworks.com>
> Cc: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Directly reading from datanode using JAVA API got
> socketTimeoutException
>
> The following is what I want to do.
> When reading a big file across multi blocks, I want to read different
> blocks from different node in parallel thus make reading big file faster.
> Is that possible?
>
> Thanks
>
> On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>
> wrote:
>
>> Your code has connected to a DataNode's TCP port, and the DataNode server
>> side is likely blocked expecting the client to send some kind of request
>> defined in the Data Transfer Protocol.  The client code here does not write
>> a request, so the DataNode server doesn't know what to do.  Instead, the
>> client immediately goes into a blocking read.  Since the DataNode server
>> side doesn't know what to do, it's never going to write any bytes back to
>> the socket connection, and therefore the client eventually times out on the
>> read.
>>
>> Stepping back, please be aware that what you are trying to do is
>> unsupported.  Relying on private implementation details like this is likely
>> to be brittle and buggy.  As the HDFS code evolves in the future, there is
>> no guarantee that what you do here will work the same way in future
>> versions.  There might not even be a connectToDN method in future versions
>> if we decide to do some internal refactoring.
>>
>> If you can give a high-level description of what you want to achieve,
>> then perhaps we can suggest a way to do it through the public API.
>>
>> --Chris Nauroth
>>
>> From: Tenghuan He <te...@gmail.com>
>> Date: Wednesday, December 30, 2015 at 9:29 AM
>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Directly reading from datanode using JAVA API got
>> socketTimeoutException
>>
>> ​Hello,
>>
>> I want to directly read from datanode blocks using JAVA API as the
>> following code, but I got socketTimeoutException
>>
>> I use reflection to call the DFSClient private method connectToDN(...),
>> and get IOStreamPair of in and out, where in is used to read bytes from
>> datanode.
>> The workhorse code is
>>
>> try {
>>     Method connectToDN;
>>     Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
>>     connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
>>     connectToDN.setAccessible(true);
>>     IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
>>     in = new DataInputStream(pair.in);
>>     System.out.println(in.getClass());
>>     byte[] b = new byte[10000];
>>     in.readFully(b);
>> } catch (Exception e) {
>>     e.printStackTrace();
>>
>> }
>>
>> and the exception is
>>
>> java.net.SocketTimeoutException: 11000 millis timeout while waiting for
>> channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765
>> remote=/192.168.179.135:50010]
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>> at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>> at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>> at java.io.FilterInputStream.read(FilterInputStream.java:133)
>> at java.io.DataInputStream.readFully(DataInputStream.java:195)
>> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>> at BlocksList.main(BlocksList.java:69)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​
>>
>> Could anyone tell me where the problem is?
>>
>> Thanks & Begards
>>
>> Tenghuan He
>>
>
>

Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Tenghuan He <te...@gmail.com>.
Thanks Chris
Your answer helps me a lot!
And I got another idea.
If launching another thread using short-circuit local reads to read data on
datanode of local machine which does not take up network bandwidth, the
combination reading may have a better performance if the amount of local
data is comparable to remote data.
Does this make sense?

Tenghuan He

On Sun, Jan 3, 2016 at 3:00 PM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> I think you can achieve something close to this with just public APIs by
> launching multiple threads, calling FileSystem#open to get a separate input
> stream in each one, and then calling seek to position each stream at a
> different block boundary.  Seek is a cheap operation, basically just
> updating internal offsets.  Seeking forward does not require reading
> through the earlier data byte-by-byte, so you won't pay the cost of
> transferring that part of the data.
>
> Whether or not this strategy would really improve performance is subject
> to a lot of other factors.  If the application's single-threaded reading
> already saturates the network bandwidth of the NIC, then starting multiple
> threads is unlikely to improve performance.  Those threads will just run
> into contention with each other on the scarce network bandwidth resources.
> If instead the application reads data gradually and performs some
> CPU-intensive processing as it reads, then perhaps the NIC is not
> saturated, and multi-threading could help.
>
> As usual with performance work, the actual outcomes are going to be highly
> situational.
>
> I hope this helps.
>
> --Chris Nauroth
>
> From: Tenghuan He <te...@gmail.com>
> Date: Thursday, December 31, 2015 at 5:17 PM
> To: Chris Nauroth <cn...@hortonworks.com>
> Cc: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Directly reading from datanode using JAVA API got
> socketTimeoutException
>
> The following is what I want to do.
> When reading a big file across multi blocks, I want to read different
> blocks from different node in parallel thus make reading big file faster.
> Is that possible?
>
> Thanks
>
> On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>
> wrote:
>
>> Your code has connected to a DataNode's TCP port, and the DataNode server
>> side is likely blocked expecting the client to send some kind of request
>> defined in the Data Transfer Protocol.  The client code here does not write
>> a request, so the DataNode server doesn't know what to do.  Instead, the
>> client immediately goes into a blocking read.  Since the DataNode server
>> side doesn't know what to do, it's never going to write any bytes back to
>> the socket connection, and therefore the client eventually times out on the
>> read.
>>
>> Stepping back, please be aware that what you are trying to do is
>> unsupported.  Relying on private implementation details like this is likely
>> to be brittle and buggy.  As the HDFS code evolves in the future, there is
>> no guarantee that what you do here will work the same way in future
>> versions.  There might not even be a connectToDN method in future versions
>> if we decide to do some internal refactoring.
>>
>> If you can give a high-level description of what you want to achieve,
>> then perhaps we can suggest a way to do it through the public API.
>>
>> --Chris Nauroth
>>
>> From: Tenghuan He <te...@gmail.com>
>> Date: Wednesday, December 30, 2015 at 9:29 AM
>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Directly reading from datanode using JAVA API got
>> socketTimeoutException
>>
>> ​Hello,
>>
>> I want to directly read from datanode blocks using JAVA API as the
>> following code, but I got socketTimeoutException
>>
>> I use reflection to call the DFSClient private method connectToDN(...),
>> and get IOStreamPair of in and out, where in is used to read bytes from
>> datanode.
>> The workhorse code is
>>
>> try {
>>     Method connectToDN;
>>     Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
>>     connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
>>     connectToDN.setAccessible(true);
>>     IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
>>     in = new DataInputStream(pair.in);
>>     System.out.println(in.getClass());
>>     byte[] b = new byte[10000];
>>     in.readFully(b);
>> } catch (Exception e) {
>>     e.printStackTrace();
>>
>> }
>>
>> and the exception is
>>
>> java.net.SocketTimeoutException: 11000 millis timeout while waiting for
>> channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765
>> remote=/192.168.179.135:50010]
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>> at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>> at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>> at java.io.FilterInputStream.read(FilterInputStream.java:133)
>> at java.io.DataInputStream.readFully(DataInputStream.java:195)
>> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>> at BlocksList.main(BlocksList.java:69)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​
>>
>> Could anyone tell me where the problem is?
>>
>> Thanks & Begards
>>
>> Tenghuan He
>>
>
>

Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Tenghuan He <te...@gmail.com>.
Thanks Chris
Your answer helps me a lot!
And I got another idea.
If launching another thread using short-circuit local reads to read data on
datanode of local machine which does not take up network bandwidth, the
combination reading may have a better performance if the amount of local
data is comparable to remote data.
Does this make sense?

Tenghuan He

On Sun, Jan 3, 2016 at 3:00 PM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> I think you can achieve something close to this with just public APIs by
> launching multiple threads, calling FileSystem#open to get a separate input
> stream in each one, and then calling seek to position each stream at a
> different block boundary.  Seek is a cheap operation, basically just
> updating internal offsets.  Seeking forward does not require reading
> through the earlier data byte-by-byte, so you won't pay the cost of
> transferring that part of the data.
>
> Whether or not this strategy would really improve performance is subject
> to a lot of other factors.  If the application's single-threaded reading
> already saturates the network bandwidth of the NIC, then starting multiple
> threads is unlikely to improve performance.  Those threads will just run
> into contention with each other on the scarce network bandwidth resources.
> If instead the application reads data gradually and performs some
> CPU-intensive processing as it reads, then perhaps the NIC is not
> saturated, and multi-threading could help.
>
> As usual with performance work, the actual outcomes are going to be highly
> situational.
>
> I hope this helps.
>
> --Chris Nauroth
>
> From: Tenghuan He <te...@gmail.com>
> Date: Thursday, December 31, 2015 at 5:17 PM
> To: Chris Nauroth <cn...@hortonworks.com>
> Cc: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Directly reading from datanode using JAVA API got
> socketTimeoutException
>
> The following is what I want to do.
> When reading a big file across multi blocks, I want to read different
> blocks from different node in parallel thus make reading big file faster.
> Is that possible?
>
> Thanks
>
> On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>
> wrote:
>
>> Your code has connected to a DataNode's TCP port, and the DataNode server
>> side is likely blocked expecting the client to send some kind of request
>> defined in the Data Transfer Protocol.  The client code here does not write
>> a request, so the DataNode server doesn't know what to do.  Instead, the
>> client immediately goes into a blocking read.  Since the DataNode server
>> side doesn't know what to do, it's never going to write any bytes back to
>> the socket connection, and therefore the client eventually times out on the
>> read.
>>
>> Stepping back, please be aware that what you are trying to do is
>> unsupported.  Relying on private implementation details like this is likely
>> to be brittle and buggy.  As the HDFS code evolves in the future, there is
>> no guarantee that what you do here will work the same way in future
>> versions.  There might not even be a connectToDN method in future versions
>> if we decide to do some internal refactoring.
>>
>> If you can give a high-level description of what you want to achieve,
>> then perhaps we can suggest a way to do it through the public API.
>>
>> --Chris Nauroth
>>
>> From: Tenghuan He <te...@gmail.com>
>> Date: Wednesday, December 30, 2015 at 9:29 AM
>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Directly reading from datanode using JAVA API got
>> socketTimeoutException
>>
>> ​Hello,
>>
>> I want to directly read from datanode blocks using JAVA API as the
>> following code, but I got socketTimeoutException
>>
>> I use reflection to call the DFSClient private method connectToDN(...),
>> and get IOStreamPair of in and out, where in is used to read bytes from
>> datanode.
>> The workhorse code is
>>
>> try {
>>     Method connectToDN;
>>     Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
>>     connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
>>     connectToDN.setAccessible(true);
>>     IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
>>     in = new DataInputStream(pair.in);
>>     System.out.println(in.getClass());
>>     byte[] b = new byte[10000];
>>     in.readFully(b);
>> } catch (Exception e) {
>>     e.printStackTrace();
>>
>> }
>>
>> and the exception is
>>
>> java.net.SocketTimeoutException: 11000 millis timeout while waiting for
>> channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765
>> remote=/192.168.179.135:50010]
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>> at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>> at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>> at java.io.FilterInputStream.read(FilterInputStream.java:133)
>> at java.io.DataInputStream.readFully(DataInputStream.java:195)
>> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>> at BlocksList.main(BlocksList.java:69)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​
>>
>> Could anyone tell me where the problem is?
>>
>> Thanks & Begards
>>
>> Tenghuan He
>>
>
>

Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Tenghuan He <te...@gmail.com>.
Thanks Chris
Your answer helps me a lot!
And I got another idea.
If launching another thread using short-circuit local reads to read data on
datanode of local machine which does not take up network bandwidth, the
combination reading may have a better performance if the amount of local
data is comparable to remote data.
Does this make sense?

Tenghuan He

On Sun, Jan 3, 2016 at 3:00 PM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> I think you can achieve something close to this with just public APIs by
> launching multiple threads, calling FileSystem#open to get a separate input
> stream in each one, and then calling seek to position each stream at a
> different block boundary.  Seek is a cheap operation, basically just
> updating internal offsets.  Seeking forward does not require reading
> through the earlier data byte-by-byte, so you won't pay the cost of
> transferring that part of the data.
>
> Whether or not this strategy would really improve performance is subject
> to a lot of other factors.  If the application's single-threaded reading
> already saturates the network bandwidth of the NIC, then starting multiple
> threads is unlikely to improve performance.  Those threads will just run
> into contention with each other on the scarce network bandwidth resources.
> If instead the application reads data gradually and performs some
> CPU-intensive processing as it reads, then perhaps the NIC is not
> saturated, and multi-threading could help.
>
> As usual with performance work, the actual outcomes are going to be highly
> situational.
>
> I hope this helps.
>
> --Chris Nauroth
>
> From: Tenghuan He <te...@gmail.com>
> Date: Thursday, December 31, 2015 at 5:17 PM
> To: Chris Nauroth <cn...@hortonworks.com>
> Cc: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Directly reading from datanode using JAVA API got
> socketTimeoutException
>
> The following is what I want to do.
> When reading a big file across multi blocks, I want to read different
> blocks from different node in parallel thus make reading big file faster.
> Is that possible?
>
> Thanks
>
> On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>
> wrote:
>
>> Your code has connected to a DataNode's TCP port, and the DataNode server
>> side is likely blocked expecting the client to send some kind of request
>> defined in the Data Transfer Protocol.  The client code here does not write
>> a request, so the DataNode server doesn't know what to do.  Instead, the
>> client immediately goes into a blocking read.  Since the DataNode server
>> side doesn't know what to do, it's never going to write any bytes back to
>> the socket connection, and therefore the client eventually times out on the
>> read.
>>
>> Stepping back, please be aware that what you are trying to do is
>> unsupported.  Relying on private implementation details like this is likely
>> to be brittle and buggy.  As the HDFS code evolves in the future, there is
>> no guarantee that what you do here will work the same way in future
>> versions.  There might not even be a connectToDN method in future versions
>> if we decide to do some internal refactoring.
>>
>> If you can give a high-level description of what you want to achieve,
>> then perhaps we can suggest a way to do it through the public API.
>>
>> --Chris Nauroth
>>
>> From: Tenghuan He <te...@gmail.com>
>> Date: Wednesday, December 30, 2015 at 9:29 AM
>> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
>> Subject: Directly reading from datanode using JAVA API got
>> socketTimeoutException
>>
>> ​Hello,
>>
>> I want to directly read from datanode blocks using JAVA API as the
>> following code, but I got socketTimeoutException
>>
>> I use reflection to call the DFSClient private method connectToDN(...),
>> and get IOStreamPair of in and out, where in is used to read bytes from
>> datanode.
>> The workhorse code is
>>
>> try {
>>     Method connectToDN;
>>     Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
>>     connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
>>     connectToDN.setAccessible(true);
>>     IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
>>     in = new DataInputStream(pair.in);
>>     System.out.println(in.getClass());
>>     byte[] b = new byte[10000];
>>     in.readFully(b);
>> } catch (Exception e) {
>>     e.printStackTrace();
>>
>> }
>>
>> and the exception is
>>
>> java.net.SocketTimeoutException: 11000 millis timeout while waiting for
>> channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765
>> remote=/192.168.179.135:50010]
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>> at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>> at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>> at java.io.FilterInputStream.read(FilterInputStream.java:133)
>> at java.io.DataInputStream.readFully(DataInputStream.java:195)
>> at java.io.DataInputStream.readFully(DataInputStream.java:169)
>> at BlocksList.main(BlocksList.java:69)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​
>>
>> Could anyone tell me where the problem is?
>>
>> Thanks & Begards
>>
>> Tenghuan He
>>
>
>

Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Chris Nauroth <cn...@hortonworks.com>.
I think you can achieve something close to this with just public APIs by launching multiple threads, calling FileSystem#open to get a separate input stream in each one, and then calling seek to position each stream at a different block boundary.  Seek is a cheap operation, basically just updating internal offsets.  Seeking forward does not require reading through the earlier data byte-by-byte, so you won't pay the cost of transferring that part of the data.

Whether or not this strategy would really improve performance is subject to a lot of other factors.  If the application's single-threaded reading already saturates the network bandwidth of the NIC, then starting multiple threads is unlikely to improve performance.  Those threads will just run into contention with each other on the scarce network bandwidth resources.  If instead the application reads data gradually and performs some CPU-intensive processing as it reads, then perhaps the NIC is not saturated, and multi-threading could help.

As usual with performance work, the actual outcomes are going to be highly situational.

I hope this helps.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Thursday, December 31, 2015 at 5:17 PM
To: Chris Nauroth <cn...@hortonworks.com>>
Cc: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Directly reading from datanode using JAVA API got socketTimeoutException

The following is what I want to do.
When reading a big file across multi blocks, I want to read different blocks from different node in parallel thus make reading big file faster.
Is that possible?

Thanks

On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>> wrote:
Your code has connected to a DataNode's TCP port, and the DataNode server side is likely blocked expecting the client to send some kind of request defined in the Data Transfer Protocol.  The client code here does not write a request, so the DataNode server doesn't know what to do.  Instead, the client immediately goes into a blocking read.  Since the DataNode server side doesn't know what to do, it's never going to write any bytes back to the socket connection, and therefore the client eventually times out on the read.

Stepping back, please be aware that what you are trying to do is unsupported.  Relying on private implementation details like this is likely to be brittle and buggy.  As the HDFS code evolves in the future, there is no guarantee that what you do here will work the same way in future versions.  There might not even be a connectToDN method in future versions if we decide to do some internal refactoring.

If you can give a high-level description of what you want to achieve, then perhaps we can suggest a way to do it through the public API.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Wednesday, December 30, 2015 at 9:29 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Directly reading from datanode using JAVA API got socketTimeoutException

​Hello,

I want to directly read from datanode blocks using JAVA API as the following code, but I got socketTimeoutException

I use reflection to call the DFSClient private method connectToDN(...), and get IOStreamPair of in and out, where in is used to read bytes from datanode.
The workhorse code is

try {
    Method connectToDN;
    Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
    connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
    connectToDN.setAccessible(true);
    IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
    in = new DataInputStream(pair.in);
    System.out.println(in.getClass());
    byte[] b = new byte[10000];
    in.readFully(b);
} catch (Exception e) {
    e.printStackTrace();

}

and the exception is

java.net.SocketTimeoutException: 11000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765<http://192.168.179.1:53765> remote=/192.168.179.135:50010<http://192.168.179.135:50010>]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​

Could anyone tell me where the problem is?

Thanks & Begards

Tenghuan He


Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Chris Nauroth <cn...@hortonworks.com>.
I think you can achieve something close to this with just public APIs by launching multiple threads, calling FileSystem#open to get a separate input stream in each one, and then calling seek to position each stream at a different block boundary.  Seek is a cheap operation, basically just updating internal offsets.  Seeking forward does not require reading through the earlier data byte-by-byte, so you won't pay the cost of transferring that part of the data.

Whether or not this strategy would really improve performance is subject to a lot of other factors.  If the application's single-threaded reading already saturates the network bandwidth of the NIC, then starting multiple threads is unlikely to improve performance.  Those threads will just run into contention with each other on the scarce network bandwidth resources.  If instead the application reads data gradually and performs some CPU-intensive processing as it reads, then perhaps the NIC is not saturated, and multi-threading could help.

As usual with performance work, the actual outcomes are going to be highly situational.

I hope this helps.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Thursday, December 31, 2015 at 5:17 PM
To: Chris Nauroth <cn...@hortonworks.com>>
Cc: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Directly reading from datanode using JAVA API got socketTimeoutException

The following is what I want to do.
When reading a big file across multi blocks, I want to read different blocks from different node in parallel thus make reading big file faster.
Is that possible?

Thanks

On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>> wrote:
Your code has connected to a DataNode's TCP port, and the DataNode server side is likely blocked expecting the client to send some kind of request defined in the Data Transfer Protocol.  The client code here does not write a request, so the DataNode server doesn't know what to do.  Instead, the client immediately goes into a blocking read.  Since the DataNode server side doesn't know what to do, it's never going to write any bytes back to the socket connection, and therefore the client eventually times out on the read.

Stepping back, please be aware that what you are trying to do is unsupported.  Relying on private implementation details like this is likely to be brittle and buggy.  As the HDFS code evolves in the future, there is no guarantee that what you do here will work the same way in future versions.  There might not even be a connectToDN method in future versions if we decide to do some internal refactoring.

If you can give a high-level description of what you want to achieve, then perhaps we can suggest a way to do it through the public API.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Wednesday, December 30, 2015 at 9:29 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Directly reading from datanode using JAVA API got socketTimeoutException

​Hello,

I want to directly read from datanode blocks using JAVA API as the following code, but I got socketTimeoutException

I use reflection to call the DFSClient private method connectToDN(...), and get IOStreamPair of in and out, where in is used to read bytes from datanode.
The workhorse code is

try {
    Method connectToDN;
    Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
    connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
    connectToDN.setAccessible(true);
    IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
    in = new DataInputStream(pair.in);
    System.out.println(in.getClass());
    byte[] b = new byte[10000];
    in.readFully(b);
} catch (Exception e) {
    e.printStackTrace();

}

and the exception is

java.net.SocketTimeoutException: 11000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765<http://192.168.179.1:53765> remote=/192.168.179.135:50010<http://192.168.179.135:50010>]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​

Could anyone tell me where the problem is?

Thanks & Begards

Tenghuan He


Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Chris Nauroth <cn...@hortonworks.com>.
I think you can achieve something close to this with just public APIs by launching multiple threads, calling FileSystem#open to get a separate input stream in each one, and then calling seek to position each stream at a different block boundary.  Seek is a cheap operation, basically just updating internal offsets.  Seeking forward does not require reading through the earlier data byte-by-byte, so you won't pay the cost of transferring that part of the data.

Whether or not this strategy would really improve performance is subject to a lot of other factors.  If the application's single-threaded reading already saturates the network bandwidth of the NIC, then starting multiple threads is unlikely to improve performance.  Those threads will just run into contention with each other on the scarce network bandwidth resources.  If instead the application reads data gradually and performs some CPU-intensive processing as it reads, then perhaps the NIC is not saturated, and multi-threading could help.

As usual with performance work, the actual outcomes are going to be highly situational.

I hope this helps.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Thursday, December 31, 2015 at 5:17 PM
To: Chris Nauroth <cn...@hortonworks.com>>
Cc: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Directly reading from datanode using JAVA API got socketTimeoutException

The following is what I want to do.
When reading a big file across multi blocks, I want to read different blocks from different node in parallel thus make reading big file faster.
Is that possible?

Thanks

On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>> wrote:
Your code has connected to a DataNode's TCP port, and the DataNode server side is likely blocked expecting the client to send some kind of request defined in the Data Transfer Protocol.  The client code here does not write a request, so the DataNode server doesn't know what to do.  Instead, the client immediately goes into a blocking read.  Since the DataNode server side doesn't know what to do, it's never going to write any bytes back to the socket connection, and therefore the client eventually times out on the read.

Stepping back, please be aware that what you are trying to do is unsupported.  Relying on private implementation details like this is likely to be brittle and buggy.  As the HDFS code evolves in the future, there is no guarantee that what you do here will work the same way in future versions.  There might not even be a connectToDN method in future versions if we decide to do some internal refactoring.

If you can give a high-level description of what you want to achieve, then perhaps we can suggest a way to do it through the public API.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Wednesday, December 30, 2015 at 9:29 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Directly reading from datanode using JAVA API got socketTimeoutException

​Hello,

I want to directly read from datanode blocks using JAVA API as the following code, but I got socketTimeoutException

I use reflection to call the DFSClient private method connectToDN(...), and get IOStreamPair of in and out, where in is used to read bytes from datanode.
The workhorse code is

try {
    Method connectToDN;
    Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
    connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
    connectToDN.setAccessible(true);
    IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
    in = new DataInputStream(pair.in);
    System.out.println(in.getClass());
    byte[] b = new byte[10000];
    in.readFully(b);
} catch (Exception e) {
    e.printStackTrace();

}

and the exception is

java.net.SocketTimeoutException: 11000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765<http://192.168.179.1:53765> remote=/192.168.179.135:50010<http://192.168.179.135:50010>]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​

Could anyone tell me where the problem is?

Thanks & Begards

Tenghuan He


Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Chris Nauroth <cn...@hortonworks.com>.
I think you can achieve something close to this with just public APIs by launching multiple threads, calling FileSystem#open to get a separate input stream in each one, and then calling seek to position each stream at a different block boundary.  Seek is a cheap operation, basically just updating internal offsets.  Seeking forward does not require reading through the earlier data byte-by-byte, so you won't pay the cost of transferring that part of the data.

Whether or not this strategy would really improve performance is subject to a lot of other factors.  If the application's single-threaded reading already saturates the network bandwidth of the NIC, then starting multiple threads is unlikely to improve performance.  Those threads will just run into contention with each other on the scarce network bandwidth resources.  If instead the application reads data gradually and performs some CPU-intensive processing as it reads, then perhaps the NIC is not saturated, and multi-threading could help.

As usual with performance work, the actual outcomes are going to be highly situational.

I hope this helps.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Thursday, December 31, 2015 at 5:17 PM
To: Chris Nauroth <cn...@hortonworks.com>>
Cc: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Re: Directly reading from datanode using JAVA API got socketTimeoutException

The following is what I want to do.
When reading a big file across multi blocks, I want to read different blocks from different node in parallel thus make reading big file faster.
Is that possible?

Thanks

On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>> wrote:
Your code has connected to a DataNode's TCP port, and the DataNode server side is likely blocked expecting the client to send some kind of request defined in the Data Transfer Protocol.  The client code here does not write a request, so the DataNode server doesn't know what to do.  Instead, the client immediately goes into a blocking read.  Since the DataNode server side doesn't know what to do, it's never going to write any bytes back to the socket connection, and therefore the client eventually times out on the read.

Stepping back, please be aware that what you are trying to do is unsupported.  Relying on private implementation details like this is likely to be brittle and buggy.  As the HDFS code evolves in the future, there is no guarantee that what you do here will work the same way in future versions.  There might not even be a connectToDN method in future versions if we decide to do some internal refactoring.

If you can give a high-level description of what you want to achieve, then perhaps we can suggest a way to do it through the public API.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Wednesday, December 30, 2015 at 9:29 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Directly reading from datanode using JAVA API got socketTimeoutException

​Hello,

I want to directly read from datanode blocks using JAVA API as the following code, but I got socketTimeoutException

I use reflection to call the DFSClient private method connectToDN(...), and get IOStreamPair of in and out, where in is used to read bytes from datanode.
The workhorse code is

try {
    Method connectToDN;
    Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
    connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
    connectToDN.setAccessible(true);
    IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
    in = new DataInputStream(pair.in);
    System.out.println(in.getClass());
    byte[] b = new byte[10000];
    in.readFully(b);
} catch (Exception e) {
    e.printStackTrace();

}

and the exception is

java.net.SocketTimeoutException: 11000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765<http://192.168.179.1:53765> remote=/192.168.179.135:50010<http://192.168.179.135:50010>]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​

Could anyone tell me where the problem is?

Thanks & Begards

Tenghuan He


Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Tenghuan He <te...@gmail.com>.
The following is what I want to do.
When reading a big file across multi blocks, I want to read different
blocks from different node in parallel thus make reading big file faster.
Is that possible?

Thanks

On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> Your code has connected to a DataNode's TCP port, and the DataNode server
> side is likely blocked expecting the client to send some kind of request
> defined in the Data Transfer Protocol.  The client code here does not write
> a request, so the DataNode server doesn't know what to do.  Instead, the
> client immediately goes into a blocking read.  Since the DataNode server
> side doesn't know what to do, it's never going to write any bytes back to
> the socket connection, and therefore the client eventually times out on the
> read.
>
> Stepping back, please be aware that what you are trying to do is
> unsupported.  Relying on private implementation details like this is likely
> to be brittle and buggy.  As the HDFS code evolves in the future, there is
> no guarantee that what you do here will work the same way in future
> versions.  There might not even be a connectToDN method in future versions
> if we decide to do some internal refactoring.
>
> If you can give a high-level description of what you want to achieve, then
> perhaps we can suggest a way to do it through the public API.
>
> --Chris Nauroth
>
> From: Tenghuan He <te...@gmail.com>
> Date: Wednesday, December 30, 2015 at 9:29 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Directly reading from datanode using JAVA API got
> socketTimeoutException
>
> ​Hello,
>
> I want to directly read from datanode blocks using JAVA API as the
> following code, but I got socketTimeoutException
>
> I use reflection to call the DFSClient private method connectToDN(...),
> and get IOStreamPair of in and out, where in is used to read bytes from
> datanode.
> The workhorse code is
>
> try {
>     Method connectToDN;
>     Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
>     connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
>     connectToDN.setAccessible(true);
>     IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
>     in = new DataInputStream(pair.in);
>     System.out.println(in.getClass());
>     byte[] b = new byte[10000];
>     in.readFully(b);
> } catch (Exception e) {
>     e.printStackTrace();
>
> }
>
> and the exception is
>
> java.net.SocketTimeoutException: 11000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765
> remote=/192.168.179.135:50010]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.FilterInputStream.read(FilterInputStream.java:133)
> at java.io.DataInputStream.readFully(DataInputStream.java:195)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at BlocksList.main(BlocksList.java:69)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​
>
> Could anyone tell me where the problem is?
>
> Thanks & Begards
>
> Tenghuan He
>

Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Tenghuan He <te...@gmail.com>.
The following is what I want to do.
When reading a big file across multi blocks, I want to read different
blocks from different node in parallel thus make reading big file faster.
Is that possible?

Thanks

On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> Your code has connected to a DataNode's TCP port, and the DataNode server
> side is likely blocked expecting the client to send some kind of request
> defined in the Data Transfer Protocol.  The client code here does not write
> a request, so the DataNode server doesn't know what to do.  Instead, the
> client immediately goes into a blocking read.  Since the DataNode server
> side doesn't know what to do, it's never going to write any bytes back to
> the socket connection, and therefore the client eventually times out on the
> read.
>
> Stepping back, please be aware that what you are trying to do is
> unsupported.  Relying on private implementation details like this is likely
> to be brittle and buggy.  As the HDFS code evolves in the future, there is
> no guarantee that what you do here will work the same way in future
> versions.  There might not even be a connectToDN method in future versions
> if we decide to do some internal refactoring.
>
> If you can give a high-level description of what you want to achieve, then
> perhaps we can suggest a way to do it through the public API.
>
> --Chris Nauroth
>
> From: Tenghuan He <te...@gmail.com>
> Date: Wednesday, December 30, 2015 at 9:29 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Directly reading from datanode using JAVA API got
> socketTimeoutException
>
> ​Hello,
>
> I want to directly read from datanode blocks using JAVA API as the
> following code, but I got socketTimeoutException
>
> I use reflection to call the DFSClient private method connectToDN(...),
> and get IOStreamPair of in and out, where in is used to read bytes from
> datanode.
> The workhorse code is
>
> try {
>     Method connectToDN;
>     Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
>     connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
>     connectToDN.setAccessible(true);
>     IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
>     in = new DataInputStream(pair.in);
>     System.out.println(in.getClass());
>     byte[] b = new byte[10000];
>     in.readFully(b);
> } catch (Exception e) {
>     e.printStackTrace();
>
> }
>
> and the exception is
>
> java.net.SocketTimeoutException: 11000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765
> remote=/192.168.179.135:50010]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.FilterInputStream.read(FilterInputStream.java:133)
> at java.io.DataInputStream.readFully(DataInputStream.java:195)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at BlocksList.main(BlocksList.java:69)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​
>
> Could anyone tell me where the problem is?
>
> Thanks & Begards
>
> Tenghuan He
>

Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Tenghuan He <te...@gmail.com>.
The following is what I want to do.
When reading a big file across multi blocks, I want to read different
blocks from different node in parallel thus make reading big file faster.
Is that possible?

Thanks

On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> Your code has connected to a DataNode's TCP port, and the DataNode server
> side is likely blocked expecting the client to send some kind of request
> defined in the Data Transfer Protocol.  The client code here does not write
> a request, so the DataNode server doesn't know what to do.  Instead, the
> client immediately goes into a blocking read.  Since the DataNode server
> side doesn't know what to do, it's never going to write any bytes back to
> the socket connection, and therefore the client eventually times out on the
> read.
>
> Stepping back, please be aware that what you are trying to do is
> unsupported.  Relying on private implementation details like this is likely
> to be brittle and buggy.  As the HDFS code evolves in the future, there is
> no guarantee that what you do here will work the same way in future
> versions.  There might not even be a connectToDN method in future versions
> if we decide to do some internal refactoring.
>
> If you can give a high-level description of what you want to achieve, then
> perhaps we can suggest a way to do it through the public API.
>
> --Chris Nauroth
>
> From: Tenghuan He <te...@gmail.com>
> Date: Wednesday, December 30, 2015 at 9:29 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Directly reading from datanode using JAVA API got
> socketTimeoutException
>
> ​Hello,
>
> I want to directly read from datanode blocks using JAVA API as the
> following code, but I got socketTimeoutException
>
> I use reflection to call the DFSClient private method connectToDN(...),
> and get IOStreamPair of in and out, where in is used to read bytes from
> datanode.
> The workhorse code is
>
> try {
>     Method connectToDN;
>     Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
>     connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
>     connectToDN.setAccessible(true);
>     IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
>     in = new DataInputStream(pair.in);
>     System.out.println(in.getClass());
>     byte[] b = new byte[10000];
>     in.readFully(b);
> } catch (Exception e) {
>     e.printStackTrace();
>
> }
>
> and the exception is
>
> java.net.SocketTimeoutException: 11000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765
> remote=/192.168.179.135:50010]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.FilterInputStream.read(FilterInputStream.java:133)
> at java.io.DataInputStream.readFully(DataInputStream.java:195)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at BlocksList.main(BlocksList.java:69)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​
>
> Could anyone tell me where the problem is?
>
> Thanks & Begards
>
> Tenghuan He
>

Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Tenghuan He <te...@gmail.com>.
The following is what I want to do.
When reading a big file across multi blocks, I want to read different
blocks from different node in parallel thus make reading big file faster.
Is that possible?

Thanks

On Thu, Dec 31, 2015 at 2:34 AM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> Your code has connected to a DataNode's TCP port, and the DataNode server
> side is likely blocked expecting the client to send some kind of request
> defined in the Data Transfer Protocol.  The client code here does not write
> a request, so the DataNode server doesn't know what to do.  Instead, the
> client immediately goes into a blocking read.  Since the DataNode server
> side doesn't know what to do, it's never going to write any bytes back to
> the socket connection, and therefore the client eventually times out on the
> read.
>
> Stepping back, please be aware that what you are trying to do is
> unsupported.  Relying on private implementation details like this is likely
> to be brittle and buggy.  As the HDFS code evolves in the future, there is
> no guarantee that what you do here will work the same way in future
> versions.  There might not even be a connectToDN method in future versions
> if we decide to do some internal refactoring.
>
> If you can give a high-level description of what you want to achieve, then
> perhaps we can suggest a way to do it through the public API.
>
> --Chris Nauroth
>
> From: Tenghuan He <te...@gmail.com>
> Date: Wednesday, December 30, 2015 at 9:29 AM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Directly reading from datanode using JAVA API got
> socketTimeoutException
>
> ​Hello,
>
> I want to directly read from datanode blocks using JAVA API as the
> following code, but I got socketTimeoutException
>
> I use reflection to call the DFSClient private method connectToDN(...),
> and get IOStreamPair of in and out, where in is used to read bytes from
> datanode.
> The workhorse code is
>
> try {
>     Method connectToDN;
>     Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
>     connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
>     connectToDN.setAccessible(true);
>     IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
>     in = new DataInputStream(pair.in);
>     System.out.println(in.getClass());
>     byte[] b = new byte[10000];
>     in.readFully(b);
> } catch (Exception e) {
>     e.printStackTrace();
>
> }
>
> and the exception is
>
> java.net.SocketTimeoutException: 11000 millis timeout while waiting for
> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765
> remote=/192.168.179.135:50010]
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.FilterInputStream.read(FilterInputStream.java:133)
> at java.io.DataInputStream.readFully(DataInputStream.java:195)
> at java.io.DataInputStream.readFully(DataInputStream.java:169)
> at BlocksList.main(BlocksList.java:69)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​
>
> Could anyone tell me where the problem is?
>
> Thanks & Begards
>
> Tenghuan He
>

Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Chris Nauroth <cn...@hortonworks.com>.
Your code has connected to a DataNode's TCP port, and the DataNode server side is likely blocked expecting the client to send some kind of request defined in the Data Transfer Protocol.  The client code here does not write a request, so the DataNode server doesn't know what to do.  Instead, the client immediately goes into a blocking read.  Since the DataNode server side doesn't know what to do, it's never going to write any bytes back to the socket connection, and therefore the client eventually times out on the read.

Stepping back, please be aware that what you are trying to do is unsupported.  Relying on private implementation details like this is likely to be brittle and buggy.  As the HDFS code evolves in the future, there is no guarantee that what you do here will work the same way in future versions.  There might not even be a connectToDN method in future versions if we decide to do some internal refactoring.

If you can give a high-level description of what you want to achieve, then perhaps we can suggest a way to do it through the public API.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Wednesday, December 30, 2015 at 9:29 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Directly reading from datanode using JAVA API got socketTimeoutException

​Hello,

I want to directly read from datanode blocks using JAVA API as the following code, but I got socketTimeoutException

I use reflection to call the DFSClient private method connectToDN(...), and get IOStreamPair of in and out, where in is used to read bytes from datanode.
The workhorse code is

try {
    Method connectToDN;
    Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
    connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
    connectToDN.setAccessible(true);
    IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
    in = new DataInputStream(pair.in);
    System.out.println(in.getClass());
    byte[] b = new byte[10000];
    in.readFully(b);
} catch (Exception e) {
    e.printStackTrace();

}

and the exception is

java.net.SocketTimeoutException: 11000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765<http://192.168.179.1:53765> remote=/192.168.179.135:50010<http://192.168.179.135:50010>]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​

Could anyone tell me where the problem is?

Thanks & Begards

Tenghuan He

Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Chris Nauroth <cn...@hortonworks.com>.
Your code has connected to a DataNode's TCP port, and the DataNode server side is likely blocked expecting the client to send some kind of request defined in the Data Transfer Protocol.  The client code here does not write a request, so the DataNode server doesn't know what to do.  Instead, the client immediately goes into a blocking read.  Since the DataNode server side doesn't know what to do, it's never going to write any bytes back to the socket connection, and therefore the client eventually times out on the read.

Stepping back, please be aware that what you are trying to do is unsupported.  Relying on private implementation details like this is likely to be brittle and buggy.  As the HDFS code evolves in the future, there is no guarantee that what you do here will work the same way in future versions.  There might not even be a connectToDN method in future versions if we decide to do some internal refactoring.

If you can give a high-level description of what you want to achieve, then perhaps we can suggest a way to do it through the public API.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Wednesday, December 30, 2015 at 9:29 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Directly reading from datanode using JAVA API got socketTimeoutException

​Hello,

I want to directly read from datanode blocks using JAVA API as the following code, but I got socketTimeoutException

I use reflection to call the DFSClient private method connectToDN(...), and get IOStreamPair of in and out, where in is used to read bytes from datanode.
The workhorse code is

try {
    Method connectToDN;
    Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
    connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
    connectToDN.setAccessible(true);
    IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
    in = new DataInputStream(pair.in);
    System.out.println(in.getClass());
    byte[] b = new byte[10000];
    in.readFully(b);
} catch (Exception e) {
    e.printStackTrace();

}

and the exception is

java.net.SocketTimeoutException: 11000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765<http://192.168.179.1:53765> remote=/192.168.179.135:50010<http://192.168.179.135:50010>]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​

Could anyone tell me where the problem is?

Thanks & Begards

Tenghuan He

Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Chris Nauroth <cn...@hortonworks.com>.
Your code has connected to a DataNode's TCP port, and the DataNode server side is likely blocked expecting the client to send some kind of request defined in the Data Transfer Protocol.  The client code here does not write a request, so the DataNode server doesn't know what to do.  Instead, the client immediately goes into a blocking read.  Since the DataNode server side doesn't know what to do, it's never going to write any bytes back to the socket connection, and therefore the client eventually times out on the read.

Stepping back, please be aware that what you are trying to do is unsupported.  Relying on private implementation details like this is likely to be brittle and buggy.  As the HDFS code evolves in the future, there is no guarantee that what you do here will work the same way in future versions.  There might not even be a connectToDN method in future versions if we decide to do some internal refactoring.

If you can give a high-level description of what you want to achieve, then perhaps we can suggest a way to do it through the public API.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Wednesday, December 30, 2015 at 9:29 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Directly reading from datanode using JAVA API got socketTimeoutException

​Hello,

I want to directly read from datanode blocks using JAVA API as the following code, but I got socketTimeoutException

I use reflection to call the DFSClient private method connectToDN(...), and get IOStreamPair of in and out, where in is used to read bytes from datanode.
The workhorse code is

try {
    Method connectToDN;
    Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
    connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
    connectToDN.setAccessible(true);
    IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
    in = new DataInputStream(pair.in);
    System.out.println(in.getClass());
    byte[] b = new byte[10000];
    in.readFully(b);
} catch (Exception e) {
    e.printStackTrace();

}

and the exception is

java.net.SocketTimeoutException: 11000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765<http://192.168.179.1:53765> remote=/192.168.179.135:50010<http://192.168.179.135:50010>]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​

Could anyone tell me where the problem is?

Thanks & Begards

Tenghuan He

Re: Directly reading from datanode using JAVA API got socketTimeoutException

Posted by Chris Nauroth <cn...@hortonworks.com>.
Your code has connected to a DataNode's TCP port, and the DataNode server side is likely blocked expecting the client to send some kind of request defined in the Data Transfer Protocol.  The client code here does not write a request, so the DataNode server doesn't know what to do.  Instead, the client immediately goes into a blocking read.  Since the DataNode server side doesn't know what to do, it's never going to write any bytes back to the socket connection, and therefore the client eventually times out on the read.

Stepping back, please be aware that what you are trying to do is unsupported.  Relying on private implementation details like this is likely to be brittle and buggy.  As the HDFS code evolves in the future, there is no guarantee that what you do here will work the same way in future versions.  There might not even be a connectToDN method in future versions if we decide to do some internal refactoring.

If you can give a high-level description of what you want to achieve, then perhaps we can suggest a way to do it through the public API.

--Chris Nauroth

From: Tenghuan He <te...@gmail.com>>
Date: Wednesday, December 30, 2015 at 9:29 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Directly reading from datanode using JAVA API got socketTimeoutException

​Hello,

I want to directly read from datanode blocks using JAVA API as the following code, but I got socketTimeoutException

I use reflection to call the DFSClient private method connectToDN(...), and get IOStreamPair of in and out, where in is used to read bytes from datanode.
The workhorse code is

try {
    Method connectToDN;
    Class[] paraList = {DatanodeInfo.class, int.class, LocatedBlock.class};
    connectToDN = dfsClient.getClass().getDeclaredMethod("connectToDN", paraList);
    connectToDN.setAccessible(true);
    IOStreamPair pair = (IOStreamPair) connectToDN.invoke(dfsClient, datanode, timeout, lb);
    in = new DataInputStream(pair.in);
    System.out.println(in.getClass());
    byte[] b = new byte[10000];
    in.readFully(b);
} catch (Exception e) {
    e.printStackTrace();

}

and the exception is

java.net.SocketTimeoutException: 11000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.179.1:53765<http://192.168.179.1:53765> remote=/192.168.179.135:50010<http://192.168.179.135:50010>]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at BlocksList.main(BlocksList.java:69)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)​

Could anyone tell me where the problem is?

Thanks & Begards

Tenghuan He