You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by sandeep das <ya...@gmail.com> on 2015/11/19 06:44:36 UTC

Yarn application reading from Data node using short-circuit.

Hi,

I was going through some benchmarking and realized that there are lots of
TCP connections are initiated while running my PIG jobs over YARN(MR2).
These TCP connections are related to data node. Although short-circuit is
enabled in my data nodes but still a lot TCP connections are being created.

I wanted to check that how can we enable YARN applicationMaster to read
data from Data node using short-circuits i.e. unix domain sockets. I
believe that will improve the performance of our jobs.


Can someone please help to understand how can I make sure that MR2 jobs
created by PIG scripts are reading data from Data node using short-circuit
instead of TCP connections?


Regards,
Sandeep

Re: Yarn application reading from Data node using short-circuit.

Posted by sandeep das <ya...@gmail.com>.
Thanks Chris, I went through the description on the link and found out that
I had not added YARN user in list of allowed users to read from unix
sockets.
I've added it now and re running the load to see if there is any
improvement.


Regards,
Sandeep

On Thu, Nov 19, 2015 at 10:52 PM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> Hello Sandeep,
>
> As long as you have enabled short-circuit read as per the documentation
> [1], I expect any Hadoop process will take advantage of it while reading a
> local replica.  However, short-circuit read will not completely eliminate
> TCP connection activity to the DataNode.  There will still be a TCP
> connection from the client to the DataNode to perform a handshake and
> establish the Unix domain socket.  This is a very small payload though
> compared to the transfer of block data over the Unix domain socket.
>
> [1]
> http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html
>
> --Chris Nauroth
>
> From: sandeep das <ya...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Wednesday, November 18, 2015 at 10:44 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Yarn application reading from Data node using short-circuit.
>
> Hi,
>
> I was going through some benchmarking and realized that there are lots of
> TCP connections are initiated while running my PIG jobs over YARN(MR2).
> These TCP connections are related to data node. Although short-circuit is
> enabled in my data nodes but still a lot TCP connections are being created.
>
> I wanted to check that how can we enable YARN applicationMaster to read
> data from Data node using short-circuits i.e. unix domain sockets. I
> believe that will improve the performance of our jobs.
>
>
> Can someone please help to understand how can I make sure that MR2 jobs
> created by PIG scripts are reading data from Data node using short-circuit
> instead of TCP connections?
>
>
> Regards,
> Sandeep
>

Re: Yarn application reading from Data node using short-circuit.

Posted by sandeep das <ya...@gmail.com>.
Thanks Chris, I went through the description on the link and found out that
I had not added YARN user in list of allowed users to read from unix
sockets.
I've added it now and re running the load to see if there is any
improvement.


Regards,
Sandeep

On Thu, Nov 19, 2015 at 10:52 PM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> Hello Sandeep,
>
> As long as you have enabled short-circuit read as per the documentation
> [1], I expect any Hadoop process will take advantage of it while reading a
> local replica.  However, short-circuit read will not completely eliminate
> TCP connection activity to the DataNode.  There will still be a TCP
> connection from the client to the DataNode to perform a handshake and
> establish the Unix domain socket.  This is a very small payload though
> compared to the transfer of block data over the Unix domain socket.
>
> [1]
> http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html
>
> --Chris Nauroth
>
> From: sandeep das <ya...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Wednesday, November 18, 2015 at 10:44 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Yarn application reading from Data node using short-circuit.
>
> Hi,
>
> I was going through some benchmarking and realized that there are lots of
> TCP connections are initiated while running my PIG jobs over YARN(MR2).
> These TCP connections are related to data node. Although short-circuit is
> enabled in my data nodes but still a lot TCP connections are being created.
>
> I wanted to check that how can we enable YARN applicationMaster to read
> data from Data node using short-circuits i.e. unix domain sockets. I
> believe that will improve the performance of our jobs.
>
>
> Can someone please help to understand how can I make sure that MR2 jobs
> created by PIG scripts are reading data from Data node using short-circuit
> instead of TCP connections?
>
>
> Regards,
> Sandeep
>

Re: Yarn application reading from Data node using short-circuit.

Posted by sandeep das <ya...@gmail.com>.
Thanks Chris, I went through the description on the link and found out that
I had not added YARN user in list of allowed users to read from unix
sockets.
I've added it now and re running the load to see if there is any
improvement.


Regards,
Sandeep

On Thu, Nov 19, 2015 at 10:52 PM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> Hello Sandeep,
>
> As long as you have enabled short-circuit read as per the documentation
> [1], I expect any Hadoop process will take advantage of it while reading a
> local replica.  However, short-circuit read will not completely eliminate
> TCP connection activity to the DataNode.  There will still be a TCP
> connection from the client to the DataNode to perform a handshake and
> establish the Unix domain socket.  This is a very small payload though
> compared to the transfer of block data over the Unix domain socket.
>
> [1]
> http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html
>
> --Chris Nauroth
>
> From: sandeep das <ya...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Wednesday, November 18, 2015 at 10:44 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Yarn application reading from Data node using short-circuit.
>
> Hi,
>
> I was going through some benchmarking and realized that there are lots of
> TCP connections are initiated while running my PIG jobs over YARN(MR2).
> These TCP connections are related to data node. Although short-circuit is
> enabled in my data nodes but still a lot TCP connections are being created.
>
> I wanted to check that how can we enable YARN applicationMaster to read
> data from Data node using short-circuits i.e. unix domain sockets. I
> believe that will improve the performance of our jobs.
>
>
> Can someone please help to understand how can I make sure that MR2 jobs
> created by PIG scripts are reading data from Data node using short-circuit
> instead of TCP connections?
>
>
> Regards,
> Sandeep
>

Re: Yarn application reading from Data node using short-circuit.

Posted by sandeep das <ya...@gmail.com>.
Thanks Chris, I went through the description on the link and found out that
I had not added YARN user in list of allowed users to read from unix
sockets.
I've added it now and re running the load to see if there is any
improvement.


Regards,
Sandeep

On Thu, Nov 19, 2015 at 10:52 PM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> Hello Sandeep,
>
> As long as you have enabled short-circuit read as per the documentation
> [1], I expect any Hadoop process will take advantage of it while reading a
> local replica.  However, short-circuit read will not completely eliminate
> TCP connection activity to the DataNode.  There will still be a TCP
> connection from the client to the DataNode to perform a handshake and
> establish the Unix domain socket.  This is a very small payload though
> compared to the transfer of block data over the Unix domain socket.
>
> [1]
> http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html
>
> --Chris Nauroth
>
> From: sandeep das <ya...@gmail.com>
> Reply-To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Date: Wednesday, November 18, 2015 at 10:44 PM
> To: "user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Yarn application reading from Data node using short-circuit.
>
> Hi,
>
> I was going through some benchmarking and realized that there are lots of
> TCP connections are initiated while running my PIG jobs over YARN(MR2).
> These TCP connections are related to data node. Although short-circuit is
> enabled in my data nodes but still a lot TCP connections are being created.
>
> I wanted to check that how can we enable YARN applicationMaster to read
> data from Data node using short-circuits i.e. unix domain sockets. I
> believe that will improve the performance of our jobs.
>
>
> Can someone please help to understand how can I make sure that MR2 jobs
> created by PIG scripts are reading data from Data node using short-circuit
> instead of TCP connections?
>
>
> Regards,
> Sandeep
>

Re: Yarn application reading from Data node using short-circuit.

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hello Sandeep,

As long as you have enabled short-circuit read as per the documentation [1], I expect any Hadoop process will take advantage of it while reading a local replica.  However, short-circuit read will not completely eliminate TCP connection activity to the DataNode.  There will still be a TCP connection from the client to the DataNode to perform a handshake and establish the Unix domain socket.  This is a very small payload though compared to the transfer of block data over the Unix domain socket.

[1] http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html

--Chris Nauroth

From: sandeep das <ya...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Wednesday, November 18, 2015 at 10:44 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Yarn application reading from Data node using short-circuit.

Hi,

I was going through some benchmarking and realized that there are lots of TCP connections are initiated while running my PIG jobs over YARN(MR2). These TCP connections are related to data node. Although short-circuit is enabled in my data nodes but still a lot TCP connections are being created.

I wanted to check that how can we enable YARN applicationMaster to read data from Data node using short-circuits i.e. unix domain sockets. I believe that will improve the performance of our jobs.


Can someone please help to understand how can I make sure that MR2 jobs created by PIG scripts are reading data from Data node using short-circuit instead of TCP connections?


Regards,
Sandeep

Re: Yarn application reading from Data node using short-circuit.

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hello Sandeep,

As long as you have enabled short-circuit read as per the documentation [1], I expect any Hadoop process will take advantage of it while reading a local replica.  However, short-circuit read will not completely eliminate TCP connection activity to the DataNode.  There will still be a TCP connection from the client to the DataNode to perform a handshake and establish the Unix domain socket.  This is a very small payload though compared to the transfer of block data over the Unix domain socket.

[1] http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html

--Chris Nauroth

From: sandeep das <ya...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Wednesday, November 18, 2015 at 10:44 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Yarn application reading from Data node using short-circuit.

Hi,

I was going through some benchmarking and realized that there are lots of TCP connections are initiated while running my PIG jobs over YARN(MR2). These TCP connections are related to data node. Although short-circuit is enabled in my data nodes but still a lot TCP connections are being created.

I wanted to check that how can we enable YARN applicationMaster to read data from Data node using short-circuits i.e. unix domain sockets. I believe that will improve the performance of our jobs.


Can someone please help to understand how can I make sure that MR2 jobs created by PIG scripts are reading data from Data node using short-circuit instead of TCP connections?


Regards,
Sandeep

Re: Yarn application reading from Data node using short-circuit.

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hello Sandeep,

As long as you have enabled short-circuit read as per the documentation [1], I expect any Hadoop process will take advantage of it while reading a local replica.  However, short-circuit read will not completely eliminate TCP connection activity to the DataNode.  There will still be a TCP connection from the client to the DataNode to perform a handshake and establish the Unix domain socket.  This is a very small payload though compared to the transfer of block data over the Unix domain socket.

[1] http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html

--Chris Nauroth

From: sandeep das <ya...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Wednesday, November 18, 2015 at 10:44 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Yarn application reading from Data node using short-circuit.

Hi,

I was going through some benchmarking and realized that there are lots of TCP connections are initiated while running my PIG jobs over YARN(MR2). These TCP connections are related to data node. Although short-circuit is enabled in my data nodes but still a lot TCP connections are being created.

I wanted to check that how can we enable YARN applicationMaster to read data from Data node using short-circuits i.e. unix domain sockets. I believe that will improve the performance of our jobs.


Can someone please help to understand how can I make sure that MR2 jobs created by PIG scripts are reading data from Data node using short-circuit instead of TCP connections?


Regards,
Sandeep

Re: Yarn application reading from Data node using short-circuit.

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hello Sandeep,

As long as you have enabled short-circuit read as per the documentation [1], I expect any Hadoop process will take advantage of it while reading a local replica.  However, short-circuit read will not completely eliminate TCP connection activity to the DataNode.  There will still be a TCP connection from the client to the DataNode to perform a handshake and establish the Unix domain socket.  This is a very small payload though compared to the transfer of block data over the Unix domain socket.

[1] http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html

--Chris Nauroth

From: sandeep das <ya...@gmail.com>>
Reply-To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Date: Wednesday, November 18, 2015 at 10:44 PM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Yarn application reading from Data node using short-circuit.

Hi,

I was going through some benchmarking and realized that there are lots of TCP connections are initiated while running my PIG jobs over YARN(MR2). These TCP connections are related to data node. Although short-circuit is enabled in my data nodes but still a lot TCP connections are being created.

I wanted to check that how can we enable YARN applicationMaster to read data from Data node using short-circuits i.e. unix domain sockets. I believe that will improve the performance of our jobs.


Can someone please help to understand how can I make sure that MR2 jobs created by PIG scripts are reading data from Data node using short-circuit instead of TCP connections?


Regards,
Sandeep