You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by rab ra <ra...@gmail.com> on 2014/01/24 06:25:24 UTC

HDFS data transfer is faster than SCP based transfer?

Hello

I have a use case that requires transfer of input files from remote storage
using SCP protocol (using jSCH jar).  To optimize this use case, I have
pre-loaded all my input files into HDFS and modified my use case so that it
copies required files from HDFS. So, when tasktrackers works, it copies
required number of input files to its local directory from HDFS. All my
tasktrackers are also datanodes. I could see my use case has run faster.
The only modification in my application is that file copy from HDFS instead
of transfer using SCP. Also, my use case involves parallel operations (run
in tasktrackers) and they do lot of file transfer. Now all these transfers
are replaced with HDFS copy.

Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
it uses TCP/IP? Can anyone give me reasonable reasons to support the
decrease of time?


with thanks and regards
rab

RE: HDFS data transfer is faster than SCP based transfer?

Posted by John Lilley <jo...@redpoint.net>.
Back to your original question, most likely the HDFS transfer is faster because HDFS operates in parallel on the cluster, whereas it sounds like your SCP transfer is coming from a single server.
There may be additional benefit to be gained by suggesting data-local scheduling of tasks, but we'd need to know more about your application.
john
From: rab ra [mailto:rabmdu@gmail.com]
Sent: Saturday, January 25, 2014 7:29 AM
To: user@hadoop.apache.org
Subject: RE: HDFS data transfer is faster than SCP based transfer?


The input files are provided as argument to a binary being executed by map process. This binary cannot read from hdfs and i cant rewrite it.
On 25 Jan 2014 19:47, "John Lilley" <jo...@redpoint.net>> wrote:
There are no short-circuit writes, only reads, AFAIK.
Is it necessary to transfer from HDFS to local disk?  Can you read from HDFS directly using the FileSystem interface?
john

From: Shekhar Sharma [mailto:shekhar2581@gmail.com<ma...@gmail.com>]
Sent: Saturday, January 25, 2014 3:44 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: HDFS data transfer is faster than SCP based transfer?


We have the concept of short circuit reads which directly reads from data node which improve read performance. Do we have similar concept like short circuit writes
On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com>> wrote:
There's a lot of difference here, although both do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c
arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com>> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote storage
> using SCP protocol (using jSCH jar).  To optimize this use case, I have
> pre-loaded all my input files into HDFS and modified my use case so that it
> copies required files from HDFS. So, when tasktrackers works, it copies
> required number of input files to its local directory from HDFS. All my
> tasktrackers are also datanodes. I could see my use case has run faster. The
> only modification in my application is that file copy from HDFS instead of
> transfer using SCP. Also, my use case involves parallel operations (run in
> tasktrackers) and they do lot of file transfer. Now all these transfers are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decrease
> of time?
>
>
> with thanks and regards
> rab



--
Harsh J

RE: HDFS data transfer is faster than SCP based transfer?

Posted by John Lilley <jo...@redpoint.net>.
Back to your original question, most likely the HDFS transfer is faster because HDFS operates in parallel on the cluster, whereas it sounds like your SCP transfer is coming from a single server.
There may be additional benefit to be gained by suggesting data-local scheduling of tasks, but we'd need to know more about your application.
john
From: rab ra [mailto:rabmdu@gmail.com]
Sent: Saturday, January 25, 2014 7:29 AM
To: user@hadoop.apache.org
Subject: RE: HDFS data transfer is faster than SCP based transfer?


The input files are provided as argument to a binary being executed by map process. This binary cannot read from hdfs and i cant rewrite it.
On 25 Jan 2014 19:47, "John Lilley" <jo...@redpoint.net>> wrote:
There are no short-circuit writes, only reads, AFAIK.
Is it necessary to transfer from HDFS to local disk?  Can you read from HDFS directly using the FileSystem interface?
john

From: Shekhar Sharma [mailto:shekhar2581@gmail.com<ma...@gmail.com>]
Sent: Saturday, January 25, 2014 3:44 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: HDFS data transfer is faster than SCP based transfer?


We have the concept of short circuit reads which directly reads from data node which improve read performance. Do we have similar concept like short circuit writes
On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com>> wrote:
There's a lot of difference here, although both do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c
arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com>> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote storage
> using SCP protocol (using jSCH jar).  To optimize this use case, I have
> pre-loaded all my input files into HDFS and modified my use case so that it
> copies required files from HDFS. So, when tasktrackers works, it copies
> required number of input files to its local directory from HDFS. All my
> tasktrackers are also datanodes. I could see my use case has run faster. The
> only modification in my application is that file copy from HDFS instead of
> transfer using SCP. Also, my use case involves parallel operations (run in
> tasktrackers) and they do lot of file transfer. Now all these transfers are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decrease
> of time?
>
>
> with thanks and regards
> rab



--
Harsh J

RE: HDFS data transfer is faster than SCP based transfer?

Posted by John Lilley <jo...@redpoint.net>.
Back to your original question, most likely the HDFS transfer is faster because HDFS operates in parallel on the cluster, whereas it sounds like your SCP transfer is coming from a single server.
There may be additional benefit to be gained by suggesting data-local scheduling of tasks, but we'd need to know more about your application.
john
From: rab ra [mailto:rabmdu@gmail.com]
Sent: Saturday, January 25, 2014 7:29 AM
To: user@hadoop.apache.org
Subject: RE: HDFS data transfer is faster than SCP based transfer?


The input files are provided as argument to a binary being executed by map process. This binary cannot read from hdfs and i cant rewrite it.
On 25 Jan 2014 19:47, "John Lilley" <jo...@redpoint.net>> wrote:
There are no short-circuit writes, only reads, AFAIK.
Is it necessary to transfer from HDFS to local disk?  Can you read from HDFS directly using the FileSystem interface?
john

From: Shekhar Sharma [mailto:shekhar2581@gmail.com<ma...@gmail.com>]
Sent: Saturday, January 25, 2014 3:44 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: HDFS data transfer is faster than SCP based transfer?


We have the concept of short circuit reads which directly reads from data node which improve read performance. Do we have similar concept like short circuit writes
On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com>> wrote:
There's a lot of difference here, although both do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c
arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com>> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote storage
> using SCP protocol (using jSCH jar).  To optimize this use case, I have
> pre-loaded all my input files into HDFS and modified my use case so that it
> copies required files from HDFS. So, when tasktrackers works, it copies
> required number of input files to its local directory from HDFS. All my
> tasktrackers are also datanodes. I could see my use case has run faster. The
> only modification in my application is that file copy from HDFS instead of
> transfer using SCP. Also, my use case involves parallel operations (run in
> tasktrackers) and they do lot of file transfer. Now all these transfers are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decrease
> of time?
>
>
> with thanks and regards
> rab



--
Harsh J

RE: HDFS data transfer is faster than SCP based transfer?

Posted by John Lilley <jo...@redpoint.net>.
Back to your original question, most likely the HDFS transfer is faster because HDFS operates in parallel on the cluster, whereas it sounds like your SCP transfer is coming from a single server.
There may be additional benefit to be gained by suggesting data-local scheduling of tasks, but we'd need to know more about your application.
john
From: rab ra [mailto:rabmdu@gmail.com]
Sent: Saturday, January 25, 2014 7:29 AM
To: user@hadoop.apache.org
Subject: RE: HDFS data transfer is faster than SCP based transfer?


The input files are provided as argument to a binary being executed by map process. This binary cannot read from hdfs and i cant rewrite it.
On 25 Jan 2014 19:47, "John Lilley" <jo...@redpoint.net>> wrote:
There are no short-circuit writes, only reads, AFAIK.
Is it necessary to transfer from HDFS to local disk?  Can you read from HDFS directly using the FileSystem interface?
john

From: Shekhar Sharma [mailto:shekhar2581@gmail.com<ma...@gmail.com>]
Sent: Saturday, January 25, 2014 3:44 AM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: HDFS data transfer is faster than SCP based transfer?


We have the concept of short circuit reads which directly reads from data node which improve read performance. Do we have similar concept like short circuit writes
On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com>> wrote:
There's a lot of difference here, although both do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c
arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com>> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote storage
> using SCP protocol (using jSCH jar).  To optimize this use case, I have
> pre-loaded all my input files into HDFS and modified my use case so that it
> copies required files from HDFS. So, when tasktrackers works, it copies
> required number of input files to its local directory from HDFS. All my
> tasktrackers are also datanodes. I could see my use case has run faster. The
> only modification in my application is that file copy from HDFS instead of
> transfer using SCP. Also, my use case involves parallel operations (run in
> tasktrackers) and they do lot of file transfer. Now all these transfers are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decrease
> of time?
>
>
> with thanks and regards
> rab



--
Harsh J

RE: HDFS data transfer is faster than SCP based transfer?

Posted by rab ra <ra...@gmail.com>.
The input files are provided as argument to a binary being executed by map
process. This binary cannot read from hdfs and i cant rewrite it.
On 25 Jan 2014 19:47, "John Lilley" <jo...@redpoint.net> wrote:

>  There are no short-circuit writes, only reads, AFAIK.
>
> Is it necessary to transfer from HDFS to local disk?  Can you read from
> HDFS directly using the FileSystem interface?
>
> john
>
>
>
> *From:* Shekhar Sharma [mailto:shekhar2581@gmail.com]
> *Sent:* Saturday, January 25, 2014 3:44 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: HDFS data transfer is faster than SCP based transfer?
>
>
>
> We have the concept of short circuit reads which directly reads from data
> node which improve read performance. Do we have similar concept like short
> circuit writes
>
> On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com> wrote:
>
> There's a lot of difference here, although both do use TCP underneath,
> but do note that SCP securely encrypts data but stock HDFS
> configuration does not.
>
> You can also ask SCP to compress data transfer via the "-C" argument
> btw - unsure if you already applied that pre-test - it may help show
> up some difference. Also, the encryption algorithm can be changed to a
> weaker one if security is not a concern during the transfer, via "-c
> arcfour".
>
> On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com> wrote:
> > Hello
> >
> > I have a use case that requires transfer of input files from remote
> storage
> > using SCP protocol (using jSCH jar).  To optimize this use case, I have
> > pre-loaded all my input files into HDFS and modified my use case so that
> it
> > copies required files from HDFS. So, when tasktrackers works, it copies
> > required number of input files to its local directory from HDFS. All my
> > tasktrackers are also datanodes. I could see my use case has run faster.
> The
> > only modification in my application is that file copy from HDFS instead
> of
> > transfer using SCP. Also, my use case involves parallel operations (run
> in
> > tasktrackers) and they do lot of file transfer. Now all these transfers
> are
> > replaced with HDFS copy.
> >
> > Can anyone tell me HDFS transfer is faster as I witnessed? Is it
> because, it
> > uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease
> > of time?
> >
> >
> > with thanks and regards
> > rab
>
>
>
> --
> Harsh J
>

RE: HDFS data transfer is faster than SCP based transfer?

Posted by rab ra <ra...@gmail.com>.
The input files are provided as argument to a binary being executed by map
process. This binary cannot read from hdfs and i cant rewrite it.
On 25 Jan 2014 19:47, "John Lilley" <jo...@redpoint.net> wrote:

>  There are no short-circuit writes, only reads, AFAIK.
>
> Is it necessary to transfer from HDFS to local disk?  Can you read from
> HDFS directly using the FileSystem interface?
>
> john
>
>
>
> *From:* Shekhar Sharma [mailto:shekhar2581@gmail.com]
> *Sent:* Saturday, January 25, 2014 3:44 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: HDFS data transfer is faster than SCP based transfer?
>
>
>
> We have the concept of short circuit reads which directly reads from data
> node which improve read performance. Do we have similar concept like short
> circuit writes
>
> On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com> wrote:
>
> There's a lot of difference here, although both do use TCP underneath,
> but do note that SCP securely encrypts data but stock HDFS
> configuration does not.
>
> You can also ask SCP to compress data transfer via the "-C" argument
> btw - unsure if you already applied that pre-test - it may help show
> up some difference. Also, the encryption algorithm can be changed to a
> weaker one if security is not a concern during the transfer, via "-c
> arcfour".
>
> On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com> wrote:
> > Hello
> >
> > I have a use case that requires transfer of input files from remote
> storage
> > using SCP protocol (using jSCH jar).  To optimize this use case, I have
> > pre-loaded all my input files into HDFS and modified my use case so that
> it
> > copies required files from HDFS. So, when tasktrackers works, it copies
> > required number of input files to its local directory from HDFS. All my
> > tasktrackers are also datanodes. I could see my use case has run faster.
> The
> > only modification in my application is that file copy from HDFS instead
> of
> > transfer using SCP. Also, my use case involves parallel operations (run
> in
> > tasktrackers) and they do lot of file transfer. Now all these transfers
> are
> > replaced with HDFS copy.
> >
> > Can anyone tell me HDFS transfer is faster as I witnessed? Is it
> because, it
> > uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease
> > of time?
> >
> >
> > with thanks and regards
> > rab
>
>
>
> --
> Harsh J
>

RE: HDFS data transfer is faster than SCP based transfer?

Posted by rab ra <ra...@gmail.com>.
The input files are provided as argument to a binary being executed by map
process. This binary cannot read from hdfs and i cant rewrite it.
On 25 Jan 2014 19:47, "John Lilley" <jo...@redpoint.net> wrote:

>  There are no short-circuit writes, only reads, AFAIK.
>
> Is it necessary to transfer from HDFS to local disk?  Can you read from
> HDFS directly using the FileSystem interface?
>
> john
>
>
>
> *From:* Shekhar Sharma [mailto:shekhar2581@gmail.com]
> *Sent:* Saturday, January 25, 2014 3:44 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: HDFS data transfer is faster than SCP based transfer?
>
>
>
> We have the concept of short circuit reads which directly reads from data
> node which improve read performance. Do we have similar concept like short
> circuit writes
>
> On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com> wrote:
>
> There's a lot of difference here, although both do use TCP underneath,
> but do note that SCP securely encrypts data but stock HDFS
> configuration does not.
>
> You can also ask SCP to compress data transfer via the "-C" argument
> btw - unsure if you already applied that pre-test - it may help show
> up some difference. Also, the encryption algorithm can be changed to a
> weaker one if security is not a concern during the transfer, via "-c
> arcfour".
>
> On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com> wrote:
> > Hello
> >
> > I have a use case that requires transfer of input files from remote
> storage
> > using SCP protocol (using jSCH jar).  To optimize this use case, I have
> > pre-loaded all my input files into HDFS and modified my use case so that
> it
> > copies required files from HDFS. So, when tasktrackers works, it copies
> > required number of input files to its local directory from HDFS. All my
> > tasktrackers are also datanodes. I could see my use case has run faster.
> The
> > only modification in my application is that file copy from HDFS instead
> of
> > transfer using SCP. Also, my use case involves parallel operations (run
> in
> > tasktrackers) and they do lot of file transfer. Now all these transfers
> are
> > replaced with HDFS copy.
> >
> > Can anyone tell me HDFS transfer is faster as I witnessed? Is it
> because, it
> > uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease
> > of time?
> >
> >
> > with thanks and regards
> > rab
>
>
>
> --
> Harsh J
>

RE: HDFS data transfer is faster than SCP based transfer?

Posted by rab ra <ra...@gmail.com>.
The input files are provided as argument to a binary being executed by map
process. This binary cannot read from hdfs and i cant rewrite it.
On 25 Jan 2014 19:47, "John Lilley" <jo...@redpoint.net> wrote:

>  There are no short-circuit writes, only reads, AFAIK.
>
> Is it necessary to transfer from HDFS to local disk?  Can you read from
> HDFS directly using the FileSystem interface?
>
> john
>
>
>
> *From:* Shekhar Sharma [mailto:shekhar2581@gmail.com]
> *Sent:* Saturday, January 25, 2014 3:44 AM
> *To:* user@hadoop.apache.org
> *Subject:* Re: HDFS data transfer is faster than SCP based transfer?
>
>
>
> We have the concept of short circuit reads which directly reads from data
> node which improve read performance. Do we have similar concept like short
> circuit writes
>
> On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com> wrote:
>
> There's a lot of difference here, although both do use TCP underneath,
> but do note that SCP securely encrypts data but stock HDFS
> configuration does not.
>
> You can also ask SCP to compress data transfer via the "-C" argument
> btw - unsure if you already applied that pre-test - it may help show
> up some difference. Also, the encryption algorithm can be changed to a
> weaker one if security is not a concern during the transfer, via "-c
> arcfour".
>
> On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com> wrote:
> > Hello
> >
> > I have a use case that requires transfer of input files from remote
> storage
> > using SCP protocol (using jSCH jar).  To optimize this use case, I have
> > pre-loaded all my input files into HDFS and modified my use case so that
> it
> > copies required files from HDFS. So, when tasktrackers works, it copies
> > required number of input files to its local directory from HDFS. All my
> > tasktrackers are also datanodes. I could see my use case has run faster.
> The
> > only modification in my application is that file copy from HDFS instead
> of
> > transfer using SCP. Also, my use case involves parallel operations (run
> in
> > tasktrackers) and they do lot of file transfer. Now all these transfers
> are
> > replaced with HDFS copy.
> >
> > Can anyone tell me HDFS transfer is faster as I witnessed? Is it
> because, it
> > uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease
> > of time?
> >
> >
> > with thanks and regards
> > rab
>
>
>
> --
> Harsh J
>

RE: HDFS data transfer is faster than SCP based transfer?

Posted by John Lilley <jo...@redpoint.net>.
There are no short-circuit writes, only reads, AFAIK.
Is it necessary to transfer from HDFS to local disk?  Can you read from HDFS directly using the FileSystem interface?
john

From: Shekhar Sharma [mailto:shekhar2581@gmail.com]
Sent: Saturday, January 25, 2014 3:44 AM
To: user@hadoop.apache.org
Subject: Re: HDFS data transfer is faster than SCP based transfer?


We have the concept of short circuit reads which directly reads from data node which improve read performance. Do we have similar concept like short circuit writes
On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com>> wrote:
There's a lot of difference here, although both do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c
arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com>> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote storage
> using SCP protocol (using jSCH jar).  To optimize this use case, I have
> pre-loaded all my input files into HDFS and modified my use case so that it
> copies required files from HDFS. So, when tasktrackers works, it copies
> required number of input files to its local directory from HDFS. All my
> tasktrackers are also datanodes. I could see my use case has run faster. The
> only modification in my application is that file copy from HDFS instead of
> transfer using SCP. Also, my use case involves parallel operations (run in
> tasktrackers) and they do lot of file transfer. Now all these transfers are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decrease
> of time?
>
>
> with thanks and regards
> rab



--
Harsh J

RE: HDFS data transfer is faster than SCP based transfer?

Posted by John Lilley <jo...@redpoint.net>.
There are no short-circuit writes, only reads, AFAIK.
Is it necessary to transfer from HDFS to local disk?  Can you read from HDFS directly using the FileSystem interface?
john

From: Shekhar Sharma [mailto:shekhar2581@gmail.com]
Sent: Saturday, January 25, 2014 3:44 AM
To: user@hadoop.apache.org
Subject: Re: HDFS data transfer is faster than SCP based transfer?


We have the concept of short circuit reads which directly reads from data node which improve read performance. Do we have similar concept like short circuit writes
On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com>> wrote:
There's a lot of difference here, although both do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c
arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com>> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote storage
> using SCP protocol (using jSCH jar).  To optimize this use case, I have
> pre-loaded all my input files into HDFS and modified my use case so that it
> copies required files from HDFS. So, when tasktrackers works, it copies
> required number of input files to its local directory from HDFS. All my
> tasktrackers are also datanodes. I could see my use case has run faster. The
> only modification in my application is that file copy from HDFS instead of
> transfer using SCP. Also, my use case involves parallel operations (run in
> tasktrackers) and they do lot of file transfer. Now all these transfers are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decrease
> of time?
>
>
> with thanks and regards
> rab



--
Harsh J

RE: HDFS data transfer is faster than SCP based transfer?

Posted by John Lilley <jo...@redpoint.net>.
There are no short-circuit writes, only reads, AFAIK.
Is it necessary to transfer from HDFS to local disk?  Can you read from HDFS directly using the FileSystem interface?
john

From: Shekhar Sharma [mailto:shekhar2581@gmail.com]
Sent: Saturday, January 25, 2014 3:44 AM
To: user@hadoop.apache.org
Subject: Re: HDFS data transfer is faster than SCP based transfer?


We have the concept of short circuit reads which directly reads from data node which improve read performance. Do we have similar concept like short circuit writes
On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com>> wrote:
There's a lot of difference here, although both do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c
arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com>> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote storage
> using SCP protocol (using jSCH jar).  To optimize this use case, I have
> pre-loaded all my input files into HDFS and modified my use case so that it
> copies required files from HDFS. So, when tasktrackers works, it copies
> required number of input files to its local directory from HDFS. All my
> tasktrackers are also datanodes. I could see my use case has run faster. The
> only modification in my application is that file copy from HDFS instead of
> transfer using SCP. Also, my use case involves parallel operations (run in
> tasktrackers) and they do lot of file transfer. Now all these transfers are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decrease
> of time?
>
>
> with thanks and regards
> rab



--
Harsh J

RE: HDFS data transfer is faster than SCP based transfer?

Posted by John Lilley <jo...@redpoint.net>.
There are no short-circuit writes, only reads, AFAIK.
Is it necessary to transfer from HDFS to local disk?  Can you read from HDFS directly using the FileSystem interface?
john

From: Shekhar Sharma [mailto:shekhar2581@gmail.com]
Sent: Saturday, January 25, 2014 3:44 AM
To: user@hadoop.apache.org
Subject: Re: HDFS data transfer is faster than SCP based transfer?


We have the concept of short circuit reads which directly reads from data node which improve read performance. Do we have similar concept like short circuit writes
On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com>> wrote:
There's a lot of difference here, although both do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c
arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com>> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote storage
> using SCP protocol (using jSCH jar).  To optimize this use case, I have
> pre-loaded all my input files into HDFS and modified my use case so that it
> copies required files from HDFS. So, when tasktrackers works, it copies
> required number of input files to its local directory from HDFS. All my
> tasktrackers are also datanodes. I could see my use case has run faster. The
> only modification in my application is that file copy from HDFS instead of
> transfer using SCP. Also, my use case involves parallel operations (run in
> tasktrackers) and they do lot of file transfer. Now all these transfers are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decrease
> of time?
>
>
> with thanks and regards
> rab



--
Harsh J

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Shekhar Sharma <sh...@gmail.com>.
We have the concept of short circuit reads which directly reads from data
node which improve read performance. Do we have similar concept like short
circuit writes
On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com> wrote:

> There's a lot of difference here, although both do use TCP underneath,
> but do note that SCP securely encrypts data but stock HDFS
> configuration does not.
>
> You can also ask SCP to compress data transfer via the "-C" argument
> btw - unsure if you already applied that pre-test - it may help show
> up some difference. Also, the encryption algorithm can be changed to a
> weaker one if security is not a concern during the transfer, via "-c
> arcfour".
>
> On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com> wrote:
> > Hello
> >
> > I have a use case that requires transfer of input files from remote
> storage
> > using SCP protocol (using jSCH jar).  To optimize this use case, I have
> > pre-loaded all my input files into HDFS and modified my use case so that
> it
> > copies required files from HDFS. So, when tasktrackers works, it copies
> > required number of input files to its local directory from HDFS. All my
> > tasktrackers are also datanodes. I could see my use case has run faster.
> The
> > only modification in my application is that file copy from HDFS instead
> of
> > transfer using SCP. Also, my use case involves parallel operations (run
> in
> > tasktrackers) and they do lot of file transfer. Now all these transfers
> are
> > replaced with HDFS copy.
> >
> > Can anyone tell me HDFS transfer is faster as I witnessed? Is it
> because, it
> > uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease
> > of time?
> >
> >
> > with thanks and regards
> > rab
>
>
>
> --
> Harsh J
>

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Shekhar Sharma <sh...@gmail.com>.
We have the concept of short circuit reads which directly reads from data
node which improve read performance. Do we have similar concept like short
circuit writes
On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com> wrote:

> There's a lot of difference here, although both do use TCP underneath,
> but do note that SCP securely encrypts data but stock HDFS
> configuration does not.
>
> You can also ask SCP to compress data transfer via the "-C" argument
> btw - unsure if you already applied that pre-test - it may help show
> up some difference. Also, the encryption algorithm can be changed to a
> weaker one if security is not a concern during the transfer, via "-c
> arcfour".
>
> On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com> wrote:
> > Hello
> >
> > I have a use case that requires transfer of input files from remote
> storage
> > using SCP protocol (using jSCH jar).  To optimize this use case, I have
> > pre-loaded all my input files into HDFS and modified my use case so that
> it
> > copies required files from HDFS. So, when tasktrackers works, it copies
> > required number of input files to its local directory from HDFS. All my
> > tasktrackers are also datanodes. I could see my use case has run faster.
> The
> > only modification in my application is that file copy from HDFS instead
> of
> > transfer using SCP. Also, my use case involves parallel operations (run
> in
> > tasktrackers) and they do lot of file transfer. Now all these transfers
> are
> > replaced with HDFS copy.
> >
> > Can anyone tell me HDFS transfer is faster as I witnessed? Is it
> because, it
> > uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease
> > of time?
> >
> >
> > with thanks and regards
> > rab
>
>
>
> --
> Harsh J
>

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Shekhar Sharma <sh...@gmail.com>.
We have the concept of short circuit reads which directly reads from data
node which improve read performance. Do we have similar concept like short
circuit writes
On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com> wrote:

> There's a lot of difference here, although both do use TCP underneath,
> but do note that SCP securely encrypts data but stock HDFS
> configuration does not.
>
> You can also ask SCP to compress data transfer via the "-C" argument
> btw - unsure if you already applied that pre-test - it may help show
> up some difference. Also, the encryption algorithm can be changed to a
> weaker one if security is not a concern during the transfer, via "-c
> arcfour".
>
> On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com> wrote:
> > Hello
> >
> > I have a use case that requires transfer of input files from remote
> storage
> > using SCP protocol (using jSCH jar).  To optimize this use case, I have
> > pre-loaded all my input files into HDFS and modified my use case so that
> it
> > copies required files from HDFS. So, when tasktrackers works, it copies
> > required number of input files to its local directory from HDFS. All my
> > tasktrackers are also datanodes. I could see my use case has run faster.
> The
> > only modification in my application is that file copy from HDFS instead
> of
> > transfer using SCP. Also, my use case involves parallel operations (run
> in
> > tasktrackers) and they do lot of file transfer. Now all these transfers
> are
> > replaced with HDFS copy.
> >
> > Can anyone tell me HDFS transfer is faster as I witnessed? Is it
> because, it
> > uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease
> > of time?
> >
> >
> > with thanks and regards
> > rab
>
>
>
> --
> Harsh J
>

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Shekhar Sharma <sh...@gmail.com>.
We have the concept of short circuit reads which directly reads from data
node which improve read performance. Do we have similar concept like short
circuit writes
On 25 Jan 2014 16:10, "Harsh J" <ha...@cloudera.com> wrote:

> There's a lot of difference here, although both do use TCP underneath,
> but do note that SCP securely encrypts data but stock HDFS
> configuration does not.
>
> You can also ask SCP to compress data transfer via the "-C" argument
> btw - unsure if you already applied that pre-test - it may help show
> up some difference. Also, the encryption algorithm can be changed to a
> weaker one if security is not a concern during the transfer, via "-c
> arcfour".
>
> On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com> wrote:
> > Hello
> >
> > I have a use case that requires transfer of input files from remote
> storage
> > using SCP protocol (using jSCH jar).  To optimize this use case, I have
> > pre-loaded all my input files into HDFS and modified my use case so that
> it
> > copies required files from HDFS. So, when tasktrackers works, it copies
> > required number of input files to its local directory from HDFS. All my
> > tasktrackers are also datanodes. I could see my use case has run faster.
> The
> > only modification in my application is that file copy from HDFS instead
> of
> > transfer using SCP. Also, my use case involves parallel operations (run
> in
> > tasktrackers) and they do lot of file transfer. Now all these transfers
> are
> > replaced with HDFS copy.
> >
> > Can anyone tell me HDFS transfer is faster as I witnessed? Is it
> because, it
> > uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease
> > of time?
> >
> >
> > with thanks and regards
> > rab
>
>
>
> --
> Harsh J
>

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Harsh J <ha...@cloudera.com>.
There's a lot of difference here, although both do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c
arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote storage
> using SCP protocol (using jSCH jar).  To optimize this use case, I have
> pre-loaded all my input files into HDFS and modified my use case so that it
> copies required files from HDFS. So, when tasktrackers works, it copies
> required number of input files to its local directory from HDFS. All my
> tasktrackers are also datanodes. I could see my use case has run faster. The
> only modification in my application is that file copy from HDFS instead of
> transfer using SCP. Also, my use case involves parallel operations (run in
> tasktrackers) and they do lot of file transfer. Now all these transfers are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decrease
> of time?
>
>
> with thanks and regards
> rab



-- 
Harsh J

Re: HDFS data transfer is faster than SCP based transfer?

Posted by rab ra <ra...@gmail.com>.
It is not a single file. Lot of small files. Files are stored in HDFS and
map operations copies required files from hdfs. One map process running in
one node only. Each file will be about 16MB
On 24 Jan 2014 23:49, "Vinod Kumar Vavilapalli" <vi...@hortonworks.com>
wrote:

> Is it a single file? Lots of files? How big are the files? Is the copy on
> a single node or are you running some kind of a MapReduce program?
>
> +Vinod
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> On Fri, Jan 24, 2014 at 7:21 AM, rab ra <ra...@gmail.com> wrote:
>
>> Hi
>>
>> Can anyone please answer my query?
>>
>> -Rab
>> ---------- Forwarded message ----------
>> From: "rab ra" <ra...@gmail.com>
>> Date: 24 Jan 2014 10:55
>> Subject: HDFS data transfer is faster than SCP based transfer?
>> To: <us...@hadoop.apache.org>
>>
>> Hello
>>
>> I have a use case that requires transfer of input files from remote
>> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
>> have pre-loaded all my input files into HDFS and modified my use case so
>> that it copies required files from HDFS. So, when tasktrackers works, it
>> copies required number of input files to its local directory from HDFS. All
>> my tasktrackers are also datanodes. I could see my use case has run faster.
>> The only modification in my application is that file copy from HDFS instead
>> of transfer using SCP. Also, my use case involves parallel operations (run
>> in tasktrackers) and they do lot of file transfer. Now all these transfers
>> are replaced with HDFS copy.
>>
>> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
>> it uses TCP/IP? Can anyone give me reasonable reasons to support the
>> decrease of time?
>>
>>
>> with thanks and regards
>> rab
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: HDFS data transfer is faster than SCP based transfer?

Posted by rab ra <ra...@gmail.com>.
It is not a single file. Lot of small files. Files are stored in HDFS and
map operations copies required files from hdfs. One map process running in
one node only. Each file will be about 16MB
On 24 Jan 2014 23:49, "Vinod Kumar Vavilapalli" <vi...@hortonworks.com>
wrote:

> Is it a single file? Lots of files? How big are the files? Is the copy on
> a single node or are you running some kind of a MapReduce program?
>
> +Vinod
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> On Fri, Jan 24, 2014 at 7:21 AM, rab ra <ra...@gmail.com> wrote:
>
>> Hi
>>
>> Can anyone please answer my query?
>>
>> -Rab
>> ---------- Forwarded message ----------
>> From: "rab ra" <ra...@gmail.com>
>> Date: 24 Jan 2014 10:55
>> Subject: HDFS data transfer is faster than SCP based transfer?
>> To: <us...@hadoop.apache.org>
>>
>> Hello
>>
>> I have a use case that requires transfer of input files from remote
>> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
>> have pre-loaded all my input files into HDFS and modified my use case so
>> that it copies required files from HDFS. So, when tasktrackers works, it
>> copies required number of input files to its local directory from HDFS. All
>> my tasktrackers are also datanodes. I could see my use case has run faster.
>> The only modification in my application is that file copy from HDFS instead
>> of transfer using SCP. Also, my use case involves parallel operations (run
>> in tasktrackers) and they do lot of file transfer. Now all these transfers
>> are replaced with HDFS copy.
>>
>> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
>> it uses TCP/IP? Can anyone give me reasonable reasons to support the
>> decrease of time?
>>
>>
>> with thanks and regards
>> rab
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: HDFS data transfer is faster than SCP based transfer?

Posted by rab ra <ra...@gmail.com>.
It is not a single file. Lot of small files. Files are stored in HDFS and
map operations copies required files from hdfs. One map process running in
one node only. Each file will be about 16MB
On 24 Jan 2014 23:49, "Vinod Kumar Vavilapalli" <vi...@hortonworks.com>
wrote:

> Is it a single file? Lots of files? How big are the files? Is the copy on
> a single node or are you running some kind of a MapReduce program?
>
> +Vinod
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> On Fri, Jan 24, 2014 at 7:21 AM, rab ra <ra...@gmail.com> wrote:
>
>> Hi
>>
>> Can anyone please answer my query?
>>
>> -Rab
>> ---------- Forwarded message ----------
>> From: "rab ra" <ra...@gmail.com>
>> Date: 24 Jan 2014 10:55
>> Subject: HDFS data transfer is faster than SCP based transfer?
>> To: <us...@hadoop.apache.org>
>>
>> Hello
>>
>> I have a use case that requires transfer of input files from remote
>> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
>> have pre-loaded all my input files into HDFS and modified my use case so
>> that it copies required files from HDFS. So, when tasktrackers works, it
>> copies required number of input files to its local directory from HDFS. All
>> my tasktrackers are also datanodes. I could see my use case has run faster.
>> The only modification in my application is that file copy from HDFS instead
>> of transfer using SCP. Also, my use case involves parallel operations (run
>> in tasktrackers) and they do lot of file transfer. Now all these transfers
>> are replaced with HDFS copy.
>>
>> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
>> it uses TCP/IP? Can anyone give me reasonable reasons to support the
>> decrease of time?
>>
>>
>> with thanks and regards
>> rab
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: HDFS data transfer is faster than SCP based transfer?

Posted by rab ra <ra...@gmail.com>.
It is not a single file. Lot of small files. Files are stored in HDFS and
map operations copies required files from hdfs. One map process running in
one node only. Each file will be about 16MB
On 24 Jan 2014 23:49, "Vinod Kumar Vavilapalli" <vi...@hortonworks.com>
wrote:

> Is it a single file? Lots of files? How big are the files? Is the copy on
> a single node or are you running some kind of a MapReduce program?
>
> +Vinod
> Hortonworks Inc.
> http://hortonworks.com/
>
>
> On Fri, Jan 24, 2014 at 7:21 AM, rab ra <ra...@gmail.com> wrote:
>
>> Hi
>>
>> Can anyone please answer my query?
>>
>> -Rab
>> ---------- Forwarded message ----------
>> From: "rab ra" <ra...@gmail.com>
>> Date: 24 Jan 2014 10:55
>> Subject: HDFS data transfer is faster than SCP based transfer?
>> To: <us...@hadoop.apache.org>
>>
>> Hello
>>
>> I have a use case that requires transfer of input files from remote
>> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
>> have pre-loaded all my input files into HDFS and modified my use case so
>> that it copies required files from HDFS. So, when tasktrackers works, it
>> copies required number of input files to its local directory from HDFS. All
>> my tasktrackers are also datanodes. I could see my use case has run faster.
>> The only modification in my application is that file copy from HDFS instead
>> of transfer using SCP. Also, my use case involves parallel operations (run
>> in tasktrackers) and they do lot of file transfer. Now all these transfers
>> are replaced with HDFS copy.
>>
>> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
>> it uses TCP/IP? Can anyone give me reasonable reasons to support the
>> decrease of time?
>>
>>
>> with thanks and regards
>> rab
>>
>
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
Is it a single file? Lots of files? How big are the files? Is the copy on a
single node or are you running some kind of a MapReduce program?

+Vinod
Hortonworks Inc.
http://hortonworks.com/


On Fri, Jan 24, 2014 at 7:21 AM, rab ra <ra...@gmail.com> wrote:

> Hi
>
> Can anyone please answer my query?
>
> -Rab
> ---------- Forwarded message ----------
> From: "rab ra" <ra...@gmail.com>
> Date: 24 Jan 2014 10:55
> Subject: HDFS data transfer is faster than SCP based transfer?
> To: <us...@hadoop.apache.org>
>
> Hello
>
> I have a use case that requires transfer of input files from remote
> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
> have pre-loaded all my input files into HDFS and modified my use case so
> that it copies required files from HDFS. So, when tasktrackers works, it
> copies required number of input files to its local directory from HDFS. All
> my tasktrackers are also datanodes. I could see my use case has run faster.
> The only modification in my application is that file copy from HDFS instead
> of transfer using SCP. Also, my use case involves parallel operations (run
> in tasktrackers) and they do lot of file transfer. Now all these transfers
> are replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
> it uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease of time?
>
>
> with thanks and regards
> rab
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
Is it a single file? Lots of files? How big are the files? Is the copy on a
single node or are you running some kind of a MapReduce program?

+Vinod
Hortonworks Inc.
http://hortonworks.com/


On Fri, Jan 24, 2014 at 7:21 AM, rab ra <ra...@gmail.com> wrote:

> Hi
>
> Can anyone please answer my query?
>
> -Rab
> ---------- Forwarded message ----------
> From: "rab ra" <ra...@gmail.com>
> Date: 24 Jan 2014 10:55
> Subject: HDFS data transfer is faster than SCP based transfer?
> To: <us...@hadoop.apache.org>
>
> Hello
>
> I have a use case that requires transfer of input files from remote
> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
> have pre-loaded all my input files into HDFS and modified my use case so
> that it copies required files from HDFS. So, when tasktrackers works, it
> copies required number of input files to its local directory from HDFS. All
> my tasktrackers are also datanodes. I could see my use case has run faster.
> The only modification in my application is that file copy from HDFS instead
> of transfer using SCP. Also, my use case involves parallel operations (run
> in tasktrackers) and they do lot of file transfer. Now all these transfers
> are replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
> it uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease of time?
>
>
> with thanks and regards
> rab
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
Is it a single file? Lots of files? How big are the files? Is the copy on a
single node or are you running some kind of a MapReduce program?

+Vinod
Hortonworks Inc.
http://hortonworks.com/


On Fri, Jan 24, 2014 at 7:21 AM, rab ra <ra...@gmail.com> wrote:

> Hi
>
> Can anyone please answer my query?
>
> -Rab
> ---------- Forwarded message ----------
> From: "rab ra" <ra...@gmail.com>
> Date: 24 Jan 2014 10:55
> Subject: HDFS data transfer is faster than SCP based transfer?
> To: <us...@hadoop.apache.org>
>
> Hello
>
> I have a use case that requires transfer of input files from remote
> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
> have pre-loaded all my input files into HDFS and modified my use case so
> that it copies required files from HDFS. So, when tasktrackers works, it
> copies required number of input files to its local directory from HDFS. All
> my tasktrackers are also datanodes. I could see my use case has run faster.
> The only modification in my application is that file copy from HDFS instead
> of transfer using SCP. Also, my use case involves parallel operations (run
> in tasktrackers) and they do lot of file transfer. Now all these transfers
> are replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
> it uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease of time?
>
>
> with thanks and regards
> rab
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.
Is it a single file? Lots of files? How big are the files? Is the copy on a
single node or are you running some kind of a MapReduce program?

+Vinod
Hortonworks Inc.
http://hortonworks.com/


On Fri, Jan 24, 2014 at 7:21 AM, rab ra <ra...@gmail.com> wrote:

> Hi
>
> Can anyone please answer my query?
>
> -Rab
> ---------- Forwarded message ----------
> From: "rab ra" <ra...@gmail.com>
> Date: 24 Jan 2014 10:55
> Subject: HDFS data transfer is faster than SCP based transfer?
> To: <us...@hadoop.apache.org>
>
> Hello
>
> I have a use case that requires transfer of input files from remote
> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
> have pre-loaded all my input files into HDFS and modified my use case so
> that it copies required files from HDFS. So, when tasktrackers works, it
> copies required number of input files to its local directory from HDFS. All
> my tasktrackers are also datanodes. I could see my use case has run faster.
> The only modification in my application is that file copy from HDFS instead
> of transfer using SCP. Also, my use case involves parallel operations (run
> in tasktrackers) and they do lot of file transfer. Now all these transfers
> are replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
> it uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease of time?
>
>
> with thanks and regards
> rab
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Fwd: HDFS data transfer is faster than SCP based transfer?

Posted by rab ra <ra...@gmail.com>.
Hi

Can anyone please answer my query?

-Rab
---------- Forwarded message ----------
From: "rab ra" <ra...@gmail.com>
Date: 24 Jan 2014 10:55
Subject: HDFS data transfer is faster than SCP based transfer?
To: <us...@hadoop.apache.org>

Hello

I have a use case that requires transfer of input files from remote storage
using SCP protocol (using jSCH jar).  To optimize this use case, I have
pre-loaded all my input files into HDFS and modified my use case so that it
copies required files from HDFS. So, when tasktrackers works, it copies
required number of input files to its local directory from HDFS. All my
tasktrackers are also datanodes. I could see my use case has run faster.
The only modification in my application is that file copy from HDFS instead
of transfer using SCP. Also, my use case involves parallel operations (run
in tasktrackers) and they do lot of file transfer. Now all these transfers
are replaced with HDFS copy.

Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
it uses TCP/IP? Can anyone give me reasonable reasons to support the
decrease of time?


with thanks and regards
rab

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Shekhar Sharma <sh...@gmail.com>.
WHEN u put the data or write into HDFS, 64kb of data is written on client
side and then it is pushed through pipeline and this process continue till
64mb of data is written which is the block size defined by the client.

While on the other hand scp will try to buffer the entire data. Passing
chunks of data would be faster than passing larger data.

Please check how writing happen in HDFS. That will give you clear picture
On 24 Jan 2014 10:56, "rab ra" <ra...@gmail.com> wrote:

> Hello
>
> I have a use case that requires transfer of input files from remote
> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
> have pre-loaded all my input files into HDFS and modified my use case so
> that it copies required files from HDFS. So, when tasktrackers works, it
> copies required number of input files to its local directory from HDFS. All
> my tasktrackers are also datanodes. I could see my use case has run faster.
> The only modification in my application is that file copy from HDFS instead
> of transfer using SCP. Also, my use case involves parallel operations (run
> in tasktrackers) and they do lot of file transfer. Now all these transfers
> are replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
> it uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease of time?
>
>
> with thanks and regards
> rab
>

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Harsh J <ha...@cloudera.com>.
There's a lot of difference here, although both do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c
arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote storage
> using SCP protocol (using jSCH jar).  To optimize this use case, I have
> pre-loaded all my input files into HDFS and modified my use case so that it
> copies required files from HDFS. So, when tasktrackers works, it copies
> required number of input files to its local directory from HDFS. All my
> tasktrackers are also datanodes. I could see my use case has run faster. The
> only modification in my application is that file copy from HDFS instead of
> transfer using SCP. Also, my use case involves parallel operations (run in
> tasktrackers) and they do lot of file transfer. Now all these transfers are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decrease
> of time?
>
>
> with thanks and regards
> rab



-- 
Harsh J

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Shekhar Sharma <sh...@gmail.com>.
WHEN u put the data or write into HDFS, 64kb of data is written on client
side and then it is pushed through pipeline and this process continue till
64mb of data is written which is the block size defined by the client.

While on the other hand scp will try to buffer the entire data. Passing
chunks of data would be faster than passing larger data.

Please check how writing happen in HDFS. That will give you clear picture
On 24 Jan 2014 10:56, "rab ra" <ra...@gmail.com> wrote:

> Hello
>
> I have a use case that requires transfer of input files from remote
> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
> have pre-loaded all my input files into HDFS and modified my use case so
> that it copies required files from HDFS. So, when tasktrackers works, it
> copies required number of input files to its local directory from HDFS. All
> my tasktrackers are also datanodes. I could see my use case has run faster.
> The only modification in my application is that file copy from HDFS instead
> of transfer using SCP. Also, my use case involves parallel operations (run
> in tasktrackers) and they do lot of file transfer. Now all these transfers
> are replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
> it uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease of time?
>
>
> with thanks and regards
> rab
>

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Shekhar Sharma <sh...@gmail.com>.
WHEN u put the data or write into HDFS, 64kb of data is written on client
side and then it is pushed through pipeline and this process continue till
64mb of data is written which is the block size defined by the client.

While on the other hand scp will try to buffer the entire data. Passing
chunks of data would be faster than passing larger data.

Please check how writing happen in HDFS. That will give you clear picture
On 24 Jan 2014 10:56, "rab ra" <ra...@gmail.com> wrote:

> Hello
>
> I have a use case that requires transfer of input files from remote
> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
> have pre-loaded all my input files into HDFS and modified my use case so
> that it copies required files from HDFS. So, when tasktrackers works, it
> copies required number of input files to its local directory from HDFS. All
> my tasktrackers are also datanodes. I could see my use case has run faster.
> The only modification in my application is that file copy from HDFS instead
> of transfer using SCP. Also, my use case involves parallel operations (run
> in tasktrackers) and they do lot of file transfer. Now all these transfers
> are replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
> it uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease of time?
>
>
> with thanks and regards
> rab
>

Fwd: HDFS data transfer is faster than SCP based transfer?

Posted by rab ra <ra...@gmail.com>.
Hi

Can anyone please answer my query?

-Rab
---------- Forwarded message ----------
From: "rab ra" <ra...@gmail.com>
Date: 24 Jan 2014 10:55
Subject: HDFS data transfer is faster than SCP based transfer?
To: <us...@hadoop.apache.org>

Hello

I have a use case that requires transfer of input files from remote storage
using SCP protocol (using jSCH jar).  To optimize this use case, I have
pre-loaded all my input files into HDFS and modified my use case so that it
copies required files from HDFS. So, when tasktrackers works, it copies
required number of input files to its local directory from HDFS. All my
tasktrackers are also datanodes. I could see my use case has run faster.
The only modification in my application is that file copy from HDFS instead
of transfer using SCP. Also, my use case involves parallel operations (run
in tasktrackers) and they do lot of file transfer. Now all these transfers
are replaced with HDFS copy.

Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
it uses TCP/IP? Can anyone give me reasonable reasons to support the
decrease of time?


with thanks and regards
rab

Fwd: HDFS data transfer is faster than SCP based transfer?

Posted by rab ra <ra...@gmail.com>.
Hi

Can anyone please answer my query?

-Rab
---------- Forwarded message ----------
From: "rab ra" <ra...@gmail.com>
Date: 24 Jan 2014 10:55
Subject: HDFS data transfer is faster than SCP based transfer?
To: <us...@hadoop.apache.org>

Hello

I have a use case that requires transfer of input files from remote storage
using SCP protocol (using jSCH jar).  To optimize this use case, I have
pre-loaded all my input files into HDFS and modified my use case so that it
copies required files from HDFS. So, when tasktrackers works, it copies
required number of input files to its local directory from HDFS. All my
tasktrackers are also datanodes. I could see my use case has run faster.
The only modification in my application is that file copy from HDFS instead
of transfer using SCP. Also, my use case involves parallel operations (run
in tasktrackers) and they do lot of file transfer. Now all these transfers
are replaced with HDFS copy.

Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
it uses TCP/IP? Can anyone give me reasonable reasons to support the
decrease of time?


with thanks and regards
rab

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Harsh J <ha...@cloudera.com>.
There's a lot of difference here, although both do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c
arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote storage
> using SCP protocol (using jSCH jar).  To optimize this use case, I have
> pre-loaded all my input files into HDFS and modified my use case so that it
> copies required files from HDFS. So, when tasktrackers works, it copies
> required number of input files to its local directory from HDFS. All my
> tasktrackers are also datanodes. I could see my use case has run faster. The
> only modification in my application is that file copy from HDFS instead of
> transfer using SCP. Also, my use case involves parallel operations (run in
> tasktrackers) and they do lot of file transfer. Now all these transfers are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decrease
> of time?
>
>
> with thanks and regards
> rab



-- 
Harsh J

Fwd: HDFS data transfer is faster than SCP based transfer?

Posted by rab ra <ra...@gmail.com>.
Hi

Can anyone please answer my query?

-Rab
---------- Forwarded message ----------
From: "rab ra" <ra...@gmail.com>
Date: 24 Jan 2014 10:55
Subject: HDFS data transfer is faster than SCP based transfer?
To: <us...@hadoop.apache.org>

Hello

I have a use case that requires transfer of input files from remote storage
using SCP protocol (using jSCH jar).  To optimize this use case, I have
pre-loaded all my input files into HDFS and modified my use case so that it
copies required files from HDFS. So, when tasktrackers works, it copies
required number of input files to its local directory from HDFS. All my
tasktrackers are also datanodes. I could see my use case has run faster.
The only modification in my application is that file copy from HDFS instead
of transfer using SCP. Also, my use case involves parallel operations (run
in tasktrackers) and they do lot of file transfer. Now all these transfers
are replaced with HDFS copy.

Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
it uses TCP/IP? Can anyone give me reasonable reasons to support the
decrease of time?


with thanks and regards
rab

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Harsh J <ha...@cloudera.com>.
There's a lot of difference here, although both do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c
arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra <ra...@gmail.com> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote storage
> using SCP protocol (using jSCH jar).  To optimize this use case, I have
> pre-loaded all my input files into HDFS and modified my use case so that it
> copies required files from HDFS. So, when tasktrackers works, it copies
> required number of input files to its local directory from HDFS. All my
> tasktrackers are also datanodes. I could see my use case has run faster. The
> only modification in my application is that file copy from HDFS instead of
> transfer using SCP. Also, my use case involves parallel operations (run in
> tasktrackers) and they do lot of file transfer. Now all these transfers are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decrease
> of time?
>
>
> with thanks and regards
> rab



-- 
Harsh J

Re: HDFS data transfer is faster than SCP based transfer?

Posted by Shekhar Sharma <sh...@gmail.com>.
WHEN u put the data or write into HDFS, 64kb of data is written on client
side and then it is pushed through pipeline and this process continue till
64mb of data is written which is the block size defined by the client.

While on the other hand scp will try to buffer the entire data. Passing
chunks of data would be faster than passing larger data.

Please check how writing happen in HDFS. That will give you clear picture
On 24 Jan 2014 10:56, "rab ra" <ra...@gmail.com> wrote:

> Hello
>
> I have a use case that requires transfer of input files from remote
> storage using SCP protocol (using jSCH jar).  To optimize this use case, I
> have pre-loaded all my input files into HDFS and modified my use case so
> that it copies required files from HDFS. So, when tasktrackers works, it
> copies required number of input files to its local directory from HDFS. All
> my tasktrackers are also datanodes. I could see my use case has run faster.
> The only modification in my application is that file copy from HDFS instead
> of transfer using SCP. Also, my use case involves parallel operations (run
> in tasktrackers) and they do lot of file transfer. Now all these transfers
> are replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because,
> it uses TCP/IP? Can anyone give me reasonable reasons to support the
> decrease of time?
>
>
> with thanks and regards
> rab
>