You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Jay Vyas <ja...@gmail.com> on 2014/01/29 02:45:06 UTC

performance of "hadoop fs -put"

Im finding that "hadoop fs -put" on a cluster is quite slow for me when i
have large amounts of small files... much slower than native file ops.
Note that Im using the RawLocalFileSystem as the underlying backing
filesystem that is being written to in this case, so HDFS isnt the issue.

I see that the Put class creates a linkedlist of # number of elements in
the path.

1) Is there a more performant way to run "fs -put"

2) Has anyone else noted that "fs -put" has extra overhead?

Im going to trace some more but , just wanted to bounce this off the
mailing list... maybe others also have run into this issue.

** Is "hadoop fs -put" inherently slower than a unix "cp"action, regardless
of filesystem -- and if so , why? **


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: performance of "hadoop fs -put"

Posted by Jay Vyas <ja...@gmail.com>.

No , im using a glob pattern, its all done in one "put" statement


On Tue, Jan 28, 2014 at 9:22 PM, Harsh J <ha...@cloudera.com> wrote:

> Are you calling one command per file? That's bound to be slow as it
> invokes a new JVM each time.
> On Jan 29, 2014 7:15 AM, "Jay Vyas" <ja...@gmail.com> wrote:
>
>> Im finding that "hadoop fs -put" on a cluster is quite slow for me when i
>> have large amounts of small files... much slower than native file ops.
>> Note that Im using the RawLocalFileSystem as the underlying backing
>> filesystem that is being written to in this case, so HDFS isnt the issue.
>>
>> I see that the Put class creates a linkedlist of # number of elements in
>> the path.
>>
>> 1) Is there a more performant way to run "fs -put"
>>
>> 2) Has anyone else noted that "fs -put" has extra overhead?
>>
>> Im going to trace some more but , just wanted to bounce this off the
>> mailing list... maybe others also have run into this issue.
>>
>> ** Is "hadoop fs -put" inherently slower than a unix "cp"action,
>> regardless of filesystem -- and if so , why? **
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: performance of "hadoop fs -put"

Posted by Jay Vyas <ja...@gmail.com>.

No , im using a glob pattern, its all done in one "put" statement


On Tue, Jan 28, 2014 at 9:22 PM, Harsh J <ha...@cloudera.com> wrote:

> Are you calling one command per file? That's bound to be slow as it
> invokes a new JVM each time.
> On Jan 29, 2014 7:15 AM, "Jay Vyas" <ja...@gmail.com> wrote:
>
>> Im finding that "hadoop fs -put" on a cluster is quite slow for me when i
>> have large amounts of small files... much slower than native file ops.
>> Note that Im using the RawLocalFileSystem as the underlying backing
>> filesystem that is being written to in this case, so HDFS isnt the issue.
>>
>> I see that the Put class creates a linkedlist of # number of elements in
>> the path.
>>
>> 1) Is there a more performant way to run "fs -put"
>>
>> 2) Has anyone else noted that "fs -put" has extra overhead?
>>
>> Im going to trace some more but , just wanted to bounce this off the
>> mailing list... maybe others also have run into this issue.
>>
>> ** Is "hadoop fs -put" inherently slower than a unix "cp"action,
>> regardless of filesystem -- and if so , why? **
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: performance of "hadoop fs -put"

Posted by Jay Vyas <ja...@gmail.com>.

No , im using a glob pattern, its all done in one "put" statement


On Tue, Jan 28, 2014 at 9:22 PM, Harsh J <ha...@cloudera.com> wrote:

> Are you calling one command per file? That's bound to be slow as it
> invokes a new JVM each time.
> On Jan 29, 2014 7:15 AM, "Jay Vyas" <ja...@gmail.com> wrote:
>
>> Im finding that "hadoop fs -put" on a cluster is quite slow for me when i
>> have large amounts of small files... much slower than native file ops.
>> Note that Im using the RawLocalFileSystem as the underlying backing
>> filesystem that is being written to in this case, so HDFS isnt the issue.
>>
>> I see that the Put class creates a linkedlist of # number of elements in
>> the path.
>>
>> 1) Is there a more performant way to run "fs -put"
>>
>> 2) Has anyone else noted that "fs -put" has extra overhead?
>>
>> Im going to trace some more but , just wanted to bounce this off the
>> mailing list... maybe others also have run into this issue.
>>
>> ** Is "hadoop fs -put" inherently slower than a unix "cp"action,
>> regardless of filesystem -- and if so , why? **
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: performance of "hadoop fs -put"

Posted by Jay Vyas <ja...@gmail.com>.

No , im using a glob pattern, its all done in one "put" statement


On Tue, Jan 28, 2014 at 9:22 PM, Harsh J <ha...@cloudera.com> wrote:

> Are you calling one command per file? That's bound to be slow as it
> invokes a new JVM each time.
> On Jan 29, 2014 7:15 AM, "Jay Vyas" <ja...@gmail.com> wrote:
>
>> Im finding that "hadoop fs -put" on a cluster is quite slow for me when i
>> have large amounts of small files... much slower than native file ops.
>> Note that Im using the RawLocalFileSystem as the underlying backing
>> filesystem that is being written to in this case, so HDFS isnt the issue.
>>
>> I see that the Put class creates a linkedlist of # number of elements in
>> the path.
>>
>> 1) Is there a more performant way to run "fs -put"
>>
>> 2) Has anyone else noted that "fs -put" has extra overhead?
>>
>> Im going to trace some more but , just wanted to bounce this off the
>> mailing list... maybe others also have run into this issue.
>>
>> ** Is "hadoop fs -put" inherently slower than a unix "cp"action,
>> regardless of filesystem -- and if so , why? **
>>
>>
>> --
>> Jay Vyas
>> http://jayunit100.blogspot.com
>>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: performance of "hadoop fs -put"

Posted by Harsh J <ha...@cloudera.com>.

Are you calling one command per file? That's bound to be slow as it invokes
a new JVM each time.
On Jan 29, 2014 7:15 AM, "Jay Vyas" <ja...@gmail.com> wrote:

> Im finding that "hadoop fs -put" on a cluster is quite slow for me when i
> have large amounts of small files... much slower than native file ops.
> Note that Im using the RawLocalFileSystem as the underlying backing
> filesystem that is being written to in this case, so HDFS isnt the issue.
>
> I see that the Put class creates a linkedlist of # number of elements in
> the path.
>
> 1) Is there a more performant way to run "fs -put"
>
> 2) Has anyone else noted that "fs -put" has extra overhead?
>
> Im going to trace some more but , just wanted to bounce this off the
> mailing list... maybe others also have run into this issue.
>
> ** Is "hadoop fs -put" inherently slower than a unix "cp"action,
> regardless of filesystem -- and if so , why? **
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: performance of "hadoop fs -put"

Posted by Harsh J <ha...@cloudera.com>.

Are you calling one command per file? That's bound to be slow as it invokes
a new JVM each time.
On Jan 29, 2014 7:15 AM, "Jay Vyas" <ja...@gmail.com> wrote:

> Im finding that "hadoop fs -put" on a cluster is quite slow for me when i
> have large amounts of small files... much slower than native file ops.
> Note that Im using the RawLocalFileSystem as the underlying backing
> filesystem that is being written to in this case, so HDFS isnt the issue.
>
> I see that the Put class creates a linkedlist of # number of elements in
> the path.
>
> 1) Is there a more performant way to run "fs -put"
>
> 2) Has anyone else noted that "fs -put" has extra overhead?
>
> Im going to trace some more but , just wanted to bounce this off the
> mailing list... maybe others also have run into this issue.
>
> ** Is "hadoop fs -put" inherently slower than a unix "cp"action,
> regardless of filesystem -- and if so , why? **
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: performance of "hadoop fs -put"

Posted by Harsh J <ha...@cloudera.com>.

Are you calling one command per file? That's bound to be slow as it invokes
a new JVM each time.
On Jan 29, 2014 7:15 AM, "Jay Vyas" <ja...@gmail.com> wrote:

> Im finding that "hadoop fs -put" on a cluster is quite slow for me when i
> have large amounts of small files... much slower than native file ops.
> Note that Im using the RawLocalFileSystem as the underlying backing
> filesystem that is being written to in this case, so HDFS isnt the issue.
>
> I see that the Put class creates a linkedlist of # number of elements in
> the path.
>
> 1) Is there a more performant way to run "fs -put"
>
> 2) Has anyone else noted that "fs -put" has extra overhead?
>
> Im going to trace some more but , just wanted to bounce this off the
> mailing list... maybe others also have run into this issue.
>
> ** Is "hadoop fs -put" inherently slower than a unix "cp"action,
> regardless of filesystem -- and if so , why? **
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Re: performance of "hadoop fs -put"

Posted by Harsh J <ha...@cloudera.com>.

Are you calling one command per file? That's bound to be slow as it invokes
a new JVM each time.
On Jan 29, 2014 7:15 AM, "Jay Vyas" <ja...@gmail.com> wrote:

> Im finding that "hadoop fs -put" on a cluster is quite slow for me when i
> have large amounts of small files... much slower than native file ops.
> Note that Im using the RawLocalFileSystem as the underlying backing
> filesystem that is being written to in this case, so HDFS isnt the issue.
>
> I see that the Put class creates a linkedlist of # number of elements in
> the path.
>
> 1) Is there a more performant way to run "fs -put"
>
> 2) Has anyone else noted that "fs -put" has extra overhead?
>
> Im going to trace some more but , just wanted to bounce this off the
> mailing list... maybe others also have run into this issue.
>
> ** Is "hadoop fs -put" inherently slower than a unix "cp"action,
> regardless of filesystem -- and if so , why? **
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>