You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Sunil S Nandihalli <su...@gmail.com> on 2012/04/24 15:42:47 UTC

hadoop streaming and a directory containing large number of .tgz files

Hi Everybody,
 I am a newbie to hadoop. I have about 40K .tgz files each of approximately
3MB . I would like to process this as if it were a single large file formed
by
"cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
how can I achieve this using hadoop-streaming or some-other similar
library..


thanks,
Sunil.

RE: hadoop streaming and a directory containing large number of .tgz files

Posted by Devaraj k <de...@huawei.com>.

Hi Sunil,

    Please check HarFileSystem (Hadoop Archive Filesystem), it will be useful to solve your problem.

Thanks
Devaraj
________________________________________
From: Sunil S Nandihalli [sunil.nandihalli@gmail.com]
Sent: Tuesday, April 24, 2012 7:12 PM
To: common-user@hadoop.apache.org
Subject: hadoop streaming and a directory containing large number of .tgz files

Hi Everybody,
 I am a newbie to hadoop. I have about 40K .tgz files each of approximately
3MB . I would like to process this as if it were a single large file formed
by
"cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
how can I achieve this using hadoop-streaming or some-other similar
library..

thanks,
Sunil.

Re: hadoop streaming and a directory containing large number of .tgz files

Posted by Raj Vishwanathan <ra...@yahoo.com>.

Sunil

You could use identity mappers, a single identity reducer and by not having output compression.,

Raj



>________________________________
> From: Sunil S Nandihalli <su...@gmail.com>
>To: common-user@hadoop.apache.org 
>Sent: Tuesday, April 24, 2012 7:01 AM
>Subject: Re: hadoop streaming and a directory containing large number of .tgz files
> 
>Sorry for reforwarding this email. I was not sure if it actually got
>through since I just got the confirmation regarding my membership to the
>mailing list.
>Thanks,
>Sunil.
>
>On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli <
>sunil.nandihalli@gmail.com> wrote:
>
>> Hi Everybody,
>>  I am a newbie to hadoop. I have about 40K .tgz files each of
>> approximately 3MB . I would like to process this as if it were a single
>> large file formed by
>> "cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
>> how can I achieve this using hadoop-streaming or some-other similar
>> library..
>>
>>
>> thanks,
>> Sunil.
>>
>
>
>

Re: hadoop streaming and a directory containing large number of .tgz files

Posted by Sunil S Nandihalli <su...@gmail.com>.

Sorry for reforwarding this email. I was not sure if it actually got
through since I just got the confirmation regarding my membership to the
mailing list.
Thanks,
Sunil.

On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli <
sunil.nandihalli@gmail.com> wrote:

> Hi Everybody,
>  I am a newbie to hadoop. I have about 40K .tgz files each of
> approximately 3MB . I would like to process this as if it were a single
> large file formed by
> "cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
> how can I achieve this using hadoop-streaming or some-other similar
> library..
>
>
> thanks,
> Sunil.
>