You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Sunil S Nandihalli <su...@gmail.com> on 2012/04/24 15:42:47 UTC
hadoop streaming and a directory containing large number of .tgz files
Hi Everybody,
I am a newbie to hadoop. I have about 40K .tgz files each of approximately
3MB . I would like to process this as if it were a single large file formed
by
"cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
how can I achieve this using hadoop-streaming or some-other similar
library..
thanks,
Sunil.
RE: hadoop streaming and a directory containing large number of .tgz
files
Posted by Devaraj k <de...@huawei.com>.
Hi Sunil,
Please check HarFileSystem (Hadoop Archive Filesystem), it will be useful to solve your problem.
Thanks
Devaraj
________________________________________
From: Sunil S Nandihalli [sunil.nandihalli@gmail.com]
Sent: Tuesday, April 24, 2012 7:12 PM
To: common-user@hadoop.apache.org
Subject: hadoop streaming and a directory containing large number of .tgz files
Hi Everybody,
I am a newbie to hadoop. I have about 40K .tgz files each of approximately
3MB . I would like to process this as if it were a single large file formed
by
"cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
how can I achieve this using hadoop-streaming or some-other similar
library..
thanks,
Sunil.
Re: hadoop streaming and a directory containing large number of .tgz files
Posted by Raj Vishwanathan <ra...@yahoo.com>.
Sunil
You could use identity mappers, a single identity reducer and by not having output compression.,
Raj
>________________________________
> From: Sunil S Nandihalli <su...@gmail.com>
>To: common-user@hadoop.apache.org
>Sent: Tuesday, April 24, 2012 7:01 AM
>Subject: Re: hadoop streaming and a directory containing large number of .tgz files
>
>Sorry for reforwarding this email. I was not sure if it actually got
>through since I just got the confirmation regarding my membership to the
>mailing list.
>Thanks,
>Sunil.
>
>On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli <
>sunil.nandihalli@gmail.com> wrote:
>
>> Hi Everybody,
>> I am a newbie to hadoop. I have about 40K .tgz files each of
>> approximately 3MB . I would like to process this as if it were a single
>> large file formed by
>> "cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
>> how can I achieve this using hadoop-streaming or some-other similar
>> library..
>>
>>
>> thanks,
>> Sunil.
>>
>
>
>
Re: hadoop streaming and a directory containing large number of .tgz files
Posted by Sunil S Nandihalli <su...@gmail.com>.
Sorry for reforwarding this email. I was not sure if it actually got
through since I just got the confirmation regarding my membership to the
mailing list.
Thanks,
Sunil.
On Tue, Apr 24, 2012 at 7:12 PM, Sunil S Nandihalli <
sunil.nandihalli@gmail.com> wrote:
> Hi Everybody,
> I am a newbie to hadoop. I have about 40K .tgz files each of
> approximately 3MB . I would like to process this as if it were a single
> large file formed by
> "cat list-of-files | gnuparallel 'tar -Oxvf {} | sed 1d' > output.txt"
> how can I achieve this using hadoop-streaming or some-other similar
> library..
>
>
> thanks,
> Sunil.
>