You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Roshan James <ro...@gmail.com> on 2009/06/23 19:41:45 UTC

Doing MapReduce over Har files

When I run map reduce task over a har file as the input, I see that the
input splits refer to 64mb byte boundaries inside the part file.

My mappers only know how to process the contents of each logical file inside
the har file. Is there some way by which I can take the offset range
specified by the input split and determine which logical files lie in that
offset range? (How else would one do map reduce over a har file?)

Roshan

Re: Doing MapReduce over Har files

Posted by Mahadev Konar <ma...@yahoo-inc.com>.

Hi Roshan and Julian,
  The har file system can be used as a input filesystem. You can just
provide the input to map reduce as har:///something/some.har , where
some.har is your har archive. This way amp reduce will use har filesystem as
an input. The only problem being that maps cannot run across logical files
in har. 

You can specify whatever input  format these files have/had before you
included them into har archives. The point being that har:/// can be used as
a input filesystem for map reduce, which will give map reduce a view of
logical files inside of har.

Hope this helps.
mahadev

On 6/26/09 2:37 AM, "jchernandez" <jc...@agnitio.es> wrote:

> 
> I also need help with this. I need to know how to handle a HAR file when it
> is the input to a MapReduce task. How do we read the HAR file so we can work
> on the individual logical files? I suppose we need to create our own
> InputFormat and RecordReader files, but I´m not sure how to proceed.
> 
> Julian 
> 
> 
> Roshan James-3 wrote:
>> 
>> When I run map reduce task over a har file as the input, I see that the
>> input splits refer to 64mb byte boundaries inside the part file.
>> 
>> My mappers only know how to process the contents of each logical file
>> inside
>> the har file. Is there some way by which I can take the offset range
>> specified by the input split and determine which logical files lie in that
>> offset range? (How else would one do map reduce over a har file?)
>> 
>> Roshan
>> 
>>

Re: Doing MapReduce over Har files

Posted by jchernandez <jc...@agnitio.es>.

I also need help with this. I need to know how to handle a HAR file when it
is the input to a MapReduce task. How do we read the HAR file so we can work
on the individual logical files? I suppose we need to create our own
InputFormat and RecordReader files, but I´m not sure how to proceed.

Julian 

Roshan James-3 wrote:
> 
> When I run map reduce task over a har file as the input, I see that the
> input splits refer to 64mb byte boundaries inside the part file.
> 
> My mappers only know how to process the contents of each logical file
> inside
> the har file. Is there some way by which I can take the offset range
> specified by the input split and determine which logical files lie in that
> offset range? (How else would one do map reduce over a har file?)
> 
> Roshan
> 
> 

-- 
View this message in context: http://www.nabble.com/Doing-MapReduce-over-Har-files-tp24171216p24217500.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.