You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by "Cardon, Tejay E" <te...@lmco.com> on 2012/12/12 19:02:58 UTC

One output file per node

First, I hope I'm posting this to the right list.  I wasn't sure if developer questions belonged here or on the user list.
Second, thanks for your thoughts.

So I have a situation in which I'm building an index across many files.  I don't want to send ALL the data across the wire by using reducers, so I'd like to use a map only job.   However, I don't want one file per mapper, I'd like to consolidate them to only one file per node.  Effectively, I'd like to have the output of the combiner go to file, but I know I can't trust combiner to always run on all outputs for the map.

Is this possible?  Perhaps some crafty partitioner that somehow sends all records to a reducer on the local node?? (I don't see this working)

Thanks,
Tejay


[cid:image001.jpg@01CDD854.54408760]

Follow me on Eureka<https://eureka.isgs.lmco.com/#people/cardonte> and Brainstorm<http://brainstorm.isgs.lmco.com/Person.aspx?id=1200>




Re: One output file per node

Posted by Radim Kolar <hs...@filez.com>.
you need custom outputcomitter

Re: One output file per node

Posted by Radim Kolar <hs...@filez.com>.
if you have strong data locality demands, then try 
http://peregrine_mapreduce.bitbucket.org/ Its 2x faster then hadoop for 
multipass job types. It has also very fast node recovery. I plan to do 
this for hdfs, concept is similar to "virtual nodes".

Its not hadoop or HDFS compatible and it has no ecosystem. I am not sure 
if it is still under development, no commits in last months.

https://bitbucket.org/burtonator/peregrine/commits

Re: One output file per node

Posted by Robert Evans <ev...@yahoo-inc.com>.
Tejay,

The way the scheduler works you are not guaranteed to get one reducer per
node.  Reducers are not scheduled based off of locality of any kind, and
even if they were the scheduler typically treats rack local the same as
node local.  The partitioner interface only allows you to say what numeric
partition an entry should go to, nothing else. There is no way to map that
numeric partition to a particular machine.  You could try to play games
but they would be very difficult to get right, especially for the corner
cases where a task can fail and may be rerun.  If your partitioner is not
exactly deterministic, you could lose some data, and double count other
data in the case of a failure.

Why don't you want to send all of the data over the wire?  When you write
it out to HDFS it will all be sent over the wire. How do you plan on using
these indexes after they are generated?  Do you plan to read from all of
the indexes in parallel to search for a single entry or do you want to
merge them together again before actually using them?

You could do your original proposal by simulating the combiner within the
map itself.  If your data is small enough you could aggregate the data
within the mapper and then only output the aggregate when all the entries
have been processed.  If it is too big to fit into memory you could look
at having a disk backed data structure with in memory caching, or even
simulate Map/Reduce itself and write all of the data out to a local file,
sort the data and read it back in already partitioned.

--Bobby

On 12/13/12 1:47 AM, "Aloke Ghoshal" <al...@gmail.com> wrote:

>Hi Tejay,
>
>Building a consolidated index file for all your source files (for terms
>within the source files) may not be doable this way. On the other hand,
>building one index file per node is doable if you run a Reducer per Node &
>use a Partitioner.
>
>- Run one Reducer per node
>- Let Mapper output carry *NodeHostName:Term* as the key
>- Use a Partitioner based on the NodeHostName portion of the key
>(KeyFieldBasedPartitioner) & a GroupingComparator based on the Term
>portion
>
>Regards,
>Aloke
>
>On Wed, Dec 12, 2012 at 11:32 PM, Cardon, Tejay E
><te...@lmco.com>wrote:
>
>>  First, I hope I¹m posting this to the right list.  I wasn¹t sure if
>> developer questions belonged here or on the user list.****
>>
>> Second, thanks for your thoughts.****
>>
>> ** **
>>
>> So I have a situation in which I¹m building an index across many files.
>> I
>> don¹t want to send ALL the data across the wire by using reducers, so
>>I¹d
>> like to use a map only job.   However, I don¹t want one file per mapper,
>> I¹d like to consolidate them to only one file per node.  Effectively,
>>I¹d
>> like to have the output of the combiner go to file, but I know I can¹t
>> trust combiner to always run on all outputs for the map.****
>>
>> ** **
>>
>> Is this possible?  Perhaps some crafty partitioner that somehow sends
>>all
>> records to a reducer on the local node?? (I don¹t see this working)****
>>
>>
>> Thanks,
>> Tejay****
>>
>> ** **
>>
>> ** **
>>
>>
>> ** **
>>
>> Follow me on Eureka <https://eureka.isgs.lmco.com/#people/cardonte> and
>> Brainstorm <http://brainstorm.isgs.lmco.com/Person.aspx?id=1200>****
>>
>> ** **
>>
>> ** **
>>
>> ** **
>>


Re: One output file per node

Posted by Aloke Ghoshal <al...@gmail.com>.
Hi Tejay,

Building a consolidated index file for all your source files (for terms
within the source files) may not be doable this way. On the other hand,
building one index file per node is doable if you run a Reducer per Node &
use a Partitioner.

- Run one Reducer per node
- Let Mapper output carry *NodeHostName:Term* as the key
- Use a Partitioner based on the NodeHostName portion of the key
(KeyFieldBasedPartitioner) & a GroupingComparator based on the Term portion

Regards,
Aloke

On Wed, Dec 12, 2012 at 11:32 PM, Cardon, Tejay E
<te...@lmco.com>wrote:

>  First, I hope I’m posting this to the right list.  I wasn’t sure if
> developer questions belonged here or on the user list.****
>
> Second, thanks for your thoughts.****
>
> ** **
>
> So I have a situation in which I’m building an index across many files.  I
> don’t want to send ALL the data across the wire by using reducers, so I’d
> like to use a map only job.   However, I don’t want one file per mapper,
> I’d like to consolidate them to only one file per node.  Effectively, I’d
> like to have the output of the combiner go to file, but I know I can’t
> trust combiner to always run on all outputs for the map.****
>
> ** **
>
> Is this possible?  Perhaps some crafty partitioner that somehow sends all
> records to a reducer on the local node?? (I don’t see this working)****
>
>
> Thanks,
> Tejay****
>
> ** **
>
> ** **
>
>
> ** **
>
> Follow me on Eureka <https://eureka.isgs.lmco.com/#people/cardonte> and
> Brainstorm <http://brainstorm.isgs.lmco.com/Person.aspx?id=1200>****
>
> ** **
>
> ** **
>
> ** **
>