You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Glenn Gore <Gl...@melbourneit.com.au> on 2010/10/04 05:17:20 UTC

Parallel out-of-order Map capability?

Hi,

Any advice on the best way to achieve the following?

Running MapReduce against data set "A" where every line is looked-up against a different data-set which is housed in system "B". The size of these datasets makes it practically infeasible to store and access from every tasknode. Data-set "B" is accessible by a network client model, and scales very well and is capable of supporting lookups from the full hadoop cluster under load (the joy of in-memory hash lookups).

To get the highest throughput of the above, an async architecture is used where for every line, the Map fires off the relevant data to the network servers, saves the context and moves to the next line... The network client when data is ready to be received does a callback and the Map output is generated for that line.

The issue with the above is that the stream of data between client/server often occurs out of order and in batches... The Map function exhibits many strange behaviours from "Spill errors" to freezing and locking up.

Setting the network client to sequential works fine, though throughput drops from ~60k transactions per second down to ~300 tps (on the server side) due to latency in the round-trip for every line.

So - is there an easy way to modify the MapReduce job to allow the above behaviour where I can read in lines, queue them to be processed and write them out as a separate step?

Regards

Glenn Gore

Re: Parallel out-of-order Map capability?

Posted by Steve Kuo <ku...@gmail.com>.

How about batching up calls from n line, make a synchronous call to server
with the batch, get the batch results and go through the result set one by
one?  This assumes the server can return batched calls in order.

Re: Parallel out-of-order Map capability?

Posted by Antonio Piccolboni <an...@piccolboni.info>.

While I don't understand all the aspects of your situation and problems, I
will attempt one suggestion. You are talking about not being able to
replicate data at each node. What if you just put both datasets in hdfs? If
you have a replication factor of 3 and 30 nodes, on average 1/10 of each
data set will be stored at each node. Since B is at the moment served from
memory, we are talking tens of GBs at most, not something that could put
fear in any hadoop cluster worth this name. If the lookup you are talking
about is expressible as a equijoin between dataset, such as f(A.row)  =
g(B.row) where f and g are functions with a fast enough implementation, then
the standard way to approach this is to do just that (copy your B dataset
into hdfs and do a join with A). It's impossible for me to diagnose anything
from your description (pasting an extract of the error log is the starting
point) , but not knowing how out of order things can be, that is how big
that "context" can grow, certainly makes me uncomfortable. If you could
disable or limit the out of order aspect of how the server responds and keep
the batch aspect and set the batch size to something small to start with,
you might get a workable compromise. I hope this helps

Antonio

On Sun, Oct 3, 2010 at 8:17 PM, Glenn Gore <Gl...@melbourneit.com.au>wrote:

> Hi,
>
> Any advice on the best way to achieve the following?
>
> Running MapReduce against data set "A" where every line is looked-up
> against a different data-set which is housed in system "B". The size of
> these datasets makes it practically infeasible to store and access from
> every tasknode. Data-set "B" is accessible by a network client model, and
> scales very well and is capable of supporting lookups from the full hadoop
> cluster under load (the joy of in-memory hash lookups).
>
> To get the highest throughput of the above, an async architecture is used
> where for every line, the Map fires off the relevant data to the network
> servers, saves the context and moves to the next line... The network client
> when data is ready to be received does a callback and the Map output is
> generated for that line.
>
> The issue with the above is that the stream of data between client/server
> often occurs out of order and in batches... The Map function exhibits many
> strange behaviours from "Spill errors" to freezing and locking up.
>
> Setting the network client to sequential works fine, though throughput
> drops from ~60k transactions per second down to ~300 tps (on the server
> side) due to latency in the round-trip for every line.
>
> So - is there an easy way to modify the MapReduce job to allow the above
> behaviour where I can read in lines, queue them to be processed and write
> them out as a separate step?
>
> Regards
>
> Glenn Gore
>