You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Jim Twensky <ji...@gmail.com> on 2009/01/02 07:45:24 UTC

Re: Shared thread safe variables?

Aaron,

I actually do something different than word count. I count all possible
phrases for every sentence in my corpus. So for instance, if I have a
sentence like "Hello world", my mappers emit:

Hello 1
World 1
Hello World 1

As you can easily realize, for longer sentences the number of intermediate
records grow much more than the original input size.

Anyway, I did what I said last week based on your previous replies and it
worked well. Thank you for the advice.

Jim

On Wed, Dec 31, 2008 at 4:06 AM, Aaron Kimball <aa...@cloudera.com> wrote:

> Hmm. Check your math on the data set size. Your input corpus may be a few
> (dozen, hundred) TB, but how many distinct words are there? The output data
> set should be at least a thousand times smaller. If you've got the hardware
> to do that initial word count step on a few TB of data, the second pass
> will
> not be a major performance concern.
>
> MapReduce is, to borrow from a tired analogy, a lot like driving a freight
> train. The raw speed of any given algorithm on it might not sound
> impressive, but even if its got a much higher constant-factor of time
> associated with it, the ability to provide nearly-flat parallelism as your
> data set grows really large more than makes up for it in the long run.
> - Aaron
>
> On Thu, Dec 25, 2008 at 2:22 AM, Jim Twensky <ji...@gmail.com>
> wrote:
>
> > Hello again,
> >
> > I think I found an answer to my question. If I write a new
> > WritableComparable object that extends IntWritable and then overwrite the
> > compareTo method, I can change the sorting order from ascending to
> > descending. That will solve my problem for getting the top 100 most
> > frequent
> > words at each combiner/reducer.
> >
> > Jim
> >
> > On Wed, Dec 24, 2008 at 12:19 PM, Jim Twensky <ji...@gmail.com>
> > wrote:
> >
> > > Hi Aaron,
> > >
> > > Thanks for the advice. I actually thought of using multiple combiners
> and
> > a
> > > single reducer but I was worried about the key sorting phase to be a
> > vaste
> > > for my purpose. If the input is just a bunch of (word,count) pairs
> which
> > is
> > > in the order of TeraBytes, wouldn't sorting be an overkill? That's why
> I
> > > thought a single serial program might perform better but I'm not sure
> how
> > > long it would take to sort the keys in such a case so probably it is
> > nothing
> > > beyond speculation and I should go and give it a try to see how well it
> > > performs.
> > >
> > > Secondly, I didn't quite understand how I can take advantage of the
> > sorted
> > > keys if I use an inverting mapper that transforms (k,v) --> (v,k)
> pairs.
> > In
> > > both cases, the combiners and the single reducer will still have to
> > iterate
> > > over all the (v,k) pairs to find the top 100 right? Or is there a way
> to
> > say
> > > something like "Give me the last 100 keys" at each reducer/combiner?
> > >
> > > Thanks in advance,
> > > Jim
> > >
> > >
> > > On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball <aa...@cloudera.com>
> > wrote:
> > >
> > >> (Addendum to my own post -- an identity mapper is probably not what
> you
> > >> want. You'd actually want an inverting mapper that transforms (k, v)
> -->
> > >> (v,
> > >> k), to take advantage of the key-based sorting.)
> > >>
> > >> - Aaron
> > >>
> > >> On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball <aa...@cloudera.com>
> > >> wrote:
> > >>
> > >> > Hi Jim,
> > >> >
> > >> > The ability to perform locking of shared mutable state is a distinct
> > >> > anti-goal of the MapReduce paradigm. One of the major benefits of
> > >> writing
> > >> > MapReduce programs is knowing that you don't have to worry about
> > >> deadlock in
> > >> > your code. If mappers could lock objects, then the failure and
> restart
> > >> > semantics of individual tasks would be vastly more complicated.
> (What
> > >> > happens if a map task crashes after it obtains a lock? Does it
> > >> automatically
> > >> > release the lock? Does some rollback mechanism undo everything that
> > >> happened
> > >> > after the lock was acquired? How would that work if--by
> > definition--the
> > >> > mapper node is no longer available?)
> > >> >
> > >> > A word frequency histogram function can certainly be written in
> > >> MapReduce
> > >> > without such state. You've got the right intuition, but a serial
> > program
> > >> is
> > >> > not necessarily the best answer. Take the existing word count
> program.
> > >> This
> > >> > converts bags of words into (word, count) pairs. Then pass this
> > through
> > >> a
> > >> > second pass, via an identity mapper to a set of combiners that each
> > emit
> > >> the
> > >> > 100 most frequent words, to a single reducer that emits the 100 most
> > >> > frequent words obtained by the combiners.
> > >> >
> > >> > Many other more complicated problems which seem to require shared
> > state,
> > >> in
> > >> > truth, only require a second (or n+1'th) MapReduce pass. Adding
> > multiple
> > >> > passes is a very valid technique for building more complex
> dataflows.
> > >> >
> > >> > Cheers,
> > >> > - Aaron
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky <jim.twensky@gmail.com
> > >> >wrote:
> > >> >
> > >> >> Hello,
> > >> >>
> > >> >> I was wondering if Hadoop provides thread safe shared variables
> that
> > >> can
> > >> >> be
> > >> >> accessed from individual mappers/reducers along with a proper
> locking
> > >> >> mechanism. To clarify things, let's say that in the word count
> > example,
> > >> I
> > >> >> want to know the word that has the highest frequency and how many
> > times
> > >> it
> > >> >> occured. I believe that the latter can be done using the counters
> > that
> > >> >> come
> > >> >> with the Hadoop framework but I don't know how to get the word
> itself
> > >> as a
> > >> >> String. Of course, the problem can be more complicated like the top
> > 100
> > >> >> words or so.
> > >> >>
> > >> >> I thought of writing a serial program which can go over the final
> > >> output
> > >> >> of
> > >> >> the word count but this wouldn't be a good idea if the output file
> > gets
> > >> >> too
> > >> >> large. However, if there is a way to define and use shared
> variables,
> > >> this
> > >> >> would be really easy to do on the fly during the word count's
> reduce
> > >> >> phase.
> > >> >>
> > >> >> Thanks,
> > >> >> Jim
> > >> >>
> > >> >
> > >> >
> > >>
> > >
> > >
> >
>