You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Andreas Paepcke <pa...@gmail.com> on 2011/01/01 18:03:17 UTC

Pointer to Spillable HashMap or to Counter Argument?

Newbie issue.

I find myself wanting a spillable hashmap facility
within my UDFs. Maybe I'm still not thinking hadoopy
enough. But hashmaps are often convenient as
temporary tools when operating over bags that
are passed into a UDF.

Yet if the bag sizes passed into the UDF are not
known to be bounded, heap exhaustion is a danger.
A spillable hashmap sounds like the most intuitive
solution. I've seen this topic popping up on the Web,
but I have not found either an implementation, or a
strong argument against such a facility in principle.

I understand that one can often write algorithms
that simply stream tuples into a UDF, rather than
passing in entire bags. But for efficiency it seems like
bagging can be a good idea.

Any pointers to an implementation or counter indication
argument?

Thanks,

Andreas

Re: Pointer to Spillable HashMap or to Counter Argument?

Posted by Andreas Paepcke <pa...@cs.stanford.edu>.

Thanks, Dmitriy, for your thoughts and pointers.
I'm now re-implementing my tfidf test case using a flow
of tuples, rather than clinging to bags, and I'll construct
the Accumulator interface support.

I'm just a neurotic about communication costs,
and once a bunch of data is in my grasp on a node,
in my address space, on my heap, I want to do as
much as I can in one scan. But, I'll change my ways,
you'll see :-)

Happy New Year to you too, and to the rest of the list,
of course.

Andreas


On Sat, Jan 1, 2011 at 12:04 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Andreas,
>
> A map usually implies random access (if you do not need random access to
> the
> keys, it is likely a different data structure would do). This also implies
> that an on-disk Map would incur extremely high IO cost. Bags are already
> spillable, and even though they are a whole lot more sequential in nature,
> hitting the point where they start spilling usually means increasing job
> run
> time by orders of magnitude (if they finish at all).  It is possible that
> you can get this to work (I see a cs.stanford in the cc list :)), but it
> seems like the effort would be better spent in making your algorithm not
> require having the whole map in memory, or reducing the size of the map. If
> the efficiency you refer to is efficiency of the job, I doubt you will get
> to it by means of a spillable map.
>
> I am not sure why you think passing in an entire bag is more efficient than
> the accumulator interface -- in most cases, using the accumulator
> implementation speeds up the job. There are some measurements at the very
> bottom of this page: http://wiki.apache.org/pig/PigAccumulatorSpec -- and
> I
> don't believe those bags were spilling to disk in either implementation,
> this was just processing time.
>
> -Dmitriy
> P.S. Happy New Year!
>
> On Sat, Jan 1, 2011 at 9:03 AM, Andreas Paepcke <pa...@gmail.com> wrote:
>
> > Newbie issue.
> >
> > I find myself wanting a spillable hashmap facility
> > within my UDFs. Maybe I'm still not thinking hadoopy
> > enough. But hashmaps are often convenient as
> > temporary tools when operating over bags that
> > are passed into a UDF.
> >
> > Yet if the bag sizes passed into the UDF are not
> > known to be bounded, heap exhaustion is a danger.
> > A spillable hashmap sounds like the most intuitive
> > solution. I've seen this topic popping up on the Web,
> > but I have not found either an implementation, or a
> > strong argument against such a facility in principle.
> >
> > I understand that one can often write algorithms
> > that simply stream tuples into a UDF, rather than
> > passing in entire bags. But for efficiency it seems like
> > bagging can be a good idea.
> >
> > Any pointers to an implementation or counter indication
> > argument?
> >
> > Thanks,
> >
> > Andreas
> >
>

Re: Pointer to Spillable HashMap or to Counter Argument?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Andreas,

A map usually implies random access (if you do not need random access to the
keys, it is likely a different data structure would do). This also implies
that an on-disk Map would incur extremely high IO cost. Bags are already
spillable, and even though they are a whole lot more sequential in nature,
hitting the point where they start spilling usually means increasing job run
time by orders of magnitude (if they finish at all).  It is possible that
you can get this to work (I see a cs.stanford in the cc list :)), but it
seems like the effort would be better spent in making your algorithm not
require having the whole map in memory, or reducing the size of the map. If
the efficiency you refer to is efficiency of the job, I doubt you will get
to it by means of a spillable map.

I am not sure why you think passing in an entire bag is more efficient than
the accumulator interface -- in most cases, using the accumulator
implementation speeds up the job. There are some measurements at the very
bottom of this page: http://wiki.apache.org/pig/PigAccumulatorSpec -- and I
don't believe those bags were spilling to disk in either implementation,
this was just processing time.

-Dmitriy
P.S. Happy New Year!

On Sat, Jan 1, 2011 at 9:03 AM, Andreas Paepcke <pa...@gmail.com> wrote:

> Newbie issue.
>
> I find myself wanting a spillable hashmap facility
> within my UDFs. Maybe I'm still not thinking hadoopy
> enough. But hashmaps are often convenient as
> temporary tools when operating over bags that
> are passed into a UDF.
>
> Yet if the bag sizes passed into the UDF are not
> known to be bounded, heap exhaustion is a danger.
> A spillable hashmap sounds like the most intuitive
> solution. I've seen this topic popping up on the Web,
> but I have not found either an implementation, or a
> strong argument against such a facility in principle.
>
> I understand that one can often write algorithms
> that simply stream tuples into a UDF, rather than
> passing in entire bags. But for efficiency it seems like
> bagging can be a good idea.
>
> Any pointers to an implementation or counter indication
> argument?
>
> Thanks,
>
> Andreas
>