You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Jingkei Ly <ji...@gmail.com> on 2009/03/25 15:23:34 UTC

Limit of 64 slots when doing a map-side join

Am I right in thinking that the CompositeInputFormat is limited to joining
64 files?

I believe this comes about because TupleWritable uses a single long-type
instance field in order to maintain a bitset of tuple slots that have been
written to - I'm guessing this is for performance reasons, but it also
implies that the TupleWritable only has 64-bits to play with when joining.

If my assumptions above are true, could replacing this long with a
java.util.BitSet be appropiate in terms of making the map-side join package
more scalable?

Re: Limit of 64 slots when doing a map-side join

Posted by jason hadoop <ja...@gmail.com>.
Be aware that there is a job run time cost for each additional data set in
your join.

On the clusters we were working with, 2ghz xeon dell 2950's, each additional
data set in the join operator added roughly 30 seconds to the job run time.

As a result of that, we would merge data sets in groups of 10, as a
background operation. This had the side effect of keeping us out of the > 32
tables == badness problem.

That is part of the reason we never bothered to attempt to fix the table
count in join problem, as our operational path took us away from the issue.


On Wed, Mar 25, 2009 at 10:30 AM, Jingkei Ly <ji...@gmail.com> wrote:

> Yes, we are leaning on the map-side join package quite heavily too - it is
> an excellent addition to the MapReduce model that's proving really useful.
> However, while HADOOP-5571 is an immediate problem for us, I can imagine
> that we will probably be wanting to join over 64 files soon as well,
> especially if we move onto larger clusters.
>
> 2009/3/25 jason hadoop <ja...@gmail.com>
>
> > That code is highly optimized and quite difficult to follow. We have
> always
> > limited our joins to 31 members and ignored the problem.
> > But I think your jira and fixing it are the correct choices.
> >
> > There is, in my opinion, a decent write up on how to use map side joins
> in
> > chapter 8 of my book, so I suspect more people will use this soon, as map
> > side join is an incredibly powerful tool.
> >
> > In one of our production applications it took the run time from 5+ hours
> to
> > about 12 minutes.
> >
> > On Wed, Mar 25, 2009 at 7:23 AM, Jingkei Ly <ji...@gmail.com>
> wrote:
> >
> > > Am I right in thinking that the CompositeInputFormat is limited to
> > joining
> > > 64 files?
> > >
> > > I believe this comes about because TupleWritable uses a single
> long-type
> > > instance field in order to maintain a bitset of tuple slots that have
> > been
> > > written to - I'm guessing this is for performance reasons, but it also
> > > implies that the TupleWritable only has 64-bits to play with when
> > joining.
> > >
> > > If my assumptions above are true, could replacing this long with a
> > > java.util.BitSet be appropiate in terms of making the map-side join
> > package
> > > more scalable?
> > >
> >
> >
> >
> > --
> > Alpha Chapters of my book on Hadoop are available
> > http://www.apress.com/book/view/9781430219422
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: Limit of 64 slots when doing a map-side join

Posted by Jingkei Ly <ji...@gmail.com>.
Yes, we are leaning on the map-side join package quite heavily too - it is
an excellent addition to the MapReduce model that's proving really useful.
However, while HADOOP-5571 is an immediate problem for us, I can imagine
that we will probably be wanting to join over 64 files soon as well,
especially if we move onto larger clusters.

2009/3/25 jason hadoop <ja...@gmail.com>

> That code is highly optimized and quite difficult to follow. We have always
> limited our joins to 31 members and ignored the problem.
> But I think your jira and fixing it are the correct choices.
>
> There is, in my opinion, a decent write up on how to use map side joins in
> chapter 8 of my book, so I suspect more people will use this soon, as map
> side join is an incredibly powerful tool.
>
> In one of our production applications it took the run time from 5+ hours to
> about 12 minutes.
>
> On Wed, Mar 25, 2009 at 7:23 AM, Jingkei Ly <ji...@gmail.com> wrote:
>
> > Am I right in thinking that the CompositeInputFormat is limited to
> joining
> > 64 files?
> >
> > I believe this comes about because TupleWritable uses a single long-type
> > instance field in order to maintain a bitset of tuple slots that have
> been
> > written to - I'm guessing this is for performance reasons, but it also
> > implies that the TupleWritable only has 64-bits to play with when
> joining.
> >
> > If my assumptions above are true, could replacing this long with a
> > java.util.BitSet be appropiate in terms of making the map-side join
> package
> > more scalable?
> >
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>

Re: Limit of 64 slots when doing a map-side join

Posted by jason hadoop <ja...@gmail.com>.
That code is highly optimized and quite difficult to follow. We have always
limited our joins to 31 members and ignored the problem.
But I think your jira and fixing it are the correct choices.

There is, in my opinion, a decent write up on how to use map side joins in
chapter 8 of my book, so I suspect more people will use this soon, as map
side join is an incredibly powerful tool.

In one of our production applications it took the run time from 5+ hours to
about 12 minutes.

On Wed, Mar 25, 2009 at 7:23 AM, Jingkei Ly <ji...@gmail.com> wrote:

> Am I right in thinking that the CompositeInputFormat is limited to joining
> 64 files?
>
> I believe this comes about because TupleWritable uses a single long-type
> instance field in order to maintain a bitset of tuple slots that have been
> written to - I'm guessing this is for performance reasons, but it also
> implies that the TupleWritable only has 64-bits to play with when joining.
>
> If my assumptions above are true, could replacing this long with a
> java.util.BitSet be appropiate in terms of making the map-side join package
> more scalable?
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422