You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by jkupferman <jk...@umail.ucsb.edu> on 2008/05/26 06:51:54 UTC

Speed up a job thats been running for 60+ hours (long)

Hi everyone,
I am using hadoop (17) to try and do some large scale user comparisons and
although the programs are all written, its taking incredibly long to run and
it seems like it should be going faster. I would really like some insight as
to what I could do to speed this up aside from just "add more computers". I
would really appreciate some help from all of the sagacious hadoop
core-users.

The basic idea is that there are a bunch of users, each of which is some
groups. I would like to know how many users each combination of groups has
in common. I laid out the data using sequence files which seems to be
working well and quickly, each sequence file entry has a text user name and
a map writable which contains all of the groups they are in. The map
function takes in each user and the outputs all of the combinations of the
groups for which it is a part of and a 1 which is the instance counter(like
in wordcount). So user x which is a member of groups 1,2,3,4 will output
1-2,1-3,1-4,2-3,2-4,3-4 as keys. Given that there are a lot of users, I made
a collector which reduces the number of records about 10x. Reducer is really
simple just sums up the total for each combination and then outputs it to a
file. Just as an aside, I make sure to use intwritables just about
everywhere which I hoped would help since there are inevitable tons of
comparisons going on.

This is being done on about 4gb of user data on an 20 Large instance cluster
on Amazons EC2. With that much data, there are about 240 map tasks and I
have it set to run 10 map tasks per task tracker. With those settings, the
slaves are running about 100% CPU and memory is just about capacity but is
almost no paging. Although the tasks seem to be progressing, some of the
tasks that have just completed have run for 30+ hours. Some of the tasks
have failed with a "Lost task tracker:" which I intend on fixing with
HADOOP-3403_0_20080516.patch, whenever this job finishes.

It seemed to me that the problem might have been calling the collector so
many times since users can be in 1000's of groups and it does about n^2
comparisons. I tried another version which outputs only n times by having
each entry output a map, but this did not prove much better on the test
trials I ran, and the extra work in the reducer is really killer.

It is not clear to me what is dragging down this job, or what I can do to
increase the rate at which it is computing. Although there is quite a bit of
data, it doesnt seem like it should be taking this long on 20 nodes. Any
help/questions/comments would be greatly appreciated. Thanks for all of your
help.

--
View this message in context: http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28long%29-tp17465721p17465721.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Speed up a job thats been running for 60+ hours (long)

Posted by Tamer Elsayed <te...@gmail.com>.

Hi,

The number of pairs u r emmitting is totally dominated by the most frequent
users (i.e, users who have the longest list of groups). If u can accept
approximate results, then I'd suggest that u drop the *top* 1% (or even
0.1%) of the users based on their frequencies. In a very similar study at
University of Maryland, with about a million documents (which correspond to
groups in your problem), we managed to get a linear time approximation of
this problem of quadratic complexity. Here is the link to the study:

http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf

On a search-related application, we found experimentally that this trick
(specifficaly, dropping 0.1%) results in a drop of just 2% in the
effectiveness. Of course, this might be different for different
applications.

Tamer

On 5/26/08, jkupferman <jk...@umail.ucsb.edu> wrote:
>
>
> Hi Yuri,
> So each user actually outputs (n^2-n)/2 records where n is the number of
> groups it is a member of. If the groups were arranged in an array from
> 0...n-1 then group x will output for all values between x+1...n-1.
>
> But yes, it does output a LOT of records. This is why I used the combiner
> which has shown to decrease the number of output records about 10x, and
> based on my understanding the combiner is run locally so only the combined
> records actually make it to the sort.
>
> I took a look at the implementation and the output is buffered so hopefully
> that helps since if it were written directly to disk on every output its
> understandable why it would be slow. I have the io.file.buffer.size set to
> 4096, since I am outputting so much, should I increase this size quite a
> bit? How big should I be looking to make this?
>
> Thanks for the help
>
>
>
>
> Yuri Kudryavcev-2 wrote:
> >
> > Hi.
> >
> > I really would like some input on this case, since I'm trying to scale up
> > a
> > similar algorithm.
> >
> > I can be totally wrong, please correct )
> > So you're emitting C^2_n group pairs from every user record by going for
> > group pairs?
> > For a n = 100 groups for an average user -- that's an 4950 output records
> > for every user. Do you see similar numbers in logs?
> > I think increasing the intermediate bunch of records in this proportion
> > degrades performance.
> >
> > - Yuri.
> >
> > On 5/26/08, jkupferman <jk...@umail.ucsb.edu> wrote:
> >>
> >>
> >> Hi everyone,
> >> I am using hadoop (17) to try and do some large scale user comparisons
> >> and
> >> although the programs are all written, its taking incredibly long to run
> >> and
> >> it seems like it should be going faster. I would really like some
> insight
> >> as
> >> to what I could do to speed this up aside from just "add more
> computers".
> >> I
> >> would really appreciate some help from all of the sagacious hadoop
> >> core-users.
> >>
> >> The basic idea is that there are a bunch of users, each of which is some
> >> groups. I would like to know how many users each combination of groups
> >> has
> >> in common. I laid out the data using sequence files which seems to be
> >> working well and quickly, each sequence file entry has a text user name
> >> and
> >> a map writable which contains all of the groups they are in. The map
> >> function takes in each user and the outputs all of the combinations of
> >> the
> >> groups for which it is a part of and a 1 which is the instance
> >> counter(like
> >> in wordcount). So user x which is a member of groups 1,2,3,4 will output
> >> 1-2,1-3,1-4,2-3,2-4,3-4 as keys. Given that there are a lot of users, I
> >> made
> >> a collector which reduces the number of records about 10x. Reducer is
> >> really
> >> simple just sums up the total for each combination and then outputs it
> to
> >> a
> >> file. Just as an aside, I make sure to use intwritables just about
> >> everywhere which I hoped would help since there are inevitable tons of
> >> comparisons going on.
> >>
> >> This is being done on about 4gb of user data on an 20 Large instance
> >> cluster
> >> on Amazons EC2. With that much data, there are about 240 map tasks and I
> >> have it set to run 10 map tasks per task tracker. With those settings,
> >> the
> >> slaves are running about 100% CPU and memory is just about capacity but
> >> is
> >> almost no paging. Although the tasks seem to be progressing, some of the
> >> tasks that have just completed have run for 30+ hours. Some of the tasks
> >> have failed with a "Lost task tracker:" which I intend on fixing with
> >> HADOOP-3403_0_20080516.patch, whenever this job finishes.
> >>
> >> It seemed to me that the problem might have been calling the collector
> so
> >> many times since users can be in 1000's of groups and it does about n^2
> >> comparisons. I tried another version which outputs only n times by
> having
> >> each entry output a map, but this did not prove much better on the test
> >> trials I ran, and the extra work in the reducer is really killer.
> >>
> >> It is not clear to me what is dragging down this job, or what I can do
> to
> >> increase the rate at which it is computing. Although there is quite a
> bit
> >> of
> >> data, it doesnt seem like it should be taking this long on 20 nodes. Any
> >> help/questions/comments would be greatly appreciated. Thanks for all of
> >> your
> >> help.
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28long%29-tp17465721p17465721.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28long%29-tp17465721p17474577.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Proud to be a follower of the "Best of Mankind"
"وَاذْكُرْ رَبَّكَ إِذَا نَسِيتَ وَقُلْ عَسَى أَنْ يَهْدِيَنِي رَبِّي
لأقْرَبَ مِنْ هَذَا رَشَدًا"

Re: Speed up a job thats been running for 60+ hours (long)

Posted by jkupferman <jk...@umail.ucsb.edu>.

Hi Yuri,
So each user actually outputs (n^2-n)/2 records where n is the number of
groups it is a member of. If the groups were arranged in an array from
0...n-1 then group x will output for all values between x+1...n-1. 

But yes, it does output a LOT of records. This is why I used the combiner
which has shown to decrease the number of output records about 10x, and
based on my understanding the combiner is run locally so only the combined
records actually make it to the sort.

I took a look at the implementation and the output is buffered so hopefully
that helps since if it were written directly to disk on every output its
understandable why it would be slow. I have the io.file.buffer.size set to
4096, since I am outputting so much, should I increase this size quite a
bit? How big should I be looking to make this?

Thanks for the help




Yuri Kudryavcev-2 wrote:
> 
> Hi.
> 
> I really would like some input on this case, since I'm trying to scale up
> a
> similar algorithm.
> 
> I can be totally wrong, please correct )
> So you're emitting C^2_n group pairs from every user record by going for
> group pairs?
> For a n = 100 groups for an average user -- that's an 4950 output records
> for every user. Do you see similar numbers in logs?
> I think increasing the intermediate bunch of records in this proportion
> degrades performance.
> 
> - Yuri.
> 
> On 5/26/08, jkupferman <jk...@umail.ucsb.edu> wrote:
>>
>>
>> Hi everyone,
>> I am using hadoop (17) to try and do some large scale user comparisons
>> and
>> although the programs are all written, its taking incredibly long to run
>> and
>> it seems like it should be going faster. I would really like some insight
>> as
>> to what I could do to speed this up aside from just "add more computers".
>> I
>> would really appreciate some help from all of the sagacious hadoop
>> core-users.
>>
>> The basic idea is that there are a bunch of users, each of which is some
>> groups. I would like to know how many users each combination of groups
>> has
>> in common. I laid out the data using sequence files which seems to be
>> working well and quickly, each sequence file entry has a text user name
>> and
>> a map writable which contains all of the groups they are in. The map
>> function takes in each user and the outputs all of the combinations of
>> the
>> groups for which it is a part of and a 1 which is the instance
>> counter(like
>> in wordcount). So user x which is a member of groups 1,2,3,4 will output
>> 1-2,1-3,1-4,2-3,2-4,3-4 as keys. Given that there are a lot of users, I
>> made
>> a collector which reduces the number of records about 10x. Reducer is
>> really
>> simple just sums up the total for each combination and then outputs it to
>> a
>> file. Just as an aside, I make sure to use intwritables just about
>> everywhere which I hoped would help since there are inevitable tons of
>> comparisons going on.
>>
>> This is being done on about 4gb of user data on an 20 Large instance
>> cluster
>> on Amazons EC2. With that much data, there are about 240 map tasks and I
>> have it set to run 10 map tasks per task tracker. With those settings,
>> the
>> slaves are running about 100% CPU and memory is just about capacity but
>> is
>> almost no paging. Although the tasks seem to be progressing, some of the
>> tasks that have just completed have run for 30+ hours. Some of the tasks
>> have failed with a "Lost task tracker:" which I intend on fixing with
>> HADOOP-3403_0_20080516.patch, whenever this job finishes.
>>
>> It seemed to me that the problem might have been calling the collector so
>> many times since users can be in 1000's of groups and it does about n^2
>> comparisons. I tried another version which outputs only n times by having
>> each entry output a map, but this did not prove much better on the test
>> trials I ran, and the extra work in the reducer is really killer.
>>
>> It is not clear to me what is dragging down this job, or what I can do to
>> increase the rate at which it is computing. Although there is quite a bit
>> of
>> data, it doesnt seem like it should be taking this long on 20 nodes. Any
>> help/questions/comments would be greatly appreciated. Thanks for all of
>> your
>> help.
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28long%29-tp17465721p17465721.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28long%29-tp17465721p17474577.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Speed up a job thats been running for 60+ hours (long)

Posted by Ted Dunning <td...@veoh.com>.

Uhh... Emitting all combinations is kinda exponential in the number of
groups.  Even worse, if some user is a member of more than 30 groups, you
will have billions (literally) of records for that user.

Try emitting all *pairs* of groups for each user.  Then filter the pairs to
only retain "interesting" pairs.  Once you have a (restricted) list of
pairs, do the same sort of thing for interesting triples.

If you are fairly restrictive at each step, the output will stay very sparse
and you won't have to consider all of the long combinations that are doomed
to be boring because of the fact that sub-combinations are already known to
be boring.  This should be much, much faster if you are fierce about
restricting how combinations progress to the next level.

For interestingness, you can use simple thresholding on number of
occurrences (say, only promote if you find >=10 occurrences).  Slightly more
clever, but probably no faster would be to use a better statistical estimate
such as the log-likelihood ratio test that I am always touting and then
limit the number of sequences that are allowed to be promoted.


On 5/26/08 3:39 AM, "Yuri Kudryavcev" <ma...@ykud.com> wrote:

> Hi.
> 
> I really would like some input on this case, since I'm trying to scale up a
> similar algorithm.
> 
> I can be totally wrong, please correct )
> So you're emitting C^2_n group pairs from every user record by going for
> group pairs?
> For a n = 100 groups for an average user -- that's an 4950 output records
> for every user. Do you see similar numbers in logs?
> I think increasing the intermediate bunch of records in this proportion
> degrades performance.
> 
> - Yuri.
> 
> On 5/26/08, jkupferman <jk...@umail.ucsb.edu> wrote:
>> 
>> 
>> Hi everyone,
>> I am using hadoop (17) to try and do some large scale user comparisons and
>> although the programs are all written, its taking incredibly long to run
>> and
>> it seems like it should be going faster. I would really like some insight
>> as
>> to what I could do to speed this up aside from just "add more computers". I
>> would really appreciate some help from all of the sagacious hadoop
>> core-users.
>> 
>> The basic idea is that there are a bunch of users, each of which is some
>> groups. I would like to know how many users each combination of groups has
>> in common. I laid out the data using sequence files which seems to be
>> working well and quickly, each sequence file entry has a text user name and
>> a map writable which contains all of the groups they are in. The map
>> function takes in each user and the outputs all of the combinations of the
>> groups for which it is a part of and a 1 which is the instance counter(like
>> in wordcount). So user x which is a member of groups 1,2,3,4 will output
>> 1-2,1-3,1-4,2-3,2-4,3-4 as keys. Given that there are a lot of users, I
>> made
>> a collector which reduces the number of records about 10x. Reducer is
>> really
>> simple just sums up the total for each combination and then outputs it to a
>> file. Just as an aside, I make sure to use intwritables just about
>> everywhere which I hoped would help since there are inevitable tons of
>> comparisons going on.
>> 
>> This is being done on about 4gb of user data on an 20 Large instance
>> cluster
>> on Amazons EC2. With that much data, there are about 240 map tasks and I
>> have it set to run 10 map tasks per task tracker. With those settings, the
>> slaves are running about 100% CPU and memory is just about capacity but is
>> almost no paging. Although the tasks seem to be progressing, some of the
>> tasks that have just completed have run for 30+ hours. Some of the tasks
>> have failed with a "Lost task tracker:" which I intend on fixing with
>> HADOOP-3403_0_20080516.patch, whenever this job finishes.
>> 
>> It seemed to me that the problem might have been calling the collector so
>> many times since users can be in 1000's of groups and it does about n^2
>> comparisons. I tried another version which outputs only n times by having
>> each entry output a map, but this did not prove much better on the test
>> trials I ran, and the extra work in the reducer is really killer.
>> 
>> It is not clear to me what is dragging down this job, or what I can do to
>> increase the rate at which it is computing. Although there is quite a bit
>> of
>> data, it doesnt seem like it should be taking this long on 20 nodes. Any
>> help/questions/comments would be greatly appreciated. Thanks for all of
>> your
>> help.
>> 
>> 
>> --
>> View this message in context:
>> http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28lo
>> ng%29-tp17465721p17465721.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> 
>>

Re: Speed up a job thats been running for 60+ hours (long)

Posted by Yuri Kudryavcev <ma...@ykud.com>.

Hi.

I really would like some input on this case, since I'm trying to scale up a
similar algorithm.

I can be totally wrong, please correct )
So you're emitting C^2_n group pairs from every user record by going for
group pairs?
For a n = 100 groups for an average user -- that's an 4950 output records
for every user. Do you see similar numbers in logs?
I think increasing the intermediate bunch of records in this proportion
degrades performance.

- Yuri.

On 5/26/08, jkupferman <jk...@umail.ucsb.edu> wrote:
>
>
> Hi everyone,
> I am using hadoop (17) to try and do some large scale user comparisons and
> although the programs are all written, its taking incredibly long to run
> and
> it seems like it should be going faster. I would really like some insight
> as
> to what I could do to speed this up aside from just "add more computers". I
> would really appreciate some help from all of the sagacious hadoop
> core-users.
>
> The basic idea is that there are a bunch of users, each of which is some
> groups. I would like to know how many users each combination of groups has
> in common. I laid out the data using sequence files which seems to be
> working well and quickly, each sequence file entry has a text user name and
> a map writable which contains all of the groups they are in. The map
> function takes in each user and the outputs all of the combinations of the
> groups for which it is a part of and a 1 which is the instance counter(like
> in wordcount). So user x which is a member of groups 1,2,3,4 will output
> 1-2,1-3,1-4,2-3,2-4,3-4 as keys. Given that there are a lot of users, I
> made
> a collector which reduces the number of records about 10x. Reducer is
> really
> simple just sums up the total for each combination and then outputs it to a
> file. Just as an aside, I make sure to use intwritables just about
> everywhere which I hoped would help since there are inevitable tons of
> comparisons going on.
>
> This is being done on about 4gb of user data on an 20 Large instance
> cluster
> on Amazons EC2. With that much data, there are about 240 map tasks and I
> have it set to run 10 map tasks per task tracker. With those settings, the
> slaves are running about 100% CPU and memory is just about capacity but is
> almost no paging. Although the tasks seem to be progressing, some of the
> tasks that have just completed have run for 30+ hours. Some of the tasks
> have failed with a "Lost task tracker:" which I intend on fixing with
> HADOOP-3403_0_20080516.patch, whenever this job finishes.
>
> It seemed to me that the problem might have been calling the collector so
> many times since users can be in 1000's of groups and it does about n^2
> comparisons. I tried another version which outputs only n times by having
> each entry output a map, but this did not prove much better on the test
> trials I ran, and the extra work in the reducer is really killer.
>
> It is not clear to me what is dragging down this job, or what I can do to
> increase the rate at which it is computing. Although there is quite a bit
> of
> data, it doesnt seem like it should be taking this long on 20 nodes. Any
> help/questions/comments would be greatly appreciated. Thanks for all of
> your
> help.
>
>
> --
> View this message in context:
> http://www.nabble.com/Speed-up-a-job-thats-been-running-for-60%2B-hours-%28long%29-tp17465721p17465721.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>