You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by David King <dk...@ketralnis.com> on 2011/11/30 07:05:36 UTC

Performance problem and profiling

I have a pig script that I've translated from an old Python job. The old script worked by read a bunch of lines of JSON into sqlite and running queries again that. The sqlite DB ended up being about 1gb on disk by the end of the job (it's about a year's worth of data) and the whole job ran in 40 to 60 minutes single-threaded on a single machine.

The pig version (sadly I require pig 0.6, as I'm running on EMR), much to my surprise, is *way* slower. Run against a few day's worth of data it takes about 15 seconds, against a month's worth takes about 1.1 minutes, against two months takes about 3 minutes, but against 3 months of data in local mode on my 4-proc laptop it takes hours. So long in fact that it didn't finish in an entire work-day and I had to kill it. On Amazon's EMR I did get a job to finish after 5h44m using 10 m1.small nodes, which is pretty nuts compared to the single-proc Python version.

There are about 15 thousand JSON files totalling 2.1gb (uncompressed), so it's not that big. And the code is, I think, pretty simple. Take a look: http://pastebin.com/3y7e2ZTq . The loader mentioned there is pretty simple too, it's basically a hack of ElephantBird's JSON loader to dive deeper into the JSON and make bags out of JSON lists in addition to simpler maps that EB does http://pastebin.com/dFKX3AJc

While it's running in local mode on my laptop it outputs a lot (about one per minute) of messages like this:

2011-11-29 18:34:21,518 [Low Memory Detector] INFO  org.apache.pig.impl.util.SpillableMemoryManager - low memory handler called (Collection threshold exceeded) init = 65404928(63872K) used = 1700522216(1660666K) committed = 2060255232(2011968K) max = 2060255232(2011968K)
2011-11-29 18:34:30,773 [Low Memory Detector] INFO  org.apache.pig.impl.util.SpillableMemoryManager - low memory handler called (Collection threshold exceeded) init = 65404928(63872K) used = 1700519216(1660663K) committed = 2060255232(2011968K) max = 2060255232(2011968K)
2011-11-29 18:34:40,953 [Low Memory Detector] INFO  org.apache.pig.impl.util.SpillableMemoryManager - low memory handler called (Collection threshold exceeded) init = 65404928(63872K) used = 1700518024(1660662K) committed = 2060255232(2011968K) max = 2060255232(2011968K)

But I'm not sure how to read EMR's debugging logs to know if it's doing that in mapreduce mode on EMR too. So my questions are:

1. Is that pig script doing anything that's really that performance intensive? Is the loader doing anything obviously bad? Why on earth is this so slow? This whole dataset should fit in RAM on 2 nodes, let alone 10

2. How do people generally go about profiling these scripts?

3. Is that "Low Memory Detector" error in local mode anything to be worried about? Or is it just telling me that some intermediate dataset doesn't fit in RAM and is being spilled to disc?

Re: Performance problem and profiling

Posted by David King <dk...@ketralnis.com>.

> It would be hard to give you any pointers without seeing the script that
> you are using.

I did link the script:

>> There are about 15 thousand JSON files totalling 2.1gb (uncompressed), so
>> it's not that big. And the code is, I think, pretty simple. Take a look:
>> http://pastebin.com/3y7e2ZTq . The loader mentioned there is pretty
>> simple too, it's basically a hack of ElephantBird's JSON loader to dive
>> deeper into the JSON and make bags out of JSON lists in addition to simpler
>> maps that EB does http://pastebin.com/dFKX3AJc

> Lastly, do not consider execution speed on your laptop as a benchmark.
> Hadoop gets it's power by running in the distributed mode on multiple
> nodes. Local mode will generally perform much worse then single threaded
> process since it's trying to mimic what happens on the cluster which
> requires quite a bit of coordination between mappers and reducers.

5+ hours over 10 nodes in mapreduce mode is pretty damning, local mode slowness or no.

Re: Performance problem and profiling

Posted by Jonathan Coveney <jc...@gmail.com>.

Do you know what stage it is giving you the heap error on? In my
experience, I've seen a couple common things that can lead to heap errors...
1) not giving it enough heap for bags to spill properly (but >1GB has been
fine in my experience)
2) a gigantic tuple. Bags are the only object in Pig that spills, so a
tuple has to be able to fit in memory. Doesn't seem to be the error here
3) Distincts. They are succinct, but can be dangerous. Usually distincts in
nested foreaches are the real culprit, but is yours dying on this stage:

userwordvals = DISTINCT userwordvals;
?
If so, I'd try rewriting that as: userwordvals = FOREACH (GROUP
userwordvals BY (all,that,stuff)) GENERATE flatten(group);

If that works, that's your culprit. Either way, it'd be helpful to know
what portion it is dying on.

And yeah, it's amazing how poorly pig/hadoop handles lots of small files.
The recommendation for filecrush is welcome -- I was looking for exactly
that the other day.

2011/11/30 David King <dk...@ketralnis.com>

> > We went through some grief with small files and inefficiencies there.
> [...]
> >> Hadoop was engineered to efficiently process small number of large files
> >> and not the other way around. Since PIG utilizes Hadoop it will have a
> >> similar limitation. Some improvement have been made on that front
> >> (CombinedInputFormat) but the performance is still lacking.
>
> Combining all of the files into just three large files reduces the
> run-time to 20 minutes on 2 nodes! (compared to 5h40m on 10 nodes). Going
> to one file per data-day (instead of one per data-hour which is what it was
> before) keeps it at a still-comfortable 33mins on 2 nodes. I didn't think
> that 15k files was in the "lots" range, but there you go. Thank you guys so
> much :)
>
> This just leaves me with the question of how to get the job to actually
> complete on my laptop in local mode. It doesn't have to be fast, but having
> it not die with out-of-memory would be a good start for testing purposes.
> I'm giving it a 2gb heap which seems like it should be fine since the
> larger intermediate chunks should be able to spill over to disk, right?
> Giving it a 3gb heap doesn't seem to change the behaviour, it just takes a
> few more minutes to die
>
>

Re: Performance problem and profiling

Posted by David King <dk...@ketralnis.com>.

> We went through some grief with small files and inefficiencies there.
[...]
>> Hadoop was engineered to efficiently process small number of large files
>> and not the other way around. Since PIG utilizes Hadoop it will have a
>> similar limitation. Some improvement have been made on that front
>> (CombinedInputFormat) but the performance is still lacking.

Combining all of the files into just three large files reduces the run-time to 20 minutes on 2 nodes! (compared to 5h40m on 10 nodes). Going to one file per data-day (instead of one per data-hour which is what it was before) keeps it at a still-comfortable 33mins on 2 nodes. I didn't think that 15k files was in the "lots" range, but there you go. Thank you guys so much :)

This just leaves me with the question of how to get the job to actually complete on my laptop in local mode. It doesn't have to be fast, but having it not die with out-of-memory would be a good start for testing purposes. I'm giving it a 2gb heap which seems like it should be fine since the larger intermediate chunks should be able to spill over to disk, right? Giving it a 3gb heap doesn't seem to change the behaviour, it just takes a few more minutes to die

Re: Performance problem and profiling

Posted by Jeremy Hanna <je...@gmail.com>.

actually - he just put it on github :)
https://github.com/edwardcapriolo/filecrush

On Nov 30, 2011, at 9:03 AM, Jeremy Hanna wrote:

> We went through some grief with small files and inefficiencies there.  First we went the route of CombinedInputFormat.  That worked for us for a while but then we started getting errors relating to the number of open files.  So we used a utility that Ed Capriolo in the Hadoop/Hive/Cassandra community wrote called FileCrush that crushes down files below a certain file size threshold into larger files so that hadoop can more efficiently deal with them - http://www.jointhegrid.com/hadoop_filecrush/index.jsp.  Ultimately we have some custom code for our specific problem - a parallel downloader from S3 (tons of small files) to HDFS that writes directly to larger sequence files.  FileCrush worked well though fwiw.
> 
> Anyway, you might try CombinedInputFormat and if that doesn't work, perhaps FileCrush - if that's indeed what your bottleneck is.
> 
> Hope that helps.
> 
> Jeremy
> 
> On Nov 30, 2011, at 7:43 AM, Alex Rovner wrote:
> 
>> David,
>> 
>> Hadoop was engineered to efficiently process small number of large files
>> and not the other way around. Since PIG utilizes Hadoop it will have a
>> similar limitation. Some improvement have been made on that front
>> (CombinedInputFormat) but the performance is still lacking.
>> 
>> The version of PIG you are using is plagued with memory issues which would
>> affect performance especially when accumulating large amount of data in the
>> reducers.
>> 
>> It would be hard to give you any pointers without seeing the script that
>> you are using.
>> 
>> Generally though, you might want to consider the following:
>> 
>> 1. Using a newer version of PIG (There might be a workaround that can be
>> put in place on EMR to do that.).
>> 2. Increasing the memory available to each mapper / reducer.
>> 3. Reduce the amount of input files by concatenating  small files into one
>> large file(s). For example combine daily files into monthly or yearly files.
>> 
>> Lastly, do not consider execution speed on your laptop as a benchmark.
>> Hadoop gets it's power by running in the distributed mode on multiple
>> nodes. Local mode will generally perform much worse then single threaded
>> process since it's trying to mimic what happens on the cluster which
>> requires quite a bit of coordination between mappers and reducers.
>> 
>> Alex
>> 
>> 
>> On Wed, Nov 30, 2011 at 1:05 AM, David King <dk...@ketralnis.com> wrote:
>> 
>>> I have a pig script that I've translated from an old Python job. The old
>>> script worked by read a bunch of lines of JSON into sqlite and running
>>> queries again that. The sqlite DB ended up being about 1gb on disk by the
>>> end of the job (it's about a year's worth of data) and the whole job ran in
>>> 40 to 60 minutes single-threaded on a single machine.
>>> 
>>> The pig version (sadly I require pig 0.6, as I'm running on EMR), much to
>>> my surprise, is *way* slower. Run against a few day's worth of data it
>>> takes about 15 seconds, against a month's worth takes about 1.1 minutes,
>>> against two months takes about 3 minutes, but against 3 months of data in
>>> local mode on my 4-proc laptop it takes hours. So long in fact that it
>>> didn't finish in an entire work-day and I had to kill it. On Amazon's EMR I
>>> did get a job to finish after 5h44m using 10 m1.small nodes, which is
>>> pretty nuts compared to the single-proc Python version.
>>> 
>>> There are about 15 thousand JSON files totalling 2.1gb (uncompressed), so
>>> it's not that big. And the code is, I think, pretty simple. Take a look:
>>> http://pastebin.com/3y7e2ZTq . The loader mentioned there is pretty
>>> simple too, it's basically a hack of ElephantBird's JSON loader to dive
>>> deeper into the JSON and make bags out of JSON lists in addition to simpler
>>> maps that EB does http://pastebin.com/dFKX3AJc
>>> 
>>> While it's running in local mode on my laptop it outputs a lot (about one
>>> per minute) of messages like this:
>>> 
>>> 2011-11-29 18:34:21,518 [Low Memory Detector] INFO
>>> org.apache.pig.impl.util.SpillableMemoryManager - low memory handler
>>> called (Collection threshold exceeded) init = 65404928(63872K) used =
>>> 1700522216(1660666K) committed = 2060255232(2011968K) max =
>>> 2060255232(2011968K)
>>> 2011-11-29 18:34:30,773 [Low Memory Detector] INFO
>>> org.apache.pig.impl.util.SpillableMemoryManager - low memory handler
>>> called (Collection threshold exceeded) init = 65404928(63872K) used =
>>> 1700519216(1660663K) committed = 2060255232(2011968K) max =
>>> 2060255232(2011968K)
>>> 2011-11-29 18:34:40,953 [Low Memory Detector] INFO
>>> org.apache.pig.impl.util.SpillableMemoryManager - low memory handler
>>> called (Collection threshold exceeded) init = 65404928(63872K) used =
>>> 1700518024(1660662K) committed = 2060255232(2011968K) max =
>>> 2060255232(2011968K)
>>> 
>>> But I'm not sure how to read EMR's debugging logs to know if it's doing
>>> that in mapreduce mode on EMR too. So my questions are:
>>> 
>>> 1. Is that pig script doing anything that's really that performance
>>> intensive? Is the loader doing anything obviously bad? Why on earth is this
>>> so slow? This whole dataset should fit in RAM on 2 nodes, let alone 10
>>> 
>>> 2. How do people generally go about profiling these scripts?
>>> 
>>> 3. Is that "Low Memory Detector" error in local mode anything to be
>>> worried about? Or is it just telling me that some intermediate dataset
>>> doesn't fit in RAM and is being spilled to disc?
>>> 
>>> 
>

Re: Performance problem and profiling

Posted by Jeremy Hanna <je...@gmail.com>.

We went through some grief with small files and inefficiencies there.  First we went the route of CombinedInputFormat.  That worked for us for a while but then we started getting errors relating to the number of open files.  So we used a utility that Ed Capriolo in the Hadoop/Hive/Cassandra community wrote called FileCrush that crushes down files below a certain file size threshold into larger files so that hadoop can more efficiently deal with them - http://www.jointhegrid.com/hadoop_filecrush/index.jsp.  Ultimately we have some custom code for our specific problem - a parallel downloader from S3 (tons of small files) to HDFS that writes directly to larger sequence files.  FileCrush worked well though fwiw.

Anyway, you might try CombinedInputFormat and if that doesn't work, perhaps FileCrush - if that's indeed what your bottleneck is.

Hope that helps.

Jeremy

On Nov 30, 2011, at 7:43 AM, Alex Rovner wrote:

> David,
> 
> Hadoop was engineered to efficiently process small number of large files
> and not the other way around. Since PIG utilizes Hadoop it will have a
> similar limitation. Some improvement have been made on that front
> (CombinedInputFormat) but the performance is still lacking.
> 
> The version of PIG you are using is plagued with memory issues which would
> affect performance especially when accumulating large amount of data in the
> reducers.
> 
> It would be hard to give you any pointers without seeing the script that
> you are using.
> 
> Generally though, you might want to consider the following:
> 
> 1. Using a newer version of PIG (There might be a workaround that can be
> put in place on EMR to do that.).
> 2. Increasing the memory available to each mapper / reducer.
> 3. Reduce the amount of input files by concatenating  small files into one
> large file(s). For example combine daily files into monthly or yearly files.
> 
> Lastly, do not consider execution speed on your laptop as a benchmark.
> Hadoop gets it's power by running in the distributed mode on multiple
> nodes. Local mode will generally perform much worse then single threaded
> process since it's trying to mimic what happens on the cluster which
> requires quite a bit of coordination between mappers and reducers.
> 
> Alex
> 
> 
> On Wed, Nov 30, 2011 at 1:05 AM, David King <dk...@ketralnis.com> wrote:
> 
>> I have a pig script that I've translated from an old Python job. The old
>> script worked by read a bunch of lines of JSON into sqlite and running
>> queries again that. The sqlite DB ended up being about 1gb on disk by the
>> end of the job (it's about a year's worth of data) and the whole job ran in
>> 40 to 60 minutes single-threaded on a single machine.
>> 
>> The pig version (sadly I require pig 0.6, as I'm running on EMR), much to
>> my surprise, is *way* slower. Run against a few day's worth of data it
>> takes about 15 seconds, against a month's worth takes about 1.1 minutes,
>> against two months takes about 3 minutes, but against 3 months of data in
>> local mode on my 4-proc laptop it takes hours. So long in fact that it
>> didn't finish in an entire work-day and I had to kill it. On Amazon's EMR I
>> did get a job to finish after 5h44m using 10 m1.small nodes, which is
>> pretty nuts compared to the single-proc Python version.
>> 
>> There are about 15 thousand JSON files totalling 2.1gb (uncompressed), so
>> it's not that big. And the code is, I think, pretty simple. Take a look:
>> http://pastebin.com/3y7e2ZTq . The loader mentioned there is pretty
>> simple too, it's basically a hack of ElephantBird's JSON loader to dive
>> deeper into the JSON and make bags out of JSON lists in addition to simpler
>> maps that EB does http://pastebin.com/dFKX3AJc
>> 
>> While it's running in local mode on my laptop it outputs a lot (about one
>> per minute) of messages like this:
>> 
>> 2011-11-29 18:34:21,518 [Low Memory Detector] INFO
>> org.apache.pig.impl.util.SpillableMemoryManager - low memory handler
>> called (Collection threshold exceeded) init = 65404928(63872K) used =
>> 1700522216(1660666K) committed = 2060255232(2011968K) max =
>> 2060255232(2011968K)
>> 2011-11-29 18:34:30,773 [Low Memory Detector] INFO
>> org.apache.pig.impl.util.SpillableMemoryManager - low memory handler
>> called (Collection threshold exceeded) init = 65404928(63872K) used =
>> 1700519216(1660663K) committed = 2060255232(2011968K) max =
>> 2060255232(2011968K)
>> 2011-11-29 18:34:40,953 [Low Memory Detector] INFO
>> org.apache.pig.impl.util.SpillableMemoryManager - low memory handler
>> called (Collection threshold exceeded) init = 65404928(63872K) used =
>> 1700518024(1660662K) committed = 2060255232(2011968K) max =
>> 2060255232(2011968K)
>> 
>> But I'm not sure how to read EMR's debugging logs to know if it's doing
>> that in mapreduce mode on EMR too. So my questions are:
>> 
>> 1. Is that pig script doing anything that's really that performance
>> intensive? Is the loader doing anything obviously bad? Why on earth is this
>> so slow? This whole dataset should fit in RAM on 2 nodes, let alone 10
>> 
>> 2. How do people generally go about profiling these scripts?
>> 
>> 3. Is that "Low Memory Detector" error in local mode anything to be
>> worried about? Or is it just telling me that some intermediate dataset
>> doesn't fit in RAM and is being spilled to disc?
>> 
>>

Re: Performance problem and profiling

Posted by Alex Rovner <al...@gmail.com>.

David,

Hadoop was engineered to efficiently process small number of large files
and not the other way around. Since PIG utilizes Hadoop it will have a
similar limitation. Some improvement have been made on that front
(CombinedInputFormat) but the performance is still lacking.

The version of PIG you are using is plagued with memory issues which would
affect performance especially when accumulating large amount of data in the
reducers.

It would be hard to give you any pointers without seeing the script that
you are using.

Generally though, you might want to consider the following:

1. Using a newer version of PIG (There might be a workaround that can be
put in place on EMR to do that.).
2. Increasing the memory available to each mapper / reducer.
3. Reduce the amount of input files by concatenating  small files into one
large file(s). For example combine daily files into monthly or yearly files.

Lastly, do not consider execution speed on your laptop as a benchmark.
Hadoop gets it's power by running in the distributed mode on multiple
nodes. Local mode will generally perform much worse then single threaded
process since it's trying to mimic what happens on the cluster which
requires quite a bit of coordination between mappers and reducers.

Alex

On Wed, Nov 30, 2011 at 1:05 AM, David King <dk...@ketralnis.com> wrote:

> I have a pig script that I've translated from an old Python job. The old
> script worked by read a bunch of lines of JSON into sqlite and running
> queries again that. The sqlite DB ended up being about 1gb on disk by the
> end of the job (it's about a year's worth of data) and the whole job ran in
> 40 to 60 minutes single-threaded on a single machine.
>
> The pig version (sadly I require pig 0.6, as I'm running on EMR), much to
> my surprise, is *way* slower. Run against a few day's worth of data it
> takes about 15 seconds, against a month's worth takes about 1.1 minutes,
> against two months takes about 3 minutes, but against 3 months of data in
> local mode on my 4-proc laptop it takes hours. So long in fact that it
> didn't finish in an entire work-day and I had to kill it. On Amazon's EMR I
> did get a job to finish after 5h44m using 10 m1.small nodes, which is
> pretty nuts compared to the single-proc Python version.
>
> There are about 15 thousand JSON files totalling 2.1gb (uncompressed), so
> it's not that big. And the code is, I think, pretty simple. Take a look:
> http://pastebin.com/3y7e2ZTq . The loader mentioned there is pretty
> simple too, it's basically a hack of ElephantBird's JSON loader to dive
> deeper into the JSON and make bags out of JSON lists in addition to simpler
> maps that EB does http://pastebin.com/dFKX3AJc
>
> While it's running in local mode on my laptop it outputs a lot (about one
> per minute) of messages like this:
>
> 2011-11-29 18:34:21,518 [Low Memory Detector] INFO
>  org.apache.pig.impl.util.SpillableMemoryManager - low memory handler
> called (Collection threshold exceeded) init = 65404928(63872K) used =
> 1700522216(1660666K) committed = 2060255232(2011968K) max =
> 2060255232(2011968K)
> 2011-11-29 18:34:30,773 [Low Memory Detector] INFO
>  org.apache.pig.impl.util.SpillableMemoryManager - low memory handler
> called (Collection threshold exceeded) init = 65404928(63872K) used =
> 1700519216(1660663K) committed = 2060255232(2011968K) max =
> 2060255232(2011968K)
> 2011-11-29 18:34:40,953 [Low Memory Detector] INFO
>  org.apache.pig.impl.util.SpillableMemoryManager - low memory handler
> called (Collection threshold exceeded) init = 65404928(63872K) used =
> 1700518024(1660662K) committed = 2060255232(2011968K) max =
> 2060255232(2011968K)
>
> But I'm not sure how to read EMR's debugging logs to know if it's doing
> that in mapreduce mode on EMR too. So my questions are:
>
> 1. Is that pig script doing anything that's really that performance
> intensive? Is the loader doing anything obviously bad? Why on earth is this
> so slow? This whole dataset should fit in RAM on 2 nodes, let alone 10
>
> 2. How do people generally go about profiling these scripts?
>
> 3. Is that "Low Memory Detector" error in local mode anything to be
> worried about? Or is it just telling me that some intermediate dataset
> doesn't fit in RAM and is being spilled to disc?
>
>