You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Han JU <ju...@gmail.com> on 2013/04/26 11:21:54 UTC

M/R job optimization

Hi,

I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
questionis that in one of the jobs, map and reduce tasks show 100% finished
in about 1m 30s, but I have to wait another 5m for this job to finish.
This job writes about 720mb compressed data to HDFS with replication factor
1, in sequence file format. I've tried copying these data to hdfs, it takes
only < 20 seconds. What happened during this 5 more minutes?

Any idea on how to optimize this part?

Thanks.

-- 
*JU Han*

UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Re: M/R job optimization

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I do not think the hint of skewed reducer is the problem here as Han
mentioned that he has to wait for 5 minutes after the job shows progress as
100% map and 100% reduce. There may be something to do with the output
committer , FileOutputCommitter needs to be looked at as what its doing for
5 min. Why so much time taken for committing a job.

Thanks,
Rahul


On Mon, Apr 29, 2013 at 9:29 PM, Ted Xu <tx...@gopivotal.com> wrote:

> Hi Han,
>
> I think your point is valid. In fact you can change the progress report
> logic by manually calling the Reporter API, but by default it is quite
> straight forward. Reducer progress is divided into 3 phases, namely copy
> phase, merge/sort phase and reduce phase, each with ~33%. In your case it
> looks your program is stucked in reduce phase. To better track the cause,
> you can check the task log, as Ted Dunning suggested before.
>
>
> On Mon, Apr 29, 2013 at 11:17 PM, Han JU <ju...@gmail.com> wrote:
>
>> Thanks Ted and .. Ted ..
>> I've been looking at the progress when the job is executing.
>> In fact, I think it's not a skewed partition problem. I've looked at the
>> mapper output files, all are of the same size and the reducer each takes a
>> single group.
>> What I want to know is that how hadoop M/R framework calculate the
>> progress percentage.
>> For example, my reducer:
>>
>> reducer(...) {
>>   call_of_another_func() // lots of complicated calculations
>> }
>>
>> Will the percentage reflect the calculation inside the function call?
>> Because I observed that in the job, all reducer reached 100% fairly
>> quickly, then they stucked there. In this time, the datanodes seem to be
>> working.
>>
>> Thanks.
>>
>>
>> 2013/4/26 Ted Dunning <td...@maprtech.com>
>>
>>> Have you checked the logs?
>>>
>>> Is there a task that is taking a long time?  What is that task doing?
>>>
>>> There are two basic possibilities:
>>>
>>> a) you have a skewed join like the other Ted mentioned.  In this case,
>>> the straggler will be seen to be working on data.
>>>
>>> b) you have a hung process.  This can be more difficult to diagnose, but
>>> indicates that there is a problem with your cluster.
>>>
>>>
>>>
>>> On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
>>>> questionis that in one of the jobs, map and reduce tasks show 100% finished
>>>> in about 1m 30s, but I have to wait another 5m for this job to finish.
>>>> This job writes about 720mb compressed data to HDFS with replication
>>>> factor 1, in sequence file format. I've tried copying these data to hdfs,
>>>> it takes only < 20 seconds. What happened during this 5 more minutes?
>>>>
>>>> Any idea on how to optimize this part?
>>>>
>>>> Thanks.
>>>>
>>>> --
>>>> *JU Han*
>>>>
>>>> UTC   -  Université de Technologie de Compiègne
>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>
>>>> +33 0619608888
>>>>
>>>
>>>
>>
>>
>> --
>> *JU Han*
>>
>> Software Engineer Intern @ KXEN Inc.
>> UTC   -  Université de Technologie de Compiègne
>> *     **GI06 - Fouille de Données et Décisionnel*
>>
>> +33 0619608888
>>
>
>
>
> --
> Regards,
> Ted Xu
>

Re: M/R job optimization

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I do not think the hint of skewed reducer is the problem here as Han
mentioned that he has to wait for 5 minutes after the job shows progress as
100% map and 100% reduce. There may be something to do with the output
committer , FileOutputCommitter needs to be looked at as what its doing for
5 min. Why so much time taken for committing a job.

Thanks,
Rahul


On Mon, Apr 29, 2013 at 9:29 PM, Ted Xu <tx...@gopivotal.com> wrote:

> Hi Han,
>
> I think your point is valid. In fact you can change the progress report
> logic by manually calling the Reporter API, but by default it is quite
> straight forward. Reducer progress is divided into 3 phases, namely copy
> phase, merge/sort phase and reduce phase, each with ~33%. In your case it
> looks your program is stucked in reduce phase. To better track the cause,
> you can check the task log, as Ted Dunning suggested before.
>
>
> On Mon, Apr 29, 2013 at 11:17 PM, Han JU <ju...@gmail.com> wrote:
>
>> Thanks Ted and .. Ted ..
>> I've been looking at the progress when the job is executing.
>> In fact, I think it's not a skewed partition problem. I've looked at the
>> mapper output files, all are of the same size and the reducer each takes a
>> single group.
>> What I want to know is that how hadoop M/R framework calculate the
>> progress percentage.
>> For example, my reducer:
>>
>> reducer(...) {
>>   call_of_another_func() // lots of complicated calculations
>> }
>>
>> Will the percentage reflect the calculation inside the function call?
>> Because I observed that in the job, all reducer reached 100% fairly
>> quickly, then they stucked there. In this time, the datanodes seem to be
>> working.
>>
>> Thanks.
>>
>>
>> 2013/4/26 Ted Dunning <td...@maprtech.com>
>>
>>> Have you checked the logs?
>>>
>>> Is there a task that is taking a long time?  What is that task doing?
>>>
>>> There are two basic possibilities:
>>>
>>> a) you have a skewed join like the other Ted mentioned.  In this case,
>>> the straggler will be seen to be working on data.
>>>
>>> b) you have a hung process.  This can be more difficult to diagnose, but
>>> indicates that there is a problem with your cluster.
>>>
>>>
>>>
>>> On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
>>>> questionis that in one of the jobs, map and reduce tasks show 100% finished
>>>> in about 1m 30s, but I have to wait another 5m for this job to finish.
>>>> This job writes about 720mb compressed data to HDFS with replication
>>>> factor 1, in sequence file format. I've tried copying these data to hdfs,
>>>> it takes only < 20 seconds. What happened during this 5 more minutes?
>>>>
>>>> Any idea on how to optimize this part?
>>>>
>>>> Thanks.
>>>>
>>>> --
>>>> *JU Han*
>>>>
>>>> UTC   -  Université de Technologie de Compiègne
>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>
>>>> +33 0619608888
>>>>
>>>
>>>
>>
>>
>> --
>> *JU Han*
>>
>> Software Engineer Intern @ KXEN Inc.
>> UTC   -  Université de Technologie de Compiègne
>> *     **GI06 - Fouille de Données et Décisionnel*
>>
>> +33 0619608888
>>
>
>
>
> --
> Regards,
> Ted Xu
>

Re: M/R job optimization

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I do not think the hint of skewed reducer is the problem here as Han
mentioned that he has to wait for 5 minutes after the job shows progress as
100% map and 100% reduce. There may be something to do with the output
committer , FileOutputCommitter needs to be looked at as what its doing for
5 min. Why so much time taken for committing a job.

Thanks,
Rahul


On Mon, Apr 29, 2013 at 9:29 PM, Ted Xu <tx...@gopivotal.com> wrote:

> Hi Han,
>
> I think your point is valid. In fact you can change the progress report
> logic by manually calling the Reporter API, but by default it is quite
> straight forward. Reducer progress is divided into 3 phases, namely copy
> phase, merge/sort phase and reduce phase, each with ~33%. In your case it
> looks your program is stucked in reduce phase. To better track the cause,
> you can check the task log, as Ted Dunning suggested before.
>
>
> On Mon, Apr 29, 2013 at 11:17 PM, Han JU <ju...@gmail.com> wrote:
>
>> Thanks Ted and .. Ted ..
>> I've been looking at the progress when the job is executing.
>> In fact, I think it's not a skewed partition problem. I've looked at the
>> mapper output files, all are of the same size and the reducer each takes a
>> single group.
>> What I want to know is that how hadoop M/R framework calculate the
>> progress percentage.
>> For example, my reducer:
>>
>> reducer(...) {
>>   call_of_another_func() // lots of complicated calculations
>> }
>>
>> Will the percentage reflect the calculation inside the function call?
>> Because I observed that in the job, all reducer reached 100% fairly
>> quickly, then they stucked there. In this time, the datanodes seem to be
>> working.
>>
>> Thanks.
>>
>>
>> 2013/4/26 Ted Dunning <td...@maprtech.com>
>>
>>> Have you checked the logs?
>>>
>>> Is there a task that is taking a long time?  What is that task doing?
>>>
>>> There are two basic possibilities:
>>>
>>> a) you have a skewed join like the other Ted mentioned.  In this case,
>>> the straggler will be seen to be working on data.
>>>
>>> b) you have a hung process.  This can be more difficult to diagnose, but
>>> indicates that there is a problem with your cluster.
>>>
>>>
>>>
>>> On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
>>>> questionis that in one of the jobs, map and reduce tasks show 100% finished
>>>> in about 1m 30s, but I have to wait another 5m for this job to finish.
>>>> This job writes about 720mb compressed data to HDFS with replication
>>>> factor 1, in sequence file format. I've tried copying these data to hdfs,
>>>> it takes only < 20 seconds. What happened during this 5 more minutes?
>>>>
>>>> Any idea on how to optimize this part?
>>>>
>>>> Thanks.
>>>>
>>>> --
>>>> *JU Han*
>>>>
>>>> UTC   -  Université de Technologie de Compiègne
>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>
>>>> +33 0619608888
>>>>
>>>
>>>
>>
>>
>> --
>> *JU Han*
>>
>> Software Engineer Intern @ KXEN Inc.
>> UTC   -  Université de Technologie de Compiègne
>> *     **GI06 - Fouille de Données et Décisionnel*
>>
>> +33 0619608888
>>
>
>
>
> --
> Regards,
> Ted Xu
>

Re: M/R job optimization

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I do not think the hint of skewed reducer is the problem here as Han
mentioned that he has to wait for 5 minutes after the job shows progress as
100% map and 100% reduce. There may be something to do with the output
committer , FileOutputCommitter needs to be looked at as what its doing for
5 min. Why so much time taken for committing a job.

Thanks,
Rahul


On Mon, Apr 29, 2013 at 9:29 PM, Ted Xu <tx...@gopivotal.com> wrote:

> Hi Han,
>
> I think your point is valid. In fact you can change the progress report
> logic by manually calling the Reporter API, but by default it is quite
> straight forward. Reducer progress is divided into 3 phases, namely copy
> phase, merge/sort phase and reduce phase, each with ~33%. In your case it
> looks your program is stucked in reduce phase. To better track the cause,
> you can check the task log, as Ted Dunning suggested before.
>
>
> On Mon, Apr 29, 2013 at 11:17 PM, Han JU <ju...@gmail.com> wrote:
>
>> Thanks Ted and .. Ted ..
>> I've been looking at the progress when the job is executing.
>> In fact, I think it's not a skewed partition problem. I've looked at the
>> mapper output files, all are of the same size and the reducer each takes a
>> single group.
>> What I want to know is that how hadoop M/R framework calculate the
>> progress percentage.
>> For example, my reducer:
>>
>> reducer(...) {
>>   call_of_another_func() // lots of complicated calculations
>> }
>>
>> Will the percentage reflect the calculation inside the function call?
>> Because I observed that in the job, all reducer reached 100% fairly
>> quickly, then they stucked there. In this time, the datanodes seem to be
>> working.
>>
>> Thanks.
>>
>>
>> 2013/4/26 Ted Dunning <td...@maprtech.com>
>>
>>> Have you checked the logs?
>>>
>>> Is there a task that is taking a long time?  What is that task doing?
>>>
>>> There are two basic possibilities:
>>>
>>> a) you have a skewed join like the other Ted mentioned.  In this case,
>>> the straggler will be seen to be working on data.
>>>
>>> b) you have a hung process.  This can be more difficult to diagnose, but
>>> indicates that there is a problem with your cluster.
>>>
>>>
>>>
>>> On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
>>>> questionis that in one of the jobs, map and reduce tasks show 100% finished
>>>> in about 1m 30s, but I have to wait another 5m for this job to finish.
>>>> This job writes about 720mb compressed data to HDFS with replication
>>>> factor 1, in sequence file format. I've tried copying these data to hdfs,
>>>> it takes only < 20 seconds. What happened during this 5 more minutes?
>>>>
>>>> Any idea on how to optimize this part?
>>>>
>>>> Thanks.
>>>>
>>>> --
>>>> *JU Han*
>>>>
>>>> UTC   -  Université de Technologie de Compiègne
>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>
>>>> +33 0619608888
>>>>
>>>
>>>
>>
>>
>> --
>> *JU Han*
>>
>> Software Engineer Intern @ KXEN Inc.
>> UTC   -  Université de Technologie de Compiègne
>> *     **GI06 - Fouille de Données et Décisionnel*
>>
>> +33 0619608888
>>
>
>
>
> --
> Regards,
> Ted Xu
>

Re: M/R job optimization

Posted by Ted Xu <tx...@gopivotal.com>.

Hi Han,

I think your point is valid. In fact you can change the progress report
logic by manually calling the Reporter API, but by default it is quite
straight forward. Reducer progress is divided into 3 phases, namely copy
phase, merge/sort phase and reduce phase, each with ~33%. In your case it
looks your program is stucked in reduce phase. To better track the cause,
you can check the task log, as Ted Dunning suggested before.


On Mon, Apr 29, 2013 at 11:17 PM, Han JU <ju...@gmail.com> wrote:

> Thanks Ted and .. Ted ..
> I've been looking at the progress when the job is executing.
> In fact, I think it's not a skewed partition problem. I've looked at the
> mapper output files, all are of the same size and the reducer each takes a
> single group.
> What I want to know is that how hadoop M/R framework calculate the
> progress percentage.
> For example, my reducer:
>
> reducer(...) {
>   call_of_another_func() // lots of complicated calculations
> }
>
> Will the percentage reflect the calculation inside the function call?
> Because I observed that in the job, all reducer reached 100% fairly
> quickly, then they stucked there. In this time, the datanodes seem to be
> working.
>
> Thanks.
>
>
> 2013/4/26 Ted Dunning <td...@maprtech.com>
>
>> Have you checked the logs?
>>
>> Is there a task that is taking a long time?  What is that task doing?
>>
>> There are two basic possibilities:
>>
>> a) you have a skewed join like the other Ted mentioned.  In this case,
>> the straggler will be seen to be working on data.
>>
>> b) you have a hung process.  This can be more difficult to diagnose, but
>> indicates that there is a problem with your cluster.
>>
>>
>>
>> On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
>>> questionis that in one of the jobs, map and reduce tasks show 100% finished
>>> in about 1m 30s, but I have to wait another 5m for this job to finish.
>>> This job writes about 720mb compressed data to HDFS with replication
>>> factor 1, in sequence file format. I've tried copying these data to hdfs,
>>> it takes only < 20 seconds. What happened during this 5 more minutes?
>>>
>>> Any idea on how to optimize this part?
>>>
>>> Thanks.
>>>
>>> --
>>> *JU Han*
>>>
>>> UTC   -  Université de Technologie de Compiègne
>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>
>>> +33 0619608888
>>>
>>
>>
>
>
> --
> *JU Han*
>
> Software Engineer Intern @ KXEN Inc.
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>



-- 
Regards,
Ted Xu

Re: M/R job optimization

Posted by Ted Xu <tx...@gopivotal.com>.

Hi Han,

I think your point is valid. In fact you can change the progress report
logic by manually calling the Reporter API, but by default it is quite
straight forward. Reducer progress is divided into 3 phases, namely copy
phase, merge/sort phase and reduce phase, each with ~33%. In your case it
looks your program is stucked in reduce phase. To better track the cause,
you can check the task log, as Ted Dunning suggested before.


On Mon, Apr 29, 2013 at 11:17 PM, Han JU <ju...@gmail.com> wrote:

> Thanks Ted and .. Ted ..
> I've been looking at the progress when the job is executing.
> In fact, I think it's not a skewed partition problem. I've looked at the
> mapper output files, all are of the same size and the reducer each takes a
> single group.
> What I want to know is that how hadoop M/R framework calculate the
> progress percentage.
> For example, my reducer:
>
> reducer(...) {
>   call_of_another_func() // lots of complicated calculations
> }
>
> Will the percentage reflect the calculation inside the function call?
> Because I observed that in the job, all reducer reached 100% fairly
> quickly, then they stucked there. In this time, the datanodes seem to be
> working.
>
> Thanks.
>
>
> 2013/4/26 Ted Dunning <td...@maprtech.com>
>
>> Have you checked the logs?
>>
>> Is there a task that is taking a long time?  What is that task doing?
>>
>> There are two basic possibilities:
>>
>> a) you have a skewed join like the other Ted mentioned.  In this case,
>> the straggler will be seen to be working on data.
>>
>> b) you have a hung process.  This can be more difficult to diagnose, but
>> indicates that there is a problem with your cluster.
>>
>>
>>
>> On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
>>> questionis that in one of the jobs, map and reduce tasks show 100% finished
>>> in about 1m 30s, but I have to wait another 5m for this job to finish.
>>> This job writes about 720mb compressed data to HDFS with replication
>>> factor 1, in sequence file format. I've tried copying these data to hdfs,
>>> it takes only < 20 seconds. What happened during this 5 more minutes?
>>>
>>> Any idea on how to optimize this part?
>>>
>>> Thanks.
>>>
>>> --
>>> *JU Han*
>>>
>>> UTC   -  Université de Technologie de Compiègne
>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>
>>> +33 0619608888
>>>
>>
>>
>
>
> --
> *JU Han*
>
> Software Engineer Intern @ KXEN Inc.
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>



-- 
Regards,
Ted Xu

Re: M/R job optimization

Posted by Ted Xu <tx...@gopivotal.com>.

Hi Han,

I think your point is valid. In fact you can change the progress report
logic by manually calling the Reporter API, but by default it is quite
straight forward. Reducer progress is divided into 3 phases, namely copy
phase, merge/sort phase and reduce phase, each with ~33%. In your case it
looks your program is stucked in reduce phase. To better track the cause,
you can check the task log, as Ted Dunning suggested before.


On Mon, Apr 29, 2013 at 11:17 PM, Han JU <ju...@gmail.com> wrote:

> Thanks Ted and .. Ted ..
> I've been looking at the progress when the job is executing.
> In fact, I think it's not a skewed partition problem. I've looked at the
> mapper output files, all are of the same size and the reducer each takes a
> single group.
> What I want to know is that how hadoop M/R framework calculate the
> progress percentage.
> For example, my reducer:
>
> reducer(...) {
>   call_of_another_func() // lots of complicated calculations
> }
>
> Will the percentage reflect the calculation inside the function call?
> Because I observed that in the job, all reducer reached 100% fairly
> quickly, then they stucked there. In this time, the datanodes seem to be
> working.
>
> Thanks.
>
>
> 2013/4/26 Ted Dunning <td...@maprtech.com>
>
>> Have you checked the logs?
>>
>> Is there a task that is taking a long time?  What is that task doing?
>>
>> There are two basic possibilities:
>>
>> a) you have a skewed join like the other Ted mentioned.  In this case,
>> the straggler will be seen to be working on data.
>>
>> b) you have a hung process.  This can be more difficult to diagnose, but
>> indicates that there is a problem with your cluster.
>>
>>
>>
>> On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
>>> questionis that in one of the jobs, map and reduce tasks show 100% finished
>>> in about 1m 30s, but I have to wait another 5m for this job to finish.
>>> This job writes about 720mb compressed data to HDFS with replication
>>> factor 1, in sequence file format. I've tried copying these data to hdfs,
>>> it takes only < 20 seconds. What happened during this 5 more minutes?
>>>
>>> Any idea on how to optimize this part?
>>>
>>> Thanks.
>>>
>>> --
>>> *JU Han*
>>>
>>> UTC   -  Université de Technologie de Compiègne
>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>
>>> +33 0619608888
>>>
>>
>>
>
>
> --
> *JU Han*
>
> Software Engineer Intern @ KXEN Inc.
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>



-- 
Regards,
Ted Xu

Re: M/R job optimization

Posted by Ted Xu <tx...@gopivotal.com>.

Hi Han,

I think your point is valid. In fact you can change the progress report
logic by manually calling the Reporter API, but by default it is quite
straight forward. Reducer progress is divided into 3 phases, namely copy
phase, merge/sort phase and reduce phase, each with ~33%. In your case it
looks your program is stucked in reduce phase. To better track the cause,
you can check the task log, as Ted Dunning suggested before.


On Mon, Apr 29, 2013 at 11:17 PM, Han JU <ju...@gmail.com> wrote:

> Thanks Ted and .. Ted ..
> I've been looking at the progress when the job is executing.
> In fact, I think it's not a skewed partition problem. I've looked at the
> mapper output files, all are of the same size and the reducer each takes a
> single group.
> What I want to know is that how hadoop M/R framework calculate the
> progress percentage.
> For example, my reducer:
>
> reducer(...) {
>   call_of_another_func() // lots of complicated calculations
> }
>
> Will the percentage reflect the calculation inside the function call?
> Because I observed that in the job, all reducer reached 100% fairly
> quickly, then they stucked there. In this time, the datanodes seem to be
> working.
>
> Thanks.
>
>
> 2013/4/26 Ted Dunning <td...@maprtech.com>
>
>> Have you checked the logs?
>>
>> Is there a task that is taking a long time?  What is that task doing?
>>
>> There are two basic possibilities:
>>
>> a) you have a skewed join like the other Ted mentioned.  In this case,
>> the straggler will be seen to be working on data.
>>
>> b) you have a hung process.  This can be more difficult to diagnose, but
>> indicates that there is a problem with your cluster.
>>
>>
>>
>> On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
>>> questionis that in one of the jobs, map and reduce tasks show 100% finished
>>> in about 1m 30s, but I have to wait another 5m for this job to finish.
>>> This job writes about 720mb compressed data to HDFS with replication
>>> factor 1, in sequence file format. I've tried copying these data to hdfs,
>>> it takes only < 20 seconds. What happened during this 5 more minutes?
>>>
>>> Any idea on how to optimize this part?
>>>
>>> Thanks.
>>>
>>> --
>>> *JU Han*
>>>
>>> UTC   -  Université de Technologie de Compiègne
>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>
>>> +33 0619608888
>>>
>>
>>
>
>
> --
> *JU Han*
>
> Software Engineer Intern @ KXEN Inc.
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>



-- 
Regards,
Ted Xu

Re: M/R job optimization

Posted by Han JU <ju...@gmail.com>.

Thanks Ted and .. Ted ..
I've been looking at the progress when the job is executing.
In fact, I think it's not a skewed partition problem. I've looked at the
mapper output files, all are of the same size and the reducer each takes a
single group.
What I want to know is that how hadoop M/R framework calculate the progress
percentage.
For example, my reducer:

reducer(...) {
  call_of_another_func() // lots of complicated calculations
}

Will the percentage reflect the calculation inside the function call?
Because I observed that in the job, all reducer reached 100% fairly
quickly, then they stucked there. In this time, the datanodes seem to be
working.

Thanks.

2013/4/26 Ted Dunning <td...@maprtech.com>

> Have you checked the logs?
>
> Is there a task that is taking a long time?  What is that task doing?
>
> There are two basic possibilities:
>
> a) you have a skewed join like the other Ted mentioned.  In this case, the
> straggler will be seen to be working on data.
>
> b) you have a hung process.  This can be more difficult to diagnose, but
> indicates that there is a problem with your cluster.
>
>
>
> On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:
>
>> Hi,
>>
>> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
>> questionis that in one of the jobs, map and reduce tasks show 100% finished
>> in about 1m 30s, but I have to wait another 5m for this job to finish.
>> This job writes about 720mb compressed data to HDFS with replication
>> factor 1, in sequence file format. I've tried copying these data to hdfs,
>> it takes only < 20 seconds. What happened during this 5 more minutes?
>>
>> Any idea on how to optimize this part?
>>
>> Thanks.
>>
>> --
>> *JU Han*
>>
>> UTC   -  Université de Technologie de Compiègne
>> *     **GI06 - Fouille de Données et Décisionnel*
>>
>> +33 0619608888
>>
>
>

-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Re: M/R job optimization

Posted by Han JU <ju...@gmail.com>.

Thanks Ted and .. Ted ..
I've been looking at the progress when the job is executing.
In fact, I think it's not a skewed partition problem. I've looked at the
mapper output files, all are of the same size and the reducer each takes a
single group.
What I want to know is that how hadoop M/R framework calculate the progress
percentage.
For example, my reducer:

reducer(...) {
  call_of_another_func() // lots of complicated calculations
}

Will the percentage reflect the calculation inside the function call?
Because I observed that in the job, all reducer reached 100% fairly
quickly, then they stucked there. In this time, the datanodes seem to be
working.

Thanks.

2013/4/26 Ted Dunning <td...@maprtech.com>

> Have you checked the logs?
>
> Is there a task that is taking a long time?  What is that task doing?
>
> There are two basic possibilities:
>
> a) you have a skewed join like the other Ted mentioned.  In this case, the
> straggler will be seen to be working on data.
>
> b) you have a hung process.  This can be more difficult to diagnose, but
> indicates that there is a problem with your cluster.
>
>
>
> On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:
>
>> Hi,
>>
>> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
>> questionis that in one of the jobs, map and reduce tasks show 100% finished
>> in about 1m 30s, but I have to wait another 5m for this job to finish.
>> This job writes about 720mb compressed data to HDFS with replication
>> factor 1, in sequence file format. I've tried copying these data to hdfs,
>> it takes only < 20 seconds. What happened during this 5 more minutes?
>>
>> Any idea on how to optimize this part?
>>
>> Thanks.
>>
>> --
>> *JU Han*
>>
>> UTC   -  Université de Technologie de Compiègne
>> *     **GI06 - Fouille de Données et Décisionnel*
>>
>> +33 0619608888
>>
>
>

-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Re: M/R job optimization

Posted by Han JU <ju...@gmail.com>.

Thanks Ted and .. Ted ..
I've been looking at the progress when the job is executing.
In fact, I think it's not a skewed partition problem. I've looked at the
mapper output files, all are of the same size and the reducer each takes a
single group.
What I want to know is that how hadoop M/R framework calculate the progress
percentage.
For example, my reducer:

reducer(...) {
  call_of_another_func() // lots of complicated calculations
}

Will the percentage reflect the calculation inside the function call?
Because I observed that in the job, all reducer reached 100% fairly
quickly, then they stucked there. In this time, the datanodes seem to be
working.

Thanks.

2013/4/26 Ted Dunning <td...@maprtech.com>

> Have you checked the logs?
>
> Is there a task that is taking a long time?  What is that task doing?
>
> There are two basic possibilities:
>
> a) you have a skewed join like the other Ted mentioned.  In this case, the
> straggler will be seen to be working on data.
>
> b) you have a hung process.  This can be more difficult to diagnose, but
> indicates that there is a problem with your cluster.
>
>
>
> On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:
>
>> Hi,
>>
>> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
>> questionis that in one of the jobs, map and reduce tasks show 100% finished
>> in about 1m 30s, but I have to wait another 5m for this job to finish.
>> This job writes about 720mb compressed data to HDFS with replication
>> factor 1, in sequence file format. I've tried copying these data to hdfs,
>> it takes only < 20 seconds. What happened during this 5 more minutes?
>>
>> Any idea on how to optimize this part?
>>
>> Thanks.
>>
>> --
>> *JU Han*
>>
>> UTC   -  Université de Technologie de Compiègne
>> *     **GI06 - Fouille de Données et Décisionnel*
>>
>> +33 0619608888
>>
>
>

-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Re: M/R job optimization

Posted by Han JU <ju...@gmail.com>.

Thanks Ted and .. Ted ..
I've been looking at the progress when the job is executing.
In fact, I think it's not a skewed partition problem. I've looked at the
mapper output files, all are of the same size and the reducer each takes a
single group.
What I want to know is that how hadoop M/R framework calculate the progress
percentage.
For example, my reducer:

reducer(...) {
  call_of_another_func() // lots of complicated calculations
}

Will the percentage reflect the calculation inside the function call?
Because I observed that in the job, all reducer reached 100% fairly
quickly, then they stucked there. In this time, the datanodes seem to be
working.

Thanks.

2013/4/26 Ted Dunning <td...@maprtech.com>

> Have you checked the logs?
>
> Is there a task that is taking a long time?  What is that task doing?
>
> There are two basic possibilities:
>
> a) you have a skewed join like the other Ted mentioned.  In this case, the
> straggler will be seen to be working on data.
>
> b) you have a hung process.  This can be more difficult to diagnose, but
> indicates that there is a problem with your cluster.
>
>
>
> On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:
>
>> Hi,
>>
>> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
>> questionis that in one of the jobs, map and reduce tasks show 100% finished
>> in about 1m 30s, but I have to wait another 5m for this job to finish.
>> This job writes about 720mb compressed data to HDFS with replication
>> factor 1, in sequence file format. I've tried copying these data to hdfs,
>> it takes only < 20 seconds. What happened during this 5 more minutes?
>>
>> Any idea on how to optimize this part?
>>
>> Thanks.
>>
>> --
>> *JU Han*
>>
>> UTC   -  Université de Technologie de Compiègne
>> *     **GI06 - Fouille de Données et Décisionnel*
>>
>> +33 0619608888
>>
>
>

-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Re: M/R job optimization

Posted by Ted Dunning <td...@maprtech.com>.

Have you checked the logs?

Is there a task that is taking a long time?  What is that task doing?

There are two basic possibilities:

a) you have a skewed join like the other Ted mentioned.  In this case, the
straggler will be seen to be working on data.

b) you have a hung process.  This can be more difficult to diagnose, but
indicates that there is a problem with your cluster.

On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:

> Hi,
>
> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
> questionis that in one of the jobs, map and reduce tasks show 100% finished
> in about 1m 30s, but I have to wait another 5m for this job to finish.
> This job writes about 720mb compressed data to HDFS with replication
> factor 1, in sequence file format. I've tried copying these data to hdfs,
> it takes only < 20 seconds. What happened during this 5 more minutes?
>
> Any idea on how to optimize this part?
>
> Thanks.
>
> --
> *JU Han*
>
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>

Re: M/R job optimization

Posted by Ted Xu <tx...@gopivotal.com>.

Hi Han,

It may be caused by skewed partitioning, which means some specific reducers
are assigned too much data than average, causing long tail. To verify that,
you can check the task counters, see if the partitioning is balanced enough.

Some tools implemented specific algorithms to handle this issue, for
example pig skewed join (http://wiki.apache.org/pig/PigSkewedJoinSpec)


On Fri, Apr 26, 2013 at 5:21 PM, Han JU <ju...@gmail.com> wrote:

> Hi,
>
> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
> questionis that in one of the jobs, map and reduce tasks show 100% finished
> in about 1m 30s, but I have to wait another 5m for this job to finish.
> This job writes about 720mb compressed data to HDFS with replication
> factor 1, in sequence file format. I've tried copying these data to hdfs,
> it takes only < 20 seconds. What happened during this 5 more minutes?
>
> Any idea on how to optimize this part?
>
> Thanks.
>
> --
> *JU Han*
>
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>


Regards,
----
Ted Xu

Re: M/R job optimization

Posted by Ted Dunning <td...@maprtech.com>.

Have you checked the logs?

Is there a task that is taking a long time?  What is that task doing?

There are two basic possibilities:

a) you have a skewed join like the other Ted mentioned.  In this case, the
straggler will be seen to be working on data.

b) you have a hung process.  This can be more difficult to diagnose, but
indicates that there is a problem with your cluster.

On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:

> Hi,
>
> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
> questionis that in one of the jobs, map and reduce tasks show 100% finished
> in about 1m 30s, but I have to wait another 5m for this job to finish.
> This job writes about 720mb compressed data to HDFS with replication
> factor 1, in sequence file format. I've tried copying these data to hdfs,
> it takes only < 20 seconds. What happened during this 5 more minutes?
>
> Any idea on how to optimize this part?
>
> Thanks.
>
> --
> *JU Han*
>
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>

Re: M/R job optimization

Posted by Ted Dunning <td...@maprtech.com>.

Have you checked the logs?

Is there a task that is taking a long time?  What is that task doing?

There are two basic possibilities:

a) you have a skewed join like the other Ted mentioned.  In this case, the
straggler will be seen to be working on data.

b) you have a hung process.  This can be more difficult to diagnose, but
indicates that there is a problem with your cluster.

On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:

> Hi,
>
> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
> questionis that in one of the jobs, map and reduce tasks show 100% finished
> in about 1m 30s, but I have to wait another 5m for this job to finish.
> This job writes about 720mb compressed data to HDFS with replication
> factor 1, in sequence file format. I've tried copying these data to hdfs,
> it takes only < 20 seconds. What happened during this 5 more minutes?
>
> Any idea on how to optimize this part?
>
> Thanks.
>
> --
> *JU Han*
>
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>

Re: M/R job optimization

Posted by Ted Xu <tx...@gopivotal.com>.

Hi Han,

It may be caused by skewed partitioning, which means some specific reducers
are assigned too much data than average, causing long tail. To verify that,
you can check the task counters, see if the partitioning is balanced enough.

Some tools implemented specific algorithms to handle this issue, for
example pig skewed join (http://wiki.apache.org/pig/PigSkewedJoinSpec)


On Fri, Apr 26, 2013 at 5:21 PM, Han JU <ju...@gmail.com> wrote:

> Hi,
>
> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
> questionis that in one of the jobs, map and reduce tasks show 100% finished
> in about 1m 30s, but I have to wait another 5m for this job to finish.
> This job writes about 720mb compressed data to HDFS with replication
> factor 1, in sequence file format. I've tried copying these data to hdfs,
> it takes only < 20 seconds. What happened during this 5 more minutes?
>
> Any idea on how to optimize this part?
>
> Thanks.
>
> --
> *JU Han*
>
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>


Regards,
----
Ted Xu

Re: M/R job optimization

Posted by Ted Xu <tx...@gopivotal.com>.

Hi Han,

It may be caused by skewed partitioning, which means some specific reducers
are assigned too much data than average, causing long tail. To verify that,
you can check the task counters, see if the partitioning is balanced enough.

Some tools implemented specific algorithms to handle this issue, for
example pig skewed join (http://wiki.apache.org/pig/PigSkewedJoinSpec)


On Fri, Apr 26, 2013 at 5:21 PM, Han JU <ju...@gmail.com> wrote:

> Hi,
>
> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
> questionis that in one of the jobs, map and reduce tasks show 100% finished
> in about 1m 30s, but I have to wait another 5m for this job to finish.
> This job writes about 720mb compressed data to HDFS with replication
> factor 1, in sequence file format. I've tried copying these data to hdfs,
> it takes only < 20 seconds. What happened during this 5 more minutes?
>
> Any idea on how to optimize this part?
>
> Thanks.
>
> --
> *JU Han*
>
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>


Regards,
----
Ted Xu

Re: M/R job optimization

Posted by Ted Xu <tx...@gopivotal.com>.

Hi Han,

It may be caused by skewed partitioning, which means some specific reducers
are assigned too much data than average, causing long tail. To verify that,
you can check the task counters, see if the partitioning is balanced enough.

Some tools implemented specific algorithms to handle this issue, for
example pig skewed join (http://wiki.apache.org/pig/PigSkewedJoinSpec)


On Fri, Apr 26, 2013 at 5:21 PM, Han JU <ju...@gmail.com> wrote:

> Hi,
>
> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
> questionis that in one of the jobs, map and reduce tasks show 100% finished
> in about 1m 30s, but I have to wait another 5m for this job to finish.
> This job writes about 720mb compressed data to HDFS with replication
> factor 1, in sequence file format. I've tried copying these data to hdfs,
> it takes only < 20 seconds. What happened during this 5 more minutes?
>
> Any idea on how to optimize this part?
>
> Thanks.
>
> --
> *JU Han*
>
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>


Regards,
----
Ted Xu

Re: M/R job optimization

Posted by Ted Dunning <td...@maprtech.com>.

Have you checked the logs?

Is there a task that is taking a long time?  What is that task doing?

There are two basic possibilities:

a) you have a skewed join like the other Ted mentioned.  In this case, the
straggler will be seen to be working on data.

b) you have a hung process.  This can be more difficult to diagnose, but
indicates that there is a problem with your cluster.

On Fri, Apr 26, 2013 at 2:21 AM, Han JU <ju...@gmail.com> wrote:

> Hi,
>
> I've implemented an algorithm with Hadoop, it's a series of 4 jobs. My
> questionis that in one of the jobs, map and reduce tasks show 100% finished
> in about 1m 30s, but I have to wait another 5m for this job to finish.
> This job writes about 720mb compressed data to HDFS with replication
> factor 1, in sequence file format. I've tried copying these data to hdfs,
> it takes only < 20 seconds. What happened during this 5 more minutes?
>
> Any idea on how to optimize this part?
>
> Thanks.
>
> --
> *JU Han*
>
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>