You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mathias De Maré <ma...@gmail.com> on 2009/08/05 13:16:58 UTC

Re: Some tasks fail to report status between the end of the map and the beginning of the merge

> On Wed, Aug 5, 2009 at 9:38 AM, Jothi Padmanabhan <jo...@yahoo-inc.com>wrote:
> Hi,
>
> Could you please try setting this parameter
> mapred.merge.recordsBeforeProgress to a lower number?
> See https://issues.apache.org/jira/browse/HADOOP-4714
>
> Cheers
> Jothi

Hm, that bug looks like it's applicable during the merge, but my case is a
block right before the merge (but seemingly right after all of the map tasks
finish).
I tried putting mapred.merge.recordsBeforeProgress to 100, and it didn't
make a difference.

On Wed, Aug 5, 2009 at 10:32 AM, Amogh Vasekar <am...@yahoo-inc.com> wrote:

> 10 mins reminds me of parameter mapred.task.timeout . This is configurable.
> Or alternatively you might just do a sysout to let tracker know of its
> existence ( not an ideal solution though )
>
> Thanks,
> Amogh

Well, the map tasks take around 30 minutes to run. Letting the task idle for
a large number of minutes after that is a lot of useless time, imho. I tried
with 20 minutes now, but I still get timeouts.

I don't know if it's useful, but here are the settings of the map tasks at
the moment:

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
  <property>
    <name>io.sort.mb</name>
    <value>3</value>
    <description>The total amount of buffer memory to use while sorting
    files, in megabytes.  By default, gives each merge stream 1MB, which
    should minimize seeks.</description>
  </property>
<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>4</value>
  <description>The maximum number of map tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>4</value>
  <description>The maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

<property>
  <name>mapred.max.split.size</name>
  <value>1000000</value>
</property>

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx400m</value>
</property>

<property>
<name>mapred.merge.recordsBeforeProgress</name>
<value>100</value>
</property>

<property>
<name>mapred.task.timeout</name>
<value>1200000</value>
</property>

</configuration>

Ideally, I would want to get rid of the delay that causes the timeouts, yet
also increase the split size somewhat (though I think a larger split size
would increase the delay even more?).
The map tasks take around 8000-11000 records as input, and can produce up to
1 000 000 records as output (in case this is relevant).

Mathias

Re: Some tasks fail to report status between the end of the map and the beginning of the merge

Posted by Mathias De Maré <ma...@gmail.com>.

2009/8/12 Mathias De Maré <ma...@gmail.com>

> Thank you, that's very useful.
> In addition, I changed the way the tasks work, so they store their data in
> HBase now (since it's more suited for handling small files).
> I'm not 100% sure yet if the problems have been resolved (still doing
> extensive testing), but I think I might have gotten rid of them (and I'll
> add the 'skipping records' option in case I do get a failure).
>


Hi,

I can get everything to 'run' successfully now, but there are still some
tasks that crash.

I was thinking perhaps my Writable class is the issue, so I'll just post it
here. Does anyone notice anything that could cause a hang? In particular the
readFields and write methods could perhaps be the reason (but I just don't
see it).

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;

/**
 * Contains information on a URL, which other URLs link to it and if it has
been crawled previously.
 * @author mathias
 */
public class URLInfo implements Writable, WritableComparable {
    String url;
    Text[] linkedfrom;
    int urlStatus;
    int seconds;

    public URLInfo() {
        url = "";
        linkedfrom = new Text[0];
        urlStatus = Constants.URL_NEW;
        seconds = 1;
    }

    /**
     *
     * @param url
     * @param linkedfrom Must only contains domain names, nothing appended
     * @param urlStatus
     */
    public URLInfo(String url, Text[] linkedfrom, int urlStatus, int
seconds) {
        this.url = url;
        this.linkedfrom = linkedfrom;
        this.urlStatus = urlStatus;
        this.seconds = seconds;
    }

    public void write(DataOutput out) throws IOException {
        new Text(url).write(out);
        new ArrayWritable(Text.class, linkedfrom).write(out);
        new IntWritable(urlStatus).write(out);
        new IntWritable(seconds).write(out);
    }

    public void readFields(DataInput in) throws IOException {
        url = Text.readString(in);
        ArrayWritable aw = new ArrayWritable(Text.class);
        aw.readFields(in);
        Writable[] linkedfromWritable = aw.get();
        linkedfrom = new Text[linkedfromWritable.length];
        for(int i=0; i<linkedfromWritable.length; i++) {
            linkedfrom[i] = (Text) linkedfromWritable[i];
        }
        IntWritable iw = new IntWritable();
        iw.readFields(in);
        urlStatus = iw.get();
        IntWritable iw2 = new IntWritable();
        iw2.readFields(in);
        seconds = iw2.get();
    }

    public int compareTo(Object o) {
        return url.compareToIgnoreCase(((URLInfo) o).url);
    }

    public void setURLStatus(int urlStatus) {
        this.urlStatus = urlStatus;
    }

    public int getURLStatus() {
        return urlStatus;
    }

    public void setLinkedFrom(Text[] linkedfrom) {
        this.linkedfrom = linkedfrom;
    }

    public Text[] getLinkedFrom() {
        return linkedfrom;
    }

    public String getURL() {
        return new String(url);
    }

    public int getSeconds() {
        return seconds;
    }

    public void setSeconds(int seconds) {
        this.seconds = seconds;
    }

    @Override
    public String toString() {
        return new String(url);
    }

    @Override
    public boolean equals(Object obj) {
        if(!(obj instanceof URLInfo)) {
            return false;
        }
        URLInfo urlObject = (URLInfo) obj;
        return this.getURL().equals(urlObject.getURL());
    }

}

Re: Some tasks fail to report status between the end of the map and the beginning of the merge

Posted by Mathias De Maré <ma...@gmail.com>.

Thank you, that's very useful.
In addition, I changed the way the tasks work, so they store their data in
HBase now (since it's more suited for handling small files).
I'm not 100% sure yet if the problems have been resolved (still doing
extensive testing), but I think I might have gotten rid of them (and I'll
add the 'skipping records' option in case I do get a failure).

Mathias

On Mon, Aug 10, 2009 at 5:46 PM, Koji Noguchi <kn...@yahoo-inc.com>wrote:

> > but I didn't find a config option
> > that allows ignoring tasks that fail.
> >
> If 0.18,
>
> http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/Jo
> bConf.html#setMaxMapTaskFailuresPercent(int)<http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/Jo%0AbConf.html#setMaxMapTaskFailuresPercent%28int%29>
> (mapred.max.map.failures.percent)
>
>
>
> http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/Jo
> bConf.html#setMaxReduceTaskFailuresPercent(int)<http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/Jo%0AbConf.html#setMaxReduceTaskFailuresPercent%28int%29>
> (mapred.max.reduce.failures.percent)
>
>
> If 0.19 or later, you can also try skipping records.
>
>
> Koji
>

Re: Some tasks fail to report status between the end of the map and the beginning of the merge

Posted by Koji Noguchi <kn...@yahoo-inc.com>.

> but I didn't find a config option
> that allows ignoring tasks that fail.
>
If 0.18, 
http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/Jo
bConf.html#setMaxMapTaskFailuresPercent(int)
(mapred.max.map.failures.percent)


http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/Jo
bConf.html#setMaxReduceTaskFailuresPercent(int)
(mapred.max.reduce.failures.percent)


If 0.19 or later, you can also try skipping records.


Koji



On 8/9/09 2:18 AM, "Mathias De Maré" <ma...@gmail.com> wrote:

> I changed the maximum split size to 30000, and now most tasks actually
> succeed.
> However, I still have the failure problem with some tasks (with a job I was
> running yesterday, I got a failure after 1900 tasks).
> The problem is that these very few failures can bring down the entire job,
> as they sometimes seem to just keep failing.
> I looked through the mapred-default.xml, but I didn't find a config option
> that allows ignoring tasks that fail. Is there a way to do this (it seems
> like the only alternative I have, since I can't make the failures stop)?
> 
> Mathias
> 
> 2009/8/5 Mathias De Maré <ma...@gmail.com>
> 
>> 
>> On Wed, Aug 5, 2009 at 9:38 AM, Jothi Padmanabhan
>> <jo...@yahoo-inc.com>wrote:
>>> Hi,
>>> 
>>> Could you please try setting this parameter
>>> mapred.merge.recordsBeforeProgress to a lower number?
>>> See https://issues.apache.org/jira/browse/HADOOP-4714
>>> 
>>> Cheers
>>> Jothi
>> 
>> 
>> Hm, that bug looks like it's applicable during the merge, but my case is a
>> block right before the merge (but seemingly right after all of the map tasks
>> finish).
>> I tried putting mapred.merge.recordsBeforeProgress to 100, and it didn't
>> make a difference.
>> 
>> On Wed, Aug 5, 2009 at 10:32 AM, Amogh Vasekar <am...@yahoo-inc.com>wrote:
>> 
>>> 10 mins reminds me of parameter mapred.task.timeout . This is
>>> configurable. Or alternatively you might just do a sysout to let tracker
>>> know of its existence ( not an ideal solution though )
>>> 
>>> Thanks,
>>> Amogh
>> 
>> 
>> Well, the map tasks take around 30 minutes to run. Letting the task idle
>> for a large number of minutes after that is a lot of useless time, imho. I
>> tried with 20 minutes now, but I still get timeouts.
>> 
>> I don't know if it's useful, but here are the settings of the map tasks at
>> the moment:
>> 
>> <configuration>
>>   <property>
>>     <name>mapred.job.tracker</name>
>>     <value>localhost:9001</value>
>>   </property>
>>   <property>
>>     <name>io.sort.mb</name>
>>     <value>3</value>
>>     <description>The total amount of buffer memory to use while sorting
>>     files, in megabytes.  By default, gives each merge stream 1MB, which
>>     should minimize seeks.</description>
>>   </property>
>> <property>
>>   <name>mapred.tasktracker.map.tasks.maximum</name>
>>   <value>4</value>
>>   <description>The maximum number of map tasks that will be run
>>   simultaneously by a task tracker.
>>   </description>
>> </property>
>> 
>> <property>
>>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>   <value>4</value>
>>   <description>The maximum number of reduce tasks that will be run
>>   simultaneously by a task tracker.
>>   </description>
>> </property>
>> 
>> <property>
>>   <name>mapred.max.split.size</name>
>>   <value>1000000</value>
>> </property>
>> 
>> <property>
>>   <name>mapred.child.java.opts</name>
>>   <value>-Xmx400m</value>
>> </property>
>> 
>> <property>
>> <name>mapred.merge.recordsBeforeProgress</name>
>> <value>100</value>
>> </property>
>> 
>> <property>
>> <name>mapred.task.timeout</name>
>> <value>1200000</value>
>> </property>
>> 
>> </configuration>
>> 
>> Ideally, I would want to get rid of the delay that causes the timeouts, yet
>> also increase the split size somewhat (though I think a larger split size
>> would increase the delay even more?).
>> The map tasks take around 8000-11000 records as input, and can produce up
>> to 1 000 000 records as output (in case this is relevant).
>> 
>> Mathias
>> 
>>

Re: Some tasks fail to report status between the end of the map and the beginning of the merge

Posted by Mathias De Maré <ma...@gmail.com>.

I changed the maximum split size to 30000, and now most tasks actually
succeed.
However, I still have the failure problem with some tasks (with a job I was
running yesterday, I got a failure after 1900 tasks).
The problem is that these very few failures can bring down the entire job,
as they sometimes seem to just keep failing.
I looked through the mapred-default.xml, but I didn't find a config option
that allows ignoring tasks that fail. Is there a way to do this (it seems
like the only alternative I have, since I can't make the failures stop)?

Mathias

2009/8/5 Mathias De Maré <ma...@gmail.com>

>
> On Wed, Aug 5, 2009 at 9:38 AM, Jothi Padmanabhan <jo...@yahoo-inc.com>wrote:
>> Hi,
>>
>> Could you please try setting this parameter
>> mapred.merge.recordsBeforeProgress to a lower number?
>> See https://issues.apache.org/jira/browse/HADOOP-4714
>>
>> Cheers
>> Jothi
>
>
> Hm, that bug looks like it's applicable during the merge, but my case is a
> block right before the merge (but seemingly right after all of the map tasks
> finish).
> I tried putting mapred.merge.recordsBeforeProgress to 100, and it didn't
> make a difference.
>
> On Wed, Aug 5, 2009 at 10:32 AM, Amogh Vasekar <am...@yahoo-inc.com>wrote:
>
>> 10 mins reminds me of parameter mapred.task.timeout . This is
>> configurable. Or alternatively you might just do a sysout to let tracker
>> know of its existence ( not an ideal solution though )
>>
>> Thanks,
>> Amogh
>
>
> Well, the map tasks take around 30 minutes to run. Letting the task idle
> for a large number of minutes after that is a lot of useless time, imho. I
> tried with 20 minutes now, but I still get timeouts.
>
> I don't know if it's useful, but here are the settings of the map tasks at
> the moment:
>
> <configuration>
>   <property>
>     <name>mapred.job.tracker</name>
>     <value>localhost:9001</value>
>   </property>
>   <property>
>     <name>io.sort.mb</name>
>     <value>3</value>
>     <description>The total amount of buffer memory to use while sorting
>     files, in megabytes.  By default, gives each merge stream 1MB, which
>     should minimize seeks.</description>
>   </property>
> <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>4</value>
>   <description>The maximum number of map tasks that will be run
>   simultaneously by a task tracker.
>   </description>
> </property>
>
> <property>
>   <name>mapred.tasktracker.reduce.tasks.maximum</name>
>   <value>4</value>
>   <description>The maximum number of reduce tasks that will be run
>   simultaneously by a task tracker.
>   </description>
> </property>
>
> <property>
>   <name>mapred.max.split.size</name>
>   <value>1000000</value>
> </property>
>
> <property>
>   <name>mapred.child.java.opts</name>
>   <value>-Xmx400m</value>
> </property>
>
> <property>
> <name>mapred.merge.recordsBeforeProgress</name>
> <value>100</value>
> </property>
>
> <property>
> <name>mapred.task.timeout</name>
> <value>1200000</value>
> </property>
>
> </configuration>
>
> Ideally, I would want to get rid of the delay that causes the timeouts, yet
> also increase the split size somewhat (though I think a larger split size
> would increase the delay even more?).
> The map tasks take around 8000-11000 records as input, and can produce up
> to 1 000 000 records as output (in case this is relevant).
>
> Mathias
>
>