You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2013/12/01 09:35:53 UTC

Re: Test naivebayes task running really slowly and not in distributed mode

Did the training run use both machines?

How large is the input for the test run?

Is it contained in a single file?




On Sat, Nov 30, 2013 at 11:22 AM, Fernando Santos <
fernandoleandro1991@gmail.com> wrote:

> Hello everyone,
>
> I'm trying to do a text classification task. My dataset is not that big, I
> have around 700.000 small comments.
>
> Following the 20newsgroups example, I created the vector from the text,
> splited it and trained the model. Now I'm trying to test it but it is
> really slow and also I cannot make it to run in the cluster. Whatever I do
> it always just run in one machine. And I think the testnb algorithm is
> supposed to run using mapReduce, right?
>
> I also tried this example here (
>
> http://chimpler.wordpress.com/2013/06/24/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages-part-2-distribute-classification-with-hadoop/
> )
> but also, the other box in the cluster is not executing any task. In fact,
> when I execute the testnb or using the MapReduceClassifier proposed in this
> tutorial above, I get one job, executing one task and this task runs really
> slowly (like 6 minutes to achieve 0.13% of the task).
>
> I think I must be doing something wrong so that the cluster is not working
> how it is supposed to be.
>
> I have a cluster with 2 box configured with hadoop 0.20.205.0 and using
> mahout 0.8.
>
> I also tried versions 0.7 and 0.6 of mahout but nothing changed.
>
> Any help would be aprreciated.
>
>
> The logs I have from this task:
>
>
> *stdout logs*
>
> Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
> /usr/local/hadoop/lib/libhadoop.so which might have disabled stack
> guard. The VM will try to fix the stack guard now.
> It's highly recommended that you fix the library with 'execstack -c
> <libfile>', or link it with '-z noexecstack'.
>
>
> *syslog logs*
>
> 2013-11-30 17:09:19,191 WARN org.apache.hadoop.util.NativeCodeLoader:
> Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable
> 2013-11-30 17:09:19,400 WARN
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
> already exists!
> 2013-11-30 17:09:19,472 INFO org.apache.hadoop.util.ProcessTree:
> setsid exited with exit code 0
> 2013-11-30 17:09:19,474 INFO org.apache.hadoop.mapred.Task:  Using
> ResourceCalculatorPlugin :
> org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5810d963
> 2013-11-30 17:09:19,543 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb
> = 100
> 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: data
> buffer = 79691776/99614720
> 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: record
> buffer = 262144/327680
>
>
>
>
>
> --
> Fernando Santos
> +55 61 8129 8505
>

Re: Test naivebayes task running really slowly and not in distributed mode

Posted by Fernando Santos <fe...@gmail.com>.

I realized what was the problem.

First of all the data was not big enough to split the job in more than one
task. Training file was 30MB and my block sizes were 64MB.

Besides that, I set the number of map (mapred.map.tasks) and reduce (
mapred.reduce.tasks) tasks in the mapred-site.xml file of hadoop.

After that the algorithm started running in an acceptable time.



2013/12/2 Fernando Santos <fe...@gmail.com>

> Train and test set are in single files (part-r-00000). Training file is
> 30MB and testing file is 2MB.
>
>
> 2013/12/2 Fernando Santos <fe...@gmail.com>
>
>> Hello Ted,
>>
>> No, the training ran also in one machine. What happens sometimes is that
>> each box execute one job one at a time, but not together. For example, if
>> it will run 3 jobs, it runs the first job in box1, the next in box2 and the
>> next in box 1 again.
>>
>> The full dataset is a csv around 70MB. I turned it into sequence file,
>> applied seq2sparse, then splitted and trained. The training task was quite
>> fast, some minutes to execute. But the test is really slow as I said, and
>> also running in one machine.
>>
>>
>>
>> 2013/12/1 Ted Dunning <te...@gmail.com>
>>
>>> Did the training run use both machines?
>>>
>>> How large is the input for the test run?
>>>
>>> Is it contained in a single file?
>>>
>>>
>>>
>>>
>>> On Sat, Nov 30, 2013 at 11:22 AM, Fernando Santos <
>>> fernandoleandro1991@gmail.com> wrote:
>>>
>>> > Hello everyone,
>>> >
>>> > I'm trying to do a text classification task. My dataset is not that
>>> big, I
>>> > have around 700.000 small comments.
>>> >
>>> > Following the 20newsgroups example, I created the vector from the text,
>>> > splited it and trained the model. Now I'm trying to test it but it is
>>> > really slow and also I cannot make it to run in the cluster. Whatever
>>> I do
>>> > it always just run in one machine. And I think the testnb algorithm is
>>> > supposed to run using mapReduce, right?
>>> >
>>> > I also tried this example here (
>>> >
>>> >
>>> http://chimpler.wordpress.com/2013/06/24/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages-part-2-distribute-classification-with-hadoop/
>>> > )
>>> > but also, the other box in the cluster is not executing any task. In
>>> fact,
>>> > when I execute the testnb or using the MapReduceClassifier proposed in
>>> this
>>> > tutorial above, I get one job, executing one task and this task runs
>>> really
>>> > slowly (like 6 minutes to achieve 0.13% of the task).
>>> >
>>> > I think I must be doing something wrong so that the cluster is not
>>> working
>>> > how it is supposed to be.
>>> >
>>> > I have a cluster with 2 box configured with hadoop 0.20.205.0 and using
>>> > mahout 0.8.
>>> >
>>> > I also tried versions 0.7 and 0.6 of mahout but nothing changed.
>>> >
>>> > Any help would be aprreciated.
>>> >
>>> >
>>> > The logs I have from this task:
>>> >
>>> >
>>> > *stdout logs*
>>> >
>>> > Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
>>> > /usr/local/hadoop/lib/libhadoop.so which might have disabled stack
>>> > guard. The VM will try to fix the stack guard now.
>>> > It's highly recommended that you fix the library with 'execstack -c
>>> > <libfile>', or link it with '-z noexecstack'.
>>> >
>>> >
>>> > *syslog logs*
>>> >
>>> > 2013-11-30 17:09:19,191 WARN org.apache.hadoop.util.NativeCodeLoader:
>>> > Unable to load native-hadoop library for your platform... using
>>> > builtin-java classes where applicable
>>> > 2013-11-30 17:09:19,400 WARN
>>> > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
>>> > already exists!
>>> > 2013-11-30 17:09:19,472 INFO org.apache.hadoop.util.ProcessTree:
>>> > setsid exited with exit code 0
>>> > 2013-11-30 17:09:19,474 INFO org.apache.hadoop.mapred.Task:  Using
>>> > ResourceCalculatorPlugin :
>>> > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5810d963
>>> > 2013-11-30 17:09:19,543 INFO org.apache.hadoop.mapred.MapTask:
>>> io.sort.mb
>>> > = 100
>>> > 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: data
>>> > buffer = 79691776/99614720
>>> > 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: record
>>> > buffer = 262144/327680
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Fernando Santos
>>> > +55 61 8129 8505
>>> >
>>>
>>
>>
>>
>> --
>> Fernando Santos
>> +55 61 8129 8505
>>
>>
>
>
> --
> Fernando Santos
> +55 61 8129 8505
>
>


-- 
Fernando Santos
+55 61 8129 8505

Re: Test naivebayes task running really slowly and not in distributed mode

Posted by Fernando Santos <fe...@gmail.com>.

Train and test set are in single files (part-r-00000). Training file is
30MB and testing file is 2MB.


2013/12/2 Fernando Santos <fe...@gmail.com>

> Hello Ted,
>
> No, the training ran also in one machine. What happens sometimes is that
> each box execute one job one at a time, but not together. For example, if
> it will run 3 jobs, it runs the first job in box1, the next in box2 and the
> next in box 1 again.
>
> The full dataset is a csv around 70MB. I turned it into sequence file,
> applied seq2sparse, then splitted and trained. The training task was quite
> fast, some minutes to execute. But the test is really slow as I said, and
> also running in one machine.
>
>
>
> 2013/12/1 Ted Dunning <te...@gmail.com>
>
>> Did the training run use both machines?
>>
>> How large is the input for the test run?
>>
>> Is it contained in a single file?
>>
>>
>>
>>
>> On Sat, Nov 30, 2013 at 11:22 AM, Fernando Santos <
>> fernandoleandro1991@gmail.com> wrote:
>>
>> > Hello everyone,
>> >
>> > I'm trying to do a text classification task. My dataset is not that
>> big, I
>> > have around 700.000 small comments.
>> >
>> > Following the 20newsgroups example, I created the vector from the text,
>> > splited it and trained the model. Now I'm trying to test it but it is
>> > really slow and also I cannot make it to run in the cluster. Whatever I
>> do
>> > it always just run in one machine. And I think the testnb algorithm is
>> > supposed to run using mapReduce, right?
>> >
>> > I also tried this example here (
>> >
>> >
>> http://chimpler.wordpress.com/2013/06/24/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages-part-2-distribute-classification-with-hadoop/
>> > )
>> > but also, the other box in the cluster is not executing any task. In
>> fact,
>> > when I execute the testnb or using the MapReduceClassifier proposed in
>> this
>> > tutorial above, I get one job, executing one task and this task runs
>> really
>> > slowly (like 6 minutes to achieve 0.13% of the task).
>> >
>> > I think I must be doing something wrong so that the cluster is not
>> working
>> > how it is supposed to be.
>> >
>> > I have a cluster with 2 box configured with hadoop 0.20.205.0 and using
>> > mahout 0.8.
>> >
>> > I also tried versions 0.7 and 0.6 of mahout but nothing changed.
>> >
>> > Any help would be aprreciated.
>> >
>> >
>> > The logs I have from this task:
>> >
>> >
>> > *stdout logs*
>> >
>> > Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
>> > /usr/local/hadoop/lib/libhadoop.so which might have disabled stack
>> > guard. The VM will try to fix the stack guard now.
>> > It's highly recommended that you fix the library with 'execstack -c
>> > <libfile>', or link it with '-z noexecstack'.
>> >
>> >
>> > *syslog logs*
>> >
>> > 2013-11-30 17:09:19,191 WARN org.apache.hadoop.util.NativeCodeLoader:
>> > Unable to load native-hadoop library for your platform... using
>> > builtin-java classes where applicable
>> > 2013-11-30 17:09:19,400 WARN
>> > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
>> > already exists!
>> > 2013-11-30 17:09:19,472 INFO org.apache.hadoop.util.ProcessTree:
>> > setsid exited with exit code 0
>> > 2013-11-30 17:09:19,474 INFO org.apache.hadoop.mapred.Task:  Using
>> > ResourceCalculatorPlugin :
>> > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5810d963
>> > 2013-11-30 17:09:19,543 INFO org.apache.hadoop.mapred.MapTask:
>> io.sort.mb
>> > = 100
>> > 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: data
>> > buffer = 79691776/99614720
>> > 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: record
>> > buffer = 262144/327680
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Fernando Santos
>> > +55 61 8129 8505
>> >
>>
>
>
>
> --
> Fernando Santos
> +55 61 8129 8505
>
>


-- 
Fernando Santos
+55 61 8129 8505

Re: Test naivebayes task running really slowly and not in distributed mode

Posted by Fernando Santos <fe...@gmail.com>.

Hello Ted,

No, the training ran also in one machine. What happens sometimes is that
each box execute one job one at a time, but not together. For example, if
it will run 3 jobs, it runs the first job in box1, the next in box2 and the
next in box 1 again.

The full dataset is a csv around 70MB. I turned it into sequence file,
applied seq2sparse, then splitted and trained. The training task was quite
fast, some minutes to execute. But the test is really slow as I said, and
also running in one machine.



2013/12/1 Ted Dunning <te...@gmail.com>

> Did the training run use both machines?
>
> How large is the input for the test run?
>
> Is it contained in a single file?
>
>
>
>
> On Sat, Nov 30, 2013 at 11:22 AM, Fernando Santos <
> fernandoleandro1991@gmail.com> wrote:
>
> > Hello everyone,
> >
> > I'm trying to do a text classification task. My dataset is not that big,
> I
> > have around 700.000 small comments.
> >
> > Following the 20newsgroups example, I created the vector from the text,
> > splited it and trained the model. Now I'm trying to test it but it is
> > really slow and also I cannot make it to run in the cluster. Whatever I
> do
> > it always just run in one machine. And I think the testnb algorithm is
> > supposed to run using mapReduce, right?
> >
> > I also tried this example here (
> >
> >
> http://chimpler.wordpress.com/2013/06/24/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages-part-2-distribute-classification-with-hadoop/
> > )
> > but also, the other box in the cluster is not executing any task. In
> fact,
> > when I execute the testnb or using the MapReduceClassifier proposed in
> this
> > tutorial above, I get one job, executing one task and this task runs
> really
> > slowly (like 6 minutes to achieve 0.13% of the task).
> >
> > I think I must be doing something wrong so that the cluster is not
> working
> > how it is supposed to be.
> >
> > I have a cluster with 2 box configured with hadoop 0.20.205.0 and using
> > mahout 0.8.
> >
> > I also tried versions 0.7 and 0.6 of mahout but nothing changed.
> >
> > Any help would be aprreciated.
> >
> >
> > The logs I have from this task:
> >
> >
> > *stdout logs*
> >
> > Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
> > /usr/local/hadoop/lib/libhadoop.so which might have disabled stack
> > guard. The VM will try to fix the stack guard now.
> > It's highly recommended that you fix the library with 'execstack -c
> > <libfile>', or link it with '-z noexecstack'.
> >
> >
> > *syslog logs*
> >
> > 2013-11-30 17:09:19,191 WARN org.apache.hadoop.util.NativeCodeLoader:
> > Unable to load native-hadoop library for your platform... using
> > builtin-java classes where applicable
> > 2013-11-30 17:09:19,400 WARN
> > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
> > already exists!
> > 2013-11-30 17:09:19,472 INFO org.apache.hadoop.util.ProcessTree:
> > setsid exited with exit code 0
> > 2013-11-30 17:09:19,474 INFO org.apache.hadoop.mapred.Task:  Using
> > ResourceCalculatorPlugin :
> > org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5810d963
> > 2013-11-30 17:09:19,543 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb
> > = 100
> > 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: data
> > buffer = 79691776/99614720
> > 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: record
> > buffer = 262144/327680
> >
> >
> >
> >
> >
> > --
> > Fernando Santos
> > +55 61 8129 8505
> >
>



-- 
Fernando Santos
+55 61 8129 8505