You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2010/01/18 13:04:55 UTC

Random thought: line separators

As I troll through the code at times trying to polish here and there I
notice small issues to bring up --

Line separators. Lots of code independently reads
System.getProperty("line.separator") in order to output a platform
specific line break. I argue this is actually slightly bad, since it
means the input/output formats of Mahout aren't fixed at all, but can
vary by platform. Output on Windows isn't read properly by Unix, etc.,
perhaps.

It'd be simpler and more compatible to use '\n' always. Thoughts?

(And, recall we don't really support Windows so well anyway, which is
the odd man out in this regard.)

Re: Random thought: line separators

Posted by Olivier Grisel <ol...@ensta.org>.
2010/1/18 Robin Anil <ro...@gmail.com>:
> could you check the logs. you will see a bigger stack trace might lead back
> to mahout classes

In the tasktracker logs I could find a more complete stacktrace (jetty
related, not sign of mahout classes) and google could pointed me to
this:

  https://issues.apache.org/jira/browse/MAPREDUCE-5

According to the comments the mapper output exhausts the heap space of
the reducer. I'll try to change the settings of my reducers.

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Re: Random thought: line separators

Posted by Robin Anil <ro...@gmail.com>.
could you check the logs. you will see a bigger stack trace might lead back
to mahout classes



On Mon, Jan 18, 2010 at 9:19 PM, Olivier Grisel <ol...@ensta.org>wrote:

> 2010/1/18 Olivier Grisel <ol...@ensta.org>:
> > 2010/1/18 Robin Anil <ro...@gmail.com>:
> >> could you be specific on which map/reduce job you encountered the error
> ?
> >
> > I thought it was on:
> >
> > hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job
> > org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> > "wikipediadump/chunk-0001.xml" -o wikipediainput-eof-exception -c
> > examples/src/test/resources/country.txt
> >
> > I just ran it again... successfully... The next time I encounter that
> > error I will note the complete complete stacktrace however
> > uninformative it looks.
>
> I ran the same jobs again on all the chunks and could reproduce the error:
>
> $ hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> "wikipediadump" -o wikipediainput-eof-exception -c
> examples/src/test/resources/country.txt
> [...]
> 10/01/18 16:20:46 INFO mapred.JobClient:  map 100% reduce 83%
> 10/01/18 16:21:42 INFO mapred.JobClient: Task Id :
> attempt_201001172109_0010_r_000000_2, Status : FAILED
> java.io.EOFException
>        at java.io.DataInputStream.readByte(DataInputStream.java:250)
>        at
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
>        at
> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
>        at org.apache.hadoop.io.Text.readString(Text.java:400)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)
>
> I have no idea where it could possibly stem from.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://code.oliviergrisel.name
>

Re: Random thought: line separators

Posted by Olivier Grisel <ol...@ensta.org>.
2010/1/18 Olivier Grisel <ol...@ensta.org>:
> 2010/1/18 Robin Anil <ro...@gmail.com>:
>> could you be specific on which map/reduce job you encountered the error ?
>
> I thought it was on:
>
> hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
> "wikipediadump/chunk-0001.xml" -o wikipediainput-eof-exception -c
> examples/src/test/resources/country.txt
>
> I just ran it again... successfully... The next time I encounter that
> error I will note the complete complete stacktrace however
> uninformative it looks.

I ran the same jobs again on all the chunks and could reproduce the error:

$ hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
"wikipediadump" -o wikipediainput-eof-exception -c
examples/src/test/resources/country.txt
[...]
10/01/18 16:20:46 INFO mapred.JobClient:  map 100% reduce 83%
10/01/18 16:21:42 INFO mapred.JobClient: Task Id :
attempt_201001172109_0010_r_000000_2, Status : FAILED
java.io.EOFException
        at java.io.DataInputStream.readByte(DataInputStream.java:250)
        at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
        at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
        at org.apache.hadoop.io.Text.readString(Text.java:400)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

I have no idea where it could possibly stem from.

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Re: Random thought: line separators

Posted by Olivier Grisel <ol...@ensta.org>.
2010/1/18 Robin Anil <ro...@gmail.com>:
> could you be specific on which map/reduce job you encountered the error ?

I thought it was on:

hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i
"wikipediadump/chunk-0001.xml" -o wikipediainput-eof-exception -c
examples/src/test/resources/country.txt

I just ran it again... successfully... The next time I encounter that
error I will note the complete complete stacktrace however
uninformative it looks.

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Re: Random thought: line separators

Posted by Robin Anil <ro...@gmail.com>.
could you be specific on which map/reduce job you encountered the error ?

On Mon, Jan 18, 2010 at 7:28 PM, Olivier Grisel <ol...@ensta.org>wrote:

> 2010/1/18 Robin Anil <ro...@gmail.com>:
> > Its this kind of thing that forced to move to sequence files instead of
> > TextKeyValueInput format and other text based/ csv based formats. Kind of
> > regretting the decision to go with tab separated format for
> BayesClassifier
> > which i wrote it 2 years ago. I will be modifying this to use sparse
> vectors
> > or the sequence files which ever fits.
> >
> > My thought is that this kind of functionality should only be used by the
> > format convertors that convert to and back from sequence files. and when
> > storing it to sequence files just enforce the \n rule for line breaks
>
> By the way, I tried to run the Bayesian classifier's features
> extractor on the following wikipedia chunk:
>
> s3://enwiki-pages-articles/enwiki-20090810-pages-articles/chunk-0001.xml
>
> And I got an EOFException in hadoop related classes (no mahout classes
> in the stacktrace). I wonder if this is related, or maybe this is
> related to the java serialization used in that step.
>
> The feature extractors works on all other chunks I tried though. All
> those chunks were extracted on a linux machine.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://code.oliviergrisel.name
>

Re: Random thought: line separators

Posted by Olivier Grisel <ol...@ensta.org>.
2010/1/18 Robin Anil <ro...@gmail.com>:
> Its this kind of thing that forced to move to sequence files instead of
> TextKeyValueInput format and other text based/ csv based formats. Kind of
> regretting the decision to go with tab separated format for BayesClassifier
> which i wrote it 2 years ago. I will be modifying this to use sparse vectors
> or the sequence files which ever fits.
>
> My thought is that this kind of functionality should only be used by the
> format convertors that convert to and back from sequence files. and when
> storing it to sequence files just enforce the \n rule for line breaks

By the way, I tried to run the Bayesian classifier's features
extractor on the following wikipedia chunk:

s3://enwiki-pages-articles/enwiki-20090810-pages-articles/chunk-0001.xml

And I got an EOFException in hadoop related classes (no mahout classes
in the stacktrace). I wonder if this is related, or maybe this is
related to the java serialization used in that step.

The feature extractors works on all other chunks I tried though. All
those chunks were extracted on a linux machine.

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Re: Random thought: line separators

Posted by Robin Anil <ro...@gmail.com>.
Its this kind of thing that forced to move to sequence files instead of
TextKeyValueInput format and other text based/ csv based formats. Kind of
regretting the decision to go with tab separated format for BayesClassifier
which i wrote it 2 years ago. I will be modifying this to use sparse vectors
or the sequence files which ever fits.

My thought is that this kind of functionality should only be used by the
format convertors that convert to and back from sequence files. and when
storing it to sequence files just enforce the \n rule for line breaks

Robin



On Mon, Jan 18, 2010 at 5:34 PM, Sean Owen <sr...@gmail.com> wrote:

> As I troll through the code at times trying to polish here and there I
> notice small issues to bring up --
>
> Line separators. Lots of code independently reads
> System.getProperty("line.separator") in order to output a platform
> specific line break. I argue this is actually slightly bad, since it
> means the input/output formats of Mahout aren't fixed at all, but can
> vary by platform. Output on Windows isn't read properly by Unix, etc.,
> perhaps.
>
> It'd be simpler and more compatible to use '\n' always. Thoughts?
>
> (And, recall we don't really support Windows so well anyway, which is
> the odd man out in this regard.)
>