You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lance Norskog <go...@gmail.com> on 2011/03/10 10:17:17 UTC

Coding gotcha in WikipediaToSequenceFile.java

In WikipediaToSequenceFile there is a coding mistake: new Job(conf)
copies the conf structure and all future changes are not reflected in
the job configuration. If you ever uncomment the second block of code,
it will not take effect.

conf.set( stuff)
conf.set(more stuff)
Job job = new Job(conf)

......

    /*
     * conf.set("mapred.compress.map.output", "true");
conf.set("mapred.map.output.compression.type",
     * "BLOCK"); conf.set("mapred.output.compress", "true");
conf.set("mapred.output.compression.type",
     * "BLOCK"); conf.set("mapred.output.compression.codec",
"org.apache.hadoop.io.compress.GzipCodec");
     */


-- 
Lance Norskog
goksron@gmail.com

Re: Coding gotcha in WikipediaToSequenceFile.java

Posted by Vasil Vasilev <va...@gmail.com>.
I had similar issue with seq2sparse. The LLR parameter was not taken
into account. The place in code was the CollocDriver

On 3/10/11, Robin Anil <ro...@gmail.com> wrote:
> the compression part shouldnt be in mahout code. It should be supplied
> externally by the user based on his hadoop setup. Can remove that
>
>
> On Thu, Mar 10, 2011 at 2:47 PM, Lance Norskog <go...@gmail.com> wrote:
>
>> In WikipediaToSequenceFile there is a coding mistake: new Job(conf)
>> copies the conf structure and all future changes are not reflected in
>> the job configuration. If you ever uncomment the second block of code,
>> it will not take effect.
>>
>> conf.set( stuff)
>> conf.set(more stuff)
>> Job job = new Job(conf)
>>
>> ......
>>
>>    /*
>>     * conf.set("mapred.compress.map.output", "true");
>> conf.set("mapred.map.output.compression.type",
>>     * "BLOCK"); conf.set("mapred.output.compress", "true");
>> conf.set("mapred.output.compression.type",
>>     * "BLOCK"); conf.set("mapred.output.compression.codec",
>> "org.apache.hadoop.io.compress.GzipCodec");
>>     */
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>

Re: Coding gotcha in WikipediaToSequenceFile.java

Posted by Robin Anil <ro...@gmail.com>.
the compression part shouldnt be in mahout code. It should be supplied
externally by the user based on his hadoop setup. Can remove that


On Thu, Mar 10, 2011 at 2:47 PM, Lance Norskog <go...@gmail.com> wrote:

> In WikipediaToSequenceFile there is a coding mistake: new Job(conf)
> copies the conf structure and all future changes are not reflected in
> the job configuration. If you ever uncomment the second block of code,
> it will not take effect.
>
> conf.set( stuff)
> conf.set(more stuff)
> Job job = new Job(conf)
>
> ......
>
>    /*
>     * conf.set("mapred.compress.map.output", "true");
> conf.set("mapred.map.output.compression.type",
>     * "BLOCK"); conf.set("mapred.output.compress", "true");
> conf.set("mapred.output.compression.type",
>     * "BLOCK"); conf.set("mapred.output.compression.codec",
> "org.apache.hadoop.io.compress.GzipCodec");
>     */
>
>
> --
> Lance Norskog
> goksron@gmail.com
>