You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by wine lover <wi...@gmail.com> on 2011/06/27 22:36:15 UTC

parameter setting for using Seqdirectory and SequenceFile

Hello Everyone,

When using seqdirectory to convert directory of documents to SequenceFile
format, it asks to set the parameter of chunk size:
<-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64>

In the example of build-ruters.sh, the chunk size is setup as 5. But I do
not know why? Is parameter input-dependent or system-dependent? Is there any
rule for setting this parameter?

When using seq2sparse to creat vectors from SequenceFile, I notice that the
build-ruters.sh use it as follows:
$MAHOUT seq2sparse \
    -i mahout-work/reuters-out-seqdir/ \
    -o mahout-work/reuters-out-seqdir-sparse-lda \
    -wt tf -seq -nr 3 \

What does "-nr 3" stand for?

Thanks,

Wenyia

Re: parameter setting for using Seqdirectory and SequenceFile

Posted by Drew Farris <dr...@apache.org>.
Hi Wenyia,

The chunk size property will cause seqdirectory to output smaller
sequence files. Using multiple small files as input will allow a
greater number of map tasks to be run in parallel because each file
will be assigned to its own map task.

In the case of the Reuters example, forcing the chunk size to 5mb will
cause 3 separate files to be generated instead of a single sequence
file.

The FileSystem blocksize of 64m is treated as an upper bound for input
splits, so unless input less than 64m is chunked into smaller files
only a single mapper will be run.

Drew

On Mon, Jun 27, 2011 at 4:36 PM, wine lover <wi...@gmail.com> wrote:
> Hello Everyone,
>
> When using seqdirectory to convert directory of documents to SequenceFile
> format, it asks to set the parameter of chunk size:
> <-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64>
>
> In the example of build-ruters.sh, the chunk size is setup as 5. But I do
> not know why? Is parameter input-dependent or system-dependent? Is there any
> rule for setting this parameter?
>
> When using seq2sparse to creat vectors from SequenceFile, I notice that the
> build-ruters.sh use it as follows:
> $MAHOUT seq2sparse \
>    -i mahout-work/reuters-out-seqdir/ \
>    -o mahout-work/reuters-out-seqdir-sparse-lda \
>    -wt tf -seq -nr 3 \
>
> What does "-nr 3" stand for?
>
> Thanks,
>
> Wenyia
>

Re: parameter setting for using Seqdirectory and SequenceFile

Posted by Konstantin Shmakov <ks...@gmail.com>.
Try
mahout seq2sparse --help



On Mon, Jun 27, 2011 at 1:36 PM, wine lover <wi...@gmail.com> wrote:

> Hello Everyone,
>
> When using seqdirectory to convert directory of documents to SequenceFile
> format, it asks to set the parameter of chunk size:
> <-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64>
>
> In the example of build-ruters.sh, the chunk size is setup as 5. But I do
> not know why? Is parameter input-dependent or system-dependent? Is there
> any
> rule for setting this parameter?
>
> When using seq2sparse to creat vectors from SequenceFile, I notice that the
> build-ruters.sh use it as follows:
> $MAHOUT seq2sparse \
>    -i mahout-work/reuters-out-seqdir/ \
>    -o mahout-work/reuters-out-seqdir-sparse-lda \
>    -wt tf -seq -nr 3 \
>
> What does "-nr 3" stand for?
>
> Thanks,
>
> Wenyia
>



-- 
ksh: