You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Mark <st...@gmail.com> on 2011/06/06 18:04:18 UTC

SequenceFilesFromDirectory

I've been running through the examples as described in the Mahout In 
Action book and I have some questions regarding the 
SequenceFilesFromDirectory.java class.

This class expects a directory of files that contains 1 document per 
file. Is there another mahout class or some options I can supply to 
SequenceFilesFromDirectory.java to parse multiple documents per file? 
For example, my files contain 1 document per line. I would like to parse 
each line of each file and create a sequence file from this. Is this 
possible with SequenceFilesFromDirectory or would I have to write this 
myself?

Thanks

Re: SequenceFilesFromDirectory

Posted by Mark <st...@gmail.com>.
Actually I think I found something that would work: 
SequenceFilesFromCsvFilter.java

I am trying to use as follows:

  bin/mahout seqdirectory -i input -o output -filter 
org.apache.mahout.text.SequenceFilesFromCsvFilter -ow

But I am receiving the following exception:

Caused by: java.lang.NumberFormatException: null
     at java.lang.Integer.parseInt(Integer.java:417)
     at java.lang.Integer.parseInt(Integer.java:499)
     at 
org.apache.mahout.text.SequenceFilesFromCsvFilter.<init>(SequenceFilesFromCsvFilter.java:56)

I believe this is because this class requires a keyColumn and 
valueColumn option. Is there anyway for me to pass these options along?

When i try adding it to the above seqdirectory command I receive:

   Unexpected -kcol while processing Job-Specific Options:


Any ideas?

Thanks

On 6/6/11 10:30 AM, Mark wrote:
> Thanks
>
> On 6/6/11 10:28 AM, Robin Anil wrote:
>> Mark you need to write your own tool to convert data into sequence 
>> files.
>> Its pretty easy. instantiate SequenceFile.Writer with both key and 
>> value as
>> Text and write your data in the file.
>>
>> If your data is very large, you might want to consider writing a Map 
>> only
>> MapReduce which can read your input and write Output<Text,Text>  in
>> SequenceFileOutputFormat
>>
>> Robin
>>
>> On Mon, Jun 6, 2011 at 10:53 PM, Mark<st...@gmail.com>  wrote:
>>
>>> I am looking to performing clustering algorithms on these documents 
>>> which I
>>> thought (I could be wrong) requires sequence files? Is this not the 
>>> case?
>>>
>>> Thanks
>>>
>>>
>>> On 6/6/11 10:11 AM, Daniel McEnnis wrote:
>>>
>>>> Mark,
>>>>
>>>> Generally speaking, Mahout has pretty good performance over log files
>>>> like the ones your describing, so they typically don't get changed
>>>> into sequence files.  You'll need to write one for yourself if you
>>>> really need sequence files (such as for key management.)
>>>>
>>>> Daniel.
>>>>
>>>> On Mon, Jun 6, 2011 at 12:04 PM, Mark<st...@gmail.com>   
>>>> wrote:
>>>>
>>>>> I've been running through the examples as described in the Mahout In
>>>>> Action
>>>>> book and I have some questions regarding the
>>>>> SequenceFilesFromDirectory.java
>>>>> class.
>>>>>
>>>>> This class expects a directory of files that contains 1 document per
>>>>> file.
>>>>> Is there another mahout class or some options I can supply to
>>>>> SequenceFilesFromDirectory.java to parse multiple documents per 
>>>>> file? For
>>>>> example, my files contain 1 document per line. I would like to 
>>>>> parse each
>>>>> line of each file and create a sequence file from this. Is this 
>>>>> possible
>>>>> with SequenceFilesFromDirectory or would I have to write this myself?
>>>>>
>>>>> Thanks
>>>>>
>>>>>

Re: SequenceFilesFromDirectory

Posted by Mark <st...@gmail.com>.
Thanks

On 6/6/11 10:28 AM, Robin Anil wrote:
> Mark you need to write your own tool to convert data into sequence files.
> Its pretty easy. instantiate SequenceFile.Writer with both key and value as
> Text and write your data in the file.
>
> If your data is very large, you might want to consider writing a Map only
> MapReduce which can read your input and write Output<Text,Text>  in
> SequenceFileOutputFormat
>
> Robin
>
> On Mon, Jun 6, 2011 at 10:53 PM, Mark<st...@gmail.com>  wrote:
>
>> I am looking to performing clustering algorithms on these documents which I
>> thought (I could be wrong) requires sequence files? Is this not the case?
>>
>> Thanks
>>
>>
>> On 6/6/11 10:11 AM, Daniel McEnnis wrote:
>>
>>> Mark,
>>>
>>> Generally speaking, Mahout has pretty good performance over log files
>>> like the ones your describing, so they typically don't get changed
>>> into sequence files.  You'll need to write one for yourself if you
>>> really need sequence files (such as for key management.)
>>>
>>> Daniel.
>>>
>>> On Mon, Jun 6, 2011 at 12:04 PM, Mark<st...@gmail.com>   wrote:
>>>
>>>> I've been running through the examples as described in the Mahout In
>>>> Action
>>>> book and I have some questions regarding the
>>>> SequenceFilesFromDirectory.java
>>>> class.
>>>>
>>>> This class expects a directory of files that contains 1 document per
>>>> file.
>>>> Is there another mahout class or some options I can supply to
>>>> SequenceFilesFromDirectory.java to parse multiple documents per file? For
>>>> example, my files contain 1 document per line. I would like to parse each
>>>> line of each file and create a sequence file from this. Is this possible
>>>> with SequenceFilesFromDirectory or would I have to write this myself?
>>>>
>>>> Thanks
>>>>
>>>>

Re: SequenceFilesFromDirectory

Posted by Robin Anil <ro...@gmail.com>.
Mark you need to write your own tool to convert data into sequence files.
Its pretty easy. instantiate SequenceFile.Writer with both key and value as
Text and write your data in the file.

If your data is very large, you might want to consider writing a Map only
MapReduce which can read your input and write Output <Text,Text> in
SequenceFileOutputFormat

Robin

On Mon, Jun 6, 2011 at 10:53 PM, Mark <st...@gmail.com> wrote:

> I am looking to performing clustering algorithms on these documents which I
> thought (I could be wrong) requires sequence files? Is this not the case?
>
> Thanks
>
>
> On 6/6/11 10:11 AM, Daniel McEnnis wrote:
>
>> Mark,
>>
>> Generally speaking, Mahout has pretty good performance over log files
>> like the ones your describing, so they typically don't get changed
>> into sequence files.  You'll need to write one for yourself if you
>> really need sequence files (such as for key management.)
>>
>> Daniel.
>>
>> On Mon, Jun 6, 2011 at 12:04 PM, Mark<st...@gmail.com>  wrote:
>>
>>> I've been running through the examples as described in the Mahout In
>>> Action
>>> book and I have some questions regarding the
>>> SequenceFilesFromDirectory.java
>>> class.
>>>
>>> This class expects a directory of files that contains 1 document per
>>> file.
>>> Is there another mahout class or some options I can supply to
>>> SequenceFilesFromDirectory.java to parse multiple documents per file? For
>>> example, my files contain 1 document per line. I would like to parse each
>>> line of each file and create a sequence file from this. Is this possible
>>> with SequenceFilesFromDirectory or would I have to write this myself?
>>>
>>> Thanks
>>>
>>>

Re: SequenceFilesFromDirectory

Posted by Mark <st...@gmail.com>.
I am looking to performing clustering algorithms on these documents 
which I thought (I could be wrong) requires sequence files? Is this not 
the case?

Thanks

On 6/6/11 10:11 AM, Daniel McEnnis wrote:
> Mark,
>
> Generally speaking, Mahout has pretty good performance over log files
> like the ones your describing, so they typically don't get changed
> into sequence files.  You'll need to write one for yourself if you
> really need sequence files (such as for key management.)
>
> Daniel.
>
> On Mon, Jun 6, 2011 at 12:04 PM, Mark<st...@gmail.com>  wrote:
>> I've been running through the examples as described in the Mahout In Action
>> book and I have some questions regarding the SequenceFilesFromDirectory.java
>> class.
>>
>> This class expects a directory of files that contains 1 document per file.
>> Is there another mahout class or some options I can supply to
>> SequenceFilesFromDirectory.java to parse multiple documents per file? For
>> example, my files contain 1 document per line. I would like to parse each
>> line of each file and create a sequence file from this. Is this possible
>> with SequenceFilesFromDirectory or would I have to write this myself?
>>
>> Thanks
>>

Re: SequenceFilesFromDirectory

Posted by Daniel McEnnis <dm...@gmail.com>.
Mark,

Generally speaking, Mahout has pretty good performance over log files
like the ones your describing, so they typically don't get changed
into sequence files.  You'll need to write one for yourself if you
really need sequence files (such as for key management.)

Daniel.

On Mon, Jun 6, 2011 at 12:04 PM, Mark <st...@gmail.com> wrote:
> I've been running through the examples as described in the Mahout In Action
> book and I have some questions regarding the SequenceFilesFromDirectory.java
> class.
>
> This class expects a directory of files that contains 1 document per file.
> Is there another mahout class or some options I can supply to
> SequenceFilesFromDirectory.java to parse multiple documents per file? For
> example, my files contain 1 document per line. I would like to parse each
> line of each file and create a sequence file from this. Is this possible
> with SequenceFilesFromDirectory or would I have to write this myself?
>
> Thanks
>