You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Mark <st...@gmail.com> on 2011/06/06 18:04:18 UTC
SequenceFilesFromDirectory
I've been running through the examples as described in the Mahout In
Action book and I have some questions regarding the
SequenceFilesFromDirectory.java class.
This class expects a directory of files that contains 1 document per
file. Is there another mahout class or some options I can supply to
SequenceFilesFromDirectory.java to parse multiple documents per file?
For example, my files contain 1 document per line. I would like to parse
each line of each file and create a sequence file from this. Is this
possible with SequenceFilesFromDirectory or would I have to write this
myself?
Thanks
Re: SequenceFilesFromDirectory
Posted by Mark <st...@gmail.com>.
Actually I think I found something that would work:
SequenceFilesFromCsvFilter.java
I am trying to use as follows:
bin/mahout seqdirectory -i input -o output -filter
org.apache.mahout.text.SequenceFilesFromCsvFilter -ow
But I am receiving the following exception:
Caused by: java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:417)
at java.lang.Integer.parseInt(Integer.java:499)
at
org.apache.mahout.text.SequenceFilesFromCsvFilter.<init>(SequenceFilesFromCsvFilter.java:56)
I believe this is because this class requires a keyColumn and
valueColumn option. Is there anyway for me to pass these options along?
When i try adding it to the above seqdirectory command I receive:
Unexpected -kcol while processing Job-Specific Options:
Any ideas?
Thanks
On 6/6/11 10:30 AM, Mark wrote:
> Thanks
>
> On 6/6/11 10:28 AM, Robin Anil wrote:
>> Mark you need to write your own tool to convert data into sequence
>> files.
>> Its pretty easy. instantiate SequenceFile.Writer with both key and
>> value as
>> Text and write your data in the file.
>>
>> If your data is very large, you might want to consider writing a Map
>> only
>> MapReduce which can read your input and write Output<Text,Text> in
>> SequenceFileOutputFormat
>>
>> Robin
>>
>> On Mon, Jun 6, 2011 at 10:53 PM, Mark<st...@gmail.com> wrote:
>>
>>> I am looking to performing clustering algorithms on these documents
>>> which I
>>> thought (I could be wrong) requires sequence files? Is this not the
>>> case?
>>>
>>> Thanks
>>>
>>>
>>> On 6/6/11 10:11 AM, Daniel McEnnis wrote:
>>>
>>>> Mark,
>>>>
>>>> Generally speaking, Mahout has pretty good performance over log files
>>>> like the ones your describing, so they typically don't get changed
>>>> into sequence files. You'll need to write one for yourself if you
>>>> really need sequence files (such as for key management.)
>>>>
>>>> Daniel.
>>>>
>>>> On Mon, Jun 6, 2011 at 12:04 PM, Mark<st...@gmail.com>
>>>> wrote:
>>>>
>>>>> I've been running through the examples as described in the Mahout In
>>>>> Action
>>>>> book and I have some questions regarding the
>>>>> SequenceFilesFromDirectory.java
>>>>> class.
>>>>>
>>>>> This class expects a directory of files that contains 1 document per
>>>>> file.
>>>>> Is there another mahout class or some options I can supply to
>>>>> SequenceFilesFromDirectory.java to parse multiple documents per
>>>>> file? For
>>>>> example, my files contain 1 document per line. I would like to
>>>>> parse each
>>>>> line of each file and create a sequence file from this. Is this
>>>>> possible
>>>>> with SequenceFilesFromDirectory or would I have to write this myself?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
Re: SequenceFilesFromDirectory
Posted by Mark <st...@gmail.com>.
Thanks
On 6/6/11 10:28 AM, Robin Anil wrote:
> Mark you need to write your own tool to convert data into sequence files.
> Its pretty easy. instantiate SequenceFile.Writer with both key and value as
> Text and write your data in the file.
>
> If your data is very large, you might want to consider writing a Map only
> MapReduce which can read your input and write Output<Text,Text> in
> SequenceFileOutputFormat
>
> Robin
>
> On Mon, Jun 6, 2011 at 10:53 PM, Mark<st...@gmail.com> wrote:
>
>> I am looking to performing clustering algorithms on these documents which I
>> thought (I could be wrong) requires sequence files? Is this not the case?
>>
>> Thanks
>>
>>
>> On 6/6/11 10:11 AM, Daniel McEnnis wrote:
>>
>>> Mark,
>>>
>>> Generally speaking, Mahout has pretty good performance over log files
>>> like the ones your describing, so they typically don't get changed
>>> into sequence files. You'll need to write one for yourself if you
>>> really need sequence files (such as for key management.)
>>>
>>> Daniel.
>>>
>>> On Mon, Jun 6, 2011 at 12:04 PM, Mark<st...@gmail.com> wrote:
>>>
>>>> I've been running through the examples as described in the Mahout In
>>>> Action
>>>> book and I have some questions regarding the
>>>> SequenceFilesFromDirectory.java
>>>> class.
>>>>
>>>> This class expects a directory of files that contains 1 document per
>>>> file.
>>>> Is there another mahout class or some options I can supply to
>>>> SequenceFilesFromDirectory.java to parse multiple documents per file? For
>>>> example, my files contain 1 document per line. I would like to parse each
>>>> line of each file and create a sequence file from this. Is this possible
>>>> with SequenceFilesFromDirectory or would I have to write this myself?
>>>>
>>>> Thanks
>>>>
>>>>
Re: SequenceFilesFromDirectory
Posted by Robin Anil <ro...@gmail.com>.
Mark you need to write your own tool to convert data into sequence files.
Its pretty easy. instantiate SequenceFile.Writer with both key and value as
Text and write your data in the file.
If your data is very large, you might want to consider writing a Map only
MapReduce which can read your input and write Output <Text,Text> in
SequenceFileOutputFormat
Robin
On Mon, Jun 6, 2011 at 10:53 PM, Mark <st...@gmail.com> wrote:
> I am looking to performing clustering algorithms on these documents which I
> thought (I could be wrong) requires sequence files? Is this not the case?
>
> Thanks
>
>
> On 6/6/11 10:11 AM, Daniel McEnnis wrote:
>
>> Mark,
>>
>> Generally speaking, Mahout has pretty good performance over log files
>> like the ones your describing, so they typically don't get changed
>> into sequence files. You'll need to write one for yourself if you
>> really need sequence files (such as for key management.)
>>
>> Daniel.
>>
>> On Mon, Jun 6, 2011 at 12:04 PM, Mark<st...@gmail.com> wrote:
>>
>>> I've been running through the examples as described in the Mahout In
>>> Action
>>> book and I have some questions regarding the
>>> SequenceFilesFromDirectory.java
>>> class.
>>>
>>> This class expects a directory of files that contains 1 document per
>>> file.
>>> Is there another mahout class or some options I can supply to
>>> SequenceFilesFromDirectory.java to parse multiple documents per file? For
>>> example, my files contain 1 document per line. I would like to parse each
>>> line of each file and create a sequence file from this. Is this possible
>>> with SequenceFilesFromDirectory or would I have to write this myself?
>>>
>>> Thanks
>>>
>>>
Re: SequenceFilesFromDirectory
Posted by Mark <st...@gmail.com>.
I am looking to performing clustering algorithms on these documents
which I thought (I could be wrong) requires sequence files? Is this not
the case?
Thanks
On 6/6/11 10:11 AM, Daniel McEnnis wrote:
> Mark,
>
> Generally speaking, Mahout has pretty good performance over log files
> like the ones your describing, so they typically don't get changed
> into sequence files. You'll need to write one for yourself if you
> really need sequence files (such as for key management.)
>
> Daniel.
>
> On Mon, Jun 6, 2011 at 12:04 PM, Mark<st...@gmail.com> wrote:
>> I've been running through the examples as described in the Mahout In Action
>> book and I have some questions regarding the SequenceFilesFromDirectory.java
>> class.
>>
>> This class expects a directory of files that contains 1 document per file.
>> Is there another mahout class or some options I can supply to
>> SequenceFilesFromDirectory.java to parse multiple documents per file? For
>> example, my files contain 1 document per line. I would like to parse each
>> line of each file and create a sequence file from this. Is this possible
>> with SequenceFilesFromDirectory or would I have to write this myself?
>>
>> Thanks
>>
Re: SequenceFilesFromDirectory
Posted by Daniel McEnnis <dm...@gmail.com>.
Mark,
Generally speaking, Mahout has pretty good performance over log files
like the ones your describing, so they typically don't get changed
into sequence files. You'll need to write one for yourself if you
really need sequence files (such as for key management.)
Daniel.
On Mon, Jun 6, 2011 at 12:04 PM, Mark <st...@gmail.com> wrote:
> I've been running through the examples as described in the Mahout In Action
> book and I have some questions regarding the SequenceFilesFromDirectory.java
> class.
>
> This class expects a directory of files that contains 1 document per file.
> Is there another mahout class or some options I can supply to
> SequenceFilesFromDirectory.java to parse multiple documents per file? For
> example, my files contain 1 document per line. I would like to parse each
> line of each file and create a sequence file from this. Is this possible
> with SequenceFilesFromDirectory or would I have to write this myself?
>
> Thanks
>