You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by rahul raghavendhra <ra...@gmail.com> on 2011/12/28 13:00:55 UTC

Mahout Seqfile format

I am new to Mahout.. i just want to know how text file is converted into
seqfile and then to sparse vectors..
any kind of text file can  be converted into seq file using ./mahout
seqdirectory ?

thanks in advance..

./rahul

Re: Mahout Seqfile format

Posted by Lance Norskog <go...@gmail.com>.
>From sequence file to sparse vector file is the fun part: there are
(roughly) two phases:
1) parse the text file and decide what is a document
2) analyze the document with the Lucene text search API and create
vectors from the output.

#1 you can figure out from example code, like the Wikipedia, Reuters
and Newsgorups code.
#2 takes some technical background, but you can use it as a black box.
It is explained in Chapter 14 of Mahout In
Action.

On Wed, Dec 28, 2011 at 8:57 AM, Josh Patterson <jo...@cloudera.com> wrote:
> Rahul,
> Currently the text file to sequence file functionality is contained in
> (as of Mahout 0.6 / trunk):
>
> org.apache.mahout.text.SequenceFilesFromDirectory
>
> and it write a K/V pair to a standard sequence file in the form of:
>
> { filepath (Text), contents of file (Text) }
>
> In the current single process form of the code it uses a custom
> PathFilter (SequenceFilesFromDirectoryFilter) to recursively walk down
> a directory and its child directories to write the contained files
> into a series of sequence files based on a variety of options like
> "chunk size".
>
> An example of running this would be:
>
> bin/mahout seqdirectory -c UTF-8 -i reuters/ -o reuters-seqfiles
>
> Josh
>
> On Wed, Dec 28, 2011 at 7:00 AM, rahul raghavendhra
> <ra...@gmail.com> wrote:
>> I am new to Mahout.. i just want to know how text file is converted into
>> seqfile and then to sparse vectors..
>> any kind of text file can  be converted into seq file using ./mahout
>> seqdirectory ?
>>
>> thanks in advance..
>>
>> ./rahul
>
>
>
> --
> Twitter: @jpatanooga
> Solution Architect @ Cloudera
> hadoop: http://www.cloudera.com



-- 
Lance Norskog
goksron@gmail.com

Re: CSV to Mahout Seqfile

Posted by Pat Ferrel <pa...@gmail.com>.
I think you need to do something with your strings. Usually this means converting them into terms and giving each term a separate id, making each term a feature and numberic. And remember that all IDs must be usable by Mahout. This typically means that you have to replace all of your ids with sequential Ints from 0-number of features or rows. So you “id” must be converted into 0-number of “ids”. I do this with a bi-directional dictionary so you can convert them back into your application ids once they are processed.

How many classifiers are you creating and to what purpose. There may be other ways to do what you need. This sounds like a job for a search engine since it can digest stings and csvs, but not if you really need a classifier rather than similarity.
   
On Aug 8, 2014, at 8:22 PM, Suneel Marthi <su...@gmail.com> wrote:

See
http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program




On Fri, Aug 8, 2014 at 11:05 PM, Aniket <sa...@gmail.com> wrote:

> Hi,
> 
> I am working on project & want to run a dataset on mahout for naive bayes
> classifier.
> dataset has csv format with columns ( id , rating ,summary, review, label).
> 
> id : numeric
> rating : numeric ( 1 to 5)
> summary : 4-5 texts strings
> review : more texts and strings
> label : positive or negative.
> 
> I am not able to fingure out how to do csv to seq. files beacuse csv has
> texts
> as well as numeric attributes. Can you please help with this ?
> 
> Thanks.
> Aniket
> 
> 


Re: CSV to Mahout Seqfile

Posted by Suneel Marthi <su...@gmail.com>.
See
http://stackoverflow.com/questions/13663567/mahout-csv-to-vector-and-running-the-program




On Fri, Aug 8, 2014 at 11:05 PM, Aniket <sa...@gmail.com> wrote:

> Hi,
>
> I am working on project & want to run a dataset on mahout for naive bayes
> classifier.
> dataset has csv format with columns ( id , rating ,summary, review, label).
>
> id : numeric
> rating : numeric ( 1 to 5)
> summary : 4-5 texts strings
> review : more texts and strings
> label : positive or negative.
>
> I am not able to fingure out how to do csv to seq. files beacuse csv has
> texts
> as well as numeric attributes. Can you please help with this ?
>
> Thanks.
> Aniket
>
>

CSV to Mahout Seqfile

Posted by Aniket <sa...@gmail.com>.
Hi,

I am working on project & want to run a dataset on mahout for naive bayes 
classifier.
dataset has csv format with columns ( id , rating ,summary, review, label).

id : numeric
rating : numeric ( 1 to 5)
summary : 4-5 texts strings
review : more texts and strings
label : positive or negative.

I am not able to fingure out how to do csv to seq. files beacuse csv has texts 
as well as numeric attributes. Can you please help with this ?

Thanks.
Aniket


Re: Mahout Seqfile format

Posted by Lance Norskog <go...@gmail.com>.
When you open a SequenceFile, there are API calls getKeyClass and
getValueClass which will give you the Writable classes.

On Thu, Dec 29, 2011 at 12:42 PM, Sean Owen <sr...@gmail.com> wrote:
> SequenceFile isn't quite one format -- it's a container format for
> key-value pairs, where keys and values can be of any type. Yes it's
> the same SequenceFile in Hadoop and Mahout, though one file written
> using SequenceFile may hold completely different data types.
>
> I suppose it's like saying that both XHTML and ebXML are both XML
> documents, but, they are not all the same format. You can deal with
> both as XML files; you can't render ebXML as a web page though.
>
> On Thu, Dec 29, 2011 at 9:32 AM, Josh Patterson <jo...@cloudera.com> wrote:
>> They both map to the same class in hadoop:
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
>>
>> JP
>>
>> On Thu, Dec 29, 2011 at 3:32 AM, rahul raghavendhra
>> <ra...@gmail.com> wrote:
>>> Thanks jose,
>>>  thank you for your reply, i have one more silly doubt, mahout sequence
>>> file format and hadoop sequence file format are same or different ?
>>> please reply
>>> ./rahul
>>>



-- 
Lance Norskog
goksron@gmail.com

Re: Mahout Seqfile format

Posted by Sean Owen <sr...@gmail.com>.
SequenceFile isn't quite one format -- it's a container format for
key-value pairs, where keys and values can be of any type. Yes it's
the same SequenceFile in Hadoop and Mahout, though one file written
using SequenceFile may hold completely different data types.

I suppose it's like saying that both XHTML and ebXML are both XML
documents, but, they are not all the same format. You can deal with
both as XML files; you can't render ebXML as a web page though.

On Thu, Dec 29, 2011 at 9:32 AM, Josh Patterson <jo...@cloudera.com> wrote:
> They both map to the same class in hadoop:
>
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
>
> JP
>
> On Thu, Dec 29, 2011 at 3:32 AM, rahul raghavendhra
> <ra...@gmail.com> wrote:
>> Thanks jose,
>>  thank you for your reply, i have one more silly doubt, mahout sequence
>> file format and hadoop sequence file format are same or different ?
>> please reply
>> ./rahul
>>

Re: Mahout Seqfile format

Posted by Josh Patterson <jo...@cloudera.com>.
They both map to the same class in hadoop:

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html

JP

On Thu, Dec 29, 2011 at 3:32 AM, rahul raghavendhra
<ra...@gmail.com> wrote:
> Thanks jose,
>  thank you for your reply, i have one more silly doubt, mahout sequence
> file format and hadoop sequence file format are same or different ?
> please reply
> ./rahul
>
>>On Wed, Dec 28, 2011 at 10:27 PM, Josh Patterson <jo...@cloudera.com> wrote:
>
>> >Rahul,
>> >Currently the text file to sequence file functionality is contained in
>> >(as of Mahout 0.6 / trunk):
>> >org.apache.mahout.text.SequenceFilesFromDirectory
>> >
>
>>and it write a K/V pair to a standard sequence file in the form of:
>>
>> >{ filepath (Text), contents of file (Text) }
>>
>> >In the current single process form of the code it uses a custom
>> >PathFilter (SequenceFilesFromDirectoryFilter) to recursively walk down
>> >a directory and its child directories to write the contained files
>> >into a series of sequence files based on a variety of options like
>> >"chunk size".
>>
>> >An example of running this would be:
>>
>> >bin/mahout seqdirectory -c UTF-8 -i reuters/ -o reuters-seqfiles
>>
>> >Josh
>>
>> On Wed, Dec 28, 2011 at 7:00 AM, rahul raghavendhra
>> <ra...@gmail.com> wrote:
>> > I am new to Mahout.. i just want to know how text file is converted into
>> > seqfile and then to sparse vectors..
>> > any kind of text file can  be converted into seq file using ./mahout
>> > seqdirectory ?
>> >
>> > thanks in advance..
>> >
>> > ./rahul
>>
>>
>>
>> --
>> Twitter: @jpatanooga
>> Solution Architect @ Cloudera
>> hadoop: http://www.cloudera.com
>>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com

Re: Mahout Seqfile format

Posted by rahul raghavendhra <ra...@gmail.com>.
Thanks jose,
 thank you for your reply, i have one more silly doubt, mahout sequence
file format and hadoop sequence file format are same or different ?
please reply
./rahul

>On Wed, Dec 28, 2011 at 10:27 PM, Josh Patterson <jo...@cloudera.com> wrote:

> >Rahul,
> >Currently the text file to sequence file functionality is contained in
> >(as of Mahout 0.6 / trunk):
> >org.apache.mahout.text.SequenceFilesFromDirectory
> >

>and it write a K/V pair to a standard sequence file in the form of:
>
> >{ filepath (Text), contents of file (Text) }
>
> >In the current single process form of the code it uses a custom
> >PathFilter (SequenceFilesFromDirectoryFilter) to recursively walk down
> >a directory and its child directories to write the contained files
> >into a series of sequence files based on a variety of options like
> >"chunk size".
>
> >An example of running this would be:
>
> >bin/mahout seqdirectory -c UTF-8 -i reuters/ -o reuters-seqfiles
>
> >Josh
>
> On Wed, Dec 28, 2011 at 7:00 AM, rahul raghavendhra
> <ra...@gmail.com> wrote:
> > I am new to Mahout.. i just want to know how text file is converted into
> > seqfile and then to sparse vectors..
> > any kind of text file can  be converted into seq file using ./mahout
> > seqdirectory ?
> >
> > thanks in advance..
> >
> > ./rahul
>
>
>
> --
> Twitter: @jpatanooga
> Solution Architect @ Cloudera
> hadoop: http://www.cloudera.com
>

Re: Mahout Seqfile format

Posted by Josh Patterson <jo...@cloudera.com>.
Rahul,
Currently the text file to sequence file functionality is contained in
(as of Mahout 0.6 / trunk):

org.apache.mahout.text.SequenceFilesFromDirectory

and it write a K/V pair to a standard sequence file in the form of:

{ filepath (Text), contents of file (Text) }

In the current single process form of the code it uses a custom
PathFilter (SequenceFilesFromDirectoryFilter) to recursively walk down
a directory and its child directories to write the contained files
into a series of sequence files based on a variety of options like
"chunk size".

An example of running this would be:

bin/mahout seqdirectory -c UTF-8 -i reuters/ -o reuters-seqfiles

Josh

On Wed, Dec 28, 2011 at 7:00 AM, rahul raghavendhra
<ra...@gmail.com> wrote:
> I am new to Mahout.. i just want to know how text file is converted into
> seqfile and then to sparse vectors..
> any kind of text file can  be converted into seq file using ./mahout
> seqdirectory ?
>
> thanks in advance..
>
> ./rahul



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com