You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "XiaoboGu (JIRA)" <ji...@apache.org> on 2011/08/07 11:52:28 UTC

[jira] [Created] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

universal map-reduce job to convert csv file to vectorwritable sequencefile
---------------------------------------------------------------------------

                 Key: MAHOUT-781
                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
             Project: Mahout
          Issue Type: Improvement
          Components: Classification
    Affects Versions: 0.6
            Reporter: XiaoboGu
            Priority: Minor




--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

RE: [jira] [Commented] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by Jeff Eastman <je...@Narus.com>.

We already have something pretty close in org.apache.mahout.clustering.conversion.InputDriver. It could be generalized to handle arbitrary delimiters (currently uses space).

-----Original Message-----
From: Sean Owen (JIRA) [mailto:jira@apache.org] 
Sent: Sunday, August 07, 2011 11:10 AM
To: dev@mahout.apache.org
Subject: [jira] [Commented] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

    [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080613#comment-13080613 ] 

Sean Owen commented on MAHOUT-781:
----------------------------------

Xiaobo, the process is that you first explain what you are trying to accomplish, then propose a patch, then it is reviewed, and then it is committed. I think you have skipped the first step.

Where does this become useful to Mahout? I can sort of imagine it as a utility class, but not in core.
What is test-data.zip?
Is there a need to preserve line number in the output? seems like no reducer is needed.

> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, test-data.zip
>
>

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by "XiaoboGu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

XiaoboGu updated MAHOUT-781:
----------------------------

    Attachment: test-data.zip

You can test with this command,
$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/mahout-examples-0.6-SNAPSHOT-job.jar org.apache.mahout.text.Csv2Seq --input input1.csv --output out.seq  --target class --header ./header.csv --predictors duration protocal_type --types numeric word --categories 2 --features 100

> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by "XiaoboGu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081617#comment-13081617 ] 

XiaoboGu commented on MAHOUT-781:
---------------------------------

Here are the purposes of making Csv2Seq 
1. Almost all raw data are in csv format, which many Mahout algorithms can't consume directly, a command line tool will save users writing/compiling and debugging java codes every time when he deals with a new csv file schema.
2. To take advantage of Map-Reduce to do data transformation quickly, this is a bottleneck in many algorithms such as AdaptiveLogisticRegression.

For your spec related question, here are the answers:
1. This is just a utility tool, the core functionality of dealing with CSV schema comes with CsvRecordFactory, which is in Mahout-core, it is the CsvRecordFactory's responsibility to deal with that many problems you have asked, and I found CsvRecordFactory does a great job.
2. In many practical use cases, CSV header and even the predictor columns and their types are too long to write in the command line, letting them stored in separate local file, and let the tool read them is a more convenient way.
3. Here are the command line parameters specification:
input :the HDFS path of the csv files to convert
output : the target HDFS directory to write the target sequence file
target : the name of the target/label variable/column
header : a local file path contain the csv header content
key : the name of key variable/column
predictors : a list of predictor variables
types : a list of predictor variable types (numeric, word, or text)
categories : the number of target categories to be considered
features :  the number of internal hashed features to use

target, key, predictors, types, and categories will be used to make a CsvRecordFactory object inside each Csv2SeqMapper object, before processing records, the mapper object will call CsvRecordFactoy’s firstline method, with the csv header content as the parameter, then every record will be processed by CsvRecordFactory’s processLine method. A RandomAccessSparseVector(numFeatures) object will be created for each record, then the resulting Vector will be written to Sequence file, new Text(key) will be the SequenceFile key, if key is not specified by user, then target will be used as key, this is the exact input SequenceFile format required by Naïve Bayes.



> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-781.
------------------------------

    Resolution: Won't Fix
    
> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by "XiaoboGu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

XiaoboGu updated MAHOUT-781:
----------------------------

    Attachment: csv2seq.patch

Now it is runnable, and a runnable test command line is :

$HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/mahout-examples-0.6-SNAPSHOT-job.jar org.apache.mahout.text.Csv2Seq --input input1.csv --output out.seq  --target class --header ./header.csv --predictors duration protocol_type --types numeric word --categories 2 --features 100

> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080613#comment-13080613 ] 

Sean Owen commented on MAHOUT-781:
----------------------------------

Xiaobo, the process is that you first explain what you are trying to accomplish, then propose a patch, then it is reviewed, and then it is committed. I think you have skipped the first step.

Where does this become useful to Mahout? I can sort of imagine it as a utility class, but not in core.
What is test-data.zip?
Is there a need to preserve line number in the output? seems like no reducer is needed.

> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by "XiaoboGu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

XiaoboGu updated MAHOUT-781:
----------------------------

    Attachment: csv2seq.patch

> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by "Dan Brickley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080769#comment-13080769 ] 

Dan Brickley edited comment on MAHOUT-781 at 8/8/11 6:54 AM:
-------------------------------------------------------------

A utility does sound useful. Good idea Xiaobo.

I was happy to find Danny Bickson's post here - http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html - which offers simple CSV importer. It takes sparse from/to/value affinity tuples and converts them into (and out of) a Mahout binary representation. Would MAHOUT-781 include this functionality?

It would be good to have a spec. There are lots of subtle variations on the CSV theme.

Can lines containing #-prefixed comments be included? Are extra blank lines acceptable or do they cause an error? Are header fields represented somehow inline, or only in a separate --header document? Is whitespace between field values discarded, included in the values we pass on, or considered invalid?

If the utility also covers conversion back to CSV, it should be possible to test round-tripping...

      was (Author: danbri):
    A utility does sound useful. Good idea Xiaobo.

I was happy to find Danny Bickson's post here - http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html - which offers simple CSV importer. It takes sparse from/to/value affinity tuples and converts them into (and out of) a Mahout binary representation. Would MAHOUT-781 include this functionality?

It would be good to have a spec. There are lots of subtle variations on the CSV theme.

Can lines containing #-prefixed comments be included? Are extra blank lines acceptable or do they cause an error? Are header fields represented somehow inline, or only in a separate --header document? Is whitespace between field values discarded, included in the values we pass on, or silently discarded?

If the utility also covers conversion back to CSV, it should be possible to test round-tripping...
  
> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by "XiaoboGu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080711#comment-13080711 ] 

XiaoboGu commented on MAHOUT-781:
---------------------------------

We have talked a lot in the dev list

http://mail-archives.apache.org/mod_mbox/mahout-dev/201108.mbox/%3cCACOCgckzcvBdkRc_+AzAXxdyvA5BeXwCLsZV7KV8Js2bR3ykGg@mail.gmail.com%3e


> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by "Dan Brickley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080769#comment-13080769 ] 

Dan Brickley commented on MAHOUT-781:
-------------------------------------

A utility does sound useful. Good idea Xiaobo.

I was happy to find Danny Bickson's post here - http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html - which offers simple CSV importer. It takes sparse from/to/value affinity tuples and converts them into (and out of) a Mahout binary representation. Would MAHOUT-781 include this functionality?

It would be good to have a spec. There are lots of subtle variations on the CSV theme.

Can lines containing #-prefixed comments be included? Are extra blank lines acceptable or do they cause an error? Are header fields represented somehow inline, or only in a separate --header document? Is whitespace between field values discarded, included in the values we pass on, or silently discarded?

If the utility also covers conversion back to CSV, it should be possible to test round-tripping...

> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by "XiaoboGu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084095#comment-13084095 ] 

XiaoboGu commented on MAHOUT-781:
---------------------------------

I think it's very easy to skip invalide lines according to some rules in the map function before passing the line to CsvRecordFactory's processLine function.

Can you help to commit the patch to trunk please?

> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by "XiaoboGu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080580#comment-13080580 ] 

XiaoboGu commented on MAHOUT-781:
---------------------------------

Can you help to commit the latest patch to trunk please.

> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080712#comment-13080712 ] 

Ted Dunning commented on MAHOUT-781:
------------------------------------

Discussion on the mailing list is good.  There still needs to be a synthesis on the JIRA.  Email conversations tend to be long and strung out.  What is needed on the JIRA is a summary the final state of the discussion.

> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

Posted by "XiaoboGu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

XiaoboGu updated MAHOUT-781:
----------------------------

    Attachment: csv2seq.patch

a few revision to command line option description.

> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira