You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dan Brickley (JIRA)" <ji...@apache.org> on 2011/08/08 08:56:27 UTC

[jira] [Issue Comment Edited] (MAHOUT-781) universal map-reduce job to convert csv file to vectorwritable sequencefile

    [ https://issues.apache.org/jira/browse/MAHOUT-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080769#comment-13080769 ] 

Dan Brickley edited comment on MAHOUT-781 at 8/8/11 6:54 AM:
-------------------------------------------------------------

A utility does sound useful. Good idea Xiaobo.

I was happy to find Danny Bickson's post here - http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html - which offers simple CSV importer. It takes sparse from/to/value affinity tuples and converts them into (and out of) a Mahout binary representation. Would MAHOUT-781 include this functionality?

It would be good to have a spec. There are lots of subtle variations on the CSV theme.

Can lines containing #-prefixed comments be included? Are extra blank lines acceptable or do they cause an error? Are header fields represented somehow inline, or only in a separate --header document? Is whitespace between field values discarded, included in the values we pass on, or considered invalid?

If the utility also covers conversion back to CSV, it should be possible to test round-tripping...

      was (Author: danbri):
    A utility does sound useful. Good idea Xiaobo.

I was happy to find Danny Bickson's post here - http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html - which offers simple CSV importer. It takes sparse from/to/value affinity tuples and converts them into (and out of) a Mahout binary representation. Would MAHOUT-781 include this functionality?

It would be good to have a spec. There are lots of subtle variations on the CSV theme.

Can lines containing #-prefixed comments be included? Are extra blank lines acceptable or do they cause an error? Are header fields represented somehow inline, or only in a separate --header document? Is whitespace between field values discarded, included in the values we pass on, or silently discarded?

If the utility also covers conversion back to CSV, it should be possible to test round-tripping...
  
> universal map-reduce job to convert csv file to vectorwritable sequencefile
> ---------------------------------------------------------------------------
>
>                 Key: MAHOUT-781
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-781
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.6
>            Reporter: XiaoboGu
>            Priority: Minor
>         Attachments: csv2seq.patch, csv2seq.patch, test-data.zip
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira