You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Isabel Drost (JIRA)" <ji...@apache.org> on 2011/03/23 15:54:06 UTC

[jira] [Resolved] (MAHOUT-590) add TSV (Tab Separate Value) input file support to SequenceFilesFromDirectory

     [ https://issues.apache.org/jira/browse/MAHOUT-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Isabel Drost resolved MAHOUT-590.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5

Patch committed.

> add TSV (Tab Separate Value) input file support to SequenceFilesFromDirectory
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-590
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-590
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Utils
>    Affects Versions: 0.4
>         Environment: Mac OS X 10.6.6, java version "1.6.0_22"
> RHL Linux 2.6.18
>            Reporter: Shige Takeda
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: 0001-added-TSV-input-file-support.patch, MAHOUT-590.patch, MAHOUT-590.patch
>
>
> I would like to add TSV (Tab Separated Value) input file type support to SequenceFilesFromDirectory.
> Here is my real use case:
> I have 36M records of input, each of which consists of ID and CONTENT and various other attributes, and I wanted to convert them to sequence files for clustering records by term vectors of CONTENT. However the problem is since I cannot create 36M files under my home directory due to quota limit that is up to 50k files, I was not able to convert them to sequence files by SequenceFilesFromDirectory utility... Meanwhile, source data format is TSV where each line of a file includes ID\tCONTENT\t... as it is suitable for Pig and most hadoop stream programs to process as input and output. NOTE: CONTENT size is up to around 2k bytes. Hence I feel better TSV support by SequenceFilesFromDirectory directly instead of taking two steps; TSV to text files and text files to Sequence files.
> I'm attaching the patch.
> Hope this makes sense to other folks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira