You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Josh Patterson (JIRA)" <ji...@apache.org> on 2012/06/28 04:34:44 UTC

[jira] [Updated] (MAHOUT-833) Make conversion to sequence files map-reduce

     [ https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Patterson updated MAHOUT-833:
----------------------------------

    Attachment: MAHOUT-833-final.patch

This patch has functionality for the MR versions of both SequenceFilesFromDirectory and SequenceFilesFromMailArchives. 

A few notes:

- I couldnt find a place in the serial version of SequenceFilesFromMailArchives that was actually turning on block compression for the sequence files explicily in code. This could be done by the conf files in hadoop 0.20.205, but it wasnt being done in code afaik

- the serial version of SequenceFilesFromMailArchives seems to not be working correctly in trunk; It does pass tests, but when its run on a .gz file from the ASF mail archives it reports 0 records extracted. The MR version works as intended in this patch, but I did not yet change the serial version.

- the structure of SequenceFilesFromMailArchives (MR version) maintains as much of the same functionality / code as I could muster from the serial version. To use the FileLineIterable  in the mbox parsing code, I had to change add a constructor, for instance.

- ended up using old MR api because of needing certain functionality that had not yet been ported as of 0.20.205
                
> Make conversion to sequence files map-reduce
> --------------------------------------------
>
>                 Key: MAHOUT-833
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-833
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.7
>            Reporter: Grant Ingersoll
>              Labels: MAHOUT_INTRO_CONTRIBUTE
>         Attachments: MAHOUT-833-final.patch, MAHOUT-833.patch
>
>
> Given input that is on HDFS, the SequenceFilesFrom****.java classes should be able to do their work in parallel.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira