You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Josh Patterson (Commented) (JIRA)" <ji...@apache.org> on 2011/11/14 02:33:52 UTC

[jira] [Commented] (MAHOUT-833) Make conversion to sequence files map-reduce

    [ https://issues.apache.org/jira/browse/MAHOUT-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13149411#comment-13149411 ] 

Josh Patterson commented on MAHOUT-833:
---------------------------------------

What are the most common expectations around how we expect input files of this type to occur?

I ask that to better take an angle on how to feed pathnames to map tasks to subdivide the work.

Depending on factors like:

- "lots of directories, few files per directory"

- " few directories, lots of files per dir"

Currently the code is built around "tagging along" on the FileSystem.ListStatus( ... ) recursive filter code path, but the MR version will have to be different.

One approach I've kicked around is that you could just walk the directory list and then hash each entry out into a group so regardless of directory each map task gets a (generally) even number of documents to process, but out of the box that doesnt consider trying to keep all of the files in one directory in the same sequence file. Does that matter here? I want to say no, but then again, why not ask.
                
> Make conversion to sequence files map-reduce
> --------------------------------------------
>
>                 Key: MAHOUT-833
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-833
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Integration
>    Affects Versions: 0.5
>            Reporter: Grant Ingersoll
>              Labels: MAHOUT_INTRO_CONTRIBUTE
>
> Given input that is on HDFS, the SequenceFilesFrom****.java classes should be able to do their work in parallel.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira