You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by "Adam Fuchs (Created) (JIRA)" <ji...@apache.org> on 2012/02/06 23:02:59 UTC

[jira] [Created] (ACCUMULO-375) Wikipedia Ingest needs more parallelism

Wikipedia Ingest needs more parallelism
---------------------------------------

                 Key: ACCUMULO-375
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-375
             Project: Accumulo
          Issue Type: Improvement
            Reporter: Adam Fuchs


The wikipedia ingest Map job uses a derivative of the FileInputFormat, which launches one job per file. Given the partitioning strategy and workload distribution, it makes sense to launch multiple mappers per file. Each mapper can then take a chunk of the articles in the file using the same partitioning strategy as the assignment of row IDs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (ACCUMULO-375) Wikipedia Ingest needs more parallelism

Posted by "Adam Fuchs (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/ACCUMULO-375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Fuchs reassigned ACCUMULO-375:
-----------------------------------

    Assignee: Adam Fuchs
    
> Wikipedia Ingest needs more parallelism
> ---------------------------------------
>
>                 Key: ACCUMULO-375
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-375
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Adam Fuchs
>            Assignee: Adam Fuchs
>
> The wikipedia ingest Map job uses a derivative of the FileInputFormat, which launches one job per file. Given the partitioning strategy and workload distribution, it makes sense to launch multiple mappers per file. Each mapper can then take a chunk of the articles in the file using the same partitioning strategy as the assignment of row IDs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira