You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Patrick Mézard (Jira)" <ji...@apache.org> on 2020/06/10 12:02:00 UTC

[jira] [Created] (NUTCH-2793) CSV indexer does not work in distributed mode

Patrick Mézard created NUTCH-2793:
-------------------------------------

             Summary: CSV indexer does not work in distributed mode
                 Key: NUTCH-2793
                 URL: https://issues.apache.org/jira/browse/NUTCH-2793
             Project: Nutch
          Issue Type: Improvement
          Components: indexer
    Affects Versions: 1.17
            Reporter: Patrick Mézard


Reasons are discussed in https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768 and following comments.

To summarize, the indexer interface is not aware of tasks so it cannot generate unique output name per reducers.

But it seems achievable because IndexWriters initialize each writer with calls to 2 open functions:
 * One passing the general configuration and a "name"
 * The second to pass indexer parameters

https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214

Fortunately, "name" is generated by calling getUniqueFile which does exactly what we want:

[https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]

I propose we use it instead of "nutch.csv" as CSVIndexWriter output file name. This is breaking change because it modifies the output name but allows the indexer to work in distributed mode.

PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)