You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/06/10 12:15:00 UTC

[jira] [Commented] (NUTCH-2793) CSV indexer does not work in distributed mode

    [ https://issues.apache.org/jira/browse/NUTCH-2793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130593#comment-17130593 ] 

ASF GitHub Bot commented on NUTCH-2793:
---------------------------------------

pmezard opened a new pull request #534:
URL: https://github.com/apache/nutch/pull/534


   Before the change, the output file name was hard-coded to "nutch.csv".
   When running in distributed mode, multiple reducers would clobber each
   other output.
   
   After the change, the filename is taken from the first open(cfg, name)
   initialization call, where name is a unique file name generated by
   IndexerOutputFormat, derived from hadoop FileOutputFormat. The CSV files
   are now named like part-r-000xx.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> CSV indexer does not work in distributed mode
> ---------------------------------------------
>
>                 Key: NUTCH-2793
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2793
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.17
>            Reporter: Patrick Mézard
>            Priority: Major
>
> Reasons are discussed in https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768 and following comments.
> To summarize, the indexer interface is not aware of tasks so it cannot generate unique output name per reducers.
> But it seems achievable because IndexWriters initialize each writer with calls to 2 open functions:
>  * One passing the general configuration and a "name"
>  * The second to pass indexer parameters
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java#L214]
> Fortunately, "name" is generated by calling getUniqueFile which does exactly what we want:
> [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java#L43]
> I propose we use it instead of "nutch.csv" as CSVIndexWriter output file name. This is a breaking change because it modifies the output name but allows the indexer to work in distributed mode.
> PR will follow the ticket creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)