You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by GitBox <gi...@apache.org> on 2020/06/10 15:21:58 UTC

[GitHub] [nutch] sebastian-nagel commented on a change in pull request #534: NUTCH-2793 indexer-csv: make it work in distributed mode

sebastian-nagel commented on a change in pull request #534:
URL: https://github.com/apache/nutch/pull/534#discussion_r438197577



##########
File path: src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java
##########
@@ -192,7 +189,7 @@ protected int find(String value, int start) {
 
   @Override
   public void open(Configuration conf, String name) throws IOException {

Review comment:
       This method is deprecated since the switch to the XML-based index writer configuration (see [NUTCH-1480](https://issues.apache.org/jira/browse/NUTCH-1480) and [the wiki page IndexWriters](https://cwiki.apache.org/confluence/display/NUTCH/IndexWriters)). "name" was just an arbitrary name not a file name indicating a task-specific output path. We would need a method which takes both: the IndexWriterParams and the output path. This would require changes in the [IndexWriter interface](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriter.java) and also the classes [IndexWriters](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexWriters.java) and [IndexerMapReduce](https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java). I'm also not sure whether the output path alone is sufficient. We'll eventually need an [OutputCommitter](https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/OutputCommitter.html) and need to think about situations if we have multiple index writers (eg. via [exchanges](https://cwiki.apache.org/confluence/display/NUTCH/Exchanges)). See also the [discussion in NUTCH-1541](https://issues.apache.org/jira/browse/NUTCH-1541?focusedCommentId=13797768&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13797768).

##########
File path: src/plugin/indexer-csv/README.md
##########
@@ -39,4 +39,4 @@ escapechar | Escape character used to escape a quote character | &quot;
 maxfieldlength | Max. length of a single field value in characters | 4096
 maxfieldvalues | Max. number of values of one field, useful for, e.g., the anchor texts field | 12
 header | Write CSV column headers | true
-outpath | Output path / directory (local filesystem path, relative to current working directory) | csvindexwriter
\ No newline at end of file
+outpath | Output path / directory (local filesystem path, relative to current working directory) | csvindexwriter

Review comment:
       still "local filesystem"? Ev. we could the outpath to overcome the problem of multiple index writers.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org