You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by GitBox <gi...@apache.org> on 2021/06/14 08:04:50 UTC

[GitHub] [nutch] sebastian-nagel commented on a change in pull request #649: NUTCH-2868 urlnormalizer-protocol fails with StringIndexOutOfBoundsException

sebastian-nagel commented on a change in pull request #649:
URL: https://github.com/apache/nutch/pull/649#discussion_r650374772



##########
File path: src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java
##########
@@ -177,6 +183,7 @@ public void setConf(Configuration conf) {
       if (reader == null) {
         Path path = new Path(file);
         FileSystem fs = path.getFileSystem(conf);
+        LOG.info("Reading {} rules file {} from {}", pluginName, file, fs);

Review comment:
       Yes, not really useful. Will change.

##########
File path: src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/ProtocolURLNormalizer.java
##########
@@ -82,15 +82,21 @@ private synchronized void readConfiguration(Reader configReader) throws IOExcept
     String line, host;
     String protocol;
     int delimiterIndex;
+    int lineNumber = 0;
 
     while ((line = reader.readLine()) != null) {
+      lineNumber++;
       if (StringUtils.isNotBlank(line) && !line.startsWith("#")) {
         line = line.trim();
         delimiterIndex = line.indexOf(" ");
         // try tabulator
         if (delimiterIndex == -1) {
           delimiterIndex = line.indexOf("\t");
         }
+        if (delimiterIndex == -1) {
+          LOG.warn("Invalid line {} without delimiter: {}", lineNumber, line);

Review comment:
       Seen with a line including no (or an empty) host name. But yes, I'll extend the unit tests.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org