You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by GitBox <gi...@apache.org> on 2020/08/03 13:58:14 UTC

[GitHub] [nutch] derhecht opened a new pull request #545: [NUTCH-1190] Move data formats used to parse "lastModified" to a config file

derhecht opened a new pull request #545:
URL: https://github.com/apache/nutch/pull/545


   see NUTCH-1190 - I just create this PR and do some formatting here


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nutch] jorgelbg commented on a change in pull request #545: [NUTCH-1190] Move data formats used to parse "lastModified" to a config file

Posted by GitBox <gi...@apache.org>.
jorgelbg commented on a change in pull request #545:
URL: https://github.com/apache/nutch/pull/545#discussion_r468779261



##########
File path: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -324,16 +358,19 @@ public Configuration getConf() {
 
   private void readConfiguration() throws IOException {
     LOG.info("Reading content type mappings from file contenttype-mapping.txt");
-    BufferedReader reader = new BufferedReader(
-        conf.getConfResourceAsReader("contenttype-mapping.txt"));
-    String line;
-    String parts[];
-    boolean formatWarningShown = false;
+    try(BufferedReader reader = new BufferedReader(

Review comment:
       nit: Not sure why we need this `try` block? If a different exception gets thrown then it should be added to the method signature.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nutch] sebastian-nagel commented on a change in pull request #545: [NUTCH-1190] Move data formats used to parse "lastModified" to a config file

Posted by GitBox <gi...@apache.org>.
sebastian-nagel commented on a change in pull request #545:
URL: https://github.com/apache/nutch/pull/545#discussion_r464464410



##########
File path: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -316,6 +324,39 @@ public void setConf(Configuration conf) {
         LOG.error(org.apache.hadoop.util.StringUtils.stringifyException(e));
       }
     }
+    
+    URL res = conf.getResource("date-styles.txt");

Review comment:
       Maybe a more specific file name?  So, that the usage of the resource is clear to every (new) user - since the `conf/` folder is shared for all Nutch components and plugins. Additionally, a comment in the config file might be useful which explains what it is used for.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nutch] derhecht commented on a change in pull request #545: [NUTCH-1190] Move data formats used to parse "lastModified" to a config file

Posted by GitBox <gi...@apache.org>.
derhecht commented on a change in pull request #545:
URL: https://github.com/apache/nutch/pull/545#discussion_r469118584



##########
File path: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -324,16 +358,19 @@ public Configuration getConf() {
 
   private void readConfiguration() throws IOException {
     LOG.info("Reading content type mappings from file contenttype-mapping.txt");
-    BufferedReader reader = new BufferedReader(
-        conf.getConfResourceAsReader("contenttype-mapping.txt"));
-    String line;
-    String parts[];
-    boolean formatWarningShown = false;
+    try(BufferedReader reader = new BufferedReader(

Review comment:
       it will auto-close the reader (stream), which was not done here at all.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nutch] jorgelbg commented on a change in pull request #545: [NUTCH-1190] Move data formats used to parse "lastModified" to a config file

Posted by GitBox <gi...@apache.org>.
jorgelbg commented on a change in pull request #545:
URL: https://github.com/apache/nutch/pull/545#discussion_r464527051



##########
File path: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -316,6 +324,39 @@ public void setConf(Configuration conf) {
         LOG.error(org.apache.hadoop.util.StringUtils.stringifyException(e));
       }
     }
+    
+    URL dateStylesResource = conf.getResource("date-styles.txt");
+    if (dateStylesResource == null) {
+      LOG.warn("Can't find resource: date-styles.txt");

Review comment:
       Right now the config file is optional. If this is the desired behavior then I would also add to the message (since it is a warning) something that the default list of dates will be used.

##########
File path: conf/date-styles.txt
##########
@@ -0,0 +1,56 @@
+#

Review comment:
       The standard ASF license header should be included in all files [for example](https://github.com/apache/nutch/blob/master/conf/host-urlnormalizer.txt.template#L1-L14)

##########
File path: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -82,6 +87,20 @@
   private HashMap<String, String> mimeMap = null;
   private boolean mapMimes = false;
   private String mapFieldName;
+  
+  /** Date-styles used to parse date. */
+  private String[] dateStyles = new String[] {

Review comment:
       Not sure if we want to keep these here? It gets overwritten with the content of the config file if the additional config file loading is successful. _If_ we want to keep them I would rename them to something like `defaultDateStyles` and potentially only use the config file for additional formats.

##########
File path: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -316,6 +324,39 @@ public void setConf(Configuration conf) {
         LOG.error(org.apache.hadoop.util.StringUtils.stringifyException(e));
       }
     }
+    
+    URL dateStylesResource = conf.getResource("date-styles.txt");
+    if (dateStylesResource == null) {
+      LOG.warn("Can't find resource: date-styles.txt");
+    } else {
+      try {
+        List lines = FileUtils.readLines(new File(dateStylesResource.getFile()));
+        for (int i = 0; i < lines.size(); i++) {
+          String dateStyle = (String) lines.get(i);
+          
+          if (StringUtils.isBlank(dateStyle)) {
+            lines.remove(i);

Review comment:
       Since we don't need the line number for the next stages, perhaps could skip the line if it is empty, `\n`, or a comment (`#`). The idea would be to avoid modifying the counter and if we reach the end of the block then we know that the line has information that should be kept.

##########
File path: conf/date-styles.txt
##########
@@ -0,0 +1,56 @@
+#
+# Set of date time formats
+# used by the plugin index-more when filling the index field `lastModified'.
+#
+# Format (line separated date time patterns following definition in https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html, comment lines start with `#'):

Review comment:
       nit: perhaps we should trim this to ~120 columns?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nutch] derhecht commented on pull request #545: [NUTCH-1190] Move data formats used to parse "lastModified" to a config file

Posted by GitBox <gi...@apache.org>.
derhecht commented on pull request #545:
URL: https://github.com/apache/nutch/pull/545#issuecomment-668079788


   > Ok, thanks! One final question: this PR would obsolete #544 and also address NUTCH-2813, right?
   
   I would say so. This here is just a bigger change then #544 and I created I to "ensure" merge of #544 if this one wouldn't be merged at all.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nutch] sebastian-nagel commented on pull request #545: [NUTCH-1190] Move data formats used to parse "lastModified" to a config file

Posted by GitBox <gi...@apache.org>.
sebastian-nagel commented on pull request #545:
URL: https://github.com/apache/nutch/pull/545#issuecomment-668078252


   Ok, thanks! One final question: this PR would obsolete #544 and also address NUTCH-2813, right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nutch] derhecht commented on a change in pull request #545: [NUTCH-1190] Move data formats used to parse "lastModified" to a config file

Posted by GitBox <gi...@apache.org>.
derhecht commented on a change in pull request #545:
URL: https://github.com/apache/nutch/pull/545#discussion_r464478821



##########
File path: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -316,6 +324,39 @@ public void setConf(Configuration conf) {
         LOG.error(org.apache.hadoop.util.StringUtils.stringifyException(e));
       }
     }
+    
+    URL res = conf.getResource("date-styles.txt");

Review comment:
       ok, any suggestion? the only default file naming thing I can see is that every one else is creating a template instead really/directly uses a file.
   so what about:
   index-more-last-modified-date-styles.txt
   ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nutch] sebastian-nagel commented on a change in pull request #545: [NUTCH-1190] Move data formats used to parse "lastModified" to a config file

Posted by GitBox <gi...@apache.org>.
sebastian-nagel commented on a change in pull request #545:
URL: https://github.com/apache/nutch/pull/545#discussion_r464477144



##########
File path: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -316,6 +324,39 @@ public void setConf(Configuration conf) {
         LOG.error(org.apache.hadoop.util.StringUtils.stringifyException(e));
       }
     }
+    
+    URL res = conf.getResource("date-styles.txt");

Review comment:
       Just in case: I meant the file name, not the variable name. A typical Nutch user only edits the configuration file but does not look into the source code to figure out what the file is used for.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nutch] sebastian-nagel closed pull request #545: [NUTCH-1190] Move data formats used to parse "lastModified" to a config file

Posted by GitBox <gi...@apache.org>.
sebastian-nagel closed pull request #545:
URL: https://github.com/apache/nutch/pull/545


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nutch] sebastian-nagel commented on a change in pull request #545: [NUTCH-1190] Move data formats used to parse "lastModified" to a config file

Posted by GitBox <gi...@apache.org>.
sebastian-nagel commented on a change in pull request #545:
URL: https://github.com/apache/nutch/pull/545#discussion_r464548994



##########
File path: src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
##########
@@ -316,6 +324,39 @@ public void setConf(Configuration conf) {
         LOG.error(org.apache.hadoop.util.StringUtils.stringifyException(e));
       }
     }
+    
+    URL res = conf.getResource("date-styles.txt");

Review comment:
       Well, that sounds too long. Maybe only index-more-date-styles.txt? But we can leave it as the usage is now described in the file itself.
   
   Yes, you're right about the *.template : if someone builds Nutch from the source code all config files in `conf/`are "instantiated" during the first compile from the *.template versions. That's to avoid that the version control system complains about modified configuration files if users need to adapt the settings for their specific use case. Could you rename the file to *.template? - it's then automatically copied when `ant runtime` is called.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [nutch] sebastian-nagel commented on pull request #545: [NUTCH-1190] Move data formats used to parse "lastModified" to a config file

Posted by GitBox <gi...@apache.org>.
sebastian-nagel commented on pull request #545:
URL: https://github.com/apache/nutch/pull/545#issuecomment-674565089


   Squashed all commits into 2c3d8642 and merged into master. Resolved NUTCH-1190 and NUTCH-2813. Thanks, @derhecht!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org