You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@nutch.apache.org by sn...@apache.org on 2017/12/18 15:49:44 UTC

[nutch] branch master updated (961c725 -> fc89e4f)

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.


    from 961c725  NUTCH-2034 CrawlDB update job to count documents in CrawlDb rejected by URL filters (patch contributed by Luis Lopez)
     add 62f6d9f  Add a new IndexingFilter that uses JEXL to decide whether to index a document.
     add 36bfac1  Some improvements based on revewier's feedback.
     add d72591a  Better tests.
     add c7c795a  Merge branch 'master' of https://github.com/apache/nutch into index-jexl-filter
     add a985e30  Fixed per reviewers' comments. Changed the package name to be more specific, added package-info.java, added to more build targets.
     add bea8621  doclint does not like self-closing tags.
     new 34236ff  fix for NUTCH-2370 contributed by msharan@usc.edu
     new d758a31  NUTCH-2474 CrawlDbReader -stats fails with ClassCastException - replace CrawlDbStatCombiner by CrawlDbStatReducer and ensure   that data is properly processed independently whether and   how often combiner is called - simplify calculation of minimum and maximum
     new 26669eb  - filter out NaN scores which break the quantile calculation
     new 194fc37  Extend indexer-elastic-rest to support languages
     new 153525c  fix formatting
     new 5ccebc9  add languages to default config
     new 9fcc2a4  fix delete
     new 42bdc65  NUTCH-2439 Upgrade Apache Tika dependency to 1.17
     new 2be2052  Add tika-config.xml to suppress Tika warnings on stderr
     new e0326de  make fully configurable
     new e7b077e  NUTCH-2480 Upgrade crawler-commons dependency to 0.9
     new 52a1c50  fix indentation
     new 67dc52c  scope variables
     new 416c457  NUTCH-2354 Upgrade Hadoop dependencies to 2.7.4
     new e7d5c13  NUTCH-2362 Upgrade MaxMind GeoIP version in index-geoip
     new e0e06f5  NUTCH-2035 urlfilter-regex case insensitive rules
     new 35193c2  NUTCH-2478 HTML parser should resolve base URL <base href=...> - fix parse-html and parse-tika - add unit test for parse-html
     new 8f692d1  NUTCH-2478 HTML parser should resolve base URL <base href=...> - finally fix parse-tika:   - href attribute of base element dropped in DOM   - need to call tikamd.get("Content-Location") - port HTML parser test from parse-html to parse-tika - add method to DomUtil which prints DocumentFragment
     new 4da6b19  fix for NUTCH-2477 (refactor checker classes) contributed by Jurian Broertjes
     new 9fb5777  Improve command-line help for URL filter and normalizer checker
     new 22fc7f0  NUTCH-2322 URL not available for Jexl operations - apply patch contributed by Markus Jelsma
     new e0a27c7  NUTCH-2034 CrawlDB update job to count documents in CrawlDb rejected by URL filters (patch contributed by Luis Lopez)
     new fc89e4f  NUTCH-2415 Create a JEXL based IndexingFilter Merge branch 'pipldev-index-jexl-filter', closes #219

The 23 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 build.xml                                          |   4 +
 conf/nutch-default.xml                             |  18 +++
 default.properties                                 |   1 +
 src/plugin/build.xml                               |   2 +
 .../{headings => index-jexl-filter}/build.xml      |   6 +-
 .../ivy.xml                                        |   0
 .../plugin.xml                                     |  14 +--
 .../nutch/indexer/jexl/JexlIndexingFilter.java     | 131 +++++++++++++++++++++
 .../apache/nutch/indexer/jexl}/package-info.java   |  16 ++-
 .../nutch/indexer/jexl/TestJexlIndexingFilter.java | 124 +++++++++++++++++++
 10 files changed, 301 insertions(+), 15 deletions(-)
 copy src/plugin/{headings => index-jexl-filter}/build.xml (88%)
 copy src/plugin/{urlnormalizer-slash => index-jexl-filter}/ivy.xml (100%)
 copy src/plugin/{mimetype-filter => index-jexl-filter}/plugin.xml (74%)
 create mode 100644 src/plugin/index-jexl-filter/src/java/org/apache/nutch/indexer/jexl/JexlIndexingFilter.java
 copy src/plugin/{scoring-similarity/src/java/org/apache/nutch/scoring/similarity/util => index-jexl-filter/src/java/org/apache/nutch/indexer/jexl}/package-info.java (51%)
 create mode 100644 src/plugin/index-jexl-filter/src/test/org/apache/nutch/indexer/jexl/TestJexlIndexingFilter.java

-- 
To stop receiving notification emails like this one, please contact
['"commits@nutch.apache.org" <co...@nutch.apache.org>'].

[nutch] 14/23: NUTCH-2354 Upgrade Hadoop dependencies to 2.7.4

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 416c457a9ddcd22f5746432a2777b9e6aa47877d
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Fri Dec 15 15:42:09 2017 +0100

    NUTCH-2354 Upgrade Hadoop dependencies to 2.7.4
---
 ivy/ivy.xml | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 7333a19..520afa0 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -52,7 +52,7 @@
 		<dependency org="com.tdunning" name="t-digest" rev="3.2" />
 
 		<!-- Hadoop Dependencies -->
-		<dependency org="org.apache.hadoop" name="hadoop-common" rev="2.7.2" conf="*->default">
+		<dependency org="org.apache.hadoop" name="hadoop-common" rev="2.7.4" conf="*->default">
 			<exclude org="hsqldb" name="hsqldb" />
 			<exclude org="net.sf.kosmosfs" name="kfs" />
 			<exclude org="net.java.dev.jets3t" name="jets3t" />
@@ -60,10 +60,10 @@
 			<exclude org="org.mortbay.jetty" name="jsp-*" />
 			<exclude org="ant" name="ant" />
 		</dependency>
-        <dependency org="org.apache.hadoop" name="hadoop-hdfs" rev="2.7.2" conf="*->default"/>
-        <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core" rev="2.7.2" conf="*->default"/>
-        <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-jobclient" rev="2.7.2" conf="*->default"/>
-        <!-- End of Hadoop Dependencies -->
+		<dependency org="org.apache.hadoop" name="hadoop-hdfs" rev="2.7.4" conf="*->default"/>
+		<dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-core" rev="2.7.4" conf="*->default"/>
+		<dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-jobclient" rev="2.7.4" conf="*->default"/>
+		<!-- End of Hadoop Dependencies -->
 
 		<dependency org="org.apache.tika" name="tika-core" rev="1.17" />
 		<dependency org="com.ibm.icu" name="icu4j" rev="55.1" />
@@ -103,9 +103,9 @@
 			<artifact name="mrunit" maven:classifier="hadoop2" />
 			<exclude org="log4j" module="log4j" />
 		</dependency>
-		<dependency org="org.mortbay.jetty" name="jetty-client" rev="6.1.22" conf="test->default" />
-		<dependency org="org.mortbay.jetty" name="jetty" rev="6.1.22" conf="test->default" />
-		<dependency org="org.mortbay.jetty" name="jetty-util" rev="6.1.22" conf="test->default" />
+		<dependency org="org.mortbay.jetty" name="jetty-client" rev="6.1.26" conf="test->default" />
+		<dependency org="org.mortbay.jetty" name="jetty" rev="6.1.26" conf="test->default" />
+		<dependency org="org.mortbay.jetty" name="jetty-util" rev="6.1.26" conf="test->default" />
 		<dependency org="tomcat" name="jasper-runtime" rev="5.5.23" conf="test->default" />
 		<dependency org="tomcat" name="jasper-compiler" rev="5.5.23" conf="test->default">
 			<exclude org="ant" name="ant" />

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 23/23: NUTCH-2415 Create a JEXL based IndexingFilter Merge branch 'pipldev-index-jexl-filter', closes #219

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit fc89e4fdaf371fc101516ba87ccd6170fd011c30
Merge: 961c725 e0a27c7
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Mon Dec 18 16:48:38 2017 +0100

    NUTCH-2415 Create a JEXL based IndexingFilter
    Merge branch 'pipldev-index-jexl-filter', closes #219

 build.xml                                          |   4 +
 conf/nutch-default.xml                             |  18 +++
 default.properties                                 |   1 +
 src/plugin/build.xml                               |   2 +
 src/plugin/index-jexl-filter/build.xml             |  22 ++++
 src/plugin/index-jexl-filter/ivy.xml               |  41 +++++++
 src/plugin/index-jexl-filter/plugin.xml            |  37 ++++++
 .../nutch/indexer/jexl/JexlIndexingFilter.java     | 131 +++++++++++++++++++++
 .../apache/nutch/indexer/jexl/package-info.java    |  30 +++++
 .../nutch/indexer/jexl/TestJexlIndexingFilter.java | 124 +++++++++++++++++++
 10 files changed, 410 insertions(+)

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 21/23: NUTCH-2322 URL not available for Jexl operations - apply patch contributed by Markus Jelsma

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 22fc7f0defb22588c4ade33b5693303f18d96253
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Sun Dec 17 15:32:04 2017 +0100

    NUTCH-2322 URL not available for Jexl operations
    - apply patch contributed by Markus Jelsma
---
 src/java/org/apache/nutch/crawl/CrawlDatum.java    | 18 ++++++++++++------
 src/java/org/apache/nutch/crawl/CrawlDbReader.java |  2 +-
 src/java/org/apache/nutch/crawl/Generator.java     |  2 +-
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/src/java/org/apache/nutch/crawl/CrawlDatum.java b/src/java/org/apache/nutch/crawl/CrawlDatum.java
index e54c791..1facf0a 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDatum.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDatum.java
@@ -23,14 +23,15 @@ import java.util.Map.Entry;
 
 import org.apache.commons.jexl2.JexlContext;
 import org.apache.commons.jexl2.Expression;
-import org.apache.commons.jexl2.JexlEngine;
 import org.apache.commons.jexl2.MapContext;
 
 import org.apache.hadoop.io.*;
 import org.apache.nutch.util.*;
+import org.apache.nutch.protocol.ProtocolStatus;
 
 /* The crawl state of a url. */
 public class CrawlDatum implements WritableComparable<CrawlDatum>, Cloneable {
+
   public static final String GENERATE_DIR_NAME = "crawl_generate";
   public static final String FETCH_DIR_NAME = "crawl_fetch";
   public static final String PARSE_DIR_NAME = "crawl_parse";
@@ -525,12 +526,13 @@ public class CrawlDatum implements WritableComparable<CrawlDatum>, Cloneable {
     }
   }
   
-  public boolean evaluate(Expression expr) {
-    if (expr != null) {
+  public boolean evaluate(Expression expr, String url) {
+    if (expr != null && url != null) {
       // Create a context and add data
       JexlContext jcontext = new MapContext();
       
       // https://issues.apache.org/jira/browse/NUTCH-2229
+      jcontext.set("url", url);
       jcontext.set("status", getStatusName(getStatus()));
       jcontext.set("fetchTime", (long)(getFetchTime()));
       jcontext.set("modifiedTime", (long)(getModifiedTime()));
@@ -542,24 +544,28 @@ public class CrawlDatum implements WritableComparable<CrawlDatum>, Cloneable {
       // Set metadata variables
       for (Map.Entry<Writable, Writable> entry : getMetaData().entrySet()) {
         Object value = entry.getValue();
+        Text tkey = (Text)entry.getKey();
         
         if (value instanceof FloatWritable) {
           FloatWritable fvalue = (FloatWritable)value;
-          Text tkey = (Text)entry.getKey();
           jcontext.set(tkey.toString(), fvalue.get());
         }
         
         if (value instanceof IntWritable) {
           IntWritable ivalue = (IntWritable)value;
-          Text tkey = (Text)entry.getKey();
           jcontext.set(tkey.toString(), ivalue.get());
         }
         
         if (value instanceof Text) {
           Text tvalue = (Text)value;
-          Text tkey = (Text)entry.getKey();     
           jcontext.set(tkey.toString().replace("-", "_"), tvalue.toString());
         }
+        
+        if (value instanceof ProtocolStatus) {
+          ProtocolStatus pvalue = (ProtocolStatus)value;
+          jcontext.set(tkey.toString().replace("-", "_"), pvalue.toString());
+        }
+
       }
                   
       try {
diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
index af30664..ddd25ef 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
@@ -700,7 +700,7 @@ public class CrawlDbReader extends Configured implements Closeable, Tool {
       
       // check expr
       if (expr != null) {
-        if (!value.evaluate(expr)) {
+        if (!value.evaluate(expr, key.toString())) {
           return;
         }
       }
diff --git a/src/java/org/apache/nutch/crawl/Generator.java b/src/java/org/apache/nutch/crawl/Generator.java
index e5f4831..d85d578 100644
--- a/src/java/org/apache/nutch/crawl/Generator.java
+++ b/src/java/org/apache/nutch/crawl/Generator.java
@@ -252,7 +252,7 @@ public class Generator extends NutchTool implements Tool {
       
       // check expr
       if (expr != null) {
-        if (!crawlDatum.evaluate(expr)) {
+        if (!crawlDatum.evaluate(expr, key.toString())) {
           return;
         }
       }

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 15/23: NUTCH-2362 Upgrade MaxMind GeoIP version in index-geoip

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit e7d5c137f88816fd4b5d5054ca7fb151bae0e97e
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Fri Dec 15 16:25:02 2017 +0100

    NUTCH-2362 Upgrade MaxMind GeoIP version in index-geoip
---
 src/plugin/index-geoip/ivy.xml    |  4 +++-
 src/plugin/index-geoip/plugin.xml | 14 +++++---------
 2 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/src/plugin/index-geoip/ivy.xml b/src/plugin/index-geoip/ivy.xml
index 1b626f0..aa56a68 100644
--- a/src/plugin/index-geoip/ivy.xml
+++ b/src/plugin/index-geoip/ivy.xml
@@ -36,10 +36,12 @@
   </publications>
 
   <dependencies>
-    <dependency org="com.maxmind.geoip2" name="geoip2" rev="2.3.1" >
+    <dependency org="com.maxmind.geoip2" name="geoip2" rev="2.10.0" >
       <!-- Exlude due to classpath issues -->
       <exclude org="org.apache.httpcomponents" name="httpclient" />
       <exclude org="org.apache.httpcomponents" name="httpcore" />
+      <exclude org="commons-codec" name="commons-codec" />
+      <exclude org="commons-logging" name="commons-logging" />
     </dependency>
   </dependencies>
   
diff --git a/src/plugin/index-geoip/plugin.xml b/src/plugin/index-geoip/plugin.xml
index 214fbd0..821ecc0 100644
--- a/src/plugin/index-geoip/plugin.xml
+++ b/src/plugin/index-geoip/plugin.xml
@@ -25,15 +25,11 @@
       <library name="index-geoip.jar">
          <export name="*"/>
       </library>
-      <library name="commons-codec-1.6.jar"/>
-      <library name="commons-logging-1.1.1.jar"/>
-      <library name="geoip2-2.3.1.jar"/>
-      <library name="google-http-client-1.20.0.jar"/>
-      <library name="jackson-annotations-2.5.0.jar"/>
-      <library name="jackson-core-2.5.3.jar"/>
-      <library name="jackson-databind-2.5.3.jar"/>
-      <library name="jsr305-1.3.9.jar"/>
-      <library name="maxmind-db-1.0.0.jar"/>
+      <library name="geoip2-2.10.0.jar"/>
+      <library name="jackson-annotations-2.9.0.jar"/>
+      <library name="jackson-core-2.9.2.jar"/>
+      <library name="jackson-databind-2.9.2.jar"/>
+      <library name="maxmind-db-1.2.2.jar"/>
    </runtime>
 
    <requires>

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 17/23: NUTCH-2478 HTML parser should resolve base URL - fix parse-html and parse-tika - add unit test for parse-html

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 35193c2ddcbe8f24ea09eeabd9e90f7bc52097d5
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Tue Dec 12 23:35:19 2017 +0100

    NUTCH-2478 HTML parser should resolve base URL <base href=...>
    - fix parse-html and parse-tika
    - add unit test for parse-html
---
 .../apache/nutch/parse/html/DOMContentUtils.java   |  7 ++----
 .../org/apache/nutch/parse/html/HtmlParser.java    |  9 ++++++--
 .../apache/nutch/parse/html/TestHtmlParser.java    | 26 +++++++++++++++++++++-
 .../apache/nutch/parse/tika/DOMContentUtils.java   |  7 ++----
 .../org/apache/nutch/parse/tika/TikaParser.java    |  9 ++++++--
 5 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
index 4527dd7..1f1061d 100644
--- a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
+++ b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
@@ -254,7 +254,7 @@ public class DOMContentUtils {
   }
 
   /** If Node contains a BASE tag then it's HREF is returned. */
-  public URL getBase(Node node) {
+  public String getBase(Node node) {
 
     NodeWalker walker = new NodeWalker(node);
 
@@ -276,10 +276,7 @@ public class DOMContentUtils {
           for (int i = 0; i < attrs.getLength(); i++) {
             Node attr = attrs.item(i);
             if ("href".equalsIgnoreCase(attr.getNodeName())) {
-              try {
-                return new URL(attr.getNodeValue());
-              } catch (MalformedURLException e) {
-              }
+              return attr.getNodeValue();
             }
           }
         }
diff --git a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
index 7f60939..e940eb1 100644
--- a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
+++ b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
@@ -207,11 +207,16 @@ public class HtmlParser implements Parser {
 
     if (!metaTags.getNoFollow()) { // okay to follow links
       ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
-      URL baseTag = utils.getBase(root);
+      URL baseTag = null;
+      try {
+        baseTag = new URL(base, utils.getBase(root));
+      } catch (MalformedURLException e) {
+        baseTag = base;
+      }
       if (LOG.isTraceEnabled()) {
         LOG.trace("Getting links...");
       }
-      utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
+      utils.getOutlinks(baseTag, l, root);
       outlinks = l.toArray(new Outlink[l.size()]);
       if (LOG.isTraceEnabled()) {
         LOG.trace("found " + outlinks.length + " outlinks in "
diff --git a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
index 0b39206..8fe94e6 100644
--- a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
+++ b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
@@ -19,10 +19,12 @@ package org.apache.nutch.parse.html;
 
 import java.lang.invoke.MethodHandles;
 import java.nio.charset.Charset;
+import java.nio.charset.StandardCharsets;
 
 import org.apache.hadoop.conf.Configuration;
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.html.HtmlParser;
+import org.apache.nutch.parse.Outlink;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.Parser;
 import org.apache.nutch.protocol.Content;
@@ -78,17 +80,26 @@ public class TestHtmlParser {
       { "HTML5, utf-16, BOM", "utf-16",
           "\ufeff<!DOCTYPE html>\n<html>\n<head>\n" + encodingTestContent } };
 
+  private static final String resolveBaseUrlTestContent = //
+      "<html>\\n<head>\n" + //
+      "  <title>Test Resolve Base URLs (NUTCH-2478)</title>\n" + //
+      "  <base href=\"//www.example.com/\">\n" + //
+      "</head>\n<body>\n" + //
+      "  <a href=\"index.html\">outlink</a>\n" + //
+      "</body>\n</html>";
+
   private Configuration conf;
   private Parser parser;
 
   public TestHtmlParser() {
     conf = NutchConfiguration.create();
+    conf.set("plugin.includes", "parse-html");
     parser = new HtmlParser();
     parser.setConf(conf);
   }
 
   protected Parse parse(byte[] contentBytes) {
-    String dummyUrl = "http://dummy.url/";
+    String dummyUrl = "http://example.com/";
     return parser.getParse(
         new Content(dummyUrl, dummyUrl, contentBytes, "text/html",
             new Metadata(), conf)).get(dummyUrl);
@@ -120,4 +131,17 @@ public class TestHtmlParser {
     }
   }
 
+  @Test
+  public void testResolveBaseUrl() {
+    byte[] contentBytes = resolveBaseUrlTestContent
+        .getBytes(StandardCharsets.UTF_8);
+    // parse using http://example.com/ as "fetch" URL
+    Parse parse = parse(contentBytes);
+    LOG.info(parse.getData().toString());
+    Outlink[] outlinks = parse.getData().getOutlinks();
+    Assert.assertEquals(1, outlinks.length);
+    Assert.assertEquals("http://www.example.com/index.html",
+        outlinks[0].getToUrl());
+  }
+
 }
diff --git a/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java b/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
index af85480..d409589 100644
--- a/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
+++ b/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
@@ -259,7 +259,7 @@ public class DOMContentUtils {
   }
 
   /** If Node contains a BASE tag then it's HREF is returned. */
-  URL getBase(Node node) {
+  public String getBase(Node node) {
 
     NodeWalker walker = new NodeWalker(node);
 
@@ -281,10 +281,7 @@ public class DOMContentUtils {
           for (int i = 0; i < attrs.getLength(); i++) {
             Node attr = attrs.item(i);
             if ("href".equalsIgnoreCase(attr.getNodeName())) {
-              try {
-                return new URL(attr.getNodeValue());
-              } catch (MalformedURLException e) {
-              }
+              return attr.getNodeValue();
             }
           }
         }
diff --git a/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java b/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
index 73cd083..1173504 100644
--- a/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
+++ b/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
@@ -170,7 +170,12 @@ public class TikaParser implements org.apache.nutch.parse.Parser {
 
     if (!metaTags.getNoFollow()) { // okay to follow links
       ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
-      URL baseTag = utils.getBase(root);
+      URL baseTag = null;
+      try {
+        baseTag = new URL(base, utils.getBase(root));
+      } catch (MalformedURLException e) {
+        baseTag = base;
+      }
       if (LOG.isTraceEnabled()) {
         LOG.trace("Getting links...");
       }
@@ -179,7 +184,7 @@ public class TikaParser implements org.apache.nutch.parse.Parser {
       //utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
       // Get outlinks from Tika
       List<Link> tikaExtractedOutlinks = linkContentHandler.getLinks();
-      utils.getOutlinks(baseTag != null ? baseTag : base, l, tikaExtractedOutlinks);
+      utils.getOutlinks(baseTag, l, root);
       outlinks = l.toArray(new Outlink[l.size()]);
       if (LOG.isTraceEnabled()) {
         LOG.trace("found " + outlinks.length + " outlinks in "

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 10/23: make fully configurable

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit e0326de05197f8415eeb750d4d8fff764db87aa9
Author: Nicola Marcacci Rossi <ni...@gmail.com>
AuthorDate: Fri Dec 15 14:18:57 2017 +0100

    make fully configurable
---
 conf/nutch-default.xml                             | 20 ++++++++++++++--
 .../elasticrest/ElasticRestConstants.java          |  2 ++
 .../elasticrest/ElasticRestIndexWriter.java        | 28 ++++++++++++++++++----
 3 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index bcb2e9e..1d9837f 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -2122,12 +2122,28 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
         A list of strings denoting the supported languages (e.g. `en,de,fr,it`).
         If this value is empty all documents will be sent to index ${elastic.rest.index}.
         If not empty the Rest client will distribute documents in different indices based on their `lang` property.
-        Indices are named with the following schema: ${elastic.rest.index}_${lang} (e.g. `nutch_de`).
-        Entries with an unsupported `lang` value will be added to index ${elastic.rest.index}_others (e.g. `nutch_others`).
+        Indices are named with the following schema: ${elastic.rest.index}${elastic.rest.separator}${lang} (e.g. `nutch_de`).
+        Entries with an unsupported `lang` value will be added to index ${elastic.rest.index}${elastic.rest.separator}${elastic.rest.sink} (e.g. `nutch_others`).
     </description>
 </property>
 
 <property>
+    <name>elastic.rest.separator</name>
+    <value>_</value>
+    <description>
+        Default value is `_`. Is used only if `elastic.rest.languages` is defined to build the index name (i.e. ${elastic.rest.index}${elastic.rest.separator}${lang}). 
+    </description>
+</property>
+
+<property>
+	<name>elastic.rest.sink</name>
+	<value>others</value>
+	<description>
+		Default value is `others`. Is used only if `elastic.rest.languages` is defined to build the index name where to store documents with unsupported languages (i.e. ${elastic.rest.index}${elastic.rest.separator}${elastic.rest.sink}).
+	</description>
+</property>
+
+<property>
     <name>elastic.rest.type</name>
     <value>doc</value>
     <description>Default type to send documents to.</description>
diff --git a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java
index 74f37eb..c0f5fe7 100644
--- a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java
+++ b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java
@@ -32,4 +32,6 @@ public interface ElasticRestConstants {
   public static final String HOSTNAME_TRUST = ELASTIC_PREFIX + "trustallhostnames";
   
   public static final String LANGUAGES = ELASTIC_PREFIX + "languages";
+  public static final String SEPARATOR = ELASTIC_PREFIX + "separator";
+  public static final String SINK = ELASTIC_PREFIX + "sink";
 }
diff --git a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
index 56cfab1..5e71b3c 100644
--- a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
+++ b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
@@ -67,7 +67,9 @@ public class ElasticRestIndexWriter implements IndexWriter {
       .getLogger(ElasticRestIndexWriter.class);
 
   private static final int DEFAULT_MAX_BULK_DOCS = 250;
-  private static final int DEFAULT_MAX_BULK_LENGTH = 2500500;
+  private static final int DEFAULT_MAX_BULK_LENGTH = 2500500;  
+  private static final String DEFAULT_SEPARATOR = "_";
+  private static final String DEFAULT_SINK = "others";
 
   private JestClient client;
   private String defaultIndex;
@@ -93,6 +95,8 @@ public class ElasticRestIndexWriter implements IndexWriter {
   private BasicFuture<JestResult> basicFuture = null;
   
   private String[] languages = null;
+  private String separator = null;
+  private String sink = null;
 
   @Override
   public void open(JobConf job, String name) throws IOException {
@@ -104,6 +108,8 @@ public class ElasticRestIndexWriter implements IndexWriter {
     https = job.getBoolean(ElasticRestConstants.HTTPS, false);
     trustAllHostnames = job.getBoolean(ElasticRestConstants.HOSTNAME_TRUST, false);
     languages = job.getStrings(ElasticRestConstants.LANGUAGES);
+    separator = job.get(ElasticRestConstants.SEPARATOR, DEFAULT_SEPARATOR);
+    sink = job.get(ElasticRestConstants.SINK, DEFAULT_SINK);
 
     // trust ALL certificates
     SSLContext sslContext = null;
@@ -205,9 +211,9 @@ public class ElasticRestIndexWriter implements IndexWriter {
         }
       }
       if (exists) {
-        index = defaultIndex + "_" + language;
+        index = getLanguageIndexName(language);
       } else {
-        index = defaultIndex + "_others";
+        index = getSinkIndexName();
       }
     } else {
       index = defaultIndex;
@@ -237,9 +243,9 @@ public class ElasticRestIndexWriter implements IndexWriter {
       if (languages != null && languages.length > 0) {
         Bulk.Builder bulkBuilder = new Bulk.Builder().defaultType(defaultType);
         for (String lang : languages) {          
-          bulkBuilder.addAction(new Delete.Builder(key).index(defaultIndex + "_" + lang).type(defaultType).build());
+          bulkBuilder.addAction(new Delete.Builder(key).index(getLanguageIndexName(lang)).type(defaultType).build());
         }
-        bulkBuilder.addAction(new Delete.Builder(key).index(defaultIndex + "_others").type(defaultType).build());
+        bulkBuilder.addAction(new Delete.Builder(key).index(getSinkIndexName()).type(defaultType).build());
         client.execute(bulkBuilder.build());
       } else {
         client.execute(new Delete.Builder(key).index(defaultIndex)
@@ -359,4 +365,16 @@ public class ElasticRestIndexWriter implements IndexWriter {
   public Configuration getConf() {
     return config;
   }
+
+  private String getLanguageIndexName(String lang) {
+    return getComposedIndexName(defaultIndex, lang);
+  }
+  
+  private String getSinkIndexName() {
+    return getComposedIndexName(defaultIndex, sink);
+  }
+  
+  private String getComposedIndexName(String prefix, String postfix) {
+    return prefix + separator + postfix;
+  }
 }

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 20/23: Improve command-line help for URL filter and normalizer checker

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 9fb5777b209f92ed6e3341980c2abe405b6b0fea
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Sun Dec 17 14:12:14 2017 +0100

    Improve command-line help for URL filter and normalizer checker
---
 src/java/org/apache/nutch/net/URLFilterChecker.java     |  9 ++++++---
 src/java/org/apache/nutch/net/URLNormalizerChecker.java | 10 +++++++---
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/src/java/org/apache/nutch/net/URLFilterChecker.java b/src/java/org/apache/nutch/net/URLFilterChecker.java
index 429aa9f..ceca301 100644
--- a/src/java/org/apache/nutch/net/URLFilterChecker.java
+++ b/src/java/org/apache/nutch/net/URLFilterChecker.java
@@ -39,8 +39,11 @@ public class URLFilterChecker extends AbstractChecker {
   private URLFilters filters = null;
 
   public int run(String[] args) throws Exception {
-    usage = "Usage: URLFilterChecker [-filterName filterName] (-stdin | -listen <port> [-keepClientCnxOpen]) \n"
-        + "\n\tTool takes a list of URLs, one per line.\n";
+    usage = "Usage: URLFilterChecker [-Dproperty=value]... [-filterName filterName] (-stdin | -listen <port> [-keepClientCnxOpen]) \n"
+        + "\n  -filterName\tURL filter plugin name (eg. urlfilter-regex) to check,"
+        + "\n             \t(if not given all configured URL filters are applied)"
+        + "\n  -stdin     \ttool reads a list of URLs from stdin, one URL per line"
+        + "\n  -listen <port>\trun tool as Telnet server listening on <port>\n";
 
     // Print help when no args given
     if (args.length < 1) {
@@ -55,7 +58,7 @@ public class URLFilterChecker extends AbstractChecker {
       } else if ((numConsumed = super.parseArgs(args, i)) > 0) {
         i += numConsumed - 1;
       } else {
-        System.err.println("ERR: Not a recognized argument: " + args[i]);
+        System.err.println("ERROR: Not a recognized argument: " + args[i]);
         System.err.println(usage);
         System.exit(-1);
       }
diff --git a/src/java/org/apache/nutch/net/URLNormalizerChecker.java b/src/java/org/apache/nutch/net/URLNormalizerChecker.java
index a435cc8..64fae58 100644
--- a/src/java/org/apache/nutch/net/URLNormalizerChecker.java
+++ b/src/java/org/apache/nutch/net/URLNormalizerChecker.java
@@ -38,8 +38,12 @@ public class URLNormalizerChecker extends AbstractChecker {
   URLNormalizers normalizers;
 
   public int run(String[] args) throws Exception {
-    usage = "Usage: URLNormalizerChecker [-normalizer <normalizerName>] [-scope <scope>] (-stdin | -listen <port> [-keepClientCnxOpen])"
-        + "\n\tscope can be one of: default,partition,generate_host_count,fetcher,crawldb,linkdb,inject,outlink\n";
+    usage = "Usage: URLNormalizerChecker [-Dproperty=value]... [-normalizer <normalizerName>] [-scope <scope>] (-stdin | -listen <port> [-keepClientCnxOpen])\n"
+        + "\n  -normalizer\tURL normalizer plugin (eg. urlnormalizer-basic) to check,"
+        + "\n             \t(if not given all configured URL normalizers are applied)"
+        + "\n  -scope     \tone of: default,partition,generate_host_count,fetcher,crawldb,linkdb,inject,outlink"
+        + "\n  -stdin     \ttool reads a list of URLs from stdin, one URL per line"
+        + "\n  -listen <port>\trun tool as Telnet server listening on <port>\n";
 
     // Print help when no args given
     if (args.length < 1) {
@@ -56,7 +60,7 @@ public class URLNormalizerChecker extends AbstractChecker {
       } else if ((numConsumed = super.parseArgs(args, i)) > 0) {
         i += numConsumed - 1;
       } else {
-        System.err.println("ERR: Not a recognized argument: " + args[i]);
+        System.err.println("ERROR: Not a recognized argument: " + args[i]);
         System.err.println(usage);
         System.exit(-1);
       }

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 01/23: fix for NUTCH-2370 contributed by msharan@usc.edu

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 34236ffecf478a1776559b0ed8c1ad929483d752
Author: Madhav Sharan <go...@gmail.com>
AuthorDate: Wed Mar 29 18:07:07 2017 -0400

    fix for NUTCH-2370 contributed by msharan@usc.edu
---
 src/java/org/apache/nutch/tools/FileDumper.java | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/java/org/apache/nutch/tools/FileDumper.java b/src/java/org/apache/nutch/tools/FileDumper.java
index 53e6be4..51cc124 100644
--- a/src/java/org/apache/nutch/tools/FileDumper.java
+++ b/src/java/org/apache/nutch/tools/FileDumper.java
@@ -57,6 +57,7 @@ import org.apache.tika.Tika;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
+import org.codehaus.jackson.map.ObjectMapper;
 /**
  * The file dumper tool enables one to reverse generate the raw content from
  * Nutch segment data directories.
@@ -154,6 +155,7 @@ public class FileDumper {
     for (File segment : segmentDirs) {
       LOG.info("Processing segment: [" + segment.getAbsolutePath() + "]");
       DataOutputStream doutputStream = null;
+      Map<String, String> filenameToUrl = new HashMap<String, String>();
 
       File segmentDir = new File(segment.getAbsolutePath(), Content.DIR_NAME);
       File[] partDirs = segmentDir.listFiles(file -> file.canRead() && file.isDirectory());
@@ -242,7 +244,7 @@ public class FileDumper {
                   } else {
                     outputFullPath = String.format("%s/%s", fullDir, DumpFileUtil.createFileName(md5Ofurl, baseName, extension));
                   }
-
+                  filenameToUrl.put(outputFullPath, url);
                   File outputFile = new File(outputFullPath);
 
                   if (!outputFile.exists()) {
@@ -284,6 +286,10 @@ public class FileDumper {
           }
         }
       }
+      //save filenameToUrl in a json file for each segment there is one mapping file 
+      String filenameToUrlFilePath = String.format("%s/%s_filenameToUrl.json", outputDir.getAbsolutePath(), segment.getName() );
+      new ObjectMapper().writeValue(new File(filenameToUrlFilePath), filenameToUrl);
+      
     }
     LOG.info("Dumper File Stats: "
         + DumpFileUtil.displayFileTypes(typeCounts, filteredCounts));

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 13/23: scope variables

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 67dc52cc0f76c51e6fd36dc88f87994c0b2386d8
Author: Nicola Marcacci Rossi <ni...@gmail.com>
AuthorDate: Fri Dec 15 15:05:38 2017 +0100

    scope variables
---
 conf/nutch-default.xml                                     | 14 +++++++-------
 .../indexwriter/elasticrest/ElasticRestConstants.java      |  6 +++---
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index f35e787..8ed0f86 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -2116,30 +2116,30 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
 </property>
 
 <property>
-    <name>elastic.rest.languages</name>
+    <name>elastic.rest.index.languages</name>
     <value></value>
     <description>
         A list of strings denoting the supported languages (e.g. `en,de,fr,it`).
         If this value is empty all documents will be sent to index ${elastic.rest.index}.
         If not empty the Rest client will distribute documents in different indices based on their `lang` property.
-        Indices are named with the following schema: ${elastic.rest.index}${elastic.rest.separator}${lang} (e.g. `nutch_de`).
-        Entries with an unsupported `lang` value will be added to index ${elastic.rest.index}${elastic.rest.separator}${elastic.rest.sink} (e.g. `nutch_others`).
+        Indices are named with the following schema: ${elastic.rest.index}${elastic.rest.index.separator}${lang} (e.g. `nutch_de`).
+        Entries with an unsupported `lang` value will be added to index ${elastic.rest.index}${elastic.rest.index.separator}${elastic.rest.index.sink} (e.g. `nutch_others`).
     </description>
 </property>
 
 <property>
-    <name>elastic.rest.separator</name>
+    <name>elastic.rest.index.separator</name>
     <value>_</value>
     <description>
-        Default value is `_`. Is used only if `elastic.rest.languages` is defined to build the index name (i.e. ${elastic.rest.index}${elastic.rest.separator}${lang}). 
+        Default value is `_`. Is used only if `elastic.rest.index.languages` is defined to build the index name (i.e. ${elastic.rest.index}${elastic.rest.index.separator}${lang}). 
     </description>
 </property>
 
 <property>
-    <name>elastic.rest.sink</name>
+    <name>elastic.rest.index.sink</name>
     <value>others</value>
     <description>
-        Default value is `others`. Is used only if `elastic.rest.languages` is defined to build the index name where to store documents with unsupported languages (i.e. ${elastic.rest.index}${elastic.rest.separator}${elastic.rest.sink}).
+        Default value is `others`. Is used only if `elastic.rest.index.languages` is defined to build the index name where to store documents with unsupported languages (i.e. ${elastic.rest.index}${elastic.rest.index.separator}${elastic.rest.index.sink}).
     </description>
 </property>
 
diff --git a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java
index c0f5fe7..b36f027 100644
--- a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java
+++ b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java
@@ -31,7 +31,7 @@ public interface ElasticRestConstants {
   public static final String HTTPS = ELASTIC_PREFIX + "https";
   public static final String HOSTNAME_TRUST = ELASTIC_PREFIX + "trustallhostnames";
   
-  public static final String LANGUAGES = ELASTIC_PREFIX + "languages";
-  public static final String SEPARATOR = ELASTIC_PREFIX + "separator";
-  public static final String SINK = ELASTIC_PREFIX + "sink";
+  public static final String LANGUAGES = INDEX + ".languages";
+  public static final String SEPARATOR = INDEX + ".separator";
+  public static final String SINK = INDEX + ".sink";
 }

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 07/23: fix delete

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 9fcc2a494acfb2b367078dcec84f264e57413315
Author: Nicola Marcacci Rossi <ni...@gmail.com>
AuthorDate: Fri Dec 15 12:05:44 2017 +0100

    fix delete
---
 .../apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java  | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
index 34ab661..56cfab1 100644
--- a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
+++ b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
@@ -237,9 +237,9 @@ public class ElasticRestIndexWriter implements IndexWriter {
       if (languages != null && languages.length > 0) {
         Bulk.Builder bulkBuilder = new Bulk.Builder().defaultType(defaultType);
         for (String lang : languages) {          
-          bulkBuilder.addAction(new Delete.Builder(key).index(defaultIndex + "_" + lang).build());
+          bulkBuilder.addAction(new Delete.Builder(key).index(defaultIndex + "_" + lang).type(defaultType).build());
         }
-        bulkBuilder.addAction(new Delete.Builder(key).index(defaultIndex + "_others").build());
+        bulkBuilder.addAction(new Delete.Builder(key).index(defaultIndex + "_others").type(defaultType).build());
         client.execute(bulkBuilder.build());
       } else {
         client.execute(new Delete.Builder(key).index(defaultIndex)

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 09/23: Add tika-config.xml to suppress Tika warnings on stderr

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 2be2052bc74a28cda9cbe6abf9cf5d2b06afcbc3
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Fri Dec 15 14:13:53 2017 +0100

    Add tika-config.xml to suppress Tika warnings on stderr
---
 conf/nutch-default.xml        |  6 ++++++
 conf/tika-config.xml.template | 20 ++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index a53cf7b..bcb2e9e 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -1407,6 +1407,12 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this
 -->
 
 <property>
+ <name>tika.config.file</name>
+ <value>tika-config.xml</value>
+ <description>Nutch-specific Tika config file</description>
+</property>
+
+<property>
   <name>tika.uppercase.element.names</name>
   <value>true</value>
   <description>Determines whether TikaParser should uppercase the element name while generating the DOM
diff --git a/conf/tika-config.xml.template b/conf/tika-config.xml.template
new file mode 100644
index 0000000..30af37d
--- /dev/null
+++ b/conf/tika-config.xml.template
@@ -0,0 +1,20 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+<properties>
+    <service-loader initializableProblemHandler="ignore"/>
+</properties>

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 03/23: - filter out NaN scores which break the quantile calculation

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 26669eb1f3f75e466eae732e79a4e6e85ea57073
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Mon Dec 11 10:35:46 2017 +0100

    - filter out NaN scores which break the quantile calculation
---
 src/java/org/apache/nutch/crawl/CrawlDbReader.java | 27 ++++++++++++++--------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
index 117aa7f..af30664 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
@@ -203,11 +203,15 @@ public class CrawlDbReader extends Configured implements Closeable, Tool {
       output.collect(new Text("retry " + value.getRetriesSinceFetch()),
           COUNT_1);
 
-      NutchWritable score = new NutchWritable(
-          new FloatWritable(value.getScore()));
-      output.collect(new Text("sc"), score);
-      output.collect(new Text("sct"), score);
-      output.collect(new Text("scd"), score);
+      if (Float.isNaN(value.getScore())) {
+        output.collect(new Text("scNaN"), COUNT_1);
+      } else {
+        NutchWritable score = new NutchWritable(
+            new FloatWritable(value.getScore()));
+        output.collect(new Text("sc"), score);
+        output.collect(new Text("sct"), score);
+        output.collect(new Text("scd"), score);
+      }
 
       // fetch time (in minutes to prevent from overflows when summing up)
       NutchWritable fetchTime = new NutchWritable(
@@ -287,7 +291,7 @@ public class CrawlDbReader extends Configured implements Closeable, Tool {
           cnt += value;
         }
         output.collect(key, new NutchWritable(new FloatWritable(cnt)));
-      } else if (k.equals("scd") || k.equals("ftd") || k.equals("fid")) {
+      } else if (k.equals("scd")) {
         MergingDigest tdigest = null;
         while (values.hasNext()) {
           Writable value = values.next().get();
@@ -301,10 +305,13 @@ public class CrawlDbReader extends Configured implements Closeable, Tool {
               tdigest.add(tdig);
             }
           } else if (value instanceof FloatWritable) {
-            if (tdigest == null) {
-              tdigest = (MergingDigest) TDigest.createMergingDigest(100.0);
+            float val = ((FloatWritable) value).get();
+            if (!Float.isNaN(val)) {
+              if (tdigest == null) {
+                tdigest = (MergingDigest) TDigest.createMergingDigest(100.0);
+              }
+              tdigest.add(val);
             }
-            tdigest.add(((FloatWritable) value).get());
           }
         }
         ByteBuffer tdigestBytes = ByteBuffer.allocate(tdigest.smallByteSize());
@@ -521,6 +528,8 @@ public class CrawlDbReader extends Configured implements Closeable, Tool {
           LOG.info("max score:\t" + fvalue);
         } else if (k.equals("sct")) {
           LOG.info("avg score:\t" + (fvalue / totalCnt.get()));
+        } else if (k.equals("scNaN")) {
+          LOG.info("score == NaN:\t" + value);
         } else if (k.equals("ftn")) {
           LOG.info("earliest fetch time:\t" + new Date(1000 * 60 * value));
         } else if (k.equals("ftx")) {

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 18/23: NUTCH-2478 HTML parser should resolve base URL - finally fix parse-tika: - href attribute of base element dropped in DOM - need to call tikamd.get("Content-Location") - port HTML parser test from parse-html to parse-tika - add method to DomUtil which prints DocumentFragment

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 8f692d13d45642f8b447d47af796f06487afeec2
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Fri Dec 15 21:35:27 2017 +0100

    NUTCH-2478 HTML parser should resolve base URL <base href=...>
    - finally fix parse-tika:
      - href attribute of base element dropped in DOM
      - need to call tikamd.get("Content-Location")
    - port HTML parser test from parse-html to parse-tika
    - add method to DomUtil which prints DocumentFragment
---
 src/java/org/apache/nutch/util/DomUtil.java            |  9 +++++++++
 .../java/org/apache/nutch/parse/html/HtmlParser.java   | 13 ++++++++-----
 .../org/apache/nutch/parse/html/TestHtmlParser.java    |  2 +-
 .../java/org/apache/nutch/parse/tika/TikaParser.java   | 18 +++++++++++-------
 .../test/org/apache/nutch/tika}/TestHtmlParser.java    | 10 +++++-----
 5 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/src/java/org/apache/nutch/util/DomUtil.java b/src/java/org/apache/nutch/util/DomUtil.java
index e93477a..b4f0eac 100644
--- a/src/java/org/apache/nutch/util/DomUtil.java
+++ b/src/java/org/apache/nutch/util/DomUtil.java
@@ -31,7 +31,9 @@ import javax.xml.transform.dom.DOMSource;
 import javax.xml.transform.stream.StreamResult;
 
 import org.apache.xerces.parsers.DOMParser;
+import org.w3c.dom.DocumentFragment;
 import org.w3c.dom.Element;
+import org.w3c.dom.NodeList;
 import org.xml.sax.InputSource;
 import org.xml.sax.SAXException;
 
@@ -103,4 +105,11 @@ public class DomUtil {
       LOG.error("Error: ", ex);
     }
   }
+
+  public static void saveDom(OutputStream os, DocumentFragment doc) {
+    NodeList docChildren = doc.getChildNodes();
+    for (int i = 0; i < docChildren.getLength(); i++) {
+      saveDom(os, (Element) docChildren.item(i));
+    }
+  }
 }
diff --git a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
index e940eb1..9ed9fa4 100644
--- a/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
+++ b/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
@@ -207,11 +207,14 @@ public class HtmlParser implements Parser {
 
     if (!metaTags.getNoFollow()) { // okay to follow links
       ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
-      URL baseTag = null;
-      try {
-        baseTag = new URL(base, utils.getBase(root));
-      } catch (MalformedURLException e) {
-        baseTag = base;
+      URL baseTag = base;
+      String baseTagHref = utils.getBase(root);
+      if (baseTagHref != null) {
+        try {
+          baseTag = new URL(base, baseTagHref);
+        } catch (MalformedURLException e) {
+          baseTag = base;
+        }
       }
       if (LOG.isTraceEnabled()) {
         LOG.trace("Getting links...");
diff --git a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
index 8fe94e6..a4c8206 100644
--- a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
+++ b/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
@@ -81,7 +81,7 @@ public class TestHtmlParser {
           "\ufeff<!DOCTYPE html>\n<html>\n<head>\n" + encodingTestContent } };
 
   private static final String resolveBaseUrlTestContent = //
-      "<html>\\n<head>\n" + //
+      "<html>\n<head>\n" + //
       "  <title>Test Resolve Base URLs (NUTCH-2478)</title>\n" + //
       "  <base href=\"//www.example.com/\">\n" + //
       "</head>\n<body>\n" + //
diff --git a/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java b/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
index 1173504..ea864be 100644
--- a/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
+++ b/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
@@ -52,6 +52,7 @@ import org.apache.tika.sax.TeeContentHandler;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 import org.w3c.dom.DocumentFragment;
+import org.w3c.dom.Element;
 import org.xml.sax.ContentHandler;
 
 /**
@@ -170,21 +171,24 @@ public class TikaParser implements org.apache.nutch.parse.Parser {
 
     if (!metaTags.getNoFollow()) { // okay to follow links
       ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
-      URL baseTag = null;
-      try {
-        baseTag = new URL(base, utils.getBase(root));
-      } catch (MalformedURLException e) {
-        baseTag = base;
+      URL baseTag = base;
+      String baseTagHref = tikamd.get("Content-Location");
+      if (baseTagHref != null) {
+        try {
+          baseTag = new URL(base, baseTagHref);
+        } catch (MalformedURLException e) {
+          LOG.trace("Invalid <base href=\"{}\">", baseTagHref);
+        }
       }
       if (LOG.isTraceEnabled()) {
-        LOG.trace("Getting links...");
+        LOG.trace("Getting links (base URL = {}) ...", baseTag);
       }
       
       // pre-1233 outlink extraction
       //utils.getOutlinks(baseTag != null ? baseTag : base, l, root);
       // Get outlinks from Tika
       List<Link> tikaExtractedOutlinks = linkContentHandler.getLinks();
-      utils.getOutlinks(baseTag, l, root);
+      utils.getOutlinks(baseTag, l, tikaExtractedOutlinks);
       outlinks = l.toArray(new Outlink[l.size()]);
       if (LOG.isTraceEnabled()) {
         LOG.trace("found " + outlinks.length + " outlinks in "
diff --git a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java b/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestHtmlParser.java
similarity index 96%
copy from src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
copy to src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestHtmlParser.java
index 8fe94e6..d2bc816 100644
--- a/src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
+++ b/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestHtmlParser.java
@@ -15,7 +15,7 @@
  * limitations under the License.
  */
 
-package org.apache.nutch.parse.html;
+package org.apache.nutch.tika;
 
 import java.lang.invoke.MethodHandles;
 import java.nio.charset.Charset;
@@ -23,7 +23,7 @@ import java.nio.charset.StandardCharsets;
 
 import org.apache.hadoop.conf.Configuration;
 import org.apache.nutch.metadata.Metadata;
-import org.apache.nutch.parse.html.HtmlParser;
+import org.apache.nutch.parse.tika.TikaParser;
 import org.apache.nutch.parse.Outlink;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.Parser;
@@ -81,7 +81,7 @@ public class TestHtmlParser {
           "\ufeff<!DOCTYPE html>\n<html>\n<head>\n" + encodingTestContent } };
 
   private static final String resolveBaseUrlTestContent = //
-      "<html>\\n<head>\n" + //
+      "<html>\n<head>\n" + //
       "  <title>Test Resolve Base URLs (NUTCH-2478)</title>\n" + //
       "  <base href=\"//www.example.com/\">\n" + //
       "</head>\n<body>\n" + //
@@ -93,8 +93,8 @@ public class TestHtmlParser {
 
   public TestHtmlParser() {
     conf = NutchConfiguration.create();
-    conf.set("plugin.includes", "parse-html");
-    parser = new HtmlParser();
+    conf.set("plugin.includes", "parse-tika");
+    parser = new TikaParser();
     parser.setConf(conf);
   }
 

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 19/23: fix for NUTCH-2477 (refactor checker classes) contributed by Jurian Broertjes

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 4da6b19e3b149687c624a996fc065207561217ed
Author: Jurian Broertjes <ju...@openindex.io>
AuthorDate: Tue Dec 12 15:52:59 2017 +0000

    fix for NUTCH-2477 (refactor checker classes) contributed by Jurian Broertjes
---
 .../nutch/indexer/IndexingFiltersChecker.java      | 125 ++-------------
 .../org/apache/nutch/net/URLFilterChecker.java     | 126 +++++----------
 src/java/org/apache/nutch/net/URLFilters.java      |   4 +
 .../org/apache/nutch/net/URLNormalizerChecker.java | 102 +++++-------
 .../org/apache/nutch/util/AbstractChecker.java     | 171 +++++++++++++++++++++
 5 files changed, 269 insertions(+), 259 deletions(-)

diff --git a/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java b/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
index 05caf5a..5491638 100644
--- a/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
+++ b/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
@@ -17,23 +17,14 @@
 
 package org.apache.nutch.indexer;
 
-import java.io.BufferedReader;
-import java.io.InputStreamReader;
-import java.io.PrintWriter;
 import java.lang.invoke.MethodHandles;
-import java.net.ServerSocket;
-import java.net.Socket;
-import java.net.InetSocketAddress;
-import java.nio.charset.Charset;
 import java.util.HashMap;
 import java.util.Iterator;
 import java.util.List;
 import java.util.Map;
 
-import org.apache.hadoop.conf.Configured;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapred.JobConf;
-import org.apache.hadoop.util.Tool;
 import org.apache.hadoop.util.ToolRunner;
 import org.apache.nutch.crawl.CrawlDatum;
 import org.apache.nutch.crawl.Inlinks;
@@ -52,6 +43,7 @@ import org.apache.nutch.protocol.ProtocolOutput;
 import org.apache.nutch.scoring.ScoringFilters;
 import org.apache.nutch.util.NutchConfiguration;
 import org.apache.nutch.util.StringUtil;
+import org.apache.nutch.util.AbstractChecker;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
@@ -65,41 +57,33 @@ import org.slf4j.LoggerFactory;
  * @author Julien Nioche
  **/
 
-public class IndexingFiltersChecker extends Configured implements Tool {
+public class IndexingFiltersChecker extends AbstractChecker {
 
   protected URLNormalizers normalizers = null;
   protected boolean dumpText = false;
   protected boolean followRedirects = false;
-  protected boolean keepClientCnxOpen = false;
   // used to simulate the metadata propagated from injection
   protected HashMap<String, String> metadata = new HashMap<>();
-  protected int tcpPort = -1;
 
   private static final Logger LOG = LoggerFactory
       .getLogger(MethodHandles.lookup().lookupClass());
 
-  public IndexingFiltersChecker() {
-
-  }
-
   public int run(String[] args) throws Exception {
     String url = null;
-    String usage = "Usage: IndexingFiltersChecker [-normalize] [-followRedirects] [-dumpText] [-md key=value] [-listen <port>] [-keepClientCnxOpen]";
+    usage = "Usage: IndexingFiltersChecker [-normalize] [-followRedirects] [-dumpText] [-md key=value] (-stdin | -listen <port> [-keepClientCnxOpen])";
 
-    if (args.length == 0) {
+    // Print help when no args given
+    if (args.length < 1) {
       System.err.println(usage);
-      return -1;
+      System.exit(-1);
     }
 
+    int numConsumed;
     for (int i = 0; i < args.length; i++) {
       if (args[i].equals("-normalize")) {
         normalizers = new URLNormalizers(getConf(), URLNormalizers.SCOPE_DEFAULT);
-      } else if (args[i].equals("-listen")) {
-        tcpPort = Integer.parseInt(args[++i]);
       } else if (args[i].equals("-followRedirects")) {
         followRedirects = true;
-      } else if (args[i].equals("-keepClientCnxOpen")) {
-        keepClientCnxOpen = true;
       } else if (args[i].equals("-dumpText")) {
         dumpText = true;
       } else if (args[i].equals("-md")) {
@@ -112,104 +96,27 @@ public class IndexingFiltersChecker extends Configured implements Tool {
         } else
           k = nextOne;
         metadata.put(k, v);
+      } else if ((numConsumed = super.parseArgs(args, i)) > 0) {
+        i += numConsumed - 1;
       } else if (i != args.length - 1) {
+        System.err.println("ERR: Not a recognized argument: " + args[i]);
         System.err.println(usage);
         System.exit(-1);
       } else {
-        url =args[i];
+        url = args[i];
       }
     }
     
-    // In listening mode?
-    if (tcpPort == -1) {
-      // No, just fetch and display
-      StringBuilder output = new StringBuilder();
-      int ret = fetch(url, output);
-      System.out.println(output);
-      return ret;
+    if (url != null) {
+      return super.processSingle(url);
     } else {
-      // Listen on socket and start workers on incoming requests
-      listen();
-    }
-    
-    return 0;
-  }
-  
-  protected void listen() throws Exception {
-    ServerSocket server = null;
-
-    try{
-      server = new ServerSocket();
-      server.bind(new InetSocketAddress(tcpPort));
-      LOG.info(server.toString());
-    } catch (Exception e) {
-      LOG.error("Could not listen on port " + tcpPort);
-      System.exit(-1);
-    }
-    
-    while(true){
-      Worker worker;
-      try{
-        worker = new Worker(server.accept());
-        Thread thread = new Thread(worker);
-        thread.start();
-      } catch (Exception e) {
-        LOG.error("Accept failed: " + tcpPort);
-        System.exit(-1);
-      }
-    }
-  }
-  
-  private class Worker implements Runnable {
-    private Socket client;
-
-    Worker(Socket client) {
-      this.client = client;
-      LOG.info(client.toString());
-    }
-
-    public void run() {
-      if (keepClientCnxOpen) {
-        while (true) { // keep connection open until closes
-          readWrite();
-        }
-      } else {
-        readWrite();
-        
-        try { // close ourselves
-          client.close();
-        } catch (Exception e){
-          LOG.error(e.toString());
-        }
-      }
-    }
-    
-    protected void readWrite() {
-      String line;
-      BufferedReader in = null;
-      PrintWriter out = null;
-      
-      try{
-        in = new BufferedReader(new InputStreamReader(client.getInputStream()));
-      } catch (Exception e) {
-        LOG.error("in or out failed");
-        System.exit(-1);
-      }
-
-      try{
-        line = in.readLine();        
-        StringBuilder output = new StringBuilder();
-        fetch(line, output);
-        
-        client.getOutputStream().write(output.toString().getBytes(Charset.forName("UTF-8")));
-      }catch (Exception e) {
-        LOG.error("Read/Write failed: " + e);
-      }
+      // Start listening
+      return super.run();
     }
   }
     
   
-  protected int fetch(String url, StringBuilder output) throws Exception {
+  protected int process(String url, StringBuilder output) throws Exception {
     if (normalizers != null) {
       url = normalizers.normalize(url, URLNormalizers.SCOPE_DEFAULT);
     }
diff --git a/src/java/org/apache/nutch/net/URLFilterChecker.java b/src/java/org/apache/nutch/net/URLFilterChecker.java
index 6fb3cf2..429aa9f 100644
--- a/src/java/org/apache/nutch/net/URLFilterChecker.java
+++ b/src/java/org/apache/nutch/net/URLFilterChecker.java
@@ -21,8 +21,9 @@ import org.apache.nutch.plugin.Extension;
 import org.apache.nutch.plugin.ExtensionPoint;
 import org.apache.nutch.plugin.PluginRepository;
 
-import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.util.ToolRunner;
 
+import org.apache.nutch.util.AbstractChecker;
 import org.apache.nutch.util.NutchConfiguration;
 
 import java.io.BufferedReader;
@@ -33,103 +34,60 @@ import java.io.InputStreamReader;
  * 
  * @author John Xing
  */
-public class URLFilterChecker {
+public class URLFilterChecker extends AbstractChecker {
 
-  private Configuration conf;
+  private URLFilters filters = null;
 
-  public URLFilterChecker(Configuration conf) {
-    this.conf = conf;
-  }
-
-  private void checkOne(String filterName) throws Exception {
-    URLFilter filter = null;
-
-    ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint(
-        URLFilter.X_POINT_ID);
-
-    if (point == null)
-      throw new RuntimeException(URLFilter.X_POINT_ID + " not found.");
-
-    Extension[] extensions = point.getExtensions();
-
-    for (int i = 0; i < extensions.length; i++) {
-      Extension extension = extensions[i];
-      filter = (URLFilter) extension.getExtensionInstance();
-      if (filter.getClass().getName().equals(filterName)) {
-        break;
-      } else {
-        filter = null;
-      }
-    }
-
-    if (filter == null)
-      throw new RuntimeException("Filter " + filterName + " not found.");
-
-    // jerome : should we keep this behavior?
-    // if (LogFormatter.hasLoggedSevere())
-    // throw new RuntimeException("Severe error encountered.");
-
-    System.out.println("Checking URLFilter " + filterName);
-
-    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
-    String line;
-    while ((line = in.readLine()) != null) {
-      String out = filter.filter(line);
-      if (out != null) {
-        System.out.print("+");
-        System.out.println(out);
-      } else {
-        System.out.print("-");
-        System.out.println(line);
-      }
-    }
-  }
+  public int run(String[] args) throws Exception {
+    usage = "Usage: URLFilterChecker [-filterName filterName] (-stdin | -listen <port> [-keepClientCnxOpen]) \n"
+        + "\n\tTool takes a list of URLs, one per line.\n";
 
-  private void checkAll() throws Exception {
-    System.out.println("Checking combination of all URLFilters available");
-
-    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
-    String line;
-    URLFilters filters = new URLFilters(this.conf);
-
-    while ((line = in.readLine()) != null) {
-      String out = filters.filter(line);
-      if (out != null) {
-        System.out.print("+");
-        System.out.println(out);
-      } else {
-        System.out.print("-");
-        System.out.println(line);
-      }
-    }
-  }
-
-  public static void main(String[] args) throws Exception {
-
-    String usage = "Usage: URLFilterChecker (-filterName filterName | -allCombined) \n"
-        + "Tool takes a list of URLs, one per line, passed via STDIN.\n";
-
-    if (args.length == 0) {
+    // Print help when no args given
+    if (args.length < 1) {
       System.err.println(usage);
       System.exit(-1);
     }
 
-    String filterName = null;
-    if (args[0].equals("-filterName")) {
-      if (args.length != 2) {
+    int numConsumed;
+    for (int i = 0; i < args.length; i++) {
+      if (args[i].equals("-filterName")) {
+        getConf().set("plugin.includes", args[++i]);
+      } else if ((numConsumed = super.parseArgs(args, i)) > 0) {
+        i += numConsumed - 1;
+      } else {
+        System.err.println("ERR: Not a recognized argument: " + args[i]);
         System.err.println(usage);
         System.exit(-1);
       }
-      filterName = args[1];
     }
 
-    URLFilterChecker checker = new URLFilterChecker(NutchConfiguration.create());
-    if (filterName != null) {
-      checker.checkOne(filterName);
+    // Print active filter list
+    filters = new URLFilters(getConf());
+    System.out.print("Checking combination of these URLFilters: ");
+    for (URLFilter filter : filters.getFilters()) {
+      System.out.print(filter.getClass().getSimpleName() + " ");
+    }
+    System.out.println("");
+
+    // Start listening
+    return super.run();
+  }
+
+  protected int process(String line, StringBuilder output) throws Exception {
+    String out = filters.filter(line);
+    if (out != null) {
+      output.append("+");
+      output.append(out);
     } else {
-      checker.checkAll();
+      output.append("-");
+      output.append(line);
     }
+    return 0;
+  }
 
-    System.exit(0);
+  public static void main(String[] args) throws Exception {
+    final int res = ToolRunner.run(NutchConfiguration.create(),
+        new URLFilterChecker(), args);
+    System.exit(res);
   }
 }
diff --git a/src/java/org/apache/nutch/net/URLFilters.java b/src/java/org/apache/nutch/net/URLFilters.java
index 3deccca..4f5bf36 100644
--- a/src/java/org/apache/nutch/net/URLFilters.java
+++ b/src/java/org/apache/nutch/net/URLFilters.java
@@ -31,6 +31,10 @@ public class URLFilters {
         URLFilter.class, URLFilter.X_POINT_ID, URLFILTER_ORDER);
   }
 
+  public URLFilter[] getFilters() {
+    return this.filters;
+  }
+
   /** Run all defined filters. Assume logical AND. */
   public String filter(String urlString) throws URLFilterException {
     for (int i = 0; i < this.filters.length; i++) {
diff --git a/src/java/org/apache/nutch/net/URLNormalizerChecker.java b/src/java/org/apache/nutch/net/URLNormalizerChecker.java
index d8f1c6e..a435cc8 100644
--- a/src/java/org/apache/nutch/net/URLNormalizerChecker.java
+++ b/src/java/org/apache/nutch/net/URLNormalizerChecker.java
@@ -21,8 +21,9 @@ import org.apache.nutch.plugin.Extension;
 import org.apache.nutch.plugin.ExtensionPoint;
 import org.apache.nutch.plugin.PluginRepository;
 
-import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.util.ToolRunner;
 
+import org.apache.nutch.util.AbstractChecker;
 import org.apache.nutch.util.NutchConfiguration;
 
 import java.io.BufferedReader;
@@ -31,87 +32,56 @@ import java.io.InputStreamReader;
 /**
  * Checks one given normalizer or all normalizers.
  */
-public class URLNormalizerChecker {
+public class URLNormalizerChecker extends AbstractChecker {
 
-  private Configuration conf;
+  private String scope = URLNormalizers.SCOPE_DEFAULT;
+  URLNormalizers normalizers;
 
-  public URLNormalizerChecker(Configuration conf) {
-    this.conf = conf;
-  }
-
-  private void checkOne(String normalizerName, String scope) throws Exception {
-    URLNormalizer normalizer = null;
-
-    ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint(
-        URLNormalizer.X_POINT_ID);
-
-    if (point == null)
-      throw new RuntimeException(URLNormalizer.X_POINT_ID + " not found.");
-
-    Extension[] extensions = point.getExtensions();
-
-    for (int i = 0; i < extensions.length; i++) {
-      Extension extension = extensions[i];
-      normalizer = (URLNormalizer) extension.getExtensionInstance();
-      if (normalizer.getClass().getName().equals(normalizerName)) {
-        break;
-      } else {
-        normalizer = null;
-      }
-    }
-
-    if (normalizer == null)
-      throw new RuntimeException("URLNormalizer " + normalizerName
-          + " not found.");
-
-    System.out.println("Checking URLNormalizer " + normalizerName);
-
-    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
-    String line;
-    while ((line = in.readLine()) != null) {
-      String out = normalizer.normalize(line, scope);
-      System.out.println(out);
-    }
-  }
-
-  private void checkAll(String scope) throws Exception {
-    System.out.println("Checking combination of all URLNormalizers available");
+  public int run(String[] args) throws Exception {
+    usage = "Usage: URLNormalizerChecker [-normalizer <normalizerName>] [-scope <scope>] (-stdin | -listen <port> [-keepClientCnxOpen])"
+        + "\n\tscope can be one of: default,partition,generate_host_count,fetcher,crawldb,linkdb,inject,outlink\n";
 
-    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
-    String line;
-    URLNormalizers normalizers = new URLNormalizers(conf, scope);
-    while ((line = in.readLine()) != null) {
-      String out = normalizers.normalize(line, scope);
-      System.out.println(out);
+    // Print help when no args given
+    if (args.length < 1) {
+      System.err.println(usage);
+      System.exit(-1);
     }
-  }
-
-  public static void main(String[] args) throws Exception {
 
-    String usage = "Usage: URLNormalizerChecker [-normalizer <normalizerName>] [-scope <scope>]"
-        + "\n\tscope can be one of: default,partition,generate_host_count,fetcher,crawldb,linkdb,inject,outlink";
-
-    String normalizerName = null;
-    String scope = URLNormalizers.SCOPE_DEFAULT;
+    int numConsumed;
     for (int i = 0; i < args.length; i++) {
       if (args[i].equals("-normalizer")) {
-        normalizerName = args[++i];
+        getConf().set("plugin.includes", args[++i]);
       } else if (args[i].equals("-scope")) {
         scope = args[++i];
+      } else if ((numConsumed = super.parseArgs(args, i)) > 0) {
+        i += numConsumed - 1;
       } else {
+        System.err.println("ERR: Not a recognized argument: " + args[i]);
         System.err.println(usage);
         System.exit(-1);
       }
     }
 
-    URLNormalizerChecker checker = new URLNormalizerChecker(
-        NutchConfiguration.create());
-    if (normalizerName != null) {
-      checker.checkOne(normalizerName, scope);
-    } else {
-      checker.checkAll(scope);
+    // Print active normalizer list
+    normalizers = new URLNormalizers(getConf(), scope);
+    System.out.print("Checking combination of these URLNormalizers: ");
+    for (URLNormalizer normalizer : normalizers.getURLNormalizers(scope)) {
+      System.out.print(normalizer.getClass().getSimpleName() + " ");
     }
+    System.out.println("");
+
+    // Start listening
+    return super.run();
+  }
 
-    System.exit(0);
+  protected int process(String line, StringBuilder output) throws Exception {
+    output.append(normalizers.normalize(line, scope));
+    return 0;
+  }
+
+  public static void main(String[] args) throws Exception {
+    final int res = ToolRunner.run(NutchConfiguration.create(),
+        new URLNormalizerChecker(), args);
+    System.exit(res);
   }
 }
diff --git a/src/java/org/apache/nutch/util/AbstractChecker.java b/src/java/org/apache/nutch/util/AbstractChecker.java
new file mode 100644
index 0000000..8424879
--- /dev/null
+++ b/src/java/org/apache/nutch/util/AbstractChecker.java
@@ -0,0 +1,171 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.util;
+
+import java.io.BufferedReader;
+import java.io.InputStreamReader;
+import java.io.PrintWriter;
+import java.lang.invoke.MethodHandles;
+import java.net.ServerSocket;
+import java.net.Socket;
+import java.net.InetSocketAddress;
+import java.nio.charset.StandardCharsets;
+
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.util.Tool;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Scaffolding class for the various Checker implementations. Can process cmdline input, stdin and TCP connections.
+ * 
+ * @author Jurian Broertjes
+ */
+public abstract class AbstractChecker extends Configured implements Tool {
+
+  private static final Logger LOG = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
+
+  protected boolean keepClientCnxOpen = false;
+  protected int tcpPort = -1;
+  protected boolean stdin = true;
+  protected String usage;
+
+  // Actual function for the processing of a single input
+  protected abstract int process(String line, StringBuilder output) throws Exception;
+
+  protected int parseArgs(String[] args, int i) {
+    if (args[i].equals("-listen")) {
+      tcpPort = Integer.parseInt(args[++i]);
+      return 2;
+    } else if (args[i].equals("-keepClientCnxOpen")) {
+      keepClientCnxOpen = true;
+      return 1;
+    } else if (args[i].equals("-stdin")) {
+      stdin = true;
+      return 1;
+    }
+    return 0;
+  }
+
+  protected int run() throws Exception {
+    // In listening mode?
+    if (tcpPort != -1) {
+      processTCP(tcpPort);
+      return 0;
+    } else if (stdin) {
+      return processStdin();
+    }
+    // Nothing to do?
+    return -1;
+  }
+
+  // Process single input and return
+  protected int processSingle(String input) throws Exception {
+    StringBuilder output = new StringBuilder();
+    int ret = process(input, output);
+    System.out.println(output);
+    return ret;
+  }
+
+  // Read from stdin
+  protected int processStdin() throws Exception {
+    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
+    String line;
+    while ((line = in.readLine()) != null) {
+      StringBuilder output = new StringBuilder();
+      int ret = process(line, output);
+      System.out.println(output);
+    }
+    return 0;
+  }
+
+  // Open TCP socket and process input
+  protected void processTCP(int tcpPort) throws Exception {
+    ServerSocket server = null;
+
+    try {
+      server = new ServerSocket();
+      server.bind(new InetSocketAddress(tcpPort));
+      LOG.info(server.toString());
+    } catch (Exception e) {
+      LOG.error("Could not listen on port " + tcpPort);
+      System.exit(-1);
+    }
+    
+    while(true){
+      Worker worker;
+      try {
+        worker = new Worker(server.accept());
+        Thread thread = new Thread(worker);
+        thread.start();
+      } catch (Exception e) {
+        LOG.error("Accept failed: " + tcpPort);
+        System.exit(-1);
+      }
+    }
+  }
+
+  private class Worker implements Runnable {
+    private Socket client;
+
+    Worker(Socket client) {
+      this.client = client;
+      LOG.info(client.toString());
+    }
+
+    public void run() {
+      if (keepClientCnxOpen) {
+        while (true) { // keep connection open until closes
+          readWrite();
+        }
+      } else {
+        readWrite();
+        
+        try { // close ourselves
+          client.close();
+        } catch (Exception e){
+          LOG.error(e.toString());
+        }
+      }
+    }
+    
+    protected void readWrite() {
+      String line;
+      BufferedReader in = null;
+      PrintWriter out = null;
+      
+      try {
+        in = new BufferedReader(new InputStreamReader(client.getInputStream()));
+      } catch (Exception e) {
+        LOG.error("in or out failed");
+        System.exit(-1);
+      }
+
+      try {
+        line = in.readLine();
+        StringBuilder output = new StringBuilder();
+        process(line, output);
+        
+        client.getOutputStream().write(output.toString().getBytes(StandardCharsets.UTF_8));
+      } catch (Exception e) {
+        LOG.error("Read/Write failed: " + e);
+      }
+    }
+  }
+}
\ No newline at end of file

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 22/23: NUTCH-2034 CrawlDB update job to count documents in CrawlDb rejected by URL filters (patch contributed by Luis Lopez)

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit e0a27c7870d632966d584cf45399b98ba77e2bd6
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Sun Dec 17 16:13:09 2017 +0100

    NUTCH-2034 CrawlDB update job to count documents in CrawlDb rejected by URL filters
    (patch contributed by Luis Lopez)
---
 src/java/org/apache/nutch/crawl/CrawlDb.java       | 12 +++++++++++-
 src/java/org/apache/nutch/crawl/CrawlDbFilter.java |  5 ++++-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/src/java/org/apache/nutch/crawl/CrawlDb.java b/src/java/org/apache/nutch/crawl/CrawlDb.java
index 080b037..9f37447 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDb.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDb.java
@@ -115,8 +115,9 @@ public class CrawlDb extends NutchTool implements Tool {
     if (LOG.isInfoEnabled()) {
       LOG.info("CrawlDb update: Merging segment data into db.");
     }
+    RunningJob crawlDBJob = null;
     try {
-      JobClient.runJob(job);
+      crawlDBJob = JobClient.runJob(job);
     } catch (IOException e) {
       FileSystem fs = crawlDb.getFileSystem(getConf());
       LockUtil.removeLockFile(fs, lock);
@@ -127,6 +128,15 @@ public class CrawlDb extends NutchTool implements Tool {
     }
 
     CrawlDb.install(job, crawlDb);
+
+    if (filter) {
+      long urlsFiltered = crawlDBJob.getCounters()
+          .findCounter("CrawlDB filter", "URLs filtered").getValue();
+      LOG.info(
+          "CrawlDb update: Total number of existing URLs in CrawlDb rejected by URL filters: {}",
+          urlsFiltered);
+    }
+
     long end = System.currentTimeMillis();
     LOG.info("CrawlDb update: finished at " + sdf.format(end) + ", elapsed: "
         + TimingUtil.elapsedTime(start, end));
diff --git a/src/java/org/apache/nutch/crawl/CrawlDbFilter.java b/src/java/org/apache/nutch/crawl/CrawlDbFilter.java
index 7b2aa80..8b46ecb 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbFilter.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbFilter.java
@@ -111,7 +111,10 @@ public class CrawlDbFilter implements
         url = null;
       }
     }
-    if (url != null) { // if it passes
+    if (url == null) {
+      reporter.getCounter("CrawlDB filter", "URLs filtered").increment(1);
+    } else {
+      // URL has passed filters
       newKey.set(url); // collect it
       output.collect(newKey, value);
     }

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 16/23: NUTCH-2035 urlfilter-regex case insensitive rules

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit e0e06f58015c982700c5ec0a2a4a43dde642f03f
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Fri Dec 15 17:25:49 2017 +0100

    NUTCH-2035 urlfilter-regex case insensitive rules
---
 conf/regex-urlfilter.txt.template | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/conf/regex-urlfilter.txt.template b/conf/regex-urlfilter.txt.template
index 78b2b31..bcf9c87 100644
--- a/conf/regex-urlfilter.txt.template
+++ b/conf/regex-urlfilter.txt.template
@@ -27,7 +27,7 @@
 
 # skip image and other suffixes we can't yet parse
 # for a more extensive coverage use the urlfilter-suffix plugin
--\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
+-(?i)\.(gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$
 
 # skip URLs containing certain characters as probable queries, etc.
 -[?*!@=]

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 06/23: add languages to default config

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 5ccebc95b807353c4e4628b576a1c0d17818ab5d
Author: Nicola Marcacci Rossi <ni...@gmail.com>
AuthorDate: Fri Dec 15 09:47:20 2017 +0100

    add languages to default config
---
 conf/nutch-default.xml | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 5e8606f..a53cf7b 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -2110,6 +2110,18 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
 </property>
 
 <property>
+    <name>elastic.rest.languages</name>
+    <value></value>
+    <description>
+        A list of strings denoting the supported languages (e.g. `en,de,fr,it`).
+        If this value is empty all documents will be sent to index ${elastic.rest.index}.
+        If not empty the Rest client will distribute documents in different indices based on their `lang` property.
+        Indices are named with the following schema: ${elastic.rest.index}_${lang} (e.g. `nutch_de`).
+        Entries with an unsupported `lang` value will be added to index ${elastic.rest.index}_others (e.g. `nutch_others`).
+    </description>
+</property>
+
+<property>
     <name>elastic.rest.type</name>
     <value>doc</value>
     <description>Default type to send documents to.</description>

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 12/23: fix indentation

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 52a1c5088203ae4fc733c4a507a8b5014c6b7bc7
Author: Nicola Marcacci Rossi <ni...@gmail.com>
AuthorDate: Fri Dec 15 14:30:02 2017 +0100

    fix indentation
---
 conf/nutch-default.xml | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 1d9837f..f35e787 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -2136,11 +2136,11 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
 </property>
 
 <property>
-	<name>elastic.rest.sink</name>
-	<value>others</value>
-	<description>
-		Default value is `others`. Is used only if `elastic.rest.languages` is defined to build the index name where to store documents with unsupported languages (i.e. ${elastic.rest.index}${elastic.rest.separator}${elastic.rest.sink}).
-	</description>
+    <name>elastic.rest.sink</name>
+    <value>others</value>
+    <description>
+        Default value is `others`. Is used only if `elastic.rest.languages` is defined to build the index name where to store documents with unsupported languages (i.e. ${elastic.rest.index}${elastic.rest.separator}${elastic.rest.sink}).
+    </description>
 </property>
 
 <property>

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 05/23: fix formatting

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 153525c49ed24e889073004171912b4fa114cb98
Author: Nicola Marcacci Rossi <ni...@gmail.com>
AuthorDate: Fri Dec 15 09:42:44 2017 +0100

    fix formatting
---
 .../indexwriter/elasticrest/ElasticRestIndexWriter.java  | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
index dc54058..34ab661 100644
--- a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
+++ b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
@@ -235,15 +235,15 @@ public class ElasticRestIndexWriter implements IndexWriter {
   public void delete(String key) throws IOException {
     try {
       if (languages != null && languages.length > 0) {
-    	Bulk.Builder bulkBuilder = new Bulk.Builder().defaultType(defaultType);
-    	for (String lang : languages) {    		  
-    	  bulkBuilder.addAction(new Delete.Builder(key).index(defaultIndex + "_" + lang).build());
-    	}
-    	bulkBuilder.addAction(new Delete.Builder(key).index(defaultIndex + "_others").build());
-    	client.execute(bulkBuilder.build());
+        Bulk.Builder bulkBuilder = new Bulk.Builder().defaultType(defaultType);
+        for (String lang : languages) {          
+          bulkBuilder.addAction(new Delete.Builder(key).index(defaultIndex + "_" + lang).build());
+        }
+        bulkBuilder.addAction(new Delete.Builder(key).index(defaultIndex + "_others").build());
+        client.execute(bulkBuilder.build());
       } else {
-    	client.execute(new Delete.Builder(key).index(defaultIndex)
-    	    .type(defaultType).build());
+        client.execute(new Delete.Builder(key).index(defaultIndex)
+          .type(defaultType).build());
       }
     } catch (IOException e) {
       LOG.error(ExceptionUtils.getStackTrace(e));

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 04/23: Extend indexer-elastic-rest to support languages

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 194fc37cb5aa2879b279014bbeaf3bd207af85fd
Author: Nicola Marcacci Rossi <ni...@gmail.com>
AuthorDate: Wed Dec 13 16:33:00 2017 +0100

    Extend indexer-elastic-rest to support languages
---
 .../elasticrest/ElasticRestConstants.java          |  4 +-
 .../elasticrest/ElasticRestIndexWriter.java        | 43 +++++++++++++++++-----
 2 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java
index 322ff44..74f37eb 100644
--- a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java
+++ b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestConstants.java
@@ -30,4 +30,6 @@ public interface ElasticRestConstants {
   public static final String TYPE = ELASTIC_PREFIX + "type";
   public static final String HTTPS = ELASTIC_PREFIX + "https";
   public static final String HOSTNAME_TRUST = ELASTIC_PREFIX + "trustallhostnames";
-}
\ No newline at end of file
+  
+  public static final String LANGUAGES = ELASTIC_PREFIX + "languages";
+}
diff --git a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
index 1364722..dc54058 100644
--- a/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
+++ b/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java
@@ -32,7 +32,6 @@ import org.apache.commons.lang.StringUtils;
 import org.apache.commons.lang3.exception.ExceptionUtils;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.mapred.JobConf;
-import org.apache.http.HttpResponse;
 import org.apache.http.concurrent.BasicFuture;
 import org.apache.http.conn.ssl.DefaultHostnameVerifier;
 import org.apache.http.conn.ssl.NoopHostnameVerifier;
@@ -48,7 +47,6 @@ import org.slf4j.LoggerFactory;
 
 import javax.net.ssl.HostnameVerifier;
 import javax.net.ssl.SSLContext;
-import java.io.BufferedReader;
 import java.io.IOException;
 import java.net.URL;
 import java.security.KeyManagementException;
@@ -58,11 +56,9 @@ import java.security.cert.CertificateException;
 import java.security.cert.X509Certificate;
 import java.util.HashMap;
 import java.util.Map;
-import java.util.MissingResourceException;
 import java.util.HashSet;
 import java.util.Set;
 import java.util.concurrent.ExecutionException;
-import java.util.concurrent.Future;
 
 /**
  */
@@ -80,7 +76,6 @@ public class ElasticRestIndexWriter implements IndexWriter {
   private Configuration config;
 
   private Bulk.Builder bulkBuilder;
-  private Future<HttpResponse> execute;
   private int port = -1;
   private String host = null;
   private Boolean https = null;
@@ -96,6 +91,8 @@ public class ElasticRestIndexWriter implements IndexWriter {
   private boolean createNewBulk = false;
   private long millis;
   private BasicFuture<JestResult> basicFuture = null;
+  
+  private String[] languages = null;
 
   @Override
   public void open(JobConf job, String name) throws IOException {
@@ -106,6 +103,7 @@ public class ElasticRestIndexWriter implements IndexWriter {
     password = job.get(ElasticRestConstants.PASSWORD);
     https = job.getBoolean(ElasticRestConstants.HTTPS, false);
     trustAllHostnames = job.getBoolean(ElasticRestConstants.HOSTNAME_TRUST, false);
+    languages = job.getStrings(ElasticRestConstants.LANGUAGES);
 
     // trust ALL certificates
     SSLContext sslContext = null;
@@ -195,7 +193,26 @@ public class ElasticRestIndexWriter implements IndexWriter {
         bulkLength += fieldValues[0].length();
       }
     }
-    Index indexRequest = new Index.Builder(source).index(defaultIndex)
+    
+    String index;
+    if (languages != null && languages.length > 0) {
+      String language = (String) doc.getFieldValue("lang");
+      boolean exists = false;
+      for (String lang : languages) {
+        if (lang.equals(language)) {
+          exists = true;
+          break;
+        }
+      }
+      if (exists) {
+        index = defaultIndex + "_" + language;
+      } else {
+        index = defaultIndex + "_others";
+      }
+    } else {
+      index = defaultIndex;
+    }
+    Index indexRequest = new Index.Builder(source).index(index)
         .type(type).id(id).build();
 
     // Add this indexing request to a bulk request
@@ -217,13 +234,21 @@ public class ElasticRestIndexWriter implements IndexWriter {
   @Override
   public void delete(String key) throws IOException {
     try {
-      client.execute(new Delete.Builder(key).index(defaultIndex)
-          .type(defaultType).build());
+      if (languages != null && languages.length > 0) {
+    	Bulk.Builder bulkBuilder = new Bulk.Builder().defaultType(defaultType);
+    	for (String lang : languages) {    		  
+    	  bulkBuilder.addAction(new Delete.Builder(key).index(defaultIndex + "_" + lang).build());
+    	}
+    	bulkBuilder.addAction(new Delete.Builder(key).index(defaultIndex + "_others").build());
+    	client.execute(bulkBuilder.build());
+      } else {
+    	client.execute(new Delete.Builder(key).index(defaultIndex)
+    	    .type(defaultType).build());
+      }
     } catch (IOException e) {
       LOG.error(ExceptionUtils.getStackTrace(e));
       throw e;
     }
-
   }
 
   @Override

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 08/23: NUTCH-2439 Upgrade Apache Tika dependency to 1.17

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 42bdc65df4569d66d188ffc9981e6bf7baea45c7
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Fri Dec 15 13:45:50 2017 +0100

    NUTCH-2439 Upgrade Apache Tika dependency to 1.17
---
 ivy/ivy.xml                      |  4 +-
 src/plugin/parse-tika/ivy.xml    |  5 +-
 src/plugin/parse-tika/plugin.xml | 98 ++++++++++++++++++----------------------
 3 files changed, 51 insertions(+), 56 deletions(-)

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index a72a632..f7867c4 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -47,7 +47,7 @@
 		<dependency org="commons-collections" name="commons-collections" rev="3.2.1" conf="*->master" />
 		<dependency org="commons-httpclient" name="commons-httpclient" rev="3.1" conf="*->master" />
 		<dependency org="commons-codec" name="commons-codec" rev="1.10" conf="*->default" />
-		<dependency org="org.apache.commons" name="commons-compress" rev="1.9" conf="*->default" />
+		<dependency org="org.apache.commons" name="commons-compress" rev="1.14" conf="*->default" />
 		<dependency org="org.apache.commons" name="commons-jexl" rev="2.1.1" />
 		<dependency org="com.tdunning" name="t-digest" rev="3.2" />
 
@@ -65,7 +65,7 @@
         <dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-jobclient" rev="2.7.2" conf="*->default"/>
         <!-- End of Hadoop Dependencies -->
 
-		<dependency org="org.apache.tika" name="tika-core" rev="1.12" />
+		<dependency org="org.apache.tika" name="tika-core" rev="1.17" />
 		<dependency org="com.ibm.icu" name="icu4j" rev="55.1" />
 
 		<dependency org="xerces" name="xercesImpl" rev="2.11.0" />
diff --git a/src/plugin/parse-tika/ivy.xml b/src/plugin/parse-tika/ivy.xml
index a01ec98..24ad25b 100644
--- a/src/plugin/parse-tika/ivy.xml
+++ b/src/plugin/parse-tika/ivy.xml
@@ -36,11 +36,14 @@
   </publications>
 
   <dependencies>
-    <dependency org="org.apache.tika" name="tika-parsers" rev="1.12" conf="*->default">
+    <dependency org="org.apache.tika" name="tika-parsers" rev="1.17" conf="*->default">
       <exclude org="org.apache.tika" name="tika-core" />
       <exclude org="org.apache.httpcomponents" name="httpclient" />
       <exclude org="org.apache.httpcomponents" name="httpcore" />
       <exclude org="org.slf4j" name="slf4j-log4j12" />
+      <exclude org="org.slf4j" name="slf4j-api" />
+      <exclude org="commons-lang" name="commons-lang" />
+      <exclude org="com.google.protobuf" name="protobuf-java" />
     </dependency>
   </dependencies>
   
diff --git a/src/plugin/parse-tika/plugin.xml b/src/plugin/parse-tika/plugin.xml
index 7f14d98..b9055e4 100644
--- a/src/plugin/parse-tika/plugin.xml
+++ b/src/plugin/parse-tika/plugin.xml
@@ -25,95 +25,87 @@
       <library name="parse-tika.jar">
          <export name="*"/>
       </library>
-      <library name="apache-mime4j-core-0.7.2.jar"/>
-      <library name="apache-mime4j-dom-0.7.2.jar"/>
+      <library name="apache-mime4j-core-0.8.1.jar"/>
+      <library name="apache-mime4j-dom-0.8.1.jar"/>
       <library name="asm-5.0.4.jar"/>
-      <library name="aspectjrt-1.8.0.jar"/>
-      <library name="bcmail-jdk15on-1.52.jar"/>
-      <library name="bcpkix-jdk15on-1.52.jar"/>
-      <library name="bcprov-jdk15on-1.52.jar"/>
+      <library name="bcmail-jdk15on-1.54.jar"/>
+      <library name="bcpkix-jdk15on-1.54.jar"/>
+      <library name="bcprov-jdk15on-1.54.jar"/>
       <library name="boilerpipe-1.1.0.jar"/>
       <library name="bzip2-0.9.1.jar"/>
       <library name="c3p0-0.9.1.1.jar"/>
       <library name="cdm-4.5.5.jar"/>
       <library name="commons-codec-1.6.jar"/>
-      <library name="commons-compress-1.10.jar"/>
+      <library name="commons-collections4-4.1.jar"/>
+      <library name="commons-compress-1.14.jar"/>
       <library name="commons-csv-1.0.jar"/>
       <library name="commons-exec-1.3.jar"/>
-      <library name="commons-io-2.4.jar"/>
-      <library name="commons-lang-2.6.jar"/>
-      <library name="commons-logging-1.1.3.jar"/>
-      <library name="commons-logging-api-1.1.jar"/>
-      <library name="commons-vfs2-2.0.jar"/>
-      <library name="cxf-core-3.0.3.jar"/>
-      <library name="cxf-rt-frontend-jaxrs-3.0.3.jar"/>
-      <library name="cxf-rt-rs-client-3.0.3.jar"/>
-      <library name="cxf-rt-transports-http-3.0.3.jar"/>
+      <library name="commons-io-2.5.jar"/>
+      <library name="curvesapi-1.04.jar"/>
+      <library name="cxf-core-3.0.16.jar"/>
+      <library name="cxf-rt-frontend-jaxrs-3.0.16.jar"/>
+      <library name="cxf-rt-rs-client-3.0.16.jar"/>
+      <library name="cxf-rt-transports-http-3.0.16.jar"/>
       <library name="ehcache-core-2.6.2.jar"/>
-      <library name="fontbox-1.8.10.jar"/>
+      <library name="fontbox-2.0.8.jar"/>
       <library name="geoapi-3.0.0.jar"/>
       <library name="grib-4.5.5.jar"/>
-      <library name="gson-2.2.4.jar"/>
+      <library name="gson-2.8.1.jar"/>
       <library name="guava-17.0.jar"/>
-      <library name="httpmime-4.2.6.jar"/>
+      <library name="httpmime-4.5.4.jar"/>
       <library name="httpservices-4.5.5.jar"/>
-      <library name="isoparser-1.0.2.jar"/>
-      <library name="jackcess-2.1.2.jar"/>
-      <library name="jackcess-encrypt-2.1.1.jar"/>
+      <library name="isoparser-1.1.18.jar"/>
+      <library name="jackcess-2.1.8.jar"/>
+      <library name="jackcess-encrypt-2.1.2.jar"/>
+      <library name="jackson-core-2.9.2.jar"/>
       <library name="java-libpst-0.8.1.jar"/>
       <library name="javax.annotation-api-1.2.jar"/>
       <library name="javax.ws.rs-api-2.0.1.jar"/>
       <library name="jcip-annotations-1.0.jar"/>
+      <library name="jcl-over-slf4j-1.7.24.jar"/>
       <library name="jcommander-1.35.jar"/>
-      <library name="jdom-2.0.2.jar"/>
       <library name="jdom2-2.0.4.jar"/>
-      <library name="jempbox-1.8.10.jar"/>
+      <library name="jempbox-1.8.13.jar"/>
       <library name="jhighlight-1.0.2.jar"/>
-      <library name="jj2000-5.2.jar"/>
-      <library name="jmatio-1.0.jar"/>
+      <library name="jmatio-1.2.jar"/>
       <library name="jna-4.1.0.jar"/>
       <library name="joda-time-2.2.jar"/>
-      <library name="json-20140107.jar"/>
+      <library name="json-1.8.jar"/>
       <library name="json-simple-1.1.1.jar"/>
       <library name="jsoup-1.7.2.jar"/>
       <library name="jsr-275-0.9.3.jar"/>
+      <library name="jul-to-slf4j-1.7.24.jar"/>
       <library name="juniversalchardet-1.0.3.jar"/>
       <library name="junrar-0.7.jar"/>
-      <library name="jwnl-1.3.3.jar"/>
-      <library name="maven-scm-api-1.4.jar"/>
-      <library name="maven-scm-provider-svn-commons-1.4.jar"/>
-      <library name="maven-scm-provider-svnexe-1.4.jar"/>
-      <library name="metadata-extractor-2.8.0.jar"/>
+      <library name="metadata-extractor-2.10.1.jar"/>
       <library name="netcdf4-4.5.5.jar"/>
-      <library name="opennlp-maxent-3.0.3.jar"/>
-      <library name="opennlp-tools-1.5.3.jar"/>
-      <library name="pdfbox-1.8.10.jar"/>
-      <library name="plexus-utils-1.5.6.jar"/>
-      <library name="poi-3.13.jar"/>
-      <library name="poi-ooxml-3.13.jar"/>
-      <library name="poi-ooxml-schemas-3.13.jar"/>
-      <library name="poi-scratchpad-3.13.jar"/>
-      <library name="protobuf-java-2.5.0.jar"/>
+      <library name="opennlp-tools-1.8.3.jar"/>
+      <library name="pdfbox-2.0.8.jar"/>
+      <library name="pdfbox-tools-2.0.8.jar"/>
+      <library name="poi-3.17.jar"/>
+      <library name="poi-ooxml-3.17.jar"/>
+      <library name="poi-ooxml-schemas-3.17.jar"/>
+      <library name="poi-scratchpad-3.17.jar"/>
       <library name="quartz-2.2.0.jar"/>
-      <library name="regexp-1.3.jar"/>
       <library name="rome-1.5.1.jar"/>
       <library name="rome-utils-1.5.1.jar"/>
-      <library name="sis-metadata-0.5.jar"/>
-      <library name="sis-netcdf-0.5.jar"/>
-      <library name="sis-referencing-0.5.jar"/>
-      <library name="sis-storage-0.5.jar"/>
-      <library name="sis-utility-0.5.jar"/>
+      <library name="sentiment-analysis-parser-0.1.jar"/>
+      <library name="sis-metadata-0.6.jar"/>
+      <library name="sis-netcdf-0.6.jar"/>
+      <library name="sis-referencing-0.6.jar"/>
+      <library name="sis-storage-0.6.jar"/>
+      <library name="sis-utility-0.6.jar"/>
       <library name="stax2-api-3.1.4.jar"/>
       <library name="tagsoup-1.2.1.jar"/>
-      <library name="tika-parsers-1.12.jar"/>
+      <library name="tika-parsers-1.17.jar"/>
       <library name="udunits-4.5.5.jar"/>
-      <library name="vorbis-java-core-0.6.jar"/>
-      <library name="vorbis-java-tika-0.6.jar"/>
+      <library name="vorbis-java-core-0.8.jar"/>
+      <library name="vorbis-java-tika-0.8.jar"/>
       <library name="woodstox-core-asl-4.4.1.jar"/>
       <library name="xmlbeans-2.6.0.jar"/>
-      <library name="xmlschema-core-2.1.0.jar"/>
-      <library name="xmpcore-5.1.2.jar"/>
-      <library name="xz-1.5.jar"/>
+      <library name="xmlschema-core-2.2.2.jar"/>
+      <library name="xmpcore-5.1.3.jar"/>
+      <library name="xz-1.6.jar"/>
    </runtime>
 
    <requires>

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 11/23: NUTCH-2480 Upgrade crawler-commons dependency to 0.9

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit e7b077eeb2d823b3a09259435915ae69b2a3471a
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Fri Dec 15 13:55:58 2017 +0100

    NUTCH-2480 Upgrade crawler-commons dependency to 0.9
---
 ivy/ivy.xml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index f7867c4..7333a19 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -74,7 +74,9 @@
 
 		<dependency org="com.google.guava" name="guava" rev="18.0" />
 
-		<dependency org="com.github.crawler-commons" name="crawler-commons" rev="0.8" />
+		<dependency org="com.github.crawler-commons" name="crawler-commons" rev="0.9">
+			<exclude org="org.apache.tika"/>
+		</dependency>
 
 		<dependency org="com.martinkl.warc" name="warc-hadoop" rev="0.1.0" />
 		

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.

[nutch] 02/23: NUTCH-2474 CrawlDbReader -stats fails with ClassCastException - replace CrawlDbStatCombiner by CrawlDbStatReducer and ensure that data is properly processed independently whether and how often combiner is called - simplify calculation of minimum and maximum

Posted by sn...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit d758a31bbee0807bcbc92a591668076cfa95aeb1
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Fri Dec 8 22:41:05 2017 +0100

    NUTCH-2474 CrawlDbReader -stats fails with ClassCastException
    - replace CrawlDbStatCombiner by CrawlDbStatReducer and ensure
      that data is properly processed independently whether and
      how often combiner is called
    - simplify calculation of minimum and maximum
---
 src/java/org/apache/nutch/crawl/CrawlDbReader.java | 273 +++++++++------------
 1 file changed, 114 insertions(+), 159 deletions(-)

diff --git a/src/java/org/apache/nutch/crawl/CrawlDbReader.java b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
index 42b5a3b..117aa7f 100644
--- a/src/java/org/apache/nutch/crawl/CrawlDbReader.java
+++ b/src/java/org/apache/nutch/crawl/CrawlDbReader.java
@@ -202,14 +202,24 @@ public class CrawlDbReader extends Configured implements Closeable, Tool {
       output.collect(new Text("status " + value.getStatus()), COUNT_1);
       output.collect(new Text("retry " + value.getRetriesSinceFetch()),
           COUNT_1);
-      output.collect(new Text("sc"), new NutchWritable(
-          new FloatWritable(value.getScore())));
+
+      NutchWritable score = new NutchWritable(
+          new FloatWritable(value.getScore()));
+      output.collect(new Text("sc"), score);
+      output.collect(new Text("sct"), score);
+      output.collect(new Text("scd"), score);
+
       // fetch time (in minutes to prevent from overflows when summing up)
-      output.collect(new Text("ft"), new NutchWritable(
-          new LongWritable(value.getFetchTime() / (1000 * 60))));
+      NutchWritable fetchTime = new NutchWritable(
+          new LongWritable(value.getFetchTime() / (1000 * 60)));
+      output.collect(new Text("ft"), fetchTime);
+      output.collect(new Text("ftt"), fetchTime);
+
       // fetch interval (in seconds)
-      output.collect(new Text("fi"),
-          new NutchWritable(new LongWritable(value.getFetchInterval())));
+      NutchWritable fetchInterval = new NutchWritable(new LongWritable(value.getFetchInterval()));
+      output.collect(new Text("fi"), fetchInterval);
+      output.collect(new Text("fit"), fetchInterval);
+
       if (sort) {
         URL u = new URL(key.toString());
         String host = u.getHost();
@@ -219,88 +229,6 @@ public class CrawlDbReader extends Configured implements Closeable, Tool {
     }
   }
 
-  public static class CrawlDbStatCombiner implements
-      Reducer<Text, NutchWritable, Text, NutchWritable> {
-    LongWritable val = new LongWritable();
-
-    public CrawlDbStatCombiner() {
-    }
-
-    public void configure(JobConf job) {
-    }
-
-    public void close() {
-    }
-
-    private void reduceMinMaxTotal(String keyPrefix, Iterator<NutchWritable> values,
-        OutputCollector<Text, NutchWritable> output, Reporter reporter)
-        throws IOException {
-      long total = 0;
-      long min = Long.MAX_VALUE;
-      long max = Long.MIN_VALUE;
-      while (values.hasNext()) {
-        long cnt = ((LongWritable) values.next().get()).get();
-        if (cnt < min)
-          min = cnt;
-        if (cnt > max)
-          max = cnt;
-        total += cnt;
-      }
-      output.collect(new Text(keyPrefix + "n"),
-          new NutchWritable(new LongWritable(min)));
-      output.collect(new Text(keyPrefix + "x"),
-          new NutchWritable(new LongWritable(max)));
-      output.collect(new Text(keyPrefix + "t"),
-          new NutchWritable(new LongWritable(total)));
-    }
-    
-    private void reduceMinMaxTotalFloat(String keyPrefix, Iterator<NutchWritable> values,
-        OutputCollector<Text, NutchWritable> output, Reporter reporter)
-        throws IOException {
-      double total = 0;
-      float min = Float.MAX_VALUE;
-      float max = Float.MIN_VALUE;
-      TDigest tdigest = TDigest.createMergingDigest(100.0);
-      while (values.hasNext()) {
-        float val = ((FloatWritable) values.next().get()).get();
-        tdigest.add(val);
-        if (val < min)
-          min = val;
-        if (val > max)
-          max = val;
-        total += val;
-      }
-      output.collect(new Text(keyPrefix + "n"),
-          new NutchWritable(new FloatWritable(min)));
-      output.collect(new Text(keyPrefix + "x"),
-          new NutchWritable(new FloatWritable(max)));
-      output.collect(new Text(keyPrefix + "t"),
-          new NutchWritable(new FloatWritable((float) total)));
-      ByteBuffer tdigestBytes = ByteBuffer.allocate(tdigest.smallByteSize());
-      tdigest.asSmallBytes(tdigestBytes);
-      output.collect(new Text(keyPrefix + "d"),
-          new NutchWritable(new BytesWritable(tdigestBytes.array())));
-    }
-
-    public void reduce(Text key, Iterator<NutchWritable> values,
-        OutputCollector<Text, NutchWritable> output, Reporter reporter)
-        throws IOException {
-      val.set(0L);
-      String k = key.toString();
-      if (k.equals("sc")) {
-        reduceMinMaxTotalFloat(k, values, output, reporter);
-      } else if (k.equals("ft") || k.equals("fi")) {
-        reduceMinMaxTotal(k, values, output, reporter);
-      } else {
-        while (values.hasNext()) {
-          LongWritable cnt = (LongWritable) values.next().get();
-          val.set(val.get() + cnt.get());
-        }
-        output.collect(key, new NutchWritable(val));
-      }
-    }
-  }
-
   public static class CrawlDbStatReducer implements
       Reducer<Text, NutchWritable, Text, NutchWritable> {
     public void configure(JobConf job) {
@@ -314,7 +242,8 @@ public class CrawlDbReader extends Configured implements Closeable, Tool {
         throws IOException {
 
       String k = key.toString();
-      if (k.equals("T")) {
+      if (k.equals("T") || k.startsWith("status") || k.startsWith("retry")
+          || k.equals("ftt") || k.equals("fit")) {
         // sum all values for this key
         long sum = 0;
         while (values.hasNext()) {
@@ -323,68 +252,59 @@ public class CrawlDbReader extends Configured implements Closeable, Tool {
         }
         // output sum
         output.collect(key, new NutchWritable(new LongWritable(sum)));
-      } else if (k.startsWith("status") || k.startsWith("retry")) {
-        LongWritable cnt = new LongWritable();
-        while (values.hasNext()) {
-          LongWritable val = (LongWritable) values.next().get();
-          cnt.set(cnt.get() + val.get());
-        }
-        output.collect(key, new NutchWritable(cnt));
-      } else if (k.equals("scx")) {
-        FloatWritable max = new FloatWritable(Float.MIN_VALUE);
-        while (values.hasNext()) {
-          FloatWritable val = (FloatWritable) values.next().get();
-          if (max.get() < val.get())
-            max.set(val.get());
-        }
-        output.collect(key, new NutchWritable(max));
-      } else if (k.equals("ftx") || k.equals("fix")) {
-        LongWritable cnt = new LongWritable(Long.MIN_VALUE);
-        while (values.hasNext()) {
-          LongWritable val = (LongWritable) values.next().get();
-          if (cnt.get() < val.get())
-            cnt.set(val.get());
-        }
-        output.collect(key, new NutchWritable(cnt));
-      } else if (k.equals("scn")) {
-        FloatWritable min = new FloatWritable(Float.MAX_VALUE);
+      } else if (k.equals("sc")) {
+        float min = Float.MAX_VALUE;
+        float max = Float.MIN_VALUE;
         while (values.hasNext()) {
-          FloatWritable val = (FloatWritable) values.next().get();
-          if (min.get() > val.get())
-            min.set(val.get());
+          float value = ((FloatWritable) values.next().get()).get();
+          if (max < value) {
+            max = value;
+          }
+          if (min > value) {
+            min = value;
+          }
         }
-        output.collect(key, new NutchWritable(min));
-      } else if (k.equals("ftn") || k.equals("fin")) {
-        LongWritable cnt = new LongWritable(Long.MAX_VALUE);
+        output.collect(key, new NutchWritable(new FloatWritable(min)));
+        output.collect(key, new NutchWritable(new FloatWritable(max)));
+      } else if (k.equals("ft") || k.equals("fi")) {
+        long min = Long.MAX_VALUE;
+        long max = Long.MIN_VALUE;
         while (values.hasNext()) {
-          LongWritable val = (LongWritable) values.next().get();
-          if (cnt.get() > val.get())
-            cnt.set(val.get());
+          long value = ((LongWritable) values.next().get()).get();
+          if (max < value) {
+            max = value;
+          }
+          if (min > value) {
+            min = value;
+          }
         }
-        output.collect(key, new NutchWritable(cnt));
+        output.collect(key, new NutchWritable(new LongWritable(min)));
+        output.collect(key, new NutchWritable(new LongWritable(max)));
       } else if (k.equals("sct")) {
-        FloatWritable cnt = new FloatWritable();
-        while (values.hasNext()) {
-          FloatWritable val = (FloatWritable) values.next().get();
-          cnt.set(cnt.get() + val.get());
-        }
-        output.collect(key, new NutchWritable(cnt));
-      } else if (k.equals("ftt") || k.equals("fit")) {
-        LongWritable cnt = new LongWritable();
+        float cnt = 0.0f;
         while (values.hasNext()) {
-          LongWritable val = (LongWritable) values.next().get();
-          cnt.set(cnt.get() + val.get());
+          float value = ((FloatWritable) values.next().get()).get();
+          cnt += value;
         }
-        output.collect(key, new NutchWritable(cnt));
+        output.collect(key, new NutchWritable(new FloatWritable(cnt)));
       } else if (k.equals("scd") || k.equals("ftd") || k.equals("fid")) {
         MergingDigest tdigest = null;
         while (values.hasNext()) {
-          byte[] bytes = ((BytesWritable) values.next().get()).getBytes();
-          MergingDigest tdig = MergingDigest.fromBytes(ByteBuffer.wrap(bytes));
-          if (tdigest == null) {
-            tdigest = tdig;
-          } else {
-            tdigest.add(tdig);
+          Writable value = values.next().get();
+          if (value instanceof BytesWritable) {
+            byte[] bytes = ((BytesWritable) value).getBytes();
+            MergingDigest tdig = MergingDigest
+                .fromBytes(ByteBuffer.wrap(bytes));
+            if (tdigest == null) {
+              tdigest = tdig;
+            } else {
+              tdigest.add(tdig);
+            }
+          } else if (value instanceof FloatWritable) {
+            if (tdigest == null) {
+              tdigest = (MergingDigest) TDigest.createMergingDigest(100.0);
+            }
+            tdigest.add(((FloatWritable) value).get());
           }
         }
         ByteBuffer tdigestBytes = ByteBuffer.allocate(tdigest.smallByteSize());
@@ -455,7 +375,7 @@ public class CrawlDbReader extends Configured implements Closeable, Tool {
 	  job.setInputFormat(SequenceFileInputFormat.class);
 
 	  job.setMapperClass(CrawlDbStatMapper.class);
-	  job.setCombinerClass(CrawlDbStatCombiner.class);
+	  job.setCombinerClass(CrawlDbStatReducer.class);
 	  job.setReducerClass(CrawlDbStatReducer.class);
 
 	  FileOutputFormat.setOutputPath(job, tmpFolder);
@@ -486,27 +406,57 @@ public class CrawlDbReader extends Configured implements Closeable, Tool {
 			    stats.put(k, value.get());
 			    continue;
 			  }
-			  if (k.equals("scx")) {
-			    FloatWritable fvalue = (FloatWritable) value.get();
-			    if (((FloatWritable) val).get() < fvalue.get())
-			      ((FloatWritable) val).set(fvalue.get());
-        } else if (k.equals("ftx") || k.equals("fix")) {
-          LongWritable lvalue = (LongWritable) value.get();
-          if (((LongWritable) val).get() < lvalue.get())
-            ((LongWritable) val).set(lvalue.get());
-        } else if (k.equals("scn")) {
-          FloatWritable fvalue = (FloatWritable) value.get();
-          if (((FloatWritable) val).get() > fvalue.get())
-            ((FloatWritable) val).set(fvalue.get());
-			  } else if (k.equals("ftn") || k.equals("fin")) {
-          LongWritable lvalue = (LongWritable) value.get();
-				  if (((LongWritable) val).get() > lvalue.get())
-				    ((LongWritable) val).set(lvalue.get());
+			  if (k.equals("sc")) {
+			    float min = Float.MAX_VALUE;
+          float max = Float.MIN_VALUE;
+			    if (stats.containsKey("scn")) {
+			      min = ((FloatWritable) stats.get("scn")).get();
+			    } else {
+			      min = ((FloatWritable) stats.get("sc")).get();
+			    }
+          if (stats.containsKey("scx")) {
+            max = ((FloatWritable) stats.get("scx")).get();
+          } else {
+            max = ((FloatWritable) stats.get("sc")).get();
+          }
+			    float fvalue = ((FloatWritable) value.get()).get();
+			    if (min > fvalue) {
+			      min = fvalue;
+			    }
+          if (max < fvalue) {
+            max = fvalue;
+          }
+          stats.put("scn", new FloatWritable(min));
+          stats.put("scx", new FloatWritable(max));
+        } else if (k.equals("ft") || k.equals("fi")) {
+          long min = Long.MAX_VALUE;
+          long max = Long.MIN_VALUE;
+          String minKey = k + "n";
+          String maxKey = k + "x";
+          if (stats.containsKey(minKey)) {
+            min = ((LongWritable) stats.get(minKey)).get();
+          } else if (stats.containsKey(k)) {
+            min = ((LongWritable) stats.get(k)).get();
+          }
+          if (stats.containsKey(maxKey)) {
+            max = ((LongWritable) stats.get(maxKey)).get();
+          } else if (stats.containsKey(k)) {
+            max = ((LongWritable) stats.get(k)).get();
+          }
+          long lvalue = ((LongWritable) value.get()).get();
+          if (min > lvalue) {
+            min = lvalue;
+          }
+          if (max < lvalue) {
+            max = lvalue;
+          }
+          stats.put(k + "n", new LongWritable(min));
+          stats.put(k + "x", new LongWritable(max));
 			  } else if (k.equals("sct")) {
           FloatWritable fvalue = (FloatWritable) value.get();
           ((FloatWritable) val)
               .set(((FloatWritable) val).get() + fvalue.get());
-        } else if (k.equals("scd") || k.equals("ftd") || k.equals("fid")) {
+        } else if (k.equals("scd")) {
           MergingDigest tdigest = null;
           MergingDigest tdig = MergingDigest.fromBytes(
               ByteBuffer.wrap(((BytesWritable) value.get()).getBytes()));
@@ -529,6 +479,11 @@ public class CrawlDbReader extends Configured implements Closeable, Tool {
 		  }
 		  reader.close();
 	  }
+    // remove score, fetch interval, and fetch time
+    // (used for min/max calculation)
+    stats.remove("sc");
+    stats.remove("fi");
+    stats.remove("ft");
 	  // removing the tmp folder
 	  fileSystem.delete(tmpFolder, true);
 	  return stats;

-- 
To stop receiving notification emails like this one, please contact
"commits@nutch.apache.org" <co...@nutch.apache.org>.