You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by sn...@apache.org on 2018/11/15 10:33:59 UTC

[nutch] branch master updated (8151237 -> f861c82)

This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.


    from 8151237  Merge pull request #387 from sebastian-nagel/NUTCH-2630-fetcher-log-robotstxt-denied
     add 8b7298d  NUTCH-1842: crawl.gen.delay value is read incorrectly from configuration.
     new a6f533d  NUTCH-2625 ProtocolFactory.getProtocol(url) may create multiple plugin instances - lock critical block (conditional creation of plugin instance)   on object cache object
     new 524a594  NUTCH-2630 Fetcher to log skipped records by robots.txt - change required log level to INFO (default) for messages   reporting skipped URLs because of robots.txt rules   (disallow or crawl delay larger than fetcher.max.crawl.delay)
     new 2a3b1d1  NUTCH-2651 Upgrade core and parse-tika to use Tika 1.19.1 - add work-around to fix downloading of dependency javax.ws.rs-api-*.jar   (need to set property packaging.type=jar)
     new 89b16ce  NUTCH-2652 Fetcher launches more fetch tasks than fetch lists - properly override method getSplits(...) of FileInputFormat
     new a9ea1f1  NUTCH-2655 Update Solr schema.xml for Solr 7.x - add required field types to schema.xml
     new 48e1aef  NUTCH-2659 Add missing Apache license headers
     new d45fb7a  NUTCH-2660 Plugin tests not executed - add missing unit test packages to plugin build.xml - tests of "headings" plugin depend on "lib-nekohtml" - add "protocol-okhttp" to Javadoc API overview - add missing test packages to ant "eclipse" target
     new 2d48152  NUTCH-2661 Move the TestOutlinks class into the o.a.n.parse path
     new 31a1ec4  NUTCH-2651 Upgrade to Tika 1.19.1 (from 1.18) - modified work-around to fix downloading of dependency javax.ws.rs-api-*.jar:   define property packaging.type in ivysettings.xml
     new a5df63a  NUTCH-2658 Adding the fields required by the index-links plugin to the schema
     new 93b1a81  NUTCH-2671 Upgrade to ant ivy library - upgrade to 2.5.0-rc1 to address NUTCH-2669
     new 393d3e5  NUTCH-2671 Upgrade to ant ivy library - fix order of ant target dependencies:   "compile-core" must come before "resolve-test"
     new e6a961c  NUTCH-2671 Upgrade to ant ivy library - roll back to 2.4.0 to bring Jenkins build back to normal
     new f861c82  NUTCH-1842: crawl.gen.delay value is read incorrectly from config Merge pull request #393 from YossiTamari/patch-2

The 14 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 src/java/org/apache/nutch/crawl/Generator.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


[nutch] 13/14: NUTCH-2671 Upgrade to ant ivy library - roll back to 2.4.0 to bring Jenkins build back to normal

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit e6a961ce967e94dc7128154b68cfa24fcd4370e9
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Tue Oct 30 17:47:22 2018 +0100

    NUTCH-2671 Upgrade to ant ivy library
    - roll back to 2.4.0 to bring Jenkins build back to normal
---
 default.properties | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/default.properties b/default.properties
index 1423025..bb987d9 100644
--- a/default.properties
+++ b/default.properties
@@ -63,7 +63,7 @@ runtime.dir=./runtime
 runtime.deploy=${runtime.dir}/deploy
 runtime.local=${runtime.dir}/local
 
-ivy.version=2.5.0-rc1
+ivy.version=2.4.0
 ivy.dir=${basedir}/ivy
 ivy.file=${ivy.dir}/ivy.xml
 ivy.jar=${ivy.dir}/ivy-${ivy.version}.jar


[nutch] 12/14: NUTCH-2671 Upgrade to ant ivy library - fix order of ant target dependencies: "compile-core" must come before "resolve-test"

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 393d3e5f96c0f381b904e17e5abcad695f911e5e
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Tue Oct 30 16:45:22 2018 +0100

    NUTCH-2671 Upgrade to ant ivy library
    - fix order of ant target dependencies:
      "compile-core" must come before "resolve-test"
---
 build.xml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/build.xml b/build.xml
index e19179e..37c44b8 100644
--- a/build.xml
+++ b/build.xml
@@ -415,7 +415,7 @@
   <!-- ================================================================== -->
   <!-- Compile test code                                                  -->
   <!-- ================================================================== -->
-  <target name="compile-core-test" depends="init, resolve-test, compile-core" description="--> compile test code">
+  <target name="compile-core-test" depends="init, compile-core, resolve-test" description="--> compile test code">
     <javac
      encoding="${build.encoding}"
      srcdir="${test.src.dir}"


[nutch] 11/14: NUTCH-2671 Upgrade to ant ivy library - upgrade to 2.5.0-rc1 to address NUTCH-2669

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 93b1a8174254de83232be12ac18d99ca4fa83518
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Mon Oct 29 13:41:42 2018 +0100

    NUTCH-2671 Upgrade to ant ivy library
    - upgrade to 2.5.0-rc1 to address NUTCH-2669
---
 .gitignore         | 1 +
 default.properties | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/.gitignore b/.gitignore
index f44d4e7..732ca05 100644
--- a/.gitignore
+++ b/.gitignore
@@ -11,4 +11,5 @@ logs/
 .project
 ivy/ivy-2.3.0.jar
 ivy/ivy-2.4.0.jar
+ivy/ivy-2.5.0-rc1.jar
 naivebayes-model
diff --git a/default.properties b/default.properties
index bb987d9..1423025 100644
--- a/default.properties
+++ b/default.properties
@@ -63,7 +63,7 @@ runtime.dir=./runtime
 runtime.deploy=${runtime.dir}/deploy
 runtime.local=${runtime.dir}/local
 
-ivy.version=2.4.0
+ivy.version=2.5.0-rc1
 ivy.dir=${basedir}/ivy
 ivy.file=${ivy.dir}/ivy.xml
 ivy.jar=${ivy.dir}/ivy-${ivy.version}.jar


[nutch] 09/14: NUTCH-2651 Upgrade to Tika 1.19.1 (from 1.18) - modified work-around to fix downloading of dependency javax.ws.rs-api-*.jar: define property packaging.type in ivysettings.xml

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 31a1ec4bab4a702fa8876926d54b212cc40acbce
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Sun Oct 21 20:49:51 2018 +0200

    NUTCH-2651 Upgrade to Tika 1.19.1 (from 1.18)
    - modified work-around to fix downloading of dependency javax.ws.rs-api-*.jar:
      define property packaging.type in ivysettings.xml
---
 default.properties  | 8 --------
 ivy/ivysettings.xml | 8 ++++++++
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/default.properties b/default.properties
index e6b3f4e..bb987d9 100644
--- a/default.properties
+++ b/default.properties
@@ -77,14 +77,6 @@ ivy.shared.default.root=${ivy.default.ivy.user.dir}/shared
 ivy.shared.default.ivy.pattern=[organisation]/[module]/[revision]/[type]s/[artifact].[ext]
 ivy.shared.default.artifact.pattern=[organisation]/[module]/[revision]/[type]s/[artifact].[ext]
 
-# work-around to fix failing dependency download of
-#  javax.ws.rs-api.jar
-# required by Tika (1.19 and higher)
-# cf. (also affects ant/ivy)
-#  https://github.com/eclipse-ee4j/jaxrs-api/issues/572
-#  https://github.com/gradle/gradle/issues/3065
-packaging.type=jar
-
 #
 # Plugins API
 #
diff --git a/ivy/ivysettings.xml b/ivy/ivysettings.xml
index d9b5044..a2dc700 100644
--- a/ivy/ivysettings.xml
+++ b/ivy/ivysettings.xml
@@ -38,6 +38,14 @@
     value="[organisation]/[module]/[revision]/[module]-[revision](-[classifier])"/>
   <property name="maven2.pattern.ext"
     value="${maven2.pattern}.[ext]"/>
+  <!-- define packaging.type=jar to work around the failing dependency download of
+         javax.ws.rs-api.jar
+       required by Tika (1.19 and higher), cf.
+         https://github.com/eclipse-ee4j/jaxrs-api/issues/572
+         https://github.com/jax-rs/api/pull/576
+  -->
+  <property name="packaging.type"
+    value="jar"/>
   <!-- pull in the local repository -->
   <include url="${ivy.default.conf.dir}/ivyconf-local.xml"/>
   <settings defaultResolver="default"/>


[nutch] 02/14: NUTCH-2630 Fetcher to log skipped records by robots.txt - change required log level to INFO (default) for messages reporting skipped URLs because of robots.txt rules (disallow or crawl delay larger than fetcher.max.crawl.delay)

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 524a59480a3e258a0363faf343fa57875f8f9ea8
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Mon Oct 8 14:50:51 2018 +0200

    NUTCH-2630 Fetcher to log skipped records by robots.txt
    - change required log level to INFO (default) for messages
      reporting skipped URLs because of robots.txt rules
      (disallow or crawl delay larger than fetcher.max.crawl.delay)
---
 src/java/org/apache/nutch/fetcher/FetcherThread.java | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/java/org/apache/nutch/fetcher/FetcherThread.java b/src/java/org/apache/nutch/fetcher/FetcherThread.java
index bfcc374..6ba920e 100644
--- a/src/java/org/apache/nutch/fetcher/FetcherThread.java
+++ b/src/java/org/apache/nutch/fetcher/FetcherThread.java
@@ -302,9 +302,7 @@ public class FetcherThread extends Thread {
             if (!rules.isAllowed(fit.url.toString())) {
               // unblock
               ((FetchItemQueues) fetchQueues).finishFetchItem(fit, true);
-              if (LOG.isDebugEnabled()) {
-                LOG.debug("Denied by robots.txt: {}", fit.url);
-              }
+              LOG.info("Denied by robots.txt: {}", fit.url);
               output(fit.url, fit.datum, null,
                   ProtocolStatus.STATUS_ROBOTS_DENIED,
                   CrawlDatum.STATUS_FETCH_GONE);
@@ -315,7 +313,7 @@ public class FetcherThread extends Thread {
               if (rules.getCrawlDelay() > maxCrawlDelay && maxCrawlDelay >= 0) {
                 // unblock
                 ((FetchItemQueues) fetchQueues).finishFetchItem(fit, true);
-                LOG.debug("Crawl-Delay for {} too long ({}), skipping", fit.url,
+                LOG.info("Crawl-Delay for {} too long ({}), skipping", fit.url,
                     rules.getCrawlDelay());
                 output(fit.url, fit.datum, null,
                     ProtocolStatus.STATUS_ROBOTS_DENIED,


[nutch] 07/14: NUTCH-2660 Plugin tests not executed - add missing unit test packages to plugin build.xml - tests of "headings" plugin depend on "lib-nekohtml" - add "protocol-okhttp" to Javadoc API overview - add missing test packages to ant "eclipse" target

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit d45fb7a659ba29371f171817a6a6de72965189c3
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Wed Oct 17 14:36:58 2018 +0200

    NUTCH-2660 Plugin tests not executed
    - add missing unit test packages to plugin build.xml
    - tests of "headings" plugin depend on "lib-nekohtml"
    - add "protocol-okhttp" to Javadoc API overview
    - add missing test packages to ant "eclipse" target
---
 build.xml                     |  2 ++
 default.properties            |  1 +
 src/plugin/build.xml          |  3 +++
 src/plugin/headings/build.xml | 18 ++++++++++++++++++
 4 files changed, 24 insertions(+)

diff --git a/build.xml b/build.xml
index 785442a..e19179e 100644
--- a/build.xml
+++ b/build.xml
@@ -1061,6 +1061,7 @@
         <source path="${plugins.dir}/feed/src/java/" />
         <source path="${plugins.dir}/feed/src/test/" />
         <source path="${plugins.dir}/headings/src/java/" />
+        <source path="${plugins.dir}/headings/src/test/" />
         <source path="${plugins.dir}/exchange-jexl/src/java/" />
         <source path="${plugins.dir}/index-anchor/src/java/" />
         <source path="${plugins.dir}/index-anchor/src/test/" />
@@ -1104,6 +1105,7 @@
         <source path="${plugins.dir}/parse-html/src/java/" />
         <source path="${plugins.dir}/parse-html/src/test/" />
         <source path="${plugins.dir}/parse-js/src/java/" />
+        <source path="${plugins.dir}/parse-js/src/test/" />
         <source path="${plugins.dir}/parse-metatags/src/java/" />
         <source path="${plugins.dir}/parse-metatags/src/test/" />
         <source path="${plugins.dir}/parse-swf/src/java/" />
diff --git a/default.properties b/default.properties
index 00af414..e6b3f4e 100644
--- a/default.properties
+++ b/default.properties
@@ -101,6 +101,7 @@ plugins.protocol=\
    org.apache.nutch.protocol.http*:\
    org.apache.nutch.protocol.httpclient*:\
    org.apache.nutch.protocol.interactiveselenium*:\
+   org.apache.nutch.protocol.okhttp*:\
    org.apache.nutch.protocol.selenium*:\
    org.apache.nutch.protocol.htmlunit*:\
 
diff --git a/src/plugin/build.xml b/src/plugin/build.xml
index d8e2ef5..d8826e8 100755
--- a/src/plugin/build.xml
+++ b/src/plugin/build.xml
@@ -113,9 +113,11 @@
      <ant dir="any23" target="test"/>
      <ant dir="creativecommons" target="test"/>
      <ant dir="feed" target="test"/>
+     <ant dir="headings" target="test"/>
      <ant dir="index-anchor" target="test"/>
      <ant dir="index-basic" target="test"/>
      <!--ant dir="index-geoip" target="test"/-->
+     <ant dir="index-jexl-filter" target="test"/>
      <ant dir="index-links" target="test"/>
      <ant dir="index-more" target="test"/>
      <ant dir="index-replace" target="test"/>
@@ -128,6 +130,7 @@
      <ant dir="mimetype-filter" target="test"/>
      <!--ant dir="parse-ext" target="test"/-->
      <ant dir="parse-html" target="test"/>
+     <ant dir="parse-js" target="test"/>
      <ant dir="parse-metatags" target="test"/>
      <ant dir="parse-swf" target="test"/>
      <ant dir="parse-tika" target="test"/>
diff --git a/src/plugin/headings/build.xml b/src/plugin/headings/build.xml
index d334ad1..29288e1 100644
--- a/src/plugin/headings/build.xml
+++ b/src/plugin/headings/build.xml
@@ -19,4 +19,22 @@
 
   <import file="../build-plugin.xml"/>
 
+  <!-- Build compilation dependencies -->
+  <target name="deps-jar">
+    <ant target="jar" inheritall="false" dir="../lib-nekohtml"/>
+  </target>
+
+  <!-- Add compilation dependencies to classpath -->
+  <path id="plugin.deps">
+    <fileset dir="${nutch.root}/build">
+      <include name="**/lib-nekohtml/*.jar" />
+    </fileset>
+  </path>
+
+  <!-- Deploy Unit test dependencies -->
+  <target name="deps-test">
+    <ant target="deploy" inheritall="false" dir="../lib-nekohtml"/>
+    <ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
+  </target>
+
 </project>


[nutch] 05/14: NUTCH-2655 Update Solr schema.xml for Solr 7.x - add required field types to schema.xml

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit a9ea1f1012f6d1b4296d4728b00cf7498aa05dba
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Mon Oct 15 15:04:01 2018 +0200

    NUTCH-2655 Update Solr schema.xml for Solr 7.x
    - add required field types to schema.xml
---
 conf/schema.xml | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/conf/schema.xml b/conf/schema.xml
index 6e7d5bf..2b095e5 100644
--- a/conf/schema.xml
+++ b/conf/schema.xml
@@ -300,6 +300,19 @@
     <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
     <fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true"/>
 
+    <!-- required for Solr 6 + 7 compatibility -->
+    <fieldType name="pdate" class="solr.DatePointField" docValues="true"/>
+    <fieldType name="pdates" class="solr.DatePointField" docValues="true" multiValued="true"/>
+    <fieldType name="pint" class="solr.IntPointField" docValues="true"/>
+    <fieldType name="pfloat" class="solr.FloatPointField" docValues="true"/>
+    <fieldType name="plong" class="solr.LongPointField" docValues="true"/>
+    <fieldType name="pdouble" class="solr.DoublePointField" docValues="true"/>
+    <fieldType name="pints" class="solr.IntPointField" docValues="true" multiValued="true"/>
+    <fieldType name="pfloats" class="solr.FloatPointField" docValues="true" multiValued="true"/>
+    <fieldType name="plongs" class="solr.LongPointField" docValues="true" multiValued="true"/>
+    <fieldType name="pdoubles" class="solr.DoublePointField" docValues="true" multiValued="true"/>
+    <fieldType name="random" class="solr.RandomSortField" indexed="true"/>
+
     <!-- sortMissingLast and sortMissingFirst attributes are optional attributes are
          currently supported on types that are sorted internally as strings
          and on numeric types.


[nutch] 04/14: NUTCH-2652 Fetcher launches more fetch tasks than fetch lists - properly override method getSplits(...) of FileInputFormat

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 89b16ce29f3bf6618ec2bf9df0807b24c1e40339
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Mon Oct 15 13:44:20 2018 +0200

    NUTCH-2652 Fetcher launches more fetch tasks than fetch lists
    - properly override method getSplits(...) of FileInputFormat
---
 src/java/org/apache/nutch/fetcher/Fetcher.java | 37 +++++++++++++-------------
 1 file changed, 18 insertions(+), 19 deletions(-)

diff --git a/src/java/org/apache/nutch/fetcher/Fetcher.java b/src/java/org/apache/nutch/fetcher/Fetcher.java
index f6584c5..fe9e71e 100644
--- a/src/java/org/apache/nutch/fetcher/Fetcher.java
+++ b/src/java/org/apache/nutch/fetcher/Fetcher.java
@@ -23,28 +23,24 @@ import java.text.SimpleDateFormat;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.HashMap;
-import java.util.Iterator;
 import java.util.LinkedList;
 import java.util.List;
 import java.util.Map;
 import java.util.concurrent.atomic.AtomicInteger;
 import java.util.concurrent.atomic.AtomicLong;
 
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.JobContext;
 import org.apache.hadoop.mapreduce.Mapper;
-import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.FileSplit;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
-import org.apache.hadoop.mapreduce.InputSplit;
-import org.apache.hadoop.mapred.FileSplit;
 import org.apache.hadoop.util.StringUtils;
 import org.apache.hadoop.util.Tool;
 import org.apache.hadoop.util.ToolRunner;
@@ -55,6 +51,8 @@ import org.apache.nutch.util.NutchConfiguration;
 import org.apache.nutch.util.NutchJob;
 import org.apache.nutch.util.NutchTool;
 import org.apache.nutch.util.TimingUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 /**
  * A queue-based fetcher.
@@ -105,19 +103,20 @@ public class Fetcher extends NutchTool implements Tool {
   private static final Logger LOG = LoggerFactory
       .getLogger(MethodHandles.lookup().lookupClass());
 
-  public static class InputFormat extends
-  SequenceFileInputFormat<Text, CrawlDatum> {
-    /** Don't split inputs, to keep things polite. */
-    public InputSplit[] getSplits(JobContext job, int nSplits) throws IOException {
+  public static class InputFormat
+      extends SequenceFileInputFormat<Text, CrawlDatum> {
+    /**
+     * Don't split inputs to keep things polite - a single fetch list must be
+     * processed in one fetcher task. Do not split a fetch lists and assigning
+     * the splits to multiple parallel tasks.
+     */
+    @Override
+    public List<InputSplit> getSplits(JobContext job) throws IOException {
       List<FileStatus> files = listStatus(job);
-      FileSplit[] splits = new FileSplit[files.size()];
-      Iterator<FileStatus> iterator= files.listIterator();
-      int index = 0;
-      while(iterator.hasNext()) {
-        index++;
-        FileStatus cur = iterator.next();
-        splits[index] = new FileSplit(cur.getPath(), 0, cur.getLen(),
-            (String[]) null);
+      List<InputSplit> splits = new ArrayList<>();
+      for (FileStatus cur : files) {
+        splits.add(
+            new FileSplit(cur.getPath(), 0, cur.getLen(), (String[]) null));
       }
       return splits;
     }


[nutch] 14/14: NUTCH-1842: crawl.gen.delay value is read incorrectly from config Merge pull request #393 from YossiTamari/patch-2

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit f861c8203c8544b91e061964441485bd2f6de145
Merge: 8151237 e6a961c
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Thu Nov 15 11:17:37 2018 +0100

    NUTCH-1842: crawl.gen.delay value is read incorrectly from config
    Merge pull request #393 from YossiTamari/patch-2

 src/java/org/apache/nutch/crawl/Generator.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


[nutch] 03/14: NUTCH-2651 Upgrade core and parse-tika to use Tika 1.19.1 - add work-around to fix downloading of dependency javax.ws.rs-api-*.jar (need to set property packaging.type=jar)

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 2a3b1d15fdebe7ada325b9b955c164270a21e127
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Fri Oct 12 13:47:43 2018 +0200

    NUTCH-2651 Upgrade core and parse-tika to use Tika 1.19.1
    - add work-around to fix downloading of dependency javax.ws.rs-api-*.jar
      (need to set property packaging.type=jar)
---
 default.properties               |   8 +++
 ivy/ivy.xml                      |   2 +-
 src/plugin/parse-tika/ivy.xml    |   2 +-
 src/plugin/parse-tika/plugin.xml | 112 +++++++++++++++++++--------------------
 4 files changed, 66 insertions(+), 58 deletions(-)

diff --git a/default.properties b/default.properties
index d6f606b..00af414 100644
--- a/default.properties
+++ b/default.properties
@@ -77,6 +77,14 @@ ivy.shared.default.root=${ivy.default.ivy.user.dir}/shared
 ivy.shared.default.ivy.pattern=[organisation]/[module]/[revision]/[type]s/[artifact].[ext]
 ivy.shared.default.artifact.pattern=[organisation]/[module]/[revision]/[type]s/[artifact].[ext]
 
+# work-around to fix failing dependency download of
+#  javax.ws.rs-api.jar
+# required by Tika (1.19 and higher)
+# cf. (also affects ant/ivy)
+#  https://github.com/eclipse-ee4j/jaxrs-api/issues/572
+#  https://github.com/gradle/gradle/issues/3065
+packaging.type=jar
+
 #
 # Plugins API
 #
diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 5272de6..f1e4a80 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -65,7 +65,7 @@
 		<dependency org="org.apache.hadoop" name="hadoop-mapreduce-client-jobclient" rev="2.7.4" conf="*->default"/>
 		<!-- End of Hadoop Dependencies -->
 
-		<dependency org="org.apache.tika" name="tika-core" rev="1.18" />
+		<dependency org="org.apache.tika" name="tika-core" rev="1.19.1" />
 		<dependency org="com.ibm.icu" name="icu4j" rev="61.1" />
 
 		<dependency org="xerces" name="xercesImpl" rev="2.11.0" />
diff --git a/src/plugin/parse-tika/ivy.xml b/src/plugin/parse-tika/ivy.xml
index 81e7a80..53c7775 100644
--- a/src/plugin/parse-tika/ivy.xml
+++ b/src/plugin/parse-tika/ivy.xml
@@ -36,7 +36,7 @@
   </publications>
 
   <dependencies>
-    <dependency org="org.apache.tika" name="tika-parsers" rev="1.18" conf="*->default">
+    <dependency org="org.apache.tika" name="tika-parsers" rev="1.19.1" conf="*->default">
       <exclude org="org.apache.tika" name="tika-core" />
       <exclude org="org.apache.httpcomponents" name="httpclient" />
       <exclude org="org.apache.httpcomponents" name="httpcore" />
diff --git a/src/plugin/parse-tika/plugin.xml b/src/plugin/parse-tika/plugin.xml
index 398c0e4..7dbe180 100644
--- a/src/plugin/parse-tika/plugin.xml
+++ b/src/plugin/parse-tika/plugin.xml
@@ -26,76 +26,79 @@
          <export name="*"/>
       </library>
       <!-- dependencies of Tika (tika-parsers) -->
-      <library name="aopalliance-1.0.jar"/>
-      <library name="apache-mime4j-core-0.8.1.jar"/>
-      <library name="apache-mime4j-dom-0.8.1.jar"/>
-      <library name="asm-5.0.4.jar"/>
-      <library name="bcmail-jdk15on-1.54.jar"/>
-      <library name="bcpkix-jdk15on-1.54.jar"/>
-      <library name="bcprov-jdk15on-1.54.jar"/>
+      <library name="activation-1.1.1.jar"/>
+      <library name="apache-mime4j-core-0.8.2.jar"/>
+      <library name="apache-mime4j-dom-0.8.2.jar"/>
+      <library name="asm-6.2.jar"/>
+      <library name="bcmail-jdk15on-1.60.jar"/>
+      <library name="bcpkix-jdk15on-1.60.jar"/>
+      <library name="bcprov-jdk15on-1.60.jar"/>
       <library name="boilerpipe-1.1.0.jar"/>
       <library name="bzip2-0.9.1.jar"/>
       <library name="c3p0-0.9.1.1.jar"/>
       <library name="cdm-4.5.5.jar"/>
-      <library name="commons-codec-1.10.jar"/>
-      <library name="commons-collections4-4.1.jar"/>
-      <library name="commons-compress-1.16.1.jar"/>
-      <library name="commons-csv-1.0.jar"/>
+      <library name="commons-codec-1.11.jar"/>
+      <library name="commons-collections4-4.2.jar"/>
+      <library name="commons-compress-1.18.jar"/>
+      <library name="commons-csv-1.5.jar"/>
       <library name="commons-exec-1.3.jar"/>
       <library name="commons-io-2.6.jar"/>
-      <library name="commons-logging-1.1.3.jar"/>
       <library name="commons-logging-1.2.jar"/>
-      <library name="commons-logging-api-1.1.jar"/>
       <library name="curvesapi-1.04.jar"/>
-      <library name="cxf-core-3.0.16.jar"/>
-      <library name="cxf-rt-frontend-jaxrs-3.0.16.jar"/>
-      <library name="cxf-rt-rs-client-3.0.16.jar"/>
-      <library name="cxf-rt-transports-http-3.0.16.jar"/>
+      <library name="cxf-core-3.2.6.jar"/>
+      <library name="cxf-rt-frontend-jaxrs-3.2.6.jar"/>
+      <library name="cxf-rt-rs-client-3.2.6.jar"/>
+      <library name="cxf-rt-transports-http-3.2.6.jar"/>
       <library name="dec-0.1.2.jar"/>
       <library name="ehcache-core-2.6.2.jar"/>
-      <library name="fontbox-2.0.9.jar"/>
+      <library name="FastInfoset-1.2.13.jar"/>
+      <library name="fontbox-2.0.12.jar"/>
       <library name="geoapi-3.0.1.jar"/>
       <library name="grib-4.5.5.jar"/>
-      <library name="gson-2.8.1.jar"/>
+      <library name="gson-2.8.5.jar"/>
       <library name="guava-17.0.jar"/>
-      <library name="httpmime-4.5.4.jar"/>
+      <library name="httpmime-4.5.6.jar"/>
       <library name="httpservices-4.5.5.jar"/>
-      <library name="isoparser-1.1.18.jar"/>
-      <library name="jackcess-2.1.10.jar"/>
+      <library name="isoparser-1.1.22.jar"/>
+      <library name="istack-commons-runtime-3.0.5.jar"/>
+      <library name="jackcess-2.1.12.jar"/>
       <library name="jackcess-encrypt-2.1.4.jar"/>
-      <library name="jackson-annotations-2.9.5.jar"/>
-      <library name="jackson-core-2.9.5.jar"/>
-      <library name="jackson-databind-2.9.5.jar"/>
-      <library name="jai-imageio-core-1.3.1.jar"/>
+      <library name="jackson-annotations-2.9.6.jar"/>
+      <library name="jackson-core-2.9.6.jar"/>
+      <library name="jackson-databind-2.9.6.jar"/>
+      <library name="jai-imageio-core-1.4.0.jar"/>
       <library name="java-libpst-0.8.1.jar"/>
-      <library name="javax.annotation-api-1.2.jar"/>
-      <library name="javax.ws.rs-api-2.0.1.jar"/>
-      <library name="jbig2-imageio-3.0.0.jar"/>
+      <library name="javax.annotation-api-1.3.jar"/>
+      <library name="javax.ws.rs-api-2.1.jar"/>
+      <library name="jaxb-api-2.3.0.jar"/>
+      <library name="jaxb-core-2.3.0.1.jar"/>
+      <library name="jaxb-runtime-2.3.0.1.jar"/>
+      <library name="jbig2-imageio-3.0.2.jar"/>
       <library name="jcip-annotations-1.0.jar"/>
-      <library name="jcl-over-slf4j-1.7.24.jar"/>
+      <library name="jcl-over-slf4j-1.7.25.jar"/>
       <library name="jcommander-1.35.jar"/>
       <library name="jdom2-2.0.6.jar"/>
-      <library name="jempbox-1.8.13.jar"/>
-      <library name="jhighlight-1.0.2.jar"/>
-      <library name="jmatio-1.2.jar"/>
-      <library name="jna-4.1.0.jar"/>
+      <library name="jempbox-1.8.16.jar"/>
+      <library name="jhighlight-1.0.3.jar"/>
+      <library name="jmatio-1.5.jar"/>
+      <library name="jna-4.3.0.jar"/>
       <library name="joda-time-2.2.jar"/>
       <library name="json-simple-1.1.1.jar"/>
-      <library name="jsoup-1.11.2.jar"/>
-      <library name="jul-to-slf4j-1.7.24.jar"/>
+      <library name="jsoup-1.11.3.jar"/>
+      <library name="jul-to-slf4j-1.7.25.jar"/>
       <library name="juniversalchardet-1.0.3.jar"/>
-      <library name="junrar-0.7.jar"/>
-      <library name="metadata-extractor-2.10.1.jar"/>
+      <library name="junrar-2.0.0.jar"/>
+      <library name="metadata-extractor-2.11.0.jar"/>
       <library name="netcdf4-4.5.5.jar"/>
-      <library name="objenesis-2.6.jar"/>
       <library name="openjson-1.0.10.jar"/>
-      <library name="opennlp-tools-1.8.4.jar"/>
-      <library name="pdfbox-2.0.9.jar"/>
-      <library name="pdfbox-tools-2.0.9.jar"/>
-      <library name="poi-3.17.jar"/>
-      <library name="poi-ooxml-3.17.jar"/>
-      <library name="poi-ooxml-schemas-3.17.jar"/>
-      <library name="poi-scratchpad-3.17.jar"/>
+      <library name="opennlp-tools-1.9.0.jar"/>
+      <library name="parso-2.0.9.jar"/>
+      <library name="pdfbox-2.0.12.jar"/>
+      <library name="pdfbox-tools-2.0.12.jar"/>
+      <library name="poi-4.0.0.jar"/>
+      <library name="poi-ooxml-4.0.0.jar"/>
+      <library name="poi-ooxml-schemas-4.0.0.jar"/>
+      <library name="poi-scratchpad-4.0.0.jar"/>
       <library name="quartz-2.2.0.jar"/>
       <library name="rome-1.5.1.jar"/>
       <library name="rome-utils-1.5.1.jar"/>
@@ -106,23 +109,20 @@
       <library name="sis-referencing-0.8.jar"/>
       <library name="sis-storage-0.8.jar"/>
       <library name="sis-utility-0.8.jar"/>
-      <library name="spring-aop-3.2.16.RELEASE.jar"/>
-      <library name="spring-beans-3.2.16.RELEASE.jar"/>
-      <library name="spring-context-3.2.16.RELEASE.jar"/>
-      <library name="spring-core-3.2.16.RELEASE.jar"/>
-      <library name="spring-expression-3.2.16.RELEASE.jar"/>
-      <library name="stax2-api-3.1.4.jar"/>
+      <library name="stax2-api-4.1.jar"/>
+      <library name="stax-ex-1.7.8.jar"/>
       <library name="tagsoup-1.2.1.jar"/>
-      <library name="tika-parsers-1.18.jar"/>
+      <library name="tika-parsers-1.19.1.jar"/>
+      <library name="txw2-2.3.0.1.jar"/>
       <library name="udunits-4.5.5.jar"/>
       <library name="uimafit-core-2.2.0.jar"/>
       <library name="uimaj-core-2.9.0.jar"/>
       <library name="unit-api-1.0.jar"/>
       <library name="vorbis-java-core-0.8.jar"/>
       <library name="vorbis-java-tika-0.8.jar"/>
-      <library name="woodstox-core-asl-4.4.1.jar"/>
-      <library name="xmlbeans-2.6.0.jar"/>
-      <library name="xmlschema-core-2.2.2.jar"/>
+      <library name="woodstox-core-5.1.0.jar"/>
+      <library name="xmlbeans-3.0.1.jar"/>
+      <library name="xmlschema-core-2.2.3.jar"/>
       <library name="xmpcore-5.1.3.jar"/>
       <library name="xz-1.8.jar"/>
       <!-- end of dependencies of Tika (tika-parsers) -->


[nutch] 01/14: NUTCH-2625 ProtocolFactory.getProtocol(url) may create multiple plugin instances - lock critical block (conditional creation of plugin instance) on object cache object

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit a6f533dfecd688a6c43212b0e826be9a2da5b4ce
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Tue Jul 24 16:19:04 2018 +0200

    NUTCH-2625 ProtocolFactory.getProtocol(url) may create multiple plugin instances
    - lock critical block (conditional creation of plugin instance)
      on object cache object
---
 .../org/apache/nutch/protocol/ProtocolFactory.java | 26 ++++++++++++----------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/src/java/org/apache/nutch/protocol/ProtocolFactory.java b/src/java/org/apache/nutch/protocol/ProtocolFactory.java
index 87944a8..2d20ecd 100644
--- a/src/java/org/apache/nutch/protocol/ProtocolFactory.java
+++ b/src/java/org/apache/nutch/protocol/ProtocolFactory.java
@@ -81,7 +81,7 @@ public class ProtocolFactory {
    * @throws ProtocolNotFound
    *           when Protocol can not be found for url
    */
-  public synchronized Protocol getProtocol(URL url)
+  public Protocol getProtocol(URL url)
       throws ProtocolNotFound {
     ObjectCache objectCache = ObjectCache.get(conf);
     try {
@@ -91,19 +91,21 @@ public class ProtocolFactory {
       }
 
       String cacheId = Protocol.X_POINT_ID + protocolName;
-      Protocol protocol = (Protocol) objectCache.getObject(cacheId);
-      if (protocol != null) {
+      synchronized (objectCache) {
+        Protocol protocol = (Protocol) objectCache.getObject(cacheId);
+        if (protocol != null) {
+          return protocol;
+        }
+
+        Extension extension = findExtension(protocolName);
+        if (extension == null) {
+          throw new ProtocolNotFound(protocolName);
+        }
+
+        protocol = (Protocol) extension.getExtensionInstance();
+        objectCache.setObject(cacheId, protocol);
         return protocol;
       }
-
-      Extension extension = findExtension(protocolName);
-      if (extension == null) {
-        throw new ProtocolNotFound(protocolName);
-      }
-
-      protocol = (Protocol) extension.getExtensionInstance();
-      objectCache.setObject(cacheId, protocol);
-      return protocol;
     } catch (PluginRuntimeException e) {
       throw new ProtocolNotFound(url.toString(), e.toString());
     }


[nutch] 08/14: NUTCH-2661 Move the TestOutlinks class into the o.a.n.parse path

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 2d48152db0d032a58ea2324e8b40b6c5c48d7cd6
Author: Jorge Luis Betancourt Gonzalez <jo...@trivago.com>
AuthorDate: Wed Oct 17 18:07:51 2018 +0200

    NUTCH-2661 Move the TestOutlinks class into the o.a.n.parse path
---
 .../index-links/src => }/test/org/apache/nutch/parse/TestOutlinks.java    | 0
 1 file changed, 0 insertions(+), 0 deletions(-)

diff --git a/src/plugin/index-links/src/test/org/apache/nutch/parse/TestOutlinks.java b/src/test/org/apache/nutch/parse/TestOutlinks.java
similarity index 100%
rename from src/plugin/index-links/src/test/org/apache/nutch/parse/TestOutlinks.java
rename to src/test/org/apache/nutch/parse/TestOutlinks.java


[nutch] 10/14: NUTCH-2658 Adding the fields required by the index-links plugin to the schema

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit a5df63a3d644e90fb881a0f16c8f29d9320d1de3
Author: Jorge Luis Betancourt <be...@gmail.com>
AuthorDate: Tue Oct 23 22:57:03 2018 +0200

    NUTCH-2658 Adding the fields required by the index-links plugin to the schema
---
 conf/schema.xml | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/conf/schema.xml b/conf/schema.xml
index 2b095e5..57a44ac 100644
--- a/conf/schema.xml
+++ b/conf/schema.xml
@@ -398,6 +398,10 @@
     <field name="lastModified" type="date" stored="true" indexed="false"/>
     <field name="date" type="tdate" stored="true" indexed="true"/>
 
+    <!-- fields for index-links -->
+    <field name="inlinks" type="url" stored="true" indexed="true" multiValued="true"/>
+    <field name="outlinks" type="url" stored="true" indexed="true" multiValued="true"/>
+
     <!-- fields for languageidentifier plugin -->
     <field name="lang" type="string" stored="true" indexed="true"/>
 


[nutch] 06/14: NUTCH-2659 Add missing Apache license headers

Posted by sn...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git

commit 48e1aef83b94468c9f839cf28b24560bef233780
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Wed Oct 17 14:23:44 2018 +0200

    NUTCH-2659 Add missing Apache license headers
---
 .../org/apache/nutch/indexer/IndexWriterParams.java  | 17 +++++++++++++++++
 .../apache/nutch/scoring/AbstractScoringFilter.java  | 17 +++++++++++++++++
 .../apache/nutch/tools/CommonCrawlFormatWARC.java    | 17 +++++++++++++++++
 src/java/org/apache/nutch/tools/WARCUtils.java       | 17 +++++++++++++++++
 .../nutch/webui/pages/instances/InstancePanel.java   | 17 +++++++++++++++++
 .../nutch/webui/pages/settings/SettingsPage.java     | 17 +++++++++++++++++
 .../parse/headings/TestHeadingsParseFilter.java      | 17 +++++++++++++++++
 src/plugin/index-replace/plugin.xml                  | 16 ++++++++++++++++
 .../nutch/indexwriter/dummy/DummyConstants.java      | 17 +++++++++++++++++
 src/plugin/parse-metatags/plugin.xml                 | 16 ++++++++++++++++
 src/plugin/scoring-depth/build.xml                   | 16 ++++++++++++++++
 src/plugin/scoring-depth/plugin.xml                  | 16 ++++++++++++++++
 .../nutch/scoring/depth/DepthScoringFilter.java      | 17 +++++++++++++++++
 .../scoring/similarity/cosine/package-info.java      | 20 +++++++++++++++++---
 .../apache/nutch/crawl/TODOTestCrawlDbStates.java    | 17 +++++++++++++++++
 15 files changed, 251 insertions(+), 3 deletions(-)

diff --git a/src/java/org/apache/nutch/indexer/IndexWriterParams.java b/src/java/org/apache/nutch/indexer/IndexWriterParams.java
index cc91ec0..952dc9e 100644
--- a/src/java/org/apache/nutch/indexer/IndexWriterParams.java
+++ b/src/java/org/apache/nutch/indexer/IndexWriterParams.java
@@ -1,3 +1,20 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
 package org.apache.nutch.indexer;
 
 import org.apache.hadoop.util.StringUtils;
diff --git a/src/java/org/apache/nutch/scoring/AbstractScoringFilter.java b/src/java/org/apache/nutch/scoring/AbstractScoringFilter.java
index d74c7fb..cd59274 100644
--- a/src/java/org/apache/nutch/scoring/AbstractScoringFilter.java
+++ b/src/java/org/apache/nutch/scoring/AbstractScoringFilter.java
@@ -1,3 +1,20 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
 package org.apache.nutch.scoring;
 
 import java.util.Collection;
diff --git a/src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java b/src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java
index 6f89b16..27f1198 100644
--- a/src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java
+++ b/src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java
@@ -1,3 +1,20 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
 package org.apache.nutch.tools;
 
 import java.io.ByteArrayInputStream;
diff --git a/src/java/org/apache/nutch/tools/WARCUtils.java b/src/java/org/apache/nutch/tools/WARCUtils.java
index a705ae7..dab3ba7 100644
--- a/src/java/org/apache/nutch/tools/WARCUtils.java
+++ b/src/java/org/apache/nutch/tools/WARCUtils.java
@@ -1,3 +1,20 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
 package org.apache.nutch.tools;
 
 import java.io.ByteArrayInputStream;
diff --git a/src/java/org/apache/nutch/webui/pages/instances/InstancePanel.java b/src/java/org/apache/nutch/webui/pages/instances/InstancePanel.java
index 5b91b1a..cc54a7b 100644
--- a/src/java/org/apache/nutch/webui/pages/instances/InstancePanel.java
+++ b/src/java/org/apache/nutch/webui/pages/instances/InstancePanel.java
@@ -1,3 +1,20 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
 package org.apache.nutch.webui.pages.instances;
 
 import org.apache.nutch.webui.model.NutchInstance;
diff --git a/src/java/org/apache/nutch/webui/pages/settings/SettingsPage.java b/src/java/org/apache/nutch/webui/pages/settings/SettingsPage.java
index 2806aa7..baf341c 100644
--- a/src/java/org/apache/nutch/webui/pages/settings/SettingsPage.java
+++ b/src/java/org/apache/nutch/webui/pages/settings/SettingsPage.java
@@ -1,3 +1,20 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
 package org.apache.nutch.webui.pages.settings;
 
 import java.util.Iterator;
diff --git a/src/plugin/headings/src/test/org/apache/nutch/parse/headings/TestHeadingsParseFilter.java b/src/plugin/headings/src/test/org/apache/nutch/parse/headings/TestHeadingsParseFilter.java
index 125d756..082b5f4 100644
--- a/src/plugin/headings/src/test/org/apache/nutch/parse/headings/TestHeadingsParseFilter.java
+++ b/src/plugin/headings/src/test/org/apache/nutch/parse/headings/TestHeadingsParseFilter.java
@@ -1,3 +1,20 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
 package org.apache.nutch.parse.headings;
 
 import org.apache.hadoop.conf.Configuration;
diff --git a/src/plugin/index-replace/plugin.xml b/src/plugin/index-replace/plugin.xml
index 3cffe60..29a4344 100644
--- a/src/plugin/index-replace/plugin.xml
+++ b/src/plugin/index-replace/plugin.xml
@@ -1,4 +1,20 @@
 <?xml version="1.0" encoding="UTF-8"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
 <plugin
    id="index-replace"
    name="Replace Indexer"
diff --git a/src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyConstants.java b/src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyConstants.java
index 7dea970..46d6d45 100644
--- a/src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyConstants.java
+++ b/src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyConstants.java
@@ -1,3 +1,20 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
 package org.apache.nutch.indexwriter.dummy;
 
 public interface DummyConstants {
diff --git a/src/plugin/parse-metatags/plugin.xml b/src/plugin/parse-metatags/plugin.xml
index 07933fa..0d0e73f 100644
--- a/src/plugin/parse-metatags/plugin.xml
+++ b/src/plugin/parse-metatags/plugin.xml
@@ -1,4 +1,20 @@
 <?xml version="1.0" encoding="UTF-8"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
 <plugin
    id="parse-metatags"
    name="MetaTags"
diff --git a/src/plugin/scoring-depth/build.xml b/src/plugin/scoring-depth/build.xml
index 6c041ed..663cd04 100644
--- a/src/plugin/scoring-depth/build.xml
+++ b/src/plugin/scoring-depth/build.xml
@@ -1,4 +1,20 @@
 <?xml version="1.0"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
 <project name="scoring-depth" default="jar-core">
 
   <import file="../build-plugin.xml"/>
diff --git a/src/plugin/scoring-depth/plugin.xml b/src/plugin/scoring-depth/plugin.xml
index ea57dc6..ce1f9f0 100644
--- a/src/plugin/scoring-depth/plugin.xml
+++ b/src/plugin/scoring-depth/plugin.xml
@@ -1,4 +1,20 @@
 <?xml version="1.0" encoding="UTF-8"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
 <plugin
    id="scoring-depth"
    name="Scoring plugin for depth-limited crawling."
diff --git a/src/plugin/scoring-depth/src/java/org/apache/nutch/scoring/depth/DepthScoringFilter.java b/src/plugin/scoring-depth/src/java/org/apache/nutch/scoring/depth/DepthScoringFilter.java
index 0a0dd27..07e0e3f 100644
--- a/src/plugin/scoring-depth/src/java/org/apache/nutch/scoring/depth/DepthScoringFilter.java
+++ b/src/plugin/scoring-depth/src/java/org/apache/nutch/scoring/depth/DepthScoringFilter.java
@@ -1,3 +1,20 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
 package org.apache.nutch.scoring.depth;
 
 import java.util.Collection;
diff --git a/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/package-info.java b/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/package-info.java
index 94b8268..49dc835 100644
--- a/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/package-info.java
+++ b/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/package-info.java
@@ -1,7 +1,21 @@
 /**
- * 
- */
-/** Implements the cosine similarity metric for scoring relevant documents 
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
  *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Implements the cosine similarity metric for scoring relevant documents
  */
 package org.apache.nutch.scoring.similarity.cosine;
diff --git a/src/test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java b/src/test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
index 730f83d..d16c6bd 100644
--- a/src/test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
+++ b/src/test/org/apache/nutch/crawl/TODOTestCrawlDbStates.java
@@ -1,3 +1,20 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
 package org.apache.nutch.crawl;
 
 import java.io.IOException;