You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by sn...@apache.org on 2020/06/18 09:35:28 UTC
[nutch] 02/02: Nutch 1.16 release - update current year in API docs
etc. - update version number - add changes / release notes
This is an automated email from the ASF dual-hosted git repository.
snagel pushed a commit to branch branch-1.17
in repository https://gitbox.apache.org/repos/asf/nutch.git
commit 77fa56e34ccd4ecf35f14111a4a3a0e2912e7f29
Author: Sebastian Nagel <sn...@apache.org>
AuthorDate: Wed Jun 17 23:10:35 2020 +0200
Nutch 1.16 release
- update current year in API docs etc.
- update version number
- add changes / release notes
---
CHANGES.txt | 82 +++++++++++++++++++++++++++++++++++++++++++++++++-
NOTICE.txt | 2 +-
conf/nutch-default.xml | 2 +-
default.properties | 4 +--
src/bin/nutch | 2 +-
5 files changed, 86 insertions(+), 6 deletions(-)
diff --git a/CHANGES.txt b/CHANGES.txt
index 3f26a8d..dcdc6e2 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,6 +1,86 @@
# Nutch Change Log
-Nutch 1.17 Development
+Nutch 1.17 Release 18/06/2020 (dd/mm/yyyy)
+Release Report: https://s.apache.org/ovhry
+
+Bug
+
+ [NUTCH-1559] - parse-metatags duplicates extracted metatags
+ [NUTCH-2379] - crawl script dedup's crawldb update is slow
+ [NUTCH-2419] - Some URL filters and normalizers do not respect command-line override for rule file
+ [NUTCH-2507] - NutchTutorial wiki pages as a lot of outdated command line calls when it starts with the solr interaction
+ [NUTCH-2511] - SitemapProcessor limited by http.content.limit
+ [NUTCH-2525] - Metadata indexer cannot handle uppercase parse metadata
+ [NUTCH-2567] - parse-metatags writes all meta tags twice
+ [NUTCH-2720] - ROBOTS metatag ignored when capitalized
+ [NUTCH-2745] - Solr schema.xml not shipped in binary release
+ [NUTCH-2748] - Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb
+ [NUTCH-2751] - nutch clean does not work with secured solr cloud
+ [NUTCH-2753] - Add -listen option to command-line help of CrawlDbReader and LinkDbReader
+ [NUTCH-2754] - fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.
+ [NUTCH-2760] - protocol-okhttp: properly record HTTP version in request message header
+ [NUTCH-2761] - ivy jar fails to download
+ [NUTCH-2763] - protocol-okhttp (store.http.headers): add whitespace in status line after status code also when message is empty
+ [NUTCH-2770] - Subcollection logic allows empty string as a whitelist value, thus matching every incoming document.
+ [NUTCH-2778] - indexer-elastic to properly log errors
+ [NUTCH-2787] - CrawlDb JSON dump does not export metadata primitive data types correctly
+ [NUTCH-2789] - Documentation: update links to point to cwiki
+ [NUTCH-2790] - CSVIndexWriter does not escape leading quotes properly
+ [NUTCH-2791] - domainstats, protocolstats and crawlcomplete do not handle GCS URLs
+
+New Feature
+
+ [NUTCH-1863] - Add JSON format dump output to readdb command
+
+Improvement
+
+ [NUTCH-1194] - Generator: CrawlDB lock should be released earlier
+ [NUTCH-2002] - ParserChecker and IndexingFiltersChecker to check robots.txt
+ [NUTCH-2184] - Enable IndexingJob to function with no crawldb
+ [NUTCH-2495] - Use -deleteGone instead of clean job in crawler script while indexing
+ [NUTCH-2496] - Speed up link inversion step in crawling script
+ [NUTCH-2501] - allow to set Java heap size when using crawl script in distributed mode
+ [NUTCH-2649] - Optionally skip TLS/SSL certificate validation for protocol-selenium and protocol-htmlunit
+ [NUTCH-2733] - protocol-okhttp: add support for Brotli compression (Content-Encoding)
+ [NUTCH-2739] - indexer-elastic: Upgrade ES and migrate to REST client
+ [NUTCH-2743] - Add list of Nutch properties (nutch-default.xml) to documentation
+ [NUTCH-2746] - Basic URL normalizer to normalize Unicode domain names
+ [NUTCH-2747] - Replace remaining o.a.commons.logging by org.slf4j
+ [NUTCH-2750] - Improve CrawlDbReader & LinkDbReader reader handling
+ [NUTCH-2752] - indexer-solr: Upgrade to latest Solr version
+ [NUTCH-2755] - Remove obsolete plugin indexer-elastic-rest
+ [NUTCH-2757] - indexer-elastic: add authentication options
+ [NUTCH-2758] - Add plugin READMEs to binary release packages
+ [NUTCH-2759] - bin/crawl: Rename option --num-slaves
+ [NUTCH-2762] - Replace http:// URLs by https:// (build files and documentation)
+ [NUTCH-2767] - Fetcher to stop filling queues skipped due to repeated exceptions
+ [NUTCH-2768] - FetcherThread: unnecessary usage of class casts
+ [NUTCH-2772] - Debugging parse filter to show serialized DOM tree
+ [NUTCH-2773] - SegmentReader (-dump or -get): show HTML content as UTF-8
+ [NUTCH-2774] - Annotate methods implementing the Hadoop API by @Override
+ [NUTCH-2775] - Fetcher to guarantee minimum delay even if robots.txt defines shorter Crawl-delay
+ [NUTCH-2776] - Fetcher to temporarily deduplicate followed redirects
+ [NUTCH-2777] - Upgrade to Hadoop 3.1
+ [NUTCH-2779] - Upgrade to Tika 1.24.1
+ [NUTCH-2780] - Upgrade index-solr to use Solr 8.5.1
+ [NUTCH-2781] - Increase default Java heap size
+ [NUTCH-2783] - Use (more) parametrized logging
+ [NUTCH-2784] - Add tool to list Nutch and Hadoop properties
+ [NUTCH-2785] - FreeGenerator: command-line option to define number of generated fetch lists
+ [NUTCH-2788] - ParseData: improve presentation of Metadata in method toString()
+ [NUTCH-2794] - Add additional ciphers to HTTP base's default cipher suite
+
+Test
+
+ [NUTCH-1945] - Test for XLSX parser
+
+Task
+
+ [NUTCH-2434] - Add methods to reset parameters HTMLMetaTags
+
+Sub-task
+
+ [NUTCH-2735] - Update the indexer-solr documentation about the schema.xml usage
Nutch 1.16 Release 02/10/2019 (dd/mm/yyyy)
diff --git a/NOTICE.txt b/NOTICE.txt
index 5b46045..71f29fa 100644
--- a/NOTICE.txt
+++ b/NOTICE.txt
@@ -1,5 +1,5 @@
Apache Nutch
-Copyright 2019 The Apache Software Foundation
+Copyright 2020 The Apache Software Foundation
This product includes software developed by The Apache Software
Foundation (http://www.apache.org/).
diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index 23af74b..b7c9570 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -164,7 +164,7 @@
<property>
<name>http.agent.version</name>
- <value>Nutch-1.17-SNAPSHOT</value>
+ <value>Nutch-1.17</value>
<description>A version string to advertise in the User-Agent
header.</description>
</property>
diff --git a/default.properties b/default.properties
index 4181800..960f788 100644
--- a/default.properties
+++ b/default.properties
@@ -14,9 +14,9 @@
# limitations under the License.
name=apache-nutch
-version=1.17-SNAPSHOT
+version=1.17
final.name=${name}-${version}
-year=2019
+year=2020
basedir = ./
src.dir = ./src/java
diff --git a/src/bin/nutch b/src/bin/nutch
index 244d812..57bf970 100755
--- a/src/bin/nutch
+++ b/src/bin/nutch
@@ -60,7 +60,7 @@ done
# if no args specified, show usage
if [ $# = 0 ]; then
- echo "nutch 1.17-SNAPSHOT"
+ echo "nutch 1.17"
echo "Usage: nutch COMMAND [-Dproperty=value]... [command-specific args]..."
echo "where COMMAND is one of:"
echo " readdb read / dump crawl db"