You are viewing a plain text version of this content. The canonical link for it is here.
- [nutch] branch master updated: NUTCH-2775 Fetcher to guarantee minimum delay even if robots.txt defines shorter Crawl-delay - guaranteed minimum delay is configured by `fetcher.min.crawl.delay` (default set equal to `fetcher.server.delay`) - posted by sn...@apache.org on 2020/04/10 11:41:56 UTC, 0 replies.
- [nutch] branch master updated: NUTCH-2777 - Upgrade to Hadoop 3.1 - posted by sn...@apache.org on 2020/04/10 11:44:53 UTC, 0 replies.
- [nutch] branch master updated (0cd0022 -> 6f51618) - posted by sn...@apache.org on 2020/04/19 09:32:38 UTC, 0 replies.
- [nutch] branch master updated (6f51618 -> dcbb0f2) - posted by sn...@apache.org on 2020/04/21 09:27:42 UTC, 0 replies.
- [nutch] branch master updated: NUTCH-2779 Upgrade to Tika 1.24.1 - posted by sn...@apache.org on 2020/04/24 07:08:35 UTC, 0 replies.
- [nutch] branch master updated (49eb1bd -> 52eec66) - posted by sn...@apache.org on 2020/04/28 07:29:27 UTC, 0 replies.
- [nutch] branch master updated: NUTCH-2783 Use (more) parametrized logging - replace logging messages with string concatenations by parametrized calls - remove LOG.isInfoEnabled() where parametrized logging is used and no or minor extra calls are done to get logging parameters (similar for other log levels) - replace needless .toString() and Integer.toString(intVal) - posted by sn...@apache.org on 2020/04/28 07:53:22 UTC, 0 replies.
- [nutch] branch master updated: NUTCH-2781 Increase default Java heap size - increase default value for NUTCH_HEAPSIZE to 4096 MB (from 1000 MB) - remove -Dmapred.child.java.opts=-Xmx1000m from default options in bin/crawl - posted by sn...@apache.org on 2020/04/28 08:04:36 UTC, 0 replies.
- [nutch] branch master updated: NUTCH-2501 allow to set Java heap size when using crawl script in distributed mode - bin/crawl - add hint how to set map and reduce task memory via -D ... options - use -D options for all steps (Nutch tools), fixes NUTCH-2379 - fix quoting of -D options, eg. -D plugin.includes='protocol-xyz|parse-xyz' - use -D options for all steps (Nutch tools) - bin/nutch - document that environment variables are only used in local mode - posted by sn...@apache.org on 2020/04/28 08:40:18 UTC, 0 replies.
- [nutch] branch master updated: NUTCH-2501 allow to set Java heap size when using crawl script in distributed mode - fix examples of `-D property=value` in bin/crawl : there must be a blank after `-D` because these arguments are first parsed by bin/crawl - posted by sn...@apache.org on 2020/04/28 08:48:37 UTC, 0 replies.
- [nutch] branch master updated: NUTCH-2778 indexer-elastic to properly log errors - add log output in BulkProcessor.Listener - do not throw an exception in BulkProcessor.Listener (ignored anyway) - posted by sn...@apache.org on 2020/04/28 15:39:34 UTC, 0 replies.
- [nutch] branch master updated: NUTCH-2772 Debugging parse filter to show serialized DOM tree - posted by sn...@apache.org on 2020/04/30 08:28:24 UTC, 0 replies.
- [nutch] branch master updated: NUTCH-2776 Fetcher to temporarily deduplicate followed redirects - cache followed redirect targets for a configurable time (`fetcher.redirect.dedupcache.seconds`) - if a redirect target is found in cache it's skipped - posted by sn...@apache.org on 2020/04/30 08:39:34 UTC, 0 replies.
- [nutch] branch master updated: NUTCH-2495: Use -deleteGone instead of clean job in crawl script while indexing - posted by sn...@apache.org on 2020/04/30 09:08:06 UTC, 0 replies.
- [nutch] branch master updated: NUTCH-2784 Tool to list Nutch properties and configured values - posted by sn...@apache.org on 2020/04/30 09:14:18 UTC, 0 replies.
- [nutch] branch master updated: NUTCH-2743 Add list of Nutch properties (nutch-default.xml) to documentation - modify ant build.xml to copy nutch-default.xml into docs/api/resources/ - adapt XSLT table layout - remove obsolete nutch-conf.xsl - fix typos and normalize spelling in nutch-default.xml - posted by sn...@apache.org on 2020/04/30 09:15:22 UTC, 0 replies.