You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/01/02 22:55:00 UTC
[jira] [Commented] (NUTCH-2490) Sitemap processing: Sitemap index
files not working
[ https://issues.apache.org/jira/browse/NUTCH-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308851#comment-16308851 ]
ASF GitHub Bot commented on NUTCH-2490:
---------------------------------------
mfeltscher opened a new pull request #269: fix for NUTCH-2490 Fix sitemap index file processing
URL: https://github.com/apache/nutch/pull/269
This fixes processing of sitemap index files by removing a unnecessary conditional.
Before:
```bash
$ echo "https://filialen.migros.ch/sitemap.xml" > sitemaps.txt && bin/nutch sitemap crawldata -sitemapUrls sitemaps.txt
SitemapProcessor: sitemap urls dir: sitemaps.txt
SitemapProcessor: Starting at 2018-01-02 22:44:58
robots.txt whitelist not configured.
SitemapProcessor: Total records rejected by filters: 0
SitemapProcessor: Total sitemaps from HostDb: 0
SitemapProcessor: Total sitemaps from seed urls: 1
SitemapProcessor: Total failed sitemap fetches: 0
SitemapProcessor: Total new sitemap entries added: 0
SitemapProcessor: Finished at 2018-01-02 22:45:02, elapsed: 00:00:03
````
After:
```bash
$ echo "https://filialen.migros.ch/sitemap.xml" > sitemaps.txt && bin/nutch sitemap crawldata -sitemapUrls sitemaps.txt
SitemapProcessor: sitemap urls dir: sitemaps.txt
SitemapProcessor: Starting at 2018-01-02 22:47:44
robots.txt whitelist not configured.
Parsing sitemap index file: https://filialen.migros.ch/sitemap.xml
Parsing sitemap file: https://filialen.migros.ch/de/sitemap.xml
Parsing sitemap file: https://filialen.migros.ch/fr/sitemap.xml
Parsing sitemap file: https://filialen.migros.ch/it/sitemap.xml
SitemapProcessor: Total records rejected by filters: 0
SitemapProcessor: Total sitemaps from HostDb: 0
SitemapProcessor: Total sitemaps from seed urls: 1
SitemapProcessor: Total failed sitemap fetches: 0
SitemapProcessor: Total new sitemap entries added: 5754
SitemapProcessor: Finished at 2018-01-02 22:47:58, elapsed: 00:00:13
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
> Sitemap processing: Sitemap index files not working
> ---------------------------------------------------
>
> Key: NUTCH-2490
> URL: https://issues.apache.org/jira/browse/NUTCH-2490
> Project: Nutch
> Issue Type: Bug
> Reporter: Moreno Feltscher
> Assignee: Moreno Feltscher
>
> The [sitemap processing feature](https://wiki.apache.org/nutch/SitemapFeature) does not properly handle sitemap index files due to a unnecessary conditional.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)