You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "DK (Jira)" <ji...@apache.org> on 2022/01/22 06:35:00 UTC

[jira] [Commented] (CONNECTORS-1620) Accept Sitemaps with content type application/xml

    [ https://issues.apache.org/jira/browse/CONNECTORS-1620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480352#comment-17480352 ] 

DK commented on CONNECTORS-1620:
--------------------------------

I tested this and webcrawler connector does not seem to recognize sitemap.xml for mimetypes text/xml and application/xml. Version 2.17. Any specifics need to be consider in configuring the repo or job with solr output connector?

Thanks

> Accept Sitemaps with content type application/xml
> -------------------------------------------------
>
>                 Key: CONNECTORS-1620
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1620
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Web connector
>            Reporter: Markus Schuch
>            Assignee: Markus Schuch
>            Priority: Major
>             Fix For: ManifoldCF 2.14
>
>
> Given an Output Connection, that does not accepts the MIME type {{application/xml}} for ingestion, it is currently not possible to crawl a sitemap.xml, when the webserver returns {{application/xml}} as content type for the sitemap.
> The sitemap is discarded before the links are extracted, because the mime type {{application/xml}} is not listed in the {{interestingMimeTypeArray}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)