You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Markus Schuch (Jira)" <ji...@apache.org> on 2019/08/23 20:16:00 UTC
[jira] [Comment Edited] (CONNECTORS-1571) Web Crawler Connector
checks different MIME type than it is sending down the pipeline
[ https://issues.apache.org/jira/browse/CONNECTORS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914580#comment-16914580 ]
Markus Schuch edited comment on CONNECTORS-1571 at 8/23/19 8:15 PM:
--------------------------------------------------------------------
This was fixed with CONNECTORS-1621. The Solr Connector no longer checks the mime type with parameters, so this is not a problem any more.
was (Author: schuchm):
This was fixed with CONNECTORS-1621. The Solr Connector no longer checks the mime type with parameters.
> Web Crawler Connector checks different MIME type than it is sending down the pipeline
> -------------------------------------------------------------------------------------
>
> Key: CONNECTORS-1571
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1571
> Project: ManifoldCF
> Issue Type: Bug
> Components: Web connector
> Affects Versions: ManifoldCF 2.10
> Reporter: Markus Schuch
> Priority: Minor
>
> The Web Crawler Connector extracts the MIME type from the request Content-Type header.
> Then it truncates the possible {{charset=whatever_encoding}} and lets the pipeline check if the resulting MIME type (without the charset) {{activities.checkMimeTypeIndexable(contentType);}} should be ingested.
> When sending the actual {{RepositoryDocument}} it sets the full MIME type (with the charset) in the document. This is no major bug, but a small inconsistency since the HttpPoster of the Solr Output Connector performs a "hard" check of the MIME type again which can have different outcome than the preceding check activity.
> I think this was introduced or (better) revealed with CONNECTORS-1482.
> Example:
> - In my scenario a crawled webpage has Content-Type {{text/html; charset=utf-8}}
> - the {{activities.checkMimeTypeIndexable(contentType);}} is called with {{text/html}}
> - the hard check performed by the Solr Connector is called with {{text/html; charset=utf-8}}
--
This message was sent by Atlassian Jira
(v8.3.2#803003)