You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Markus Schuch (JIRA)" <ji...@apache.org> on 2019/01/14 14:25:00 UTC

[jira] [Created] (CONNECTORS-1571) Web Crawler Connector checks different MIME type than it is sending down the pipeline

Markus Schuch created CONNECTORS-1571:
-----------------------------------------

             Summary: Web Crawler Connector checks different MIME type than it is sending down the pipeline
                 Key: CONNECTORS-1571
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1571
             Project: ManifoldCF
          Issue Type: Bug
          Components: Web connector
            Reporter: Markus Schuch


The Web Crawler Connector extracts the MIME type from the request Content-Type header.
Then it truncates the possible {{charset=whatever_encoding}} and lets the pipeline check if the resulting MIME type (without the charset) {{activities.checkMimeTypeIndexable(contentType);}} should be ingested.

When sending the actual {{RepositoryDocument}} it sets the full MIME type (with the charset) in the document. This is no major bug, but a small inconsistency since the HttpPoster of the Solr Output Connector performs a "hard" check of the MIME type again which can have different outcome than the preceding check activity.

I think this was introduced or (better) revealed with CONNECTORS-1482.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)