You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2018/10/17 14:38:00 UTC
[jira] [Updated] (CONNECTORS-1547) No activity record for for
excluded documents in WebCrawlerConnector
[ https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karl Wright updated CONNECTORS-1547:
------------------------------------
Fix Version/s: ManifoldCF 2.12
> No activity record for for excluded documents in WebCrawlerConnector
> --------------------------------------------------------------------
>
> Key: CONNECTORS-1547
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1547
> Project: ManifoldCF
> Issue Type: Bug
> Components: Web connector
> Reporter: Olivier Tavard
> Assignee: Karl Wright
> Priority: Minor
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf_local_files.log, manifoldcf_web.log, simple_history_files.jpg, simple_history_web.jpg
>
>
> Hi,
> I noticed that there is no activity record logged for documents excluded by the Document Filter transformation connector in the WebCrawler connector.
> To reproduce the issue on MCF out of the box :
> Null output connector
> Web repository connector
> Job :
> - DocumentFilter added which only accepts application/msword (doc/docx) documents
> The simple history does not mention the documents excluded (excepted for html documents). They have fetch activity and that's all (see simple_history_web.jpeg).
> We can only see the documents excluded by the MCF log (with DEBUG verbosity activity on connectors) :
> {code:java}
> Removing url 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png' because it had the wrong content type ('image/png'){code}
> (see manifoldcf_local_files.log)
> The related code is in WebcrawlerConnector.java l.904 :
> {code:java}
> fetchStatus.contextMessage = "it had the wrong content type ('"+contentType+"')";
> fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
> activityResultCode = null;{code}
> The activityResultCode is null.
>
>
> If we configure the same job but for a Local File system connector with the same Document Filter transformation connector, the simple history mentions all the documents excluded in the simple history (see simple_history_files.jpeg) and the code mentions a specific error code with an activity record logged (class FileConnector l. 415) :
> {code:java}
> if (!activities.checkMimeTypeIndexable(mimeType))
> {
> errorCode = activities.EXCLUDED_MIMETYPE;
> errorDesc = "Excluded because mime type ('"+mimeType+"')";
> Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because mime type ('"+mimeType+"') was excluded by output connector.");
> activities.noDocument(documentIdentifier,versionString);
> continue;
> }{code}
>
> So the Web Crawler connector should have the same behaviour than for FileConnector and explicitly mention all the documents excluded by the user I think.
>
> Best regards,
> Olivier
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)