You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2018/10/17 14:38:00 UTC

[jira] [Updated] (CONNECTORS-1547) No activity record for for excluded documents in WebCrawlerConnector

     [ https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright updated CONNECTORS-1547:
------------------------------------
    Fix Version/s: ManifoldCF 2.12

> No activity record for for excluded documents in WebCrawlerConnector
> --------------------------------------------------------------------
>
>                 Key: CONNECTORS-1547
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1547
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>            Reporter: Olivier Tavard
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF 2.12
>
>         Attachments: manifoldcf_local_files.log, manifoldcf_web.log, simple_history_files.jpg, simple_history_web.jpg
>
>
> Hi,
> I noticed that there is no activity record logged for documents excluded by the Document Filter transformation connector  in the WebCrawler connector.
> To reproduce the issue on MCF out of the box :
> Null output connector 
> Web repository connector 
> Job :
> - DocumentFilter added which only accepts application/msword (doc/docx) documents
> The simple history does not mention the documents excluded (excepted for html documents). They have fetch activity and that's all (see simple_history_web.jpeg).
> We can only see the documents excluded by the MCF log (with DEBUG verbosity activity on connectors) :
> {code:java}
> Removing url 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png' because it had the wrong content type ('image/png'){code}
> (see manifoldcf_local_files.log)
> The related code is in WebcrawlerConnector.java l.904 :
> {code:java}
> fetchStatus.contextMessage = "it had the wrong content type ('"+contentType+"')";
>  fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
>  activityResultCode = null;{code}
> The activityResultCode is null.
>  
>  
> If we configure the same job but for a Local File system connector with the same Document Filter transformation connector, the simple history mentions all the documents excluded in the simple history (see simple_history_files.jpeg)  and the code mentions a specific error code with an activity record logged (class FileConnector l. 415) : 
> {code:java}
> if (!activities.checkMimeTypeIndexable(mimeType))
>  {
>  errorCode = activities.EXCLUDED_MIMETYPE;
>  errorDesc = "Excluded because mime type ('"+mimeType+"')";
>  Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because mime type ('"+mimeType+"') was excluded by output connector.");
>  activities.noDocument(documentIdentifier,versionString);
>  continue;
>  }{code}
>  
> So the Web Crawler connector should have the same behaviour than for FileConnector and explicitly mention all the documents excluded by the user I think.
>  
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)