You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2010/03/04 11:44:27 UTC

[jira] Created: (CONNECTORS-16) JCIFS connector's document fingerprinting feature is not general enough

JCIFS connector's document fingerprinting feature is not general enough
-----------------------------------------------------------------------

                 Key: CONNECTORS-16
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-16
             Project: Lucene Connector Framework
          Issue Type: Improvement
          Components: Framework agents process, Framework crawler agent, GTS connector, JCIFS connector, LiveLink connector, Lucene/SOLR connector, Meridio connector, RSS connector, SharePoint connector, Web connector
            Reporter: Karl Wright
            Priority: Minor


The JCIFS connector has a feature, called "fingerprinting", which allows it to classify documents according to ability of the back-end to index that content.  Right at the moment, this fingerprinter is capable of recognizing PDFs, Microsoft Office files, and text files as being indexable.  One could imagine, though, that different SOLR plugins, etc. might have more capability than that.  Also, other connectors could potentially benefit from similar technology, specifically any connector that deals with binary documents.

One approach to solving this problem would be to remove the feature entirely, and allow whatever pipeline exists in SOLR determine the indexability after the fact.  The reason this feature was added at MetaCarta, however, is that it may be possible to exclude an un-useful document without having to fetch the whole thing, and (at least for MetaCarta clients) the number of unindexable files of gigantic size was a big concern.

Another approach might be to tie the functionality in with the output connector interface, so that an output connector would (somehow) determine applicability of a document.  This would require some care to make it possible to fingerprint without having to download the entire document, but would otherwise have the correct overall structure.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (CONNECTORS-16) JCIFS connector's document fingerprinting feature is not general enough

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright reassigned CONNECTORS-16:
-------------------------------------

    Assignee: Karl Wright

> JCIFS connector's document fingerprinting feature is not general enough
> -----------------------------------------------------------------------
>
>                 Key: CONNECTORS-16
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-16
>             Project: Lucene Connector Framework
>          Issue Type: Improvement
>          Components: Framework agents process, Framework crawler agent, GTS connector, JCIFS connector, LiveLink connector, Lucene/SOLR connector, Meridio connector, RSS connector, SharePoint connector, Web connector
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>            Priority: Minor
>
> The JCIFS connector has a feature, called "fingerprinting", which allows it to classify documents according to ability of the back-end to index that content.  Right at the moment, this fingerprinter is capable of recognizing PDFs, Microsoft Office files, and text files as being indexable.  One could imagine, though, that different SOLR plugins, etc. might have more capability than that.  Also, other connectors could potentially benefit from similar technology, specifically any connector that deals with binary documents.
> One approach to solving this problem would be to remove the feature entirely, and allow whatever pipeline exists in SOLR determine the indexability after the fact.  The reason this feature was added at MetaCarta, however, is that it may be possible to exclude an un-useful document without having to fetch the whole thing, and (at least for MetaCarta clients) the number of unindexable files of gigantic size was a big concern.
> Another approach might be to tie the functionality in with the output connector interface, so that an output connector would (somehow) determine applicability of a document.  This would require some care to make it possible to fingerprint without having to download the entire document, but would otherwise have the correct overall structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (CONNECTORS-16) JCIFS connector's document fingerprinting feature is not general enough

Posted by "Karl Wright (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CONNECTORS-16?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wright resolved CONNECTORS-16.
-----------------------------------

    Resolution: Fixed

I opted to move the fingerprinting technology into the output connector logic, so an output connector gets to determine what fingerprinting actually means.  For GTS, the fingerprinting logic is thus preserved, while for SOLR, all documents are passed in and Tika can determine what to do at that point.


> JCIFS connector's document fingerprinting feature is not general enough
> -----------------------------------------------------------------------
>
>                 Key: CONNECTORS-16
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-16
>             Project: Lucene Connector Framework
>          Issue Type: Improvement
>          Components: Framework agents process, Framework crawler agent, GTS connector, JCIFS connector, LiveLink connector, Lucene/SOLR connector, Meridio connector, RSS connector, SharePoint connector, Web connector
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>            Priority: Minor
>
> The JCIFS connector has a feature, called "fingerprinting", which allows it to classify documents according to ability of the back-end to index that content.  Right at the moment, this fingerprinter is capable of recognizing PDFs, Microsoft Office files, and text files as being indexable.  One could imagine, though, that different SOLR plugins, etc. might have more capability than that.  Also, other connectors could potentially benefit from similar technology, specifically any connector that deals with binary documents.
> One approach to solving this problem would be to remove the feature entirely, and allow whatever pipeline exists in SOLR determine the indexability after the fact.  The reason this feature was added at MetaCarta, however, is that it may be possible to exclude an un-useful document without having to fetch the whole thing, and (at least for MetaCarta clients) the number of unindexable files of gigantic size was a big concern.
> Another approach might be to tie the functionality in with the output connector interface, so that an output connector would (somehow) determine applicability of a document.  This would require some care to make it possible to fingerprint without having to download the entire document, but would otherwise have the correct overall structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CONNECTORS-16) JCIFS connector's document fingerprinting feature is not general enough

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CONNECTORS-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841208#action_12841208 ] 

Sami Siren commented on CONNECTORS-16:
--------------------------------------

Apache Tika also has some built in capabilities to detect file type from files content, usually the bytes required for detection are in the beginning of the file and so only small portion of the file would be needed to fiqure out the type. 

from TIKA-285: 

* The media type registry in Tika was synchronized with the MIME type 
   configuration in the Apache HTTP Server. Tika now knows about 1274 
   different media types and can detect 672 of those using 927 file 
   extension and 280 magic byte patterns.

> JCIFS connector's document fingerprinting feature is not general enough
> -----------------------------------------------------------------------
>
>                 Key: CONNECTORS-16
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-16
>             Project: Lucene Connector Framework
>          Issue Type: Improvement
>          Components: Framework agents process, Framework crawler agent, GTS connector, JCIFS connector, LiveLink connector, Lucene/SOLR connector, Meridio connector, RSS connector, SharePoint connector, Web connector
>            Reporter: Karl Wright
>            Priority: Minor
>
> The JCIFS connector has a feature, called "fingerprinting", which allows it to classify documents according to ability of the back-end to index that content.  Right at the moment, this fingerprinter is capable of recognizing PDFs, Microsoft Office files, and text files as being indexable.  One could imagine, though, that different SOLR plugins, etc. might have more capability than that.  Also, other connectors could potentially benefit from similar technology, specifically any connector that deals with binary documents.
> One approach to solving this problem would be to remove the feature entirely, and allow whatever pipeline exists in SOLR determine the indexability after the fact.  The reason this feature was added at MetaCarta, however, is that it may be possible to exclude an un-useful document without having to fetch the whole thing, and (at least for MetaCarta clients) the number of unindexable files of gigantic size was a big concern.
> Another approach might be to tie the functionality in with the output connector interface, so that an output connector would (somehow) determine applicability of a document.  This would require some care to make it possible to fingerprint without having to download the entire document, but would otherwise have the correct overall structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.