You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jan Høydahl (Commented JIRA)" <ji...@apache.org> on 2011/12/28 01:14:30 UTC

[jira] [Commented] (SOLR-2416) Solr Cell fails to index Zip file contents

    [ https://issues.apache.org/jira/browse/SOLR-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176371#comment-13176371 ] 

Jan Høydahl commented on SOLR-2416:
-----------------------------------

If we add this, the behavior should probably be parameter driven. Some questions arises:
a) What to do with metadata? Should meta data for all files in the ZIP be added to the document? What's Tikas default?
b) How do you present the title of such a document consisting of multiple docs from ZIP? Each individual document has its own title metadata...
c) Do you always want to traverse all files in the ZIP or only some types?
d) What do you do when a ZIP contains another ZIP?

All in all, perhaps this isn't such a useful feature after all?
                
> Solr Cell fails to index Zip file contents
> ------------------------------------------
>
>                 Key: SOLR-2416
>                 URL: https://issues.apache.org/jira/browse/SOLR-2416
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler, contrib - Solr Cell (Tika extraction)
>    Affects Versions: 1.4.1
>            Reporter: Jayendra Patil
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2416_ExtractingDocumentLoader.patch
>
>
> Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again.
> It just indexes the file names again.
> This issue was addressed some time back, late last year, but seems to have reappeared with the latest code.
> Jira for the Data Import handler part with the patch and the testcase - https://issues.apache.org/jira/browse/SOLR-2332.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org