You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jayendra Patil (JIRA)" <ji...@apache.org> on 2011/01/24 02:32:43 UTC

[jira] Created: (SOLR-2332) TikaEntityProcessor retrieves only File Names from Zip extraction

TikaEntityProcessor retrieves only File Names from Zip extraction
-----------------------------------------------------------------

                 Key: SOLR-2332
                 URL: https://issues.apache.org/jira/browse/SOLR-2332
             Project: Solr
          Issue Type: Bug
          Components: contrib - DataImportHandler
    Affects Versions: 4.0
            Reporter: Jayendra Patil


Extraction of Zip files using TikaEntityProcessor results in only names of file.
It does not extract the contents of the Files in the Zip

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2332) TikaEntityProcessor retrieves only File Names from Zip extraction

Posted by "Jayendra Patil (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jayendra Patil updated SOLR-2332:
---------------------------------

    Attachment: solr-word.zip
                SOLR-2332.patch

Attached is the Patch for the fix and Testcase.
Also attached is the Test zip file.

> TikaEntityProcessor retrieves only File Names from Zip extraction
> -----------------------------------------------------------------
>
>                 Key: SOLR-2332
>                 URL: https://issues.apache.org/jira/browse/SOLR-2332
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>    Affects Versions: 4.0
>            Reporter: Jayendra Patil
>         Attachments: SOLR-2332.patch, solr-word.zip
>
>
> Extraction of Zip files using TikaEntityProcessor results in only names of file.
> It does not extract the contents of the Files in the Zip

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (SOLR-2332) TikaEntityProcessor retrieves only File Names from Zip extraction

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-2332:
---------------------------

    Fix Version/s:     (was: 4.0)

removing fixVersion=4.0 since there is no evidence that anyone is currently working on this issue.  (this can certainly be revisited if volunteers step forward)

                
> TikaEntityProcessor retrieves only File Names from Zip extraction
> -----------------------------------------------------------------
>
>                 Key: SOLR-2332
>                 URL: https://issues.apache.org/jira/browse/SOLR-2332
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>            Reporter: Jayendra Patil
>         Attachments: SOLR-2332.patch, solr-word.zip
>
>
> Extraction of Zip files using TikaEntityProcessor results in only names of file.
> It does not extract the contents of the Files in the Zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2332) TikaEntityProcessor retrieves only File Names from Zip extraction

Posted by "Lance Norskog (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210033#comment-13210033 ] 

Lance Norskog commented on SOLR-2332:
-------------------------------------

Unpacking a zip file is a very narrow, focused operation. This could also be done with a separate UpdateRequestHandler that does nothing but unpack zip files. It would use the basic JDK zip file code, not Tika. You configure the Tika handler beneath it. 

Another use case is a ZIP file full of solr update xml files, which TIKA does not know about. To do this, you want an UpdateRequestHandler stack like this: zip unpacker -> XmlUpdateRequestHandler

                
> TikaEntityProcessor retrieves only File Names from Zip extraction
> -----------------------------------------------------------------
>
>                 Key: SOLR-2332
>                 URL: https://issues.apache.org/jira/browse/SOLR-2332
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>            Reporter: Jayendra Patil
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2332.patch, solr-word.zip
>
>
> Extraction of Zip files using TikaEntityProcessor results in only names of file.
> It does not extract the contents of the Files in the Zip

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2332) TikaEntityProcessor retrieves only File Names from Zip extraction

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-2332:
---------------------------

    Affects Version/s:     (was: 4.0)
        Fix Version/s: 3.2

I can't find any docs suggestion how exactly TikaEntityProcessor should be expected to deal with zip files, particularly what to expect if a zip files contains multiple documents.

FWIW: TikaEntityProcessor did not exist in Solr 1.4.1, so the behavior currently seen in the 3x branch (and the 3.1rc1 artifacts) is not a regression.

> TikaEntityProcessor retrieves only File Names from Zip extraction
> -----------------------------------------------------------------
>
>                 Key: SOLR-2332
>                 URL: https://issues.apache.org/jira/browse/SOLR-2332
>             Project: Solr
>          Issue Type: Bug
>          Components: contrib - DataImportHandler
>            Reporter: Jayendra Patil
>             Fix For: 3.2
>
>         Attachments: SOLR-2332.patch, solr-word.zip
>
>
> Extraction of Zip files using TikaEntityProcessor results in only names of file.
> It does not extract the contents of the Files in the Zip

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org