You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Manish (JIRA)" <ji...@apache.org> on 2011/04/11 00:49:05 UTC

[jira] [Created] (TIKA-637) Need API to get list of embedded documents

Need API to get list of embedded documents
------------------------------------------

                 Key: TIKA-637
                 URL: https://issues.apache.org/jira/browse/TIKA-637
             Project: Tika
          Issue Type: New Feature
          Components: parser
    Affects Versions: 1.0
            Reporter: Manish


Apache tika works great to extract the content and the meta data of documents. 
but if it can have APIs where it can get you individual documents' input stream along with its content and meta data, it would be great. 

For example, if it is extracting zip files, then if we can have the output in the form of list of <text, metadata, inputstream> for each document, or provide an callback for each <text, metadata, inputstream>, then it can be used for both text extraction and also to extract individual documents from container files. 

I have already done it for zip and also PST. But if we can have some standard API, then it would be great. 


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-637) Need API to get list of embedded documents

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018170#comment-13018170 ] 

Nick Burch commented on TIKA-637:
---------------------------------

Doesn't org.apache.tika.extractor.ParserContainerExtractor do what you need?

> Need API to get list of embedded documents
> ------------------------------------------
>
>                 Key: TIKA-637
>                 URL: https://issues.apache.org/jira/browse/TIKA-637
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Manish
>
> Apache tika works great to extract the content and the meta data of documents. 
> but if it can have APIs where it can get you individual documents' input stream along with its content and meta data, it would be great. 
> For example, if it is extracting zip files, then if we can have the output in the form of list of <text, metadata, inputstream> for each document, or provide an callback for each <text, metadata, inputstream>, then it can be used for both text extraction and also to extract individual documents from container files. 
> I have already done it for zip and also PST. But if we can have some standard API, then it would be great. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-637) Need API to get list of embedded documents

Posted by "Nick Burch (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-637.
-----------------------------

    Resolution: Not A Problem

Closing as "Not A Problem", as this is handled by supplying a recursing parser on the ParseContext. For an example of this, see how the -z option in the TikaCLI works
                
> Need API to get list of embedded documents
> ------------------------------------------
>
>                 Key: TIKA-637
>                 URL: https://issues.apache.org/jira/browse/TIKA-637
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.10
>            Reporter: Manish
>
> Apache tika works great to extract the content and the meta data of documents. 
> but if it can have APIs where it can get you individual documents' input stream along with its content and meta data, it would be great. 
> For example, if it is extracting zip files, then if we can have the output in the form of list of <text, metadata, inputstream> for each document, or provide an callback for each <text, metadata, inputstream>, then it can be used for both text extraction and also to extract individual documents from container files. 
> I have already done it for zip and also PST. But if we can have some standard API, then it would be great. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-637) Need API to get list of embedded documents

Posted by "Maxim Valyanskiy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018239#comment-13018239 ] 

Maxim Valyanskiy commented on TIKA-637:
---------------------------------------

tika cli app has option "-z" that extracts all embedded files to current directory

> Need API to get list of embedded documents
> ------------------------------------------
>
>                 Key: TIKA-637
>                 URL: https://issues.apache.org/jira/browse/TIKA-637
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Manish
>
> Apache tika works great to extract the content and the meta data of documents. 
> but if it can have APIs where it can get you individual documents' input stream along with its content and meta data, it would be great. 
> For example, if it is extracting zip files, then if we can have the output in the form of list of <text, metadata, inputstream> for each document, or provide an callback for each <text, metadata, inputstream>, then it can be used for both text extraction and also to extract individual documents from container files. 
> I have already done it for zip and also PST. But if we can have some standard API, then it would be great. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira