You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jonathan Koren (JIRA)" <ji...@apache.org> on 2009/06/26 02:26:07 UTC

[jira] Created: (TIKA-252) PackageParser's XHTML should contain metadata of subfiles

PackageParser's XHTML should contain metadata of subfiles
---------------------------------------------------------

                 Key: TIKA-252
                 URL: https://issues.apache.org/jira/browse/TIKA-252
             Project: Tika
          Issue Type: Improvement
            Reporter: Jonathan Koren


Currently PackageParser only sets the Metadata based on the outermost file type.  For instance, an gzipped tar containing pdfs will have Metadata.Content-Type set to application/gzip, and the mimetypes of the internal files (the pdfs) will be lost.  

It would be nice if the metadata found when parsing the contained pdfs would be recoverable.  Perhaps in a sequence like:
<div class="metadata><span class="METADATA-KEY">METADATA-VALUE</span>...</div> within the <div class="package-file">


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-252) PackageParser's XHTML should contain metadata of subfiles

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757777#action_12757777 ] 

Ken Krugler commented on TIKA-252:
----------------------------------

I'd run into something similar. I recently wrote an mbox parser for Tika, since I need that for my Bixo web crawler.

A single mbox file logically decomposes into multiple documents (one per email). I can and do currently treat it as a single document, where I use XHTML <ul> lists for each message's headers. But it would work better from the client perspective if the metadata being returned by the parse() call could be used as expected - e.g. DublinCore's SUBJECT, DATE, and CREATOR match up with each email's subject, date and author header fields.

An alternative idea is that you could make the parse() API callable multiple times, where it incrementally processes the input stream, and returns a boolean for whether or not additional data remains. The parser becomes more complex, in that it would need to maintain some state (probably in the context param) but it would be a pretty minor change for the caller.


> PackageParser's XHTML should contain metadata of subfiles
> ---------------------------------------------------------
>
>                 Key: TIKA-252
>                 URL: https://issues.apache.org/jira/browse/TIKA-252
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Jonathan Koren
>            Priority: Minor
>
> Currently PackageParser only sets the Metadata based on the outermost file type.  For instance, an gzipped tar containing pdfs will have Metadata.Content-Type set to application/gzip, and the mimetypes of the internal files (the pdfs) will be lost.  
> It would be nice if the metadata found when parsing the contained pdfs would be recoverable.  Perhaps in a sequence like:
> <div class="metadata><span class="METADATA-KEY">METADATA-VALUE</span>...</div> within the <div class="package-file">

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-252) PackageParser's XHTML should contain metadata of subfiles

Posted by "Jonathan Koren (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Koren updated TIKA-252:
--------------------------------

          Component/s: parser
             Priority: Minor  (was: Major)
    Affects Version/s: 0.4

> PackageParser's XHTML should contain metadata of subfiles
> ---------------------------------------------------------
>
>                 Key: TIKA-252
>                 URL: https://issues.apache.org/jira/browse/TIKA-252
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Jonathan Koren
>            Priority: Minor
>
> Currently PackageParser only sets the Metadata based on the outermost file type.  For instance, an gzipped tar containing pdfs will have Metadata.Content-Type set to application/gzip, and the mimetypes of the internal files (the pdfs) will be lost.  
> It would be nice if the metadata found when parsing the contained pdfs would be recoverable.  Perhaps in a sequence like:
> <div class="metadata><span class="METADATA-KEY">METADATA-VALUE</span>...</div> within the <div class="package-file">

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.