You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Jonathan Koren <jo...@soe.ucsc.edu> on 2009/06/19 01:46:19 UTC

metadata and package files

I'm parsing a package file, let's say foo.tar.gz.  AutoDetectParser  
does the right thing in the sense
that returns an XHTML file that contains entries for each file in the  
tar file which is in the gzip file.  However, the metadata object   
returned by top AutoDetectParser contains only the metadata for the  
outermost package, i.e. the gzip.  Obviously the metadata for each  
file within the tar exists, otherwise PackageParser wouldn't be able  
to use AutoDetectParser to correctly chain down within the file.   
(i.e. Somewhere foo.tar/foo.pdf is tagged as application/pdf to enable  
PDFParser to correctly convert it to text.)

Examining the XHTML returned, reveals nothing.  It's just a bunch of  
<div class="package-entry">s delineating the different entries in the  
TAR.  Is there a way to get the metadata for each entry within a  
package file, and I'm just missing it?  If not, it seems like  
PackageParser could be modified to spit out a bunch of DIVs of the  
form: <div class="metadata" name="METADATA-KEY">METADATA-VALUE</div>


--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/

Re: metadata and package files

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Fri, Jun 19, 2009 at 1:46 AM, Jonathan Koren<jo...@soe.ucsc.edu> wrote:
> Is there a way to get the metadata for each entry within a package file,
> and I'm just missing it?

No, it's currently not possible. The rationale for that is the same
why we don't include the top-level metadata in the XHTML output, i.e.
metadata is not really a part of the normal text content of the
document (it's not rendered by default, etc.).

> If not, it seems like PackageParser could be modified
> to spit out a bunch of DIVs of the form: <div class="metadata"
> name="METADATA-KEY">METADATA-VALUE</div>

That would confuse the distinction between metadata and normal
document text, especially when just the character stream is extracted.
If you need access to the entry metadata, it would probably be better
to expose it as attributes of the package-entry div.

BR,

Jukka Zitting

metadata and package files

Posted by Jonathan Koren <jo...@soe.ucsc.edu>.

I'm parsing a package file, let's say foo.tar.gz.  AutoDetectParser  
does the right thing in the sense
that returns an XHTML file that contains entries for each file in the  
tar file which is in the gzip file.  However, the metadata object   
returned by top AutoDetectParser contains only the metadata for the  
outermost package, i.e. the gzip.  Obviously the metadata for each  
file within the tar exists, otherwise PackageParser wouldn't be able  
to use AutoDetectParser to correctly chain down within the file.   
(i.e. Somewhere foo.tar/foo.pdf is tagged as application/pdf to enable  
PDFParser to correctly convert it to text.)

Examining the XHTML returned, reveals nothing.  It's just a bunch of  
<div class="package-entry">s delineating the different entries in the  
TAR.  Is there a way to get the metadata for each entry within a  
package file, and I'm just missing it?  If not, it seems like  
PackageParser could be modified to spit out a bunch of DIVs of the  
form: <div class="metadata" name="METADATA-KEY">METADATA-VALUE</div>

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/