You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Paul Jakubik <pa...@purediscovery.com> on 2010/07/12 03:09:48 UTC

I want to capture metadata from individual files in a package

Hi,

I want to be able to parse zip, tar.gz, etc. files and extract metadata from
each file in the package. When I looked through the code, it looks like the
package parser creates a separate metadata object for each file in the
package, and I don't see a way to get to that object.

- Are there any plans for adding the ability to extract metadata from each
file in the package?

Here are the first two ways I thought of that this could be implemented:
- Add metadata to the context, and add a clear method so the user can clear
the metadata after each file in the package is parsed (in the ContentHandler
when a "div" element is closed).
- Write all metadata to the header section of the generated XHTML for each
document.

Will there be a way to get this metadata anytime soon?

Paul

Re: I want to capture metadata from individual files in a package

Posted by Jonathan Koren <jo...@soe.ucsc.edu>.
On Jul 11, 2010, at 6:09 PM, Paul Jakubik wrote:
> Hi,
> 
> I want to be able to parse zip, tar.gz, etc. files and extract metadata from each file in the package. When I looked through the code, it looks like the package parser creates a separate metadata object for each file in the package, and I don't see a way to get to that object.
> 
> - Are there any plans for adding the ability to extract metadata from each file in the package?
> 
> Here are the first two ways I thought of that this could be implemented:
> - Add metadata to the context, and add a clear method so the user can clear the metadata after each file in the package is parsed (in the ContentHandler when a "div" element is closed).
> - Write all metadata to the header section of the generated XHTML for each document.
> 
> Will there be a way to get this metadata anytime soon?


I asked about this same thing almost exactly a  year ago.
http://mail-archives.apache.org/mod_mbox/lucene-tika-dev/200906.mbox/%3c3949E4F8-0ACF-4BA4-8FFC-57AF8A783C69@soe.ucsc.edu%3e

and got unceremoniously shot down.
http://mail-archives.apache.org/mod_mbox/lucene-tika-dev/200907.mbox/%3c510143ac0907300409u699a3953t9b2dfbd6bb63367a@mail.gmail.com%3e

	That would confuse the distinction between metadata and normal
	document text, especially when just the character stream is extracted.
	If you need access to the entry metadata, it would probably be better
	to expose it as attributes of the package-entry div.

Still really really want this feature.  

I dont like the idea of writing all the metadata in the HEAD, because now you don't know which file had which metadata.  I guess you could always make the ID attribute of XHTML's META tag correspond to the filename of the contained file, but it doesn't feels kind of sloppy to do that.

Mostly, I think the problem comes from trying to shove everything into an HTML instead of just fully embracing XML isn't necessarily the best choice, especially for package files.  If XML was used, there's no reason why you couldn't have something that looked like:
	<FILE>
		<META key="" value="" />
		<CONTENT>
			<FILE>
				<META key="" value="" />
				<CONTENT>
				</CONTENT>
			</FILE>
			<FILE>
				<META key="" value="" />
				<CONTENT>
				</CONTENT>
			</FILE>
	</FILE>

But the XHTML-vs-XML ship has sailed, so there's no point in re-litigating that.  Perhaps it's something to consider for version 2.0.

An alternative way of handling this would be to create a nonrecursive version of AutoDetectParser.  That way, when the parser returned the metadata on the package, a metadata key could be set like, isPackage=TRUE, and then the user could get an iterator to each contained file contained package, and then manually call AutoDetectParserNonRecursive on each of the contained files, thus getting the metadata as needed.

--
Jonathan Koren
jonathan@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/