You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Gary McGath <de...@mcgath.com> on 2013/04/01 16:06:45 UTC

Tika metadata values

Is there any documentation of the metadata values (e.g., compression
types) that Tika can return? I've been trying to find them in the source
code and not having much luck; a grep of the whole directory turns up,
for example, the string "lzw" in the test files but nowhere else, and I
know that Tika really does return "lzw" as a compression type.

What I'm trying to do is map the values returned by the Tika API to the
set of strings that's used by another application, to the extent that
it's possible.

-- 
Gary McGath, Professional Software Developer
http://www.garymcgath.com

Re: Tika metadata values

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 1 Apr 2013, Gary McGath wrote:
>> You can get some idea of the kinds of metadata Tika will return by
>> running the tika-app jar with the --list-met-models option. However, I
>> think that might need a bit of tweaking since the work done to make more
>> of the Tika metadata be properties based.
>
> That's very useful stuff, giving a list of the property names, but I've
> seen properties returned (e.g., "X Resolution") which aren't given
> there. What I'd really like, though, is a way to find out all the values
> that can be returned for a given property; in other words, a controlled
> vocabulary, if there is one. This would let Tika collaborate better with
> other tools.

Tika tries to maintain backwards compatibility between releases. You'll 
see things like a bare "X Resolution" come through, normally along with a 
more structured type as well, as we've evolved the API and improved the 
metadata framework

For something that's properties based, you should be able to get a fair 
idea of the possible values from the property definition and the java 
docs. Wherever possible, we re-use external metadata standards, so you can 
often go and look up their spec to see what you'll get.

For anything that isn't properties based, you'll want to raise an 
enhancement to get it converted!

Nick

Re: Tika metadata values

Posted by Gary McGath <de...@mcgath.com>.
On 4/1/13 10:48 AM, Nick Burch wrote:
> On Mon, 1 Apr 2013, Gary McGath wrote:
>> Is there any documentation of the metadata values (e.g., compression
>> types) that Tika can return?
> 
> Metadata? Or mime type? If I wanted to know if a file was a .tar.gz
> compressed archive, or a .arj one, I'd use the mimetype detection in
> Tika rather than parsing the file file to get the metadata

The MIME type is useful but not sufficient for many purposes. For
characterizing a file in a digital repository, it's desirable to pull
out as much information as possible. The application which I'm working
on pulls characterization information out of a number of tools and tries
to coordinate them under a common vocabulary (which can be like herding
cats).
> 
> You can get some idea of the kinds of metadata Tika will return by
> running the tika-app jar with the --list-met-models option. However, I
> think that might need a bit of tweaking since the work done to make more
> of the Tika metadata be properties based.

That's very useful stuff, giving a list of the property names, but I've
seen properties returned (e.g., "X Resolution") which aren't given
there. What I'd really like, though, is a way to find out all the values
that can be returned for a given property; in other words, a controlled
vocabulary, if there is one. This would let Tika collaborate better with
other tools.

> Ray - don't suppose you fancy doing some work to expose more of the
> great metadata properties work you did via the Tika App? :)


-- 
Gary McGath, Professional Software Developer
http://www.garymcgath.com

Re: Tika metadata values

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 1 Apr 2013, Gary McGath wrote:
> Is there any documentation of the metadata values (e.g., compression
> types) that Tika can return?

Metadata? Or mime type? If I wanted to know if a file was a .tar.gz 
compressed archive, or a .arj one, I'd use the mimetype detection in Tika 
rather than parsing the file file to get the metadata

You can get some idea of the kinds of metadata Tika will return by running 
the tika-app jar with the --list-met-models option. However, I think that 
might need a bit of tweaking since the work done to make more of the Tika 
metadata be properties based.

Ray - don't suppose you fancy doing some work to expose more of the great 
metadata properties work you did via the Tika App? :)

Nick