You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Gary McGath <de...@mcgath.com> on 2013/04/01 16:06:45 UTC
Tika metadata values
Is there any documentation of the metadata values (e.g., compression
types) that Tika can return? I've been trying to find them in the source
code and not having much luck; a grep of the whole directory turns up,
for example, the string "lzw" in the test files but nowhere else, and I
know that Tika really does return "lzw" as a compression type.
What I'm trying to do is map the values returned by the Tika API to the
set of strings that's used by another application, to the extent that
it's possible.
--
Gary McGath, Professional Software Developer
http://www.garymcgath.com
Re: Tika metadata values
Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 1 Apr 2013, Gary McGath wrote:
>> You can get some idea of the kinds of metadata Tika will return by
>> running the tika-app jar with the --list-met-models option. However, I
>> think that might need a bit of tweaking since the work done to make more
>> of the Tika metadata be properties based.
>
> That's very useful stuff, giving a list of the property names, but I've
> seen properties returned (e.g., "X Resolution") which aren't given
> there. What I'd really like, though, is a way to find out all the values
> that can be returned for a given property; in other words, a controlled
> vocabulary, if there is one. This would let Tika collaborate better with
> other tools.
Tika tries to maintain backwards compatibility between releases. You'll
see things like a bare "X Resolution" come through, normally along with a
more structured type as well, as we've evolved the API and improved the
metadata framework
For something that's properties based, you should be able to get a fair
idea of the possible values from the property definition and the java
docs. Wherever possible, we re-use external metadata standards, so you can
often go and look up their spec to see what you'll get.
For anything that isn't properties based, you'll want to raise an
enhancement to get it converted!
Nick
Re: Tika metadata values
Posted by Gary McGath <de...@mcgath.com>.
On 4/1/13 10:48 AM, Nick Burch wrote:
> On Mon, 1 Apr 2013, Gary McGath wrote:
>> Is there any documentation of the metadata values (e.g., compression
>> types) that Tika can return?
>
> Metadata? Or mime type? If I wanted to know if a file was a .tar.gz
> compressed archive, or a .arj one, I'd use the mimetype detection in
> Tika rather than parsing the file file to get the metadata
The MIME type is useful but not sufficient for many purposes. For
characterizing a file in a digital repository, it's desirable to pull
out as much information as possible. The application which I'm working
on pulls characterization information out of a number of tools and tries
to coordinate them under a common vocabulary (which can be like herding
cats).
>
> You can get some idea of the kinds of metadata Tika will return by
> running the tika-app jar with the --list-met-models option. However, I
> think that might need a bit of tweaking since the work done to make more
> of the Tika metadata be properties based.
That's very useful stuff, giving a list of the property names, but I've
seen properties returned (e.g., "X Resolution") which aren't given
there. What I'd really like, though, is a way to find out all the values
that can be returned for a given property; in other words, a controlled
vocabulary, if there is one. This would let Tika collaborate better with
other tools.
> Ray - don't suppose you fancy doing some work to expose more of the
> great metadata properties work you did via the Tika App? :)
--
Gary McGath, Professional Software Developer
http://www.garymcgath.com
Re: Tika metadata values
Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 1 Apr 2013, Gary McGath wrote:
> Is there any documentation of the metadata values (e.g., compression
> types) that Tika can return?
Metadata? Or mime type? If I wanted to know if a file was a .tar.gz
compressed archive, or a .arj one, I'd use the mimetype detection in Tika
rather than parsing the file file to get the metadata
You can get some idea of the kinds of metadata Tika will return by running
the tika-app jar with the --list-met-models option. However, I think that
might need a bit of tweaking since the work done to make more of the Tika
metadata be properties based.
Ray - don't suppose you fancy doing some work to expose more of the great
metadata properties work you did via the Tika App? :)
Nick