You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by אברהם חיון <av...@gmail.com> on 2014/04/23 10:35:20 UTC

Correct use of Tika's MediaType

I want to use Tika's MediaType class to compare mediaTypes.

I first use Tika to detect the MediaType. Then I want to start an action
according to the MediaType.

So if the MediaType is from type XML I want to do some action, if it is a
compressed file I want to start an other action.

My problem is that there are many XML types, so how do I check if it is an
XML using the MediaType ?

Here is my previous (before Tika) implementation:

if (contentType.contains("text/xml") ||
    contentType.contains("application/xml") ||
    contentType.contains("application/x-xml") ||
    contentType.contains("application/atom+xml") ||
    contentType.contains("application/rss+xml")) {
        processXML();
}

else if (contentType.contains("application/gzip") ||
    contentType.contains("application/x-gzip") ||
    contentType.contains("application/x-gunzip") ||
    contentType.contains("application/gzipped") ||
    contentType.contains("application/gzip-compressed") ||
    contentType.contains("application/x-compress") ||
    contentType.contains("gzip/document") ||
    contentType.contains("application/octet-stream")) {
        processGzip();
}

I want to switch it to use Tika something like the following:

MediaType mediaType = MediaType.parse(contentType);
if (mediaType == APPLICATION_XML) {
    return processXml();
} else if (mediaType == APPLICATION_ZIP || mediaType == OCTET_STREAM) {
    return processGzip();
}

But the problem is that Tika.detect(...) returns many different types which
don't have a MediaType constant.

How can I just identify the MediaType if it is type XML ? Or if it is type
Compress ? I need a "Father" type which includes all of it's childs, maybe
a method which is: "boolean isXML()" which includes application/xml and
text/xml and application/x-xml or "boolean isCompress()" which includes all
of the zip + gzip types etc

Re: Correct use of Tika's MediaType

Posted by אברהם חיון <av...@gmail.com>.
I have raised Issues 1281 & 1282 in the Tika Jira.

One for the additional XML type
The second for the 4 additional Gzip types



Avi.


On Fri, Apr 25, 2014 at 12:04 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Fri, 25 Apr 2014, אברהם חיון wrote:
>
>> I pitched the team to drop support for unrecognized (by Tika)
>> media-types, and if Tika decides to insert them into it's registry then we
>> will support them automatically.
>>
>
> If you want additional types supported, please raise a bug in jira, and
> list them there. Someone'll hopefully review and commit them fairly quickly
> from there!
>
>  The GZIP format is as follows in Wikipedia:
>> http://en.wikipedia.org/wiki/Gzip
>>
>> The MediaType according to Wikipedia is application/gzip, while in the
>> TIKA
>> DB it is: "*application/x-gzip*" and the "*application/gzip*"  is totally
>>
>> left out (not even an alias) !?
>>
>
> Looks like those were only added quite recently, from the date of the RFC.
> I've raised TIKA-1280 to track it
>
> Nick

Re: Correct use of Tika's MediaType

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 25 Apr 2014, אברהם חיון wrote:
> I pitched the team to drop support for unrecognized (by Tika) 
> media-types, and if Tika decides to insert them into it's registry then 
> we will support them automatically.

If you want additional types supported, please raise a bug in jira, and 
list them there. Someone'll hopefully review and commit them fairly 
quickly from there!

> The GZIP format is as follows in Wikipedia:
> http://en.wikipedia.org/wiki/Gzip
>
> The MediaType according to Wikipedia is application/gzip, while in the TIKA
> DB it is: "*application/x-gzip*" and the "*application/gzip*"  is totally
> left out (not even an alias) !?

Looks like those were only added quite recently, from the date of the RFC. 
I've raised TIKA-1280 to track it

Nick

Re: Correct use of Tika's MediaType

Posted by אברהם חיון <av...@gmail.com>.
Nick, thank you for everything.


I humbly accept all of your comments and will check the mediatype then
recurse through supertypes.
I will also check aliases of my expected Media-types to enhance the
media-type recognition.


I pitched the team to drop support for unrecognized (by Tika) media-types,
and if Tika decides to insert them into it's registry then we will support
them automatically.



I have still one question for you which might be a missing media-type or
alias in Tika, and if this is the case I will open an issue in Tika's bug
control system.


The GZIP format is as follows in Wikipedia:
http://en.wikipedia.org/wiki/Gzip

The MediaType according to Wikipedia is application/gzip, while in the TIKA
DB it is: "*application/x-gzip*" and the "*application/gzip*"  is totally
left out (not even an alias) !?

Is it a "bug" or am I missing something ?







On Thu, Apr 24, 2014 at 1:55 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Thu, 24 Apr 2014, אברהם חיון wrote:
>
>> These two are aliases. You might need to check you're using the canonical
>>> form
>>>
>>>  *Can you please elaborate?   What is the difference between the alias
>> and
>> the canonical form ?*
>>
>
> From the Tika mimetypes file:
>
>   <mime-type type="application/xml">
>     <acronym>XML</acronym>
>     <_comment>Extensible Markup Language</_comment>
>     <tika:link>http://en.wikipedia.org/wiki/Xml</tika:link>
>     <tika:uti>public.xml</tika:uti>
>     <alias type="text/xml"/>
>
> So, the official / canonical mimetype is application/xml, while text/xml
> is an alias for it.
>
> MediaTypeRegistry - http://tika.apache.org/1.5/api/org/apache/tika/mime/
> MediaTypeRegistry.html - can give you the aliases for a given canonical
> type. You can use the normalize call to turn the alias into the canonical
> form if needed
>
>
>  Tika doesn't know about this, is it a common alias?
>>>
>>
>> *Not used a lot, but several places list it as an XML type, like here:*
>> *http://filext.com/file-extension/XML
>> <http://filext.com/file-extension/XML>*
>> *or*
>> *http://help.dottoro.com/lapuadlp.php
>> <http://help.dottoro.com/lapuadlp.php>*
>>
>
> If they're commonly used aliases, please open a jira and suggest them
>
>  *Where should I look to see the right and acceptable mediaType / aliases
>> of
>> every format ?*
>>
>
> https://svn.apache.org/repos/asf/tika/trunk/tika-core/src/
> main/resources/org/apache/tika/mime/tika-mimetypes.xml
>
> Nick

Re: Correct use of Tika's MediaType

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 24 Apr 2014, אברהם חיון wrote:
>> These two are aliases. You might need to check you're using the 
>> canonical form
>>
> *Can you please elaborate?   What is the difference between the alias and
> the canonical form ?*

>From the Tika mimetypes file:

   <mime-type type="application/xml">
     <acronym>XML</acronym>
     <_comment>Extensible Markup Language</_comment>
     <tika:link>http://en.wikipedia.org/wiki/Xml</tika:link>
     <tika:uti>public.xml</tika:uti>
     <alias type="text/xml"/>

So, the official / canonical mimetype is application/xml, while text/xml 
is an alias for it.

MediaTypeRegistry - 
http://tika.apache.org/1.5/api/org/apache/tika/mime/MediaTypeRegistry.html 
- can give you the aliases for a given canonical type. You can use the 
normalize call to turn the alias into the canonical form if needed


>> Tika doesn't know about this, is it a common alias?
>
> *Not used a lot, but several places list it as an XML type, like here:*
> *http://filext.com/file-extension/XML
> <http://filext.com/file-extension/XML>*
> *or*
> *http://help.dottoro.com/lapuadlp.php
> <http://help.dottoro.com/lapuadlp.php>*

If they're commonly used aliases, please open a jira and suggest them

> *Where should I look to see the right and acceptable mediaType / aliases of
> every format ?*

https://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml

Nick

Re: Correct use of Tika's MediaType

Posted by אברהם חיון <av...@gmail.com>.
On Thu, Apr 24, 2014 at 12:11 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Thu, 24 Apr 2014, אברהם חיון wrote:
>
>> Here is the simple code (Thank you Nick):
>> List<MediaType> mts = new ArrayList<MediaType>();
>> // All of these should return XML type
>> mts.add(MediaType.parse("text/xml"));
>> mts.add(MediaType.parse("application/xml"));
>>
>
> These two are aliases. You might need to check you're using the canonical
> form
>
*Can you please elaborate?   What is the difference between the alias and
the canonical form ?*


>
>  mts.add(MediaType.parse("application/x-xml"));
>>
>
> Tika doesn't know about this, is it a common alias?

*Not used a lot, but several places list it as an XML type, like here:*
*http://filext.com/file-extension/XML
<http://filext.com/file-extension/XML>*
*or*
*http://help.dottoro.com/lapuadlp.php
<http://help.dottoro.com/lapuadlp.php>*

*Where should I look to see the right and acceptable mediaType / aliases of
every format ?*


>
>
>  mts.add(MediaType.parse("application/atom+xml"));
>> mts.add(MediaType.parse("application/rss+xml"));
>>
>
>  // All of these should return Compress or ZIP type
>> mts.add(MediaType.parse("application/gzip"));
>> mts.add(MediaType.parse("application/x-gzip"));
>> mts.add(MediaType.parse("application/x-compress"));
>>
>
> None of these is zip! That's application/zip . These are all different
> compression formats to zip
>
*You are right, my bad.*



>  mts.add(MediaType.parse("application/x-gunzip"));
>> mts.add(MediaType.parse("application/gzipped"));
>> mts.add(MediaType.parse("application/gzip-compressed"));
>> mts.add(MediaType.parse("gzip/document"));
>>
>
> Tika doesn't know about any of those, if they're common you might want to
> suggest them as new aliases and/or new mime types

*They are listed in several places, though I am not sure they are listed in
the "Official" places.*



>
> Nick

Re: Correct use of Tika's MediaType

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 24 Apr 2014, אברהם חיון wrote:
> Here is the simple code (Thank you Nick):
> List<MediaType> mts = new ArrayList<MediaType>();
> // All of these should return XML type
> mts.add(MediaType.parse("text/xml"));
> mts.add(MediaType.parse("application/xml"));

These two are aliases. You might need to check you're using the canonical 
form

> mts.add(MediaType.parse("application/x-xml"));

Tika doesn't know about this, is it a common alias?

> mts.add(MediaType.parse("application/atom+xml"));
> mts.add(MediaType.parse("application/rss+xml"));

> // All of these should return Compress or ZIP type
> mts.add(MediaType.parse("application/gzip"));
> mts.add(MediaType.parse("application/x-gzip"));
> mts.add(MediaType.parse("application/x-compress"));

None of these is zip! That's application/zip . These are all different 
compression formats to zip

> mts.add(MediaType.parse("application/x-gunzip"));
> mts.add(MediaType.parse("application/gzipped"));
> mts.add(MediaType.parse("application/gzip-compressed"));
> mts.add(MediaType.parse("gzip/document"));

Tika doesn't know about any of those, if they're common you might want to 
suggest them as new aliases and/or new mime types

Nick

Re: Correct use of Tika's MediaType

Posted by אברהם חיון <av...@gmail.com>.
Ok, I run the code.


But the results don't give me the expected (from my perspective :-)  )
results.


Here is the simple code (Thank you Nick):
List<MediaType> mts = new ArrayList<MediaType>();
// All of these should return XML type
mts.add(MediaType.parse("text/xml"));
mts.add(MediaType.parse("application/xml"));
mts.add(MediaType.parse("application/x-xml"));
mts.add(MediaType.parse("application/atom+xml"));
mts.add(MediaType.parse("application/rss+xml"));

// All of these should return Compress or ZIP type
mts.add(MediaType.parse("application/gzip"));
mts.add(MediaType.parse("application/x-gzip"));
mts.add(MediaType.parse("application/x-gunzip"));
mts.add(MediaType.parse("application/gzipped"));
mts.add(MediaType.parse("application/gzip-compressed"));
mts.add(MediaType.parse("application/x-compress"));
mts.add(MediaType.parse("gzip/document"));

AutoDetectParser parser = new AutoDetectParser();
MediaTypeRegistry registry = parser.getMediaTypeRegistry();

for (MediaType mediaType : mts) {
        System.out.println("Original: " + mediaType.toString());
        MediaType supertype = registry.getSupertype(mediaType);
        System.out.println("  supertype: " + supertype);
}


* Please note that I didn't loop/recurse because each one of the above has
only 1 parent, so recursing with my types didn't yield different results
* Please note that I hoped the first group to parse to
MediaType.APPLICATION_XML
* Please note that I hoped the second group to parse to
MediaType.APPLICATION_ZIP


The results are as follows:
Original: text/xml
  supertype: text/plain

Original: application/xml
  supertype: text/plain

Original: application/x-xml
  supertype: application/octet-stream

Original: application/atom+xml
  supertype: application/xml

Original: application/rss+xml
  supertype: application/xml

Original: application/gzip
  supertype: application/octet-stream

Original: application/x-gzip
  supertype: application/octet-stream

Original: application/x-gunzip
  supertype: application/octet-stream

Original: application/gzipped
  supertype: application/octet-stream

Original: application/gzip-compressed
  supertype: application/octet-stream

Original: application/x-compress
  supertype: application/octet-stream

Original: gzip/document
  supertype: application/octet-stream





As you can see from the results:
* The first two types parse to "text/plain" which is not good for me.
* The third type parsed to "octet-stream" which isn't as hoped also
* All of compressed types parsed to "octet-stream" which doesn't really
help me either



On Wed, Apr 23, 2014 at 1:52 PM, אברהם חיון <av...@gmail.com> wrote:

> Thank you Nick.
>
>
> Using that code I can easily recurse to the parent MediaType.
>
>
> I wonder who the main parents are, but I will try it tonight and see what
> I get.
>
>
> I will report my success / failure.
>
>
>
> Thanks,
> Avi.
>
>
>
>  On Wed, Apr 23, 2014 at 1:29 PM, Nick Burch <ap...@gagravarr.org> wrote:
>
>> On Wed, 23 Apr 2014, אברהם חיון wrote:
>>
>>> I need to download the Tika source code for that and I am still at work.
>>>
>>
>> It's all in SVN, so you can just browse it:
>> http://svn.apache.org/viewvc/tika/trunk/tika-app/src/main/
>> java/org/apache/tika/cli/TikaCLI.java?view=markup
>>
>> And view it raw:
>> http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/
>> main/java/org/apache/tika/cli/TikaCLI.java
>>
>> Nick
>
>
>

Re: Correct use of Tika's MediaType

Posted by אברהם חיון <av...@gmail.com>.
Thank you Nick.


Using that code I can easily recurse to the parent MediaType.


I wonder who the main parents are, but I will try it tonight and see what I
get.


I will report my success / failure.



Thanks,
Avi.



On Wed, Apr 23, 2014 at 1:29 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Wed, 23 Apr 2014, אברהם חיון wrote:
>
>> I need to download the Tika source code for that and I am still at work.
>>
>
> It's all in SVN, so you can just browse it:
> http://svn.apache.org/viewvc/tika/trunk/tika-app/src/main/
> java/org/apache/tika/cli/TikaCLI.java?view=markup
>
> And view it raw:
> http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/
> main/java/org/apache/tika/cli/TikaCLI.java
>
> Nick

Re: Correct use of Tika's MediaType

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 23 Apr 2014, אברהם חיון wrote:
> I need to download the Tika source code for that and I am still at work.

It's all in SVN, so you can just browse it:
http://svn.apache.org/viewvc/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java?view=markup

And view it raw:
http://svn.apache.org/repos/asf/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java

Nick

Re: Correct use of Tika's MediaType

Posted by אברהם חיון <av...@gmail.com>.
Not yet,

I need to download the Tika source code for that and I am still at work.

It just seemed that you are much more familiar in it than me, so I asked
for some code, I will look into it tonight.


And I thank you for your efforts.



Avi.


On Wed, Apr 23, 2014 at 12:11 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Wed, 23 Apr 2014, אברהם חיון wrote:
>
>> Nick, can you help me with some code?
>>
>
> Did you try my suggestion?
>
>
>  Look at the displaySupportedTypes() method of TikaCLI for an example of
>> getting the supertype
>>
>
> Nick

Re: Correct use of Tika's MediaType

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 23 Apr 2014, אברהם חיון wrote:
> Nick, can you help me with some code?

Did you try my suggestion?

> Look at the displaySupportedTypes() method of TikaCLI for an example of
> getting the supertype

Nick

Re: Correct use of Tika's MediaType

Posted by אברהם חיון <av...@gmail.com>.
Nick, can you help me with some code?

Can you throw in some pseuso code (I won't hold you to the accuracy) so I
will understand the strategy a little bit more?


On Wed, Apr 23, 2014 at 11:40 AM, Nick Burch <ap...@gagravarr.org> wrote:

> On Wed, 23 Apr 2014, אברהם חיון wrote:
>
>> So if the MediaType is from type XML I want to do some action, if it is a
>> compressed file I want to start an other action.
>>
>> My problem is that there are many XML types, so how do I check if it is an
>> XML using the MediaType ?
>>
>
> Check the supertype, and then recurse checking supertypes of that, until
> you either hit the type of interest, or hit octet stream
>
> Look at the displaySupportedTypes() method of TikaCLI for an example of
> getting the supertype, then recurse on that if you want to keep checking up
> the tree
>
> Nick

Re: Correct use of Tika's MediaType

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 23 Apr 2014, אברהם חיון wrote:
> So if the MediaType is from type XML I want to do some action, if it is a
> compressed file I want to start an other action.
>
> My problem is that there are many XML types, so how do I check if it is an
> XML using the MediaType ?

Check the supertype, and then recurse checking supertypes of that, until 
you either hit the type of interest, or hit octet stream

Look at the displaySupportedTypes() method of TikaCLI for an example of 
getting the supertype, then recurse on that if you want to keep checking 
up the tree

Nick