You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Willy T. Koch" <ti...@kochkonsult.no> on 2022/02/10 21:30:37 UTC

Returning file extension alongside mime-type?

Hi,
Newly Tika user here. Really impressed by the Tika toolkit and we’re planning to use it as a Docker service in our case management solution used by the public sector in the Nordics, for many different use cases.

As for content detection, today the content-type field with mime type is returned. What we would need is a mime-type to file extension lookup and it seems logical that this was also returned by Tika.

After some research there are some quite extensive lists with mime-type to file extension mapping, based on the official IANA list and Apache and nginx servers
Example:
https://www.iana.org/assignments/media-types/media-types.xhtml

https://svn.apache.org/repos/asf/httpd/httpd/trunk/docs/conf/mime.types

Is this an add-on that could be considered as part of a standard Tika setup? Has this need been discussed before?


Regards,
Willy T. Koch
Norway


Re: Returning file extension alongside mime-type?

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 24 Feb 2022, Tim Allison wrote:
> A separate endpoint, then?  That would be cleaner.

We already have some mime details related endpoints, would be an extension 
or related endpoint to those, see earlier-thread:
https://lists.apache.org/thread/jlym8ypnrj978hmzjgvkc1fpxnc7g51h

Nick

Re: Returning file extension alongside mime-type?

Posted by Tim Allison <ta...@apache.org>.
A separate endpoint, then?  That would be cleaner.

On Thu, Feb 24, 2022 at 6:31 AM Nick Burch <ap...@gagravarr.org> wrote:
>
> On Tue, 22 Feb 2022, Tim Allison wrote:
> > I guess the question is how far do we want to bake this in?  I could see
> > adding a field for the default extension in the
> > CompositeDetector/DefaultDetector.  This would then be triggered on
> > embedded files, too.  I can't imagine this would add much cost
> > computationally(???), and it would just show up for free all over the
> > place.
>
> Ah, I thought this would be something that required two API hits. Having
> done your detection or parsing, you'd then query a mimetype related API to
> get extra details on the type you were told your file was.
>
> You could also pre-check types you think you'd be interested in, or grab
> all the details on all the types, if you so wanted
>
> Nick

Re: Returning file extension alongside mime-type?

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 22 Feb 2022, Tim Allison wrote:
> I guess the question is how far do we want to bake this in?  I could see 
> adding a field for the default extension in the 
> CompositeDetector/DefaultDetector.  This would then be triggered on 
> embedded files, too.  I can't imagine this would add much cost 
> computationally(???), and it would just show up for free all over the 
> place.

Ah, I thought this would be something that required two API hits. Having 
done your detection or parsing, you'd then query a mimetype related API to 
get extra details on the type you were told your file was.

You could also pre-check types you think you'd be interested in, or grab 
all the details on all the types, if you so wanted

Nick

Re: Returning file extension alongside mime-type?

Posted by Tim Allison <ta...@apache.org>.
I guess the question is how far do we want to bake this in?  I could
see adding a field for the default extension in the
CompositeDetector/DefaultDetector.  This would then be triggered on
embedded files, too.  I can't imagine this would add much cost
computationally(???), and it would just show up for free all over the
place.

It does feel a bit smelly to add this one feature, but I've done worse
in my career. :(

Or, do we want a custom handler/parameter on the detect/ endpoint in
tika-server?

Is the use case that you want to parse the file _and_ get this
information in one go?  Or, are you only running detect on the
main/container file?

On Thu, Feb 17, 2022 at 2:00 PM Nick Burch <ap...@gagravarr.org> wrote:
>
> On Thu, 10 Feb 2022, Nick Burch wrote:
> > On Thu, 10 Feb 2022, Willy T. Koch wrote:
> >> …and calling it as a webservice with Postman/curl.
> >
> > Ah, I think we might not be exposing the full details of the mime types via
> > the server, only details of their parsers and the heirarchy, eg
> > http://localhost:9998/mime-types#audio/vorbis
> >
> > (We have that info in Java we're just seemingly not making it available)
> >
> >
> > I'm not sure about exposing all the details of all the types by default,
> > but adding a flag and/or a sub-endpoint that would return the full
> > details of a type, including extensions and comments etc, seems OK to
> > me. Thoughts anyone?
>
> Tika devs - any thoughts on this? It's a pretty small code change (we
> already have the data on the mime type!), just need feedback on extending
> the existing API vs adding a new one
>
> Nick

Re: Returning file extension alongside mime-type?

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 8 Mar 2022, Willy T. Koch wrote:
> That’s fantastic, thank you!
>
> Looking forward to testing when the Tika Docker repo is updated with 
> this release.

That may take a few weeks, but if you don't mind building Tika from 
source, you should be able to give it a whirl now. (As far as I'm aware, 
we don't build the docker image from snapshots)

If you checkout Tika from source - 
https://tika.apache.org/contribute.html#Source_Code - and build the 
project with maven, you should then be able to go to the tika-docker/full 
directory and build the docker image locally

Nick


Re: Returning file extension alongside mime-type?

Posted by "Willy T. Koch" <ti...@kochkonsult.no>.
That’s fantastic, thank you! 

Looking forward to testing when the Tika Docker repo is updated with this release. 

Regards,

Willy T. Koch

Den Tir 8 mar 2022, kl. 00:35, skrev Nick Burch:
> On Fri, 18 Feb 2022, Willy T. Koch wrote:
> > Den Tor 17 feb 2022, kl. 20:00, skrev Nick Burch:
> >> Tika devs - any thoughts on this? It's a pretty small code change (we
> >> already have the data on the mime type!), just need feedback on extending
> >> the existing API vs adding a new one
> >
> > By also returning the default/most commonly used file extension, Apache 
> > Tika in Docker will be the perfect security companion for SaaS 
> > solutions.
> >
> > To be able to verify all files before they are archived will prevent 
> > different errors down the line, like with PDF conversion and document 
> > production.
> 
> OK, this is now implemented. Should be in 2.3.1 or 2.4, whatever the next 
> release is.
> 
> You will need to make an additional request to 
> /mime-types/{type}/{subtype} eg /mime-types/application/cbor to get the 
> full details on the type. You ought to be able to cache that though in 
> case it helps.
> 
> See https://issues.apache.org/jira/browse/TIKA-3694 for a bit more detail 
> and the example JSON you'll get
> 
> Nick
> 

Re: Returning file extension alongside mime-type?

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 18 Feb 2022, Willy T. Koch wrote:
> Den Tor 17 feb 2022, kl. 20:00, skrev Nick Burch:
>> Tika devs - any thoughts on this? It's a pretty small code change (we
>> already have the data on the mime type!), just need feedback on extending
>> the existing API vs adding a new one
>
> By also returning the default/most commonly used file extension, Apache 
> Tika in Docker will be the perfect security companion for SaaS 
> solutions.
>
> To be able to verify all files before they are archived will prevent 
> different errors down the line, like with PDF conversion and document 
> production.

OK, this is now implemented. Should be in 2.3.1 or 2.4, whatever the next 
release is.

You will need to make an additional request to 
/mime-types/{type}/{subtype} eg /mime-types/application/cbor to get the 
full details on the type. You ought to be able to cache that though in 
case it helps.

See https://issues.apache.org/jira/browse/TIKA-3694 for a bit more detail 
and the example JSON you'll get

Nick

Re: Returning file extension alongside mime-type?

Posted by "Willy T. Koch" <ti...@kochkonsult.no>.
Den Tor 17 feb 2022, kl. 20:00, skrev Nick Burch:
> On Thu, 10 Feb 2022, Nick Burch wrote:
> > On Thu, 10 Feb 2022, Willy T. Koch wrote:
> >> …and calling it as a webservice with Postman/curl.
> >
> > Ah, I think we might not be exposing the full details of the mime types via 
> > the server, only details of their parsers and the heirarchy, eg
> > http://localhost:9998/mime-types#audio/vorbis
> >
> > (We have that info in Java we're just seemingly not making it available)
> >
> >
> > I'm not sure about exposing all the details of all the types by default, 
> > but adding a flag and/or a sub-endpoint that would return the full 
> > details of a type, including extensions and comments etc, seems OK to 
> > me. Thoughts anyone?
> 
> Tika devs - any thoughts on this? It's a pretty small code change (we 
> already have the data on the mime type!), just need feedback on extending 
> the existing API vs adding a new one
> 
> Nick

By also returning the default/most commonly used file extension, Apache Tika in Docker will be the perfect security companion for SaaS solutions. 

To be able to verify all files before they are archived will prevent different errors down the line, like with PDF conversion and document production. 

Re: Returning file extension alongside mime-type?

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 10 Feb 2022, Nick Burch wrote:
> On Thu, 10 Feb 2022, Willy T. Koch wrote:
>> …and calling it as a webservice with Postman/curl.
>
> Ah, I think we might not be exposing the full details of the mime types via 
> the server, only details of their parsers and the heirarchy, eg
> http://localhost:9998/mime-types#audio/vorbis
>
> (We have that info in Java we're just seemingly not making it available)
>
>
> I'm not sure about exposing all the details of all the types by default, 
> but adding a flag and/or a sub-endpoint that would return the full 
> details of a type, including extensions and comments etc, seems OK to 
> me. Thoughts anyone?

Tika devs - any thoughts on this? It's a pretty small code change (we 
already have the data on the mime type!), just need feedback on extending 
the existing API vs adding a new one

Nick

Re: Returning file extension alongside mime-type?

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 10 Feb 2022, Willy T. Koch wrote:
> …and calling it as a webservice with Postman/curl.

Ah, I think we might not be exposing the full details of the mime types 
via the server, only details of their parsers and the heirarchy, eg
http://localhost:9998/mime-types#audio/vorbis

(We have that info in Java we're just seemingly not making it available)


I'm not sure about exposing all the details of all the types by default, 
but adding a flag and/or a sub-endpoint that would return the full details 
of a type, including extensions and comments etc, seems OK to me. Thoughts 
anyone?

Nick

Re: Returning file extension alongside mime-type?

Posted by "Willy T. Koch" <ti...@kochkonsult.no>.
…and calling it as a webservice with Postman/curl. 

Willy

Den Tor 10 feb 2022, kl. 22:43, skrev Willy T. Koch:
> Ah, that’s good news, will look into that!
> 
> I’ve only been using the 2.2.1-full official Tika docker image with default config, only added some more Tesseract languages for OCR. 
> 
> Vennlig hilsen
> 
> Willy T. Koch
> 
> 
> Den Tor 10 feb 2022, kl. 22:40, skrev Nick Burch:
>> On Thu, 10 Feb 2022, Willy T. Koch wrote:
>> > As for content detection, today the content-type field with mime type is 
>> > returned. What we would need is a mime-type to file extension lookup and 
>> > it seems logical that this was also returned by Tika.
>> 
>> How are you calling Tika? We already have APIs for this. Just ask the 
>> MimeTypes class (available via TikaConfig.getMimeRepository) about a type, 
>> and it'll return the details including the preferred extension and other 
>> possible well-known extensions
>> 
>> Nick
>> 
> 

Re: Returning file extension alongside mime-type?

Posted by "Willy T. Koch" <ti...@kochkonsult.no>.
Ah, that’s good news, will look into that!

I’ve only been using the 2.2.1-full official Tika docker image with default config, only added some more Tesseract languages for OCR. 

Vennlig hilsen

Willy T. Koch
tika@kochkonsult.no
Mob: +47 480 321 77


Den Tor 10 feb 2022, kl. 22:40, skrev Nick Burch:
> On Thu, 10 Feb 2022, Willy T. Koch wrote:
> > As for content detection, today the content-type field with mime type is 
> > returned. What we would need is a mime-type to file extension lookup and 
> > it seems logical that this was also returned by Tika.
> 
> How are you calling Tika? We already have APIs for this. Just ask the 
> MimeTypes class (available via TikaConfig.getMimeRepository) about a type, 
> and it'll return the details including the preferred extension and other 
> possible well-known extensions
> 
> Nick
> 

Re: Returning file extension alongside mime-type?

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 10 Feb 2022, Willy T. Koch wrote:
> As for content detection, today the content-type field with mime type is 
> returned. What we would need is a mime-type to file extension lookup and 
> it seems logical that this was also returned by Tika.

How are you calling Tika? We already have APIs for this. Just ask the 
MimeTypes class (available via TikaConfig.getMimeRepository) about a type, 
and it'll return the details including the preferred extension and other 
possible well-known extensions

Nick