You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Peter Conrad <cy...@quisquis.de> on 2022/09/27 07:23:47 UTC

Validate MIME-type

Hello,

I'm working on an server application where clients can upload pieces of
data together with the data's MIME type, and I would like to verify
that the given data is valid in terms of the given type (for a broad
definition of "valid").

I have tried using Tika.detect in various ways, but the results were
not satisfying so far, the general problem being that multiple MIME
types might be valid for some given data whereas Tika will return only
one match.

For example, a piece of HTML source that is valid as "text/html" would
also be valid as "text/plain", a piece of text with charset US-ASCII
would also be valid with charset UTF-8.

While I've has *some* success with giving the client-provided type as
metadata to Tika.detect() at least in the text/* case, there are other
cases where not only multiple subtypes may apply but also multiple
supertypes (e.g. the string "P2 2 2 1 0 1 1 0" is valid text/plain but
also valid image/x-portable-graymap, here Tika always returns the
image type never the text type). Also, using the client-provided
MIME-type sometimes leads to false results, e.g. the byte sequence
(1,2,3,4,5) would be accepted as image/gif which it clearly isn't.

* Is there a way of using Tika to answer the question "is <data> a valid
  instance of <type>?
* Is there a way to ask Tika "give me all possible <type>s for <data>"
  instead of just "give me the best match"?

Thanks for your suggestions,

	Peter
-- 
Cyrano UG (haftungsbeschränkt)
Alicestr. 102
63263 Neu-Isenburg
Germany

Tel.: +49 6102 821206

Geschäftsführer: Peter Conrad

AG Offenbach
HRB Nr. 47931

USt-ID: DE296491819

Re: Validate MIME-type

Posted by Peter Conrad <cy...@quisquis.de>.
Hi,

Am Thu, 29 Sep 2022 11:33:26 +0100 (BST)
schrieb Nick Burch <ni...@apache.org>:

> Any chance you could write up a bit more about what you're trying to 
> achieve, and what you're trying to protect against?
> 
> It's ApacheCon next week, and we may be able to get a few of us
> together in-person to brainstorm what's possible in this area

the OP pretty much says it all: we have client-provided data and a
client-provided MIME type and want to check if that MIME type is
plausible for the data. We don't require full-fledged
validation/verification, but we're happy with plausibility as provided
by e. g. the `file` command (i. e. look for "magic" bytes, apply
heuristics etc.).

This is just one simple layer of protection. (IIRC there was a bug in an
ancient IE version where the server would send javascript with an
image/* MIME-type but the browser would still execute the JS. That
kind of thing should be caught.)

Any suggestions would be welcome.

Thanks,
	Peter
-- 
Cyrano UG (haftungsbeschränkt)
Alicestr. 102
63263 Neu-Isenburg
Germany

Tel.: +49 6102 821206

Geschäftsführer: Peter Conrad

AG Offenbach
HRB Nr. 47931

USt-ID: DE296491819

Re: Validate MIME-type

Posted by Nick Burch <ni...@apache.org>.
On Thu, 29 Sep 2022, Peter Conrad wrote:
> thanks. That's definitely an improvement. But I think it's not
> sufficient.
>
> AFAICS your code uses "aliases" as in "if it's type X then it can also
> be type Y". However there's also cases where a specific instance of
> type X can also be type Y but not all instances of type X. For example,
> the eicar.com antivirus test file is a MSDOS-executable consisting
> purely of ASCII characters, so it would be valid text/plain AND
> application/x-msdownload but clearly neither all text/plain's are valid
> application/x-msdownload's nor vice versa so there can't be an alias
> connecting the two.

Any chance you could write up a bit more about what you're trying to 
achieve, and what you're trying to protect against?

It's ApacheCon next week, and we may be able to get a few of us together 
in-person to brainstorm what's possible in this area

Thanks
Nick

Re: Validate MIME-type

Posted by Peter Conrad <cy...@quisquis.de>.
Hi,

Am Tue, 27 Sep 2022 09:35:42 +0200
schrieb Tamás Cservenák <ta...@cservenak.net>:

> See this class, IMO it does exactly what you want:
> https://github.com/sonatype/nexus-public/blob/main/components/nexus-mime/src/main/java/org/sonatype/nexus/mime/internal/DefaultMimeSupport.java#L138
> 
> Is able to detect several ("unravel" aliases and hierarchy) mime
> types by content or by filename.

thanks. That's definitely an improvement. But I think it's not
sufficient.

AFAICS your code uses "aliases" as in "if it's type X then it can also
be type Y". However there's also cases where a specific instance of
type X can also be type Y but not all instances of type X. For example,
the eicar.com antivirus test file is a MSDOS-executable consisting
purely of ASCII characters, so it would be valid text/plain AND
application/x-msdownload but clearly neither all text/plain's are valid
application/x-msdownload's nor vice versa so there can't be an alias
connecting the two.

cu,
	Peter
-- 
Cyrano UG (haftungsbeschränkt)
Alicestr. 102
63263 Neu-Isenburg
Germany

Tel.: +49 6102 821206

Geschäftsführer: Peter Conrad

AG Offenbach
HRB Nr. 47931

USt-ID: DE296491819

Re: Validate MIME-type

Posted by Tamás Cservenák <ta...@cservenak.net>.
Well, Nx2 did exactly that :) But in Nx3 times changes :D :D

https://github.com/sonatype/nexus-public/blob/nexus-2.x/components/nexus-core/src/main/java/org/sonatype/nexus/mime/DefaultMimeSupport.java#L158-L172

T

On Tue, Sep 27, 2022 at 11:57 AM Nick Burch <ni...@apache.org> wrote:

> On Tue, 27 Sep 2022, Tamás Cservenák wrote:
> > See this class, IMO it does exactly what you want:
> >
> https://github.com/sonatype/nexus-public/blob/main/components/nexus-mime/src/main/java/org/sonatype/nexus/mime/internal/DefaultMimeSupport.java#L138
> >
> > Is able to detect several ("unravel" aliases and hierarchy) mime types by
> > content or by filename.
>
> Looks pretty good!
>
> I think there might be a few more cases where you want to check the
> supertype, and possibly some cases where you want to check the supertype
> of the supertype!
>
> Other code you can pinch ideas from is in TikaCLI eg displaySupportedTypes
>
> https://github.com/apache/tika/blob/main/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java#L854
> and the parent type checking in compareFileMagic
>
> https://github.com/apache/tika/blob/main/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java#L943
>
> Nick

Re: Validate MIME-type

Posted by Nick Burch <ni...@apache.org>.
On Tue, 27 Sep 2022, Tamás Cservenák wrote:
> See this class, IMO it does exactly what you want:
> https://github.com/sonatype/nexus-public/blob/main/components/nexus-mime/src/main/java/org/sonatype/nexus/mime/internal/DefaultMimeSupport.java#L138
>
> Is able to detect several ("unravel" aliases and hierarchy) mime types by
> content or by filename.

Looks pretty good!

I think there might be a few more cases where you want to check the 
supertype, and possibly some cases where you want to check the supertype 
of the supertype!

Other code you can pinch ideas from is in TikaCLI eg displaySupportedTypes
https://github.com/apache/tika/blob/main/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java#L854
and the parent type checking in compareFileMagic
https://github.com/apache/tika/blob/main/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java#L943

Nick

Re: Validate MIME-type

Posted by Tamás Cservenák <ta...@cservenak.net>.
Howdy,

we did something similar in one Maven Repository Manager codebase (Nx1/Nx2
but same in Nx3), as we had exact same requirements:

See this class, IMO it does exactly what you want:
https://github.com/sonatype/nexus-public/blob/main/components/nexus-mime/src/main/java/org/sonatype/nexus/mime/internal/DefaultMimeSupport.java#L138

Is able to detect several ("unravel" aliases and hierarchy) mime types by
content or by filename.

Also, it was important to override some Tika defaults (for example in Maven
universe ".rar" extension is resource-adaper JAR  and not RAR compression
format usually), that was achieved by augmenting Tika with rules like the
"build ins" are (but is user extensible):
https://github.com/sonatype/nexus-public/blob/main/components/nexus-mime/src/main/resources/builtin-mimetypes.properties
and
https://github.com/sonatype/nexus-public/blob/main/components/nexus-mime/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml

As for your first question, Nx3 does "content validation" as this (uses
that class above)
https://github.com/sonatype/nexus-public/blob/main/components/nexus-repository-services/src/main/java/org/sonatype/nexus/repository/mime/DefaultContentValidator.java

HTH
T


On Tue, Sep 27, 2022 at 9:24 AM Peter Conrad <cy...@quisquis.de> wrote:

> Hello,
>
> I'm working on an server application where clients can upload pieces of
> data together with the data's MIME type, and I would like to verify
> that the given data is valid in terms of the given type (for a broad
> definition of "valid").
>
> I have tried using Tika.detect in various ways, but the results were
> not satisfying so far, the general problem being that multiple MIME
> types might be valid for some given data whereas Tika will return only
> one match.
>
> For example, a piece of HTML source that is valid as "text/html" would
> also be valid as "text/plain", a piece of text with charset US-ASCII
> would also be valid with charset UTF-8.
>
> While I've has *some* success with giving the client-provided type as
> metadata to Tika.detect() at least in the text/* case, there are other
> cases where not only multiple subtypes may apply but also multiple
> supertypes (e.g. the string "P2 2 2 1 0 1 1 0" is valid text/plain but
> also valid image/x-portable-graymap, here Tika always returns the
> image type never the text type). Also, using the client-provided
> MIME-type sometimes leads to false results, e.g. the byte sequence
> (1,2,3,4,5) would be accepted as image/gif which it clearly isn't.
>
> * Is there a way of using Tika to answer the question "is <data> a valid
>   instance of <type>?
> * Is there a way to ask Tika "give me all possible <type>s for <data>"
>   instead of just "give me the best match"?
>
> Thanks for your suggestions,
>
>         Peter
> --
> Cyrano UG (haftungsbeschränkt)
> Alicestr. 102
> 63263 Neu-Isenburg
> Germany
>
> Tel.: +49 6102 821206
>
> Geschäftsführer: Peter Conrad
>
> AG Offenbach
> HRB Nr. 47931
>
> USt-ID: DE296491819
>