You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Chris Bamford <cb...@mimecast.com> on 2016/11/09 16:42:13 UTC

Mime type matching: tika-mimetypes.xml

Hi, I was wondering exactly what this syntax means in tika-mimetypes.xml


<mime-type type="message/rfc822">
  <magic priority="50">
    <match value="Status:" type="string" offset="0"/>
    …
      <match value="Message-ID:" type="string" offset="0:8192"/>
    </match>
  </magic>
  ...

</mime-type>

Does offset="0:8192" mean match 'Message-ID:' anywhere in the first 8192 bytes?  If so, I'm not sure it is working properly as I have some eml files with this string near the beginning (but not at byte offset 0) where it does not match.  Is there some other logic involved which I am missing?

Thanks,

Chris


Chris Bamford
Lead Software Engineer

CityPoint, One Ropemaker Street, 
London,
EC2Y 9AW.

mobile +44 7860 405292
tel: +44 (0) 207 847 8700
web www.mimecast.com


The information contained in this communication from cbamford@mimecast.com is confidential and may be legally privileged. It is intended solely for use by user@tika.apache.org and others authorized to receive it. If you are not user@tika.apache.org you are hereby notified that any disclosure, copying, distribution or taking action in reliance of the contents of this information is strictly prohibited and may be unlawful.


Mimecast Ltd. is a company registered in England and Wales with the company number 4698693 VAT No. GB 832 5179 29
Registered Office: CityPoint, One Ropemaker Street, Moorgate, London, EC2Y 9AW Email Address: info@mimecast.com

This email message has been scanned for viruses by Mimecast.
Mimecast delivers a complete managed email solution from a single web based platform.
For more information please visit http://www.mimecast.com


Re: Mime type matching: tika-mimetypes.xml

Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 9 Nov 2016, Chris Bamford wrote:
> <mime-type type="message/rfc822">
>  <magic priority="50">
>    <match value="Status:" type="string" offset="0"/>
>    \u2026
>      <match value="Message-ID:" type="string" offset="0:8192"/>
>    </match>
>  </magic>
>  ...
>
> </mime-type>
>
> Does offset="0:8192" mean match 'Message-ID:' anywhere in the first 8192 
> bytes?

Yup, that's it. If that is found, and nothing with a priority score of 
higher than 50 also matches, it'll return that type. If a higher priority 
matched, that other one will win.

(There's also some bits for if the extension matches a type in the same 
family, eg for specialising)

> If so, I'm not sure it is working properly as I have some eml files with 
> this string near the beginning (but not at byte offset 0) where it does 
> not match.  Is there some other logic involved which I am missing?

If you can share a small file that shows it, we can take a look for you.

Nick