You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Chris Bamford <cb...@mimecast.com> on 2016/11/09 16:42:13 UTC
Mime type matching: tika-mimetypes.xml
Hi, I was wondering exactly what this syntax means in tika-mimetypes.xml
<mime-type type="message/rfc822">
<magic priority="50">
<match value="Status:" type="string" offset="0"/>
…
<match value="Message-ID:" type="string" offset="0:8192"/>
</match>
</magic>
...
</mime-type>
Does offset="0:8192" mean match 'Message-ID:' anywhere in the first 8192 bytes? If so, I'm not sure it is working properly as I have some eml files with this string near the beginning (but not at byte offset 0) where it does not match. Is there some other logic involved which I am missing?
Thanks,
Chris
Chris Bamford
Lead Software Engineer
CityPoint, One Ropemaker Street,
London,
EC2Y 9AW.
mobile +44 7860 405292
tel: +44 (0) 207 847 8700
web www.mimecast.com
The information contained in this communication from cbamford@mimecast.com is confidential and may be legally privileged. It is intended solely for use by user@tika.apache.org and others authorized to receive it. If you are not user@tika.apache.org you are hereby notified that any disclosure, copying, distribution or taking action in reliance of the contents of this information is strictly prohibited and may be unlawful.
Mimecast Ltd. is a company registered in England and Wales with the company number 4698693 VAT No. GB 832 5179 29
Registered Office: CityPoint, One Ropemaker Street, Moorgate, London, EC2Y 9AW Email Address: info@mimecast.com
This email message has been scanned for viruses by Mimecast.
Mimecast delivers a complete managed email solution from a single web based platform.
For more information please visit http://www.mimecast.com
Re: Mime type matching: tika-mimetypes.xml
Posted by Nick Burch <ap...@gagravarr.org>.
On Wed, 9 Nov 2016, Chris Bamford wrote:
> <mime-type type="message/rfc822">
> <magic priority="50">
> <match value="Status:" type="string" offset="0"/>
> \u2026
> <match value="Message-ID:" type="string" offset="0:8192"/>
> </match>
> </magic>
> ...
>
> </mime-type>
>
> Does offset="0:8192" mean match 'Message-ID:' anywhere in the first 8192
> bytes?
Yup, that's it. If that is found, and nothing with a priority score of
higher than 50 also matches, it'll return that type. If a higher priority
matched, that other one will win.
(There's also some bits for if the extension matches a type in the same
family, eg for specialising)
> If so, I'm not sure it is working properly as I have some eml files with
> this string near the beginning (but not at byte offset 0) where it does
> not match. Is there some other logic involved which I am missing?
If you can share a small file that shows it, we can take a look for you.
Nick