You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Nick Burch <ni...@apache.org> on 2020/06/09 11:04:46 UTC

Mime type magic and repeated similar blocks - thoughts?

Hi All

At the moment, to detect RFC822 emails, we try and check for a bunch of 
common header lines right at the start. If not, we check for a few "could 
be an unusual header, could be some text", followed by checking for common 
headers in a larger area of text below.

For example, starts with "Received:" or starts with "X-" and has 
"\nReceived:" near that, in mime-magic it's
https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L6100

After a recent bug, we now have 3 different "could be a header not sure" 
blocks at the start (X-, DKIM- or ARC-), all with exactly the same block 
of possible real headers below. These need to be kept in sync between the 
3 initial matches, and if not could cause bugs

Ideally, I'd like to group those three together to avoid that + simplify + 
make it easier to understand


One option might be to make the first big a regexp, so we can do eg 
^((X-)|(DKIM-)|(ARC-)) to match all of them. Not sure if that's clearer, 
nor the performance? Could maybe even then add the other headers to check 
in after, if that doesn't make it too hard to understand?

Alternately, we could maybe tweak the xml to support an or construct, so 
you could give multiple ones to match at one level with multiple "normal 
or's" below?

Or something else?

Any thoughts anyone?

Thanks
Nick

Re: Mime type magic and repeated similar blocks - thoughts?

Posted by Tim Allison <ta...@apache.org>.
I like the regex option, and I _think_ that the anchor at the beginning
(along with the lack of backtracking) shouldn't cause horrible performance
degradation.

On Tue, Jun 9, 2020 at 7:04 AM Nick Burch <ni...@apache.org> wrote:

> Hi All
>
> At the moment, to detect RFC822 emails, we try and check for a bunch of
> common header lines right at the start. If not, we check for a few "could
> be an unusual header, could be some text", followed by checking for common
> headers in a larger area of text below.
>
> For example, starts with "Received:" or starts with "X-" and has
> "\nReceived:" near that, in mime-magic it's
>
> https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L6100
>
> After a recent bug, we now have 3 different "could be a header not sure"
> blocks at the start (X-, DKIM- or ARC-), all with exactly the same block
> of possible real headers below. These need to be kept in sync between the
> 3 initial matches, and if not could cause bugs
>
> Ideally, I'd like to group those three together to avoid that + simplify +
> make it easier to understand
>
>
> One option might be to make the first big a regexp, so we can do eg
> ^((X-)|(DKIM-)|(ARC-)) to match all of them. Not sure if that's clearer,
> nor the performance? Could maybe even then add the other headers to check
> in after, if that doesn't make it too hard to understand?
>
> Alternately, we could maybe tweak the xml to support an or construct, so
> you could give multiple ones to match at one level with multiple "normal
> or's" below?
>
> Or something else?
>
> Any thoughts anyone?
>
> Thanks
> Nick
>