You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Jim Idle <ji...@proofpoint.com> on 2018/03/01 05:31:52 UTC
Malware RTF is not detected as RTF
I can open a ticket for this but wanted to just run it by you first.
As explained here: http://www.decalage.info/rtf_tricks (no need to read if you don’t care 😉
Malicious RTF files take advantage of the fact that Microsoft do not follow their own RTF spec. Specifically, Word et al only looks for the opening sequence:
{rt
Thought the spec says it should be:
{rtf1
Where 1 is the version number.
Tika fails to identify a malware file starting:
{\rtf1{\pict\jpegblip\picw24\pich24\bin49922
As an RTF file – it says that it is application/octet-stream
Could the Tika detector be modified to just look for {rt as per Office tools?
Cheers,
Jim
RE: Malware RTF is not detected as RTF
Posted by Jim Idle <ji...@proofpoint.com>.
Will do - of course the implementation is down to you guys to do what you think is most sensible without breaking others.
The current detector just looks for {\rtf
If it just made the f optional or did not look for it, then I am pretty certain that it would break nothing, but I would be happy with an artificial mime-type too.
I have worked around it for now, so I can wait for the next release cycle.
I will add an rtf that does not contain malware, for sure. In fact all you need do is use vi to delete the f1 part of any normal rtf magic and you have your test. I will attach it though 😊
Cheers,
Jim
> -----Original Message-----
> From: Nick Burch [mailto:apache@gagravarr.org]
> Sent: Thursday, March 1, 2018 21:14
> To: user@tika.apache.org
> Subject: Re: Malware RTF is not detected as RTF
>
> On Thu, 1 Mar 2018, Jim Idle wrote:
> > Malicious RTF files take advantage of the fact that Microsoft do not
> > follow their own RTF spec. Specifically, Word et al only looks for the
> > opening sequence:
> >
> > {rt
> >
> > Thought the spec says it should be:
> >
> > {rtf1
>
> I don't think that Tika can assume that all RTF users are as broken as Word is!
>
> I'd be tempted to define a new mimetype of application/x-broken-rtf or
> similar, and feed that a lower priority magic for {\rt, with a suitable
> comment/explanation. That way, we won't tell people something is an RTF
> which isn't, but we can help them spot these problematic files
>
> If you could create a small, broken but non-malicious rtf file, then raise an
> enhancement jira + attach, that'd be great!
>
> Nick
Re: Malware RTF is not detected as RTF
Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 1 Mar 2018, Jim Idle wrote:
> Malicious RTF files take advantage of the fact that Microsoft do not
> follow their own RTF spec. Specifically, Word et al only looks for the
> opening sequence:
>
> {rt
>
> Thought the spec says it should be:
>
> {rtf1
I don't think that Tika can assume that all RTF users are as broken as
Word is!
I'd be tempted to define a new mimetype of application/x-broken-rtf or
similar, and feed that a lower priority magic for {\rt, with a suitable
comment/explanation. That way, we won't tell people something is an RTF
which isn't, but we can help them spot these problematic files
If you could create a small, broken but non-malicious rtf file, then raise
an enhancement jira + attach, that'd be great!
Nick
RE: Malware RTF is not detected as RTF
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Yes. Please do open a ticket, and y, I have a need to read anything from decalage…he does some amazing work. 😊
I trust you wouldn’t, but please don’t post an actual malware file for us to use in our unit tests. 😉
From: Jim Idle [mailto:jidle@proofpoint.com]
Sent: Thursday, March 1, 2018 12:32 AM
To: user@tika.apache.org
Subject: Malware RTF is not detected as RTF
I can open a ticket for this but wanted to just run it by you first.
As explained here: http://www.decalage.info/rtf_tricks (no need to read if you don’t care 😉
Malicious RTF files take advantage of the fact that Microsoft do not follow their own RTF spec. Specifically, Word et al only looks for the opening sequence:
{rt
Thought the spec says it should be:
{rtf1
Where 1 is the version number.
Tika fails to identify a malware file starting:
{\rtf1{\pict\jpegblip\picw24\pich24\bin49922
As an RTF file – it says that it is application/octet-stream
Could the Tika detector be modified to just look for {rt as per Office tools?
Cheers,
Jim