You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Ricky Boone <ri...@gmail.com> on 2022/03/02 22:58:50 UTC

PDFinfo not returning expected producer, creator values

If this is the wrong forum to report this, let me know.

I'm trying to create a couple rules to identify questionable PDFs
(phishing, etc.).  While evaluating the debug output from spamassassin for
the pdfinfo plugin, I noticed that some of the test file attributes aren't
being populated correctly, when comparing against exiftool, Adobe Reader,
Firefox, etc.  The producer and creator fields, specifically, appear to be
left as unknown.

Compared against other emails and PDFs, I get similar results, so I suspect
it's an issue with the plugin or how it is parsing the PDF.  I do have this
example available, however it is malicious (it links to a phishing site),
so I wouldn't want to link to it directly in this thread.

For example:

$ less Invoice0098539.pdf
%PDF-1.4
1 0 obj
<<
/Title (<FE><FF>)
/Creator (<FE><FF>^@w^@k^@h^@t^@m^@l^@t^@o^@p^@d^@f^@ ^@0^@.^@1^@2^@.^@5)
/Producer (<FE><FF>^@Q^@t^@ ^@4^@.^@8^@.^@7)
/CreationDate (D:20220302192255Z)
>>
...

$  exiftool Invoice0098539.pdf
ExifTool Version Number         : 12.30
File Name                       : Invoice0098539.pdf
Directory                       : .
File Size                       : 21 KiB
File Modification Date/Time     : 2022:03:02 16:34:04-05:00
File Access Date/Time           : 2022:03:02 16:37:43-05:00
File Inode Change Date/Time     : 2022:03:02 16:34:04-05:00
File Permissions                : -rw-r--r--
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Title                           :
Creator                         : wkhtmltopdf 0.12.5
Producer                        : Qt 4.8.7
Create Date                     : 2022:03:02 19:22:55Z
Page Count                      : 1

$ sa-debug
...
config: fixed relative path:
/var/lib/spamassassin/3.004004/updates_spamassassin_org/20_pdfinfo.cf
config: using "/var/lib/spamassassin/3.004004/updates_spamassassin_org/
20_pdfinfo.cf" for included file
config: read file /var/lib/spamassassin/3.004004/updates_spamassassin_org/
20_pdfinfo.cf
pdfinfo: Identified 1 possible mime parts that need checked for PDF content
pdfinfo: found part, type=1 file=Invoice0098539.pdf cte=base64
pdfinfo: set_tag called for PDFVERSION 1.4
pdfinfo: set_tag called for PDFNAME Invoice0098539.pdf
...
pdfinfo: Filename=Invoice0098539.pdf Total HxW: 560 x 824 (55232 area)
pdfinfo: Filename=Invoice0098539.pdf Title=untitled Author=unknown
Producer=unknown Created=20220302192255 Modified=0
pdfinfo: MD5 results for Invoice0098539.pdf -
md5=3F6F5C7CB71BDB101BADEF3CFFA9FE63
fuzzy1=32531F1D9420EE5721866DF28A3C6A17
fuzzy2=549DC099D6DFEF65AEA67FA0DF151C14
pdfinfo: set_tag called for PDFPRODUCER unknown
pdfinfo: set_tag called for PDFTITLE untitled
pdfinfo: set_tag called for PDFCREATOR unknown
pdfinfo: set_tag called for PDFAUTHOR unknown
pdfinfo: set_tag called for PDFMD5 32531F1D9420EE5721866DF28A3C6A17
pdfinfo: set_tag called for PDFMD5FUZZY1 32531F1D9420EE5721866DF28A3C6A17
pdfinfo: set_tag called for PDFMD5FUZZY2 549DC099D6DFEF65AEA67FA0DF151C14
pdfinfo: set_tag called for PDFCOUNT 1
pdfinfo: set_tag called for PDFIMGCOUNT 8
pdfinfo: image ratio=0.00103201042873696, min=0.000 max=0.005
pdfinfo: is_empty_body = 23 bytes
pdfinfo: pdf_name_regex hit on Invoice0098539.pdf

Re: PDFinfo not returning expected producer, creator values

Posted by "Kevin A. McGrail" <km...@apache.org>.
I also want to mirror Bill's comment of a very detailed but report

On Fri, Mar 4, 2022, 18:05 Ricky Boone <ri...@gmail.com> wrote:

> Sorry for the late reply, crazy week.
>
> Honestly, I wasn't expecting such a quick and relevant response, so thanks
> and kudos for that.  :)
>
> I'm not currently using trunk, so I will try to patch in the changes
> described during a quiet period over the weekend.  It does look like that
> should do the trick, though.
>
> On Thu, Mar 3, 2022 at 1:48 AM Bill Cole <
> sausers-20150205@billmail.scconsult.com> wrote:
>
>> On 2022-03-02 at 17:58:50 UTC-0500 (Wed, 2 Mar 2022 17:58:50 -0500)
>> Ricky Boone <ri...@gmail.com>
>> is rumored to have said:
>>
>> > If this is the wrong forum to report this, let me know.
>>
>> This is fine. I've also documented the fix in our Bugzilla at
>> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7960
>>
>> If you're running the 'trunk' version out of svn, the fix is in there.
>> We do not even have a target date for the next release, but we generally
>> do not break 'trunk' if you're feeling adventurous.
>>
>> If you're a different sort of adventurous, willing to hack on your local
>> copy of the code, the fix is to remove these lines (~223-224) which skip
>> lines based on an antique assumption:
>>
>> -      # lines containing high bytes will have no data we need, so save
>> some cycles
>> -      next if ($line =~ /[\x80-\xff]/);
>>
>> Thank you very much for the detailed analysis. I had seen this problem
>> on some PDFs but have not had the time to dive into the issue. You
>> vastly reduced the pain of fixing it.
>>
>>
>> > I'm trying to create a couple rules to identify questionable PDFs
>> > (phishing, etc.).  While evaluating the debug output from spamassassin
>> > for
>> > the pdfinfo plugin, I noticed that some of the test file attributes
>> > aren't
>> > being populated correctly, when comparing against exiftool, Adobe
>> > Reader,
>> > Firefox, etc.  The producer and creator fields, specifically, appear
>> > to be
>> > left as unknown.
>> >
>> > Compared against other emails and PDFs, I get similar results, so I
>> > suspect
>> > it's an issue with the plugin or how it is parsing the PDF.  I do have
>> > this
>> > example available, however it is malicious (it links to a phishing
>> > site),
>> > so I wouldn't want to link to it directly in this thread.
>> >
>> > For example:
>> >
>> > $ less Invoice0098539.pdf
>> > %PDF-1.4
>> > 1 0 obj
>> > <<
>> > /Title (<FE><FF>)
>> > /Creator (<FE><FF>^@w^@k^@h^@t^@m^@l^@t^@o^@p^@d^@f^@
>> > ^@0^@.^@1^@2^@.^@5)
>> > /Producer (<FE><FF>^@Q^@t^@ ^@4^@.^@8^@.^@7)
>>
>> There's the cause. Apparently the use of UTF-16BE encoding with a
>> leading BOM for metadata was not so common when that plugin was written.
>> It saw the BOM and assumed the line was binary data.
>>
>>
>> --
>> Bill Cole
>> bill@scconsult.com or billcole@apache.org
>> (AKA @grumpybozo and many *@billmail.scconsult.com addresses)
>> Not Currently Available For Hire
>>
>

Re: PDFinfo not returning expected producer, creator values

Posted by Ricky Boone <ri...@gmail.com>.
Sorry for the late reply, crazy week.

Honestly, I wasn't expecting such a quick and relevant response, so thanks
and kudos for that.  :)

I'm not currently using trunk, so I will try to patch in the changes
described during a quiet period over the weekend.  It does look like that
should do the trick, though.

On Thu, Mar 3, 2022 at 1:48 AM Bill Cole <
sausers-20150205@billmail.scconsult.com> wrote:

> On 2022-03-02 at 17:58:50 UTC-0500 (Wed, 2 Mar 2022 17:58:50 -0500)
> Ricky Boone <ri...@gmail.com>
> is rumored to have said:
>
> > If this is the wrong forum to report this, let me know.
>
> This is fine. I've also documented the fix in our Bugzilla at
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7960
>
> If you're running the 'trunk' version out of svn, the fix is in there.
> We do not even have a target date for the next release, but we generally
> do not break 'trunk' if you're feeling adventurous.
>
> If you're a different sort of adventurous, willing to hack on your local
> copy of the code, the fix is to remove these lines (~223-224) which skip
> lines based on an antique assumption:
>
> -      # lines containing high bytes will have no data we need, so save
> some cycles
> -      next if ($line =~ /[\x80-\xff]/);
>
> Thank you very much for the detailed analysis. I had seen this problem
> on some PDFs but have not had the time to dive into the issue. You
> vastly reduced the pain of fixing it.
>
>
> > I'm trying to create a couple rules to identify questionable PDFs
> > (phishing, etc.).  While evaluating the debug output from spamassassin
> > for
> > the pdfinfo plugin, I noticed that some of the test file attributes
> > aren't
> > being populated correctly, when comparing against exiftool, Adobe
> > Reader,
> > Firefox, etc.  The producer and creator fields, specifically, appear
> > to be
> > left as unknown.
> >
> > Compared against other emails and PDFs, I get similar results, so I
> > suspect
> > it's an issue with the plugin or how it is parsing the PDF.  I do have
> > this
> > example available, however it is malicious (it links to a phishing
> > site),
> > so I wouldn't want to link to it directly in this thread.
> >
> > For example:
> >
> > $ less Invoice0098539.pdf
> > %PDF-1.4
> > 1 0 obj
> > <<
> > /Title (<FE><FF>)
> > /Creator (<FE><FF>^@w^@k^@h^@t^@m^@l^@t^@o^@p^@d^@f^@
> > ^@0^@.^@1^@2^@.^@5)
> > /Producer (<FE><FF>^@Q^@t^@ ^@4^@.^@8^@.^@7)
>
> There's the cause. Apparently the use of UTF-16BE encoding with a
> leading BOM for metadata was not so common when that plugin was written.
> It saw the BOM and assumed the line was binary data.
>
>
> --
> Bill Cole
> bill@scconsult.com or billcole@apache.org
> (AKA @grumpybozo and many *@billmail.scconsult.com addresses)
> Not Currently Available For Hire
>

Re: PDFinfo not returning expected producer, creator values

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 2022-03-02 at 17:58:50 UTC-0500 (Wed, 2 Mar 2022 17:58:50 -0500)
Ricky Boone <ri...@gmail.com>
is rumored to have said:

> If this is the wrong forum to report this, let me know.

This is fine. I've also documented the fix in our Bugzilla at 
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7960

If you're running the 'trunk' version out of svn, the fix is in there. 
We do not even have a target date for the next release, but we generally 
do not break 'trunk' if you're feeling adventurous.

If you're a different sort of adventurous, willing to hack on your local 
copy of the code, the fix is to remove these lines (~223-224) which skip 
lines based on an antique assumption:

-      # lines containing high bytes will have no data we need, so save 
some cycles
-      next if ($line =~ /[\x80-\xff]/);

Thank you very much for the detailed analysis. I had seen this problem 
on some PDFs but have not had the time to dive into the issue. You 
vastly reduced the pain of fixing it.


> I'm trying to create a couple rules to identify questionable PDFs
> (phishing, etc.).  While evaluating the debug output from spamassassin 
> for
> the pdfinfo plugin, I noticed that some of the test file attributes 
> aren't
> being populated correctly, when comparing against exiftool, Adobe 
> Reader,
> Firefox, etc.  The producer and creator fields, specifically, appear 
> to be
> left as unknown.
>
> Compared against other emails and PDFs, I get similar results, so I 
> suspect
> it's an issue with the plugin or how it is parsing the PDF.  I do have 
> this
> example available, however it is malicious (it links to a phishing 
> site),
> so I wouldn't want to link to it directly in this thread.
>
> For example:
>
> $ less Invoice0098539.pdf
> %PDF-1.4
> 1 0 obj
> <<
> /Title (<FE><FF>)
> /Creator (<FE><FF>^@w^@k^@h^@t^@m^@l^@t^@o^@p^@d^@f^@ 
> ^@0^@.^@1^@2^@.^@5)
> /Producer (<FE><FF>^@Q^@t^@ ^@4^@.^@8^@.^@7)

There's the cause. Apparently the use of UTF-16BE encoding with a 
leading BOM for metadata was not so common when that plugin was written. 
It saw the BOM and assumed the line was binary data.


-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire