You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2006/08/27 12:22:04 UTC

Re: FuzzyOcr 2.3b released, fixes bugs and improves stability

"John D. Hardin" writes:
>On Sat, 26 Aug 2006, Loren Wilton wrote:
>> > That's what I was thinking, and would allow leverage by a lot of
>> > plugins (e.g. the Word plugin I am prepping to start)...
>> >
>> > Create some PerMsgStatus string variable or some such that the body
>> > rules would be run over...
>> 
>> Actually the easy way would probably be to create a new X-Spam
>> header item that rules could run on.
>
>...an X-Spam-mumble header containing the text extracted from an
>attached Word document? That somehow strikes me as a bad idea...

Actually, I think it's quite a good one ;)  headers provide a
good way for plugins to offer name=value metadata pairs for rules
to match on.

The idea of sticking text from OCR'd images into the body is interesting
-- however, I'm not sure it'd be useful in this case. One key aspect that
makes the rules accurate, is that it's not that the text appears
*anywhere* in the mail; it's that the text appears in an OCR'd image.

>> I think it would be easy enough for the plugin to stick text into
>> the body array if it wanted to, and it if ran early enough that it
>> would be useful.  Whether or not the ocr text would be useful for
>> body rules is an entirely different question.
>
>The text within an attached image or document will have verbiage
>similar to the text within a classical spam - the goal, after all, is
>to sell something to the victim.
>
>I can see it now: spammers reduced to sending obfuscated text rendered
>as an animated GIF embedded in a Word document in a Zip file attached
>to an email whose subject is "Invoice #437892" with no body text... :)

and people would still read it ;)

--j.

Re: FuzzyOcr 2.3b released, fixes bugs and improves stability

Posted by "John D. Hardin" <jh...@impsec.org>.
On Sun, 27 Aug 2006, Justin Mason wrote:

> "John D. Hardin" writes:
> >On Sat, 26 Aug 2006, Loren Wilton wrote:
> >> > That's what I was thinking, and would allow leverage by a lot of
> >> > plugins (e.g. the Word plugin I am prepping to start)...
> >> >
> >> > Create some PerMsgStatus string variable or some such that the body
> >> > rules would be run over...
> >> 
> >> Actually the easy way would probably be to create a new X-Spam
> >> header item that rules could run on.
> >
> >...an X-Spam-mumble header containing the text extracted from an
> >attached Word document? That somehow strikes me as a bad idea...
> 
> Actually, I think it's quite a good one ;)  headers provide a
> good way for plugins to offer name=value metadata pairs for rules
> to match on.

Well, yes, so long as the header does not get inserted into the
rewritten message.

However, there is a much richer set of body text rules than header
rules. I think they should be leveraged against the image text (and
attached document text) as well. After all, they are just variant
delivery methods for the same message: BUY MY SHIT^WSTUFF!

> The idea of sticking text from OCR'd images into the body is
> interesting -- however, I'm not sure it'd be useful in this case.
> One key aspect that makes the rules accurate, is that it's not
> that the text appears *anywhere* in the mail; it's that the text
> appears in an OCR'd image.

Okay, how about this: a "variant-encapsulation" object in $PMS where
the text from images/documents is stuffed, and has the body rules run
over it, and has a multiplier or threshhold or some such that
affects/controls how the score from the body rules against that block
of text are applied to the message as a whole.

What bothers me is the separate list of simplified matching rules that
FuzzyOCR is using. I think that it would be better in the long run to
leverage the rich set of existing body rules rather than having a
separate set of simple rules.

--
 John Hardin KA7OHZ    ICQ#15735746    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174    pgpk -a jhardin@impsec.org
 key: 0xB8732E79 - 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  People seem to have this obsession with objects and tools as being
  dangerous in and of themselves, as though a weapon will act of its
  own accord to cause harm. A weapon is just a force multiplier. It's
  *humans* that are (or are not) dangerous.
-----------------------------------------------------------------------
 23 days until Talk Like a Pirate day