You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by John Thompson <jo...@gmail.com> on 2007/08/31 20:32:58 UTC

FuzzyOcr misses

I've gotten a number of image spams that don't trigger FuzzyOcr at all
for some reason, e.g. http://www.os2.dhs.org/~john/DPO.gif

If I run the email through spamassassin manually, e.g. "spamassassin -D
FuzzyOcr < DPO.eml" there's no indication that FuzzyOcr found anything
at all:

[66152] dbg: FuzzyOcr: Score{autodisable} = 1000
[66152] dbg: FuzzyOcr: Using gifsicle => /usr/local/bin/gifsicle
[66152] dbg: FuzzyOcr: Using giffix => /usr/local/bin/giffix
[66152] dbg: FuzzyOcr: Using giftext => /usr/local/bin/giftext
[66152] dbg: FuzzyOcr: Using gifinter => /usr/local/bin/gifinter
[66152] dbg: FuzzyOcr: Using giftopnm => /usr/local/bin/giftopnm
[66152] dbg: FuzzyOcr: Using jpegtopnm => /usr/local/bin/jpegtopnm
[66152] dbg: FuzzyOcr: Using pngtopnm => /usr/local/bin/pngtopnm
[66152] dbg: FuzzyOcr: Using bmptopnm => /usr/local/bin/bmptopnm
[66152] dbg: FuzzyOcr: Using tifftopnm => /usr/local/bin/tifftopnm
[66152] dbg: FuzzyOcr: Using ppmhist => /usr/local/bin/ppmhist
[66152] dbg: FuzzyOcr: Using pamfile => /usr/local/bin/pamfile
[66152] dbg: FuzzyOcr: Using gocr => /usr/local/bin/gocr
[66152] dbg: FuzzyOcr: Using ocrad => /usr/local/bin/ocrad
[66152] dbg: FuzzyOcr: Loaded <62> words from
"/usr/local/etc/mail/spamassassin/FuzzyOcr.words"
[66152] dbg: FuzzyOcr: Using scan: $gocr -i $pfile
[66152] dbg: FuzzyOcr: Using scan: $gocr -l 180 -d 2 -i $pfile
[66152] info: rules: meta test FM_DDDD_TIMES_2 has dependency
'FH_HOST_EQ_D_D_D_D' with a zero score
[66152] info: rules: meta test FM_SEX_HOSTDDDD has dependency
'FH_HOST_EQ_D_D_D_D' with a zero score
[66152] dbg: FuzzyOcr: Saved: /tmp/.spamassassin661528qp1mltmp/raw.eml
[66152] dbg: FuzzyOcr: Wrote:
/tmp/.spamassassin661528qp1mltmp/8oNs11_f1_.gif
[66152] dbg: FuzzyOcr: Found: 1 images
[66152] dbg: FuzzyOcr: Errors to: /tmp/.spamassassin661528qp1mltmp/raw.err
[66152] dbg: FuzzyOcr: Analyzing file with content-type="image/gif"
[66152] dbg: FuzzyOcr: pfile =>
/tmp/.spamassassin661528qp1mltmp/8oNs11_f1_.gif.pnm
[66152] dbg: FuzzyOcr: efile =>
/tmp/.spamassassin661528qp1mltmp/8oNs11_f1_.gif.err
[66152] dbg: FuzzyOcr: Found GIF header name="8oNs11_f1_.gif"
[66152] dbg: FuzzyOcr: Image is single non-interlaced...
[66152] dbg: FuzzyOcr: Image hashing disabled in configuration, skipping...
[66152] dbg: FuzzyOcr: Trying: $gocr -i $pfile
[66152] dbg: FuzzyOcr: Trying: $gocr -l 180 -d 2 -i $pfile
[66152] dbg: FuzzyOcr: Remove DIR: /tmp/.spamassassin661528qp1mltmp
[66152] dbg: FuzzyOcr: FuzzyOcr ending successfully...

Using spamassassin-3.2.3, FuzzyOcr-3.4, gocr-0.44, ocrad-0.16 on
FreeBSD-6.2. If I use the FuzzyOcr sample image spams, it seems to work.
What gives?

-- 
John Thompson (john@os2.dhs.org)
Appleton WI USA

Re: FuzzyOcr misses

Posted by John Thompson <jo...@gmail.com>.
René Berber wrote:

> John Thompson wrote:
> 
>> I've gotten a number of image spams that don't trigger FuzzyOcr at all
>> for some reason, e.g. http://www.os2.dhs.org/~john/DPO.gif
> [snip]
>> Using spamassassin-3.2.3, FuzzyOcr-3.4, gocr-0.44, ocrad-0.16 on
>> FreeBSD-6.2. If I use the FuzzyOcr sample image spams, it seems to work.
>> What gives?
> 
> Old FuzzyOcr, and probably old ocrad.

Thanks, René; I'll see if updating helps.

-- 
John Thompson (john@os2.dhs.org)
Appleton WI USA


Re: FuzzyOcr misses

Posted by René Berber <r....@computer.org>.
John Thompson wrote:

> I've gotten a number of image spams that don't trigger FuzzyOcr at all
> for some reason, e.g. http://www.os2.dhs.org/~john/DPO.gif
[snip]
> Using spamassassin-3.2.3, FuzzyOcr-3.4, gocr-0.44, ocrad-0.16 on
> FreeBSD-6.2. If I use the FuzzyOcr sample image spams, it seems to work.
> What gives?

Old FuzzyOcr, and probably old ocrad.

Using FuzzyOcr 3.5.1 (plus patched files to revision 131) and ocrad 0.17 (with
0.16 the test didn't give any result, thanks for making me upgrade):

$ spamassassin -x -D FuzzyOcr -t < /c/tmp/Spam\ example.eml
...
[2684] dbg: FuzzyOcr: Exec : /usr/local/bin/ocrad -s5 -i
/tmp/.spamassassin2340eonVM9tmp/DPO.gif.pnm
[2684] dbg: FuzzyOcr: Stdout:
>/tmp/.spamassassin2340eonVM9tmp/scanset.ocrad-invert.out
[2340] dbg: FuzzyOcr: Saved pid: 2684
[2684] dbg: FuzzyOcr: Stderr:
>/tmp/.spamassassin2340eonVM9tmp/scanset.ocrad-invert.err
[2340] dbg: FuzzyOcr: Elapsed [2684]: 1.544600 sec. (/usr/local/bin/ocrad: exit 0)
[2340] dbg: FuzzyOcr: ocrdata=>>Discount Pharmacy Online
[2340] dbg: FuzzyOcr: Special offers: Save up_o 80°/
[2340] dbg: FuzzyOcr: o
[2340] dbg: FuzzyOcr: V#GRA ONLY $2.00
[2340] dbg: FuzzyOcr: CIALIS ONL.Y $2.00
[2340] dbg: FuzzyOcr: SOMA ONLY $2.44
[2340] dbg: FuzzyOcr: ULTRAM ONLY $2.28
[2340] dbg: FuzzyOcr:
[2340] dbg: FuzzyOcr: .. ... ... ... ... ... .-. ... ... ... ... ... ... ... ...
-.. ... ... ... .
[2340] dbg: FuzzyOcr:
[2340] dbg: FuzzyOcr: For mo.rY information, Please do not click
[2340] dbg: FuzzyOcr: Just type: wrm.SiDnpleRXZ.org
[2340] dbg: FuzzyOcr: inthe address barofyou browser,then press the Enterkey
[2340] dbg: FuzzyOcr:
[2340] dbg: FuzzyOcr: <<=end
[2340] info: FuzzyOcr: Scanset "ocrad-invert" found word "addressbar" with fuzz
of 0.1000
[2340] info: FuzzyOcr: line: "inthe address barofyou browserthen press the enterkey"
[2340] info: FuzzyOcr: Scanset "ocrad-invert" found word "cialis" with fuzz of
0.0000
[2340] info: FuzzyOcr: line: "cialis only oo"
[2340] info: FuzzyOcr: Scanset "ocrad-invert" found word "click" with fuzz of 0.0000
[2340] info: FuzzyOcr: line: "for mory information please do not click"
[2340] info: FuzzyOcr: Scanset "ocrad-invert" found word "offer" with fuzz of 0.0000
[2340] info: FuzzyOcr: line: "special offers save upo ao"
...
Content analysis details:   (11.5 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
-1.4 ALL_TRUSTED            Passed through trusted hosts only via SMTP
 0.6 HTML_IMAGE_RATIO_02    BODY: HTML has a low ratio of text to image area
 0.0 HTML_MESSAGE           BODY: HTML included in message
 1.5 HTML_IMAGE_ONLY_04     BODY: HTML: images with 0-400 bytes of words
 1.4 SARE_GIF_ATTACH        FULL: Email has a inline gif
 9.5 FUZZY_OCR              BODY: Mail contains an image with common spam text
inside
                            [Words found:]
                            ["addressbar" in 1 lines]
                            ["cialis" in 1 lines]
                            ["click" in 1 lines]
                            ["offer" in 1 lines]
                            ["browser" in 1 lines]
                            ["soma" in 1 lines]
                            ["type" in 1 lines]
                            ["pharmacy" in 1 lines]
                            [(12 word occurrences found)]
...
-- 
René Berber