You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Rejaine Monteiro <re...@bhz.jamef.com.br> on 2007/04/03 23:15:35 UTC

FUZZY_OCR find words that not exist on image

This image on http://rejaine.multiply.com/photos/photo/5/1?
was targed by Fuzzy-OCR:

12 FUZZY_OCR              BODY: Mail contains an image with common spam 
text inside
                           Words found:
                           "news" in 5 lines
                           "money" in 1 lines
                           "million" in 1 lines
                           "trade" in 1 lines
                           "levitra" in 1 lines
                           "product" in 1 lines
                           (10 word occurrences found)

But this image don't have any words above

Any tip? 

Re: FUZZY_OCR find words that not exist on image

Posted by Rejaine Monteiro <re...@bhz.jamef.com.br>.
Humm....  Ok, I'll  upgrade to last version (3.5.1) and make more tests...
Thanks!

René Berber escreveu:
>> and upgrade to new version, but  I'm
>> already using the last Fuzzry version (OCR 2.3b)
>>     
>
> That's not the latest version, go to the FuzzyOcr page again and read what it
> says carefully.
>   

Re: FUZZY_OCR find words that not exist on image

Posted by René Berber <r....@computer.org>.
Rejaine Monteiro wrote:

[snip]
> I have another sugestions, from another users, but not work here... Like
> adjust force focr_threshold and edit words list and ajust factor for
> individual words (this not resolve from my case, because words on this
> image are NOT listed on wordlist)

The way FuzzyOcr works is that it takes all the text it can find, strips spaces,
and then does a loose-match, any sequence of letters that is similar to within a
threshold is taken as a match.

That's why a lower threshold means better matches, but still it doesn't know the
difference between a real word and one that is close or is part of another word,
you have to tune the parameters and help in other ways, like not scanning large
images.

To figure out why your image, which uses Portuguese, has matches you have to run
spamassassin in debug mode and see the detail from FuzzyOcr.  Here's an example
from running `spamassassin -x -t -D FuzzyOcr < test.eml` where test.eml was made
 by just pasting your image into a message:

...
[464] dbg: FuzzyOcr: Not enough OCR Hits without space stripping, doing second
matching pass...
[464] dbg: FuzzyOcr: Saved pid: 3600
[3600] dbg: FuzzyOcr: Exec : /usr/local/bin/gocr -l 180 -d 2 -i
/tmp/.spamassassin464aJYbljtmp/moz-screenshot.jpg.pnm
[3600] dbg: FuzzyOcr: Stdout: >/tmp/.spamassassin464aJYbljtmp/scanset.gocr-180.out
[3600] dbg: FuzzyOcr: Stderr: >/tmp/.spamassassin464aJYbljtmp/scanset.gocr-180.err
[464] dbg: FuzzyOcr: Elapsed [3600]: 0.703125 sec. (/usr/local/bin/gocr: exit 0)
[464] dbg: FuzzyOcr: ocrdata=>>__ _D __
...
[464] dbg: FuzzyOcr: Fique em dia com o que h_ de
[464] dbg: FuzzyOcr: melhor para a sua beleza !
[464] dbg: FuzzyOcr:
[464] dbg: FuzzyOcr:
[464] dbg: FuzzyOcr: Em com$moraç6o ao Dia Int$maci_I da mulh8r, a
[464] dbg: FuzzyOcr: koga_ Na_p vai _I'__ o C_cuIto MaIs B$_za. Em
[464] dbg: FuzzyOcr: _rc_j com a Jamet. _o ateados tO con_tgs para
[464] dbg: FuzzyOcr: um evento completo, com serviç_ $ cuidodoI espei_s
[464] dbg: FuzzyOcr: _ra s0us cob8los, pge e unhos. oIġm de v__os promoç_es.
[464] dbg: FuzzyOcr:
[464] dbg: FuzzyOcr:
[464] dbg: FuzzyOcr: coNFIRA AlcuNs sERvIcos au_ vocE IRn EmcoNrRAR
[464] dbg: FuzzyOcr:
[464] dbg: FuzzyOcr: , Escova - An_Iig capilar
[464] dbg: FuzzyOcr: , Querot_n_aç_o - Hidrataç_o
[464] dbg: FuzzyOcr: . Massagem capilar . Higienizoç_o da _t$
[464] dbg: FuzzyOcr: . 0ecoroç_o de unho . Qu__ mossoge
[464] dbg: FuzzyOcr: . Maquiagem . O_entaç_es d$ b$teza
[464] dbg: FuzzyOcr:
[464] dbg: FuzzyOcr:
[464] dbg: FuzzyOcr: Intgemdo_ em __'_ do C_c_to, 1om mhDr $m cmtoto com
[464] dbg: FuzzyOcr: GjeI_ - RH IgjeIB_bhJom8I.com_t _ concmgr o0 convIle.
[464] dbg: FuzzyOcr:
[464] dbg: FuzzyOcr: Data: _q de morca - S6bada
[464] dbg: FuzzyOcr:
[464] dbg: FuzzyOcr: HOfO'"O' !1h a' I '7h A_Rnujo
[464] dbg: FuzzyOcr: tocal: Ay. Get_llo Va_ga_, $4O ,,
[464] dbg: FuzzyOcr: <<=end
[464] info: FuzzyOcr: Scanset "gocr-180" found word "service" with fuzz of 0.1429
[464] info: FuzzyOcr: line: "confira alcuns servicos au voce irn emconrrar"
...
Content analysis details:   (5.6 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 0.0 UNPARSEABLE_RELAY      Informational: message has unparseable relay lines
 0.0 HTML_MESSAGE           BODY: HTML included in message
 0.0 MIME_HTML_ONLY         BODY: Message only has text/html MIME parts
 2.9 HTML_IMAGE_ONLY_04     BODY: HTML: images with 0-400 bytes of words
 2.8 FUZZY_OCR              BODY: Mail contains an image with common spam text
inside
                            Words found:
"service" in 1 lines
"service" in
                            2 lines
(3 word occurrences found)

As you can see, with the parameters I'm using and version 3.5.1, only one word
was detected.  The score is actually a known bug in the current version, it
counted twice the same word (I think it doesn't appear in two different lines).

> and upgrade to new version, but  I'm
> already using the last Fuzzry version (OCR 2.3b)

That's not the latest version, go to the FuzzyOcr page again and read what it
says carefully.
-- 
René Berber


Re: FUZZY_OCR find words that not exist on image

Posted by Mikael Syska <mi...@syska.dk>.
Hey,

Try upgrading ...: http://fuzzyocr.own-hero.net/wiki/Downloads

As it states on the page itself ... 2.3 is deprecated .... guess must 
are using 3.x version ... I'm using 3.5.1 two places with out problems

// ouT

Rejaine Monteiro wrote:
>
> sorry by my poor english..
> maybe I'm not understanding you, but the last message from you I see is:
> "Put the image on a website and put the link to this list. Otherwise, 
> we're only guessing."
> So, I put the image on multiply...
>
> I have another sugestions, from another users, but not work here... 
> Like adjust force focr_threshold and edit words list and ajust factor 
> for individual words (this not resolve from my case, because words on 
> this image are NOT listed on wordlist) and upgrade to new version, 
> but  I'm already using the last Fuzzry version (OCR 2.3b)
>
>
> Evan Platt escreveu:
>> At 01:15 PM 4/3/2007, Rejaine Monteiro wrote:
>>
>>> This image on http://rejaine.multiply.com/photos/photo/5/1?
>>> was targed by Fuzzy-OCR:
>>>
>>> 12 FUZZY_OCR              BODY: Mail contains an image with common 
>>> spam text inside
>>>                           Words found:
>>>                           "news" in 5 lines
>>>                           "money" in 1 lines
>>>                           "million" in 1 lines
>>>                           "trade" in 1 lines
>>>                           "levitra" in 1 lines
>>>                           "product" in 1 lines
>>>                           (10 word occurrences found)
>>>
>>> But this image don't have any words above
>>>
>>> Any tip?
>>
>> See the 4 or so answers to this question the last time you asked on 
>> 03/23.
>
>


Re: FUZZY_OCR find words that not exist on image

Posted by Rejaine Monteiro <re...@bhz.jamef.com.br>.
sorry by my poor english..
maybe I'm not understanding you, but the last message from you I see is:
"Put the image on a website and put the link to this list. Otherwise, 
we're only guessing."
So, I put the image on multiply...

I have another sugestions, from another users, but not work here... Like 
adjust force focr_threshold and edit words list and ajust factor for 
individual words (this not resolve from my case, because words on this 
image are NOT listed on wordlist) and upgrade to new version, but  I'm 
already using the last Fuzzry version (OCR 2.3b)


Evan Platt escreveu:
> At 01:15 PM 4/3/2007, Rejaine Monteiro wrote:
>
>> This image on http://rejaine.multiply.com/photos/photo/5/1?
>> was targed by Fuzzy-OCR:
>>
>> 12 FUZZY_OCR              BODY: Mail contains an image with common 
>> spam text inside
>>                           Words found:
>>                           "news" in 5 lines
>>                           "money" in 1 lines
>>                           "million" in 1 lines
>>                           "trade" in 1 lines
>>                           "levitra" in 1 lines
>>                           "product" in 1 lines
>>                           (10 word occurrences found)
>>
>> But this image don't have any words above
>>
>> Any tip?
>
> See the 4 or so answers to this question the last time you asked on 
> 03/23.

Re: FUZZY_OCR find words that not exist on image

Posted by Evan Platt <ev...@espphotography.com>.
At 01:15 PM 4/3/2007, Rejaine Monteiro wrote:

>This image on http://rejaine.multiply.com/photos/photo/5/1?
>was targed by Fuzzy-OCR:
>
>12 FUZZY_OCR              BODY: Mail contains an image with common 
>spam text inside
>                           Words found:
>                           "news" in 5 lines
>                           "money" in 1 lines
>                           "million" in 1 lines
>                           "trade" in 1 lines
>                           "levitra" in 1 lines
>                           "product" in 1 lines
>                           (10 word occurrences found)
>
>But this image don't have any words above
>
>Any tip?

See the 4 or so answers to this question the last time you asked on 03/23.