You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Pascal Maes <pa...@uclouvain.be> on 2006/11/11 10:22:37 UTC

Questions about FuzzyOCR

Version 2.3b


1) Here is the ouptut of the scanner (gocr -i) :

_____

date                             Informations



9- 11-lO06    1O_30   Le __ek-end du 3-4r'11, les adresses de cou  
r_er jlectron_que des jtud_ants non
ri_nscmts j _UCL ont jtj ddsact_vjes. La ra_son est pÄrement  
adm_n_strat_ve et I_je j
Ia caNe j puce. Pour permeNre j ces jtud_ants de rjcupdrer leurs  
messaqes, nous
avons fa_t en soNe qu'_Is pu_ssent encore accjder j leur boîte aux  
leNres jusqu'au
l4.r l 1 ,/lo 06 .
ANent_on, la consuttat_on se fera av_ un cI_ent de messager_e ! 
Thunderb_rd. Eudora,
Outlook.. .7 ou v_a le _IebMa_I ma_s plus v_a le poNa_I .


We get almost the same result with gocr -l 180 -d 2 -i

And FuzzyOCr says :

   13 FUZZY_OCR              BODY: Mail contains an image with common  
spam text inside
                             Words found:
                             "wexe" in 3 lines
                             "alert" in 2 lines
                             "alert" in 2 lines
                             "investor" in 1 lines
                             "trade" in 3 lines
                             (11 word occurrences found)

But I don't find any of these words in th text above !


2) How remove an image which as been stored by mistake in the hash  
database ?

Thanks
--
Pascal




Re: Questions about FuzzyOCR

Posted by decoder <de...@own-hero.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
 
decoder wrote:
> Pascal Maes wrote:
>> Version 2.3b
>
>
>> 1) Here is the ouptut of the scanner (gocr -i) :
>
>> _____
>
>> date                             Informations
>
>
>
>> 9- 11-lO06    1O_30   Le __ek-end du 3-4r'11, les adresses de cou
>>  r_er jlectron_que des jtud_ants non ri_nscmts j _UCL ont jtj
>> ddsact_vjes. La ra_son est pÄrement adm_n_strat_ve et I_je j Ia
>> caNe j puce. Pour permeNre j ces jtud_ants de rjcupdrer leurs
>> messaqes, nous avons fa_t en soNe qu'_Is pu_ssent encore accjder
>> j leur boîte aux leNres jusqu'au l4.r l 1 ,/lo 06 . ANent_on, la
>>  consuttat_on se fera av_ un cI_ent de messager_e !Thunderb_rd.
>> Eudora, Outlook.. .7 ou v_a le _IebMa_I ma_s plus v_a le poNa_I .
>>
>
>
>> We get almost the same result with gocr -l 180 -d 2 -i
>
>> And FuzzyOCr says :
>
>> 13 FUZZY_OCR              BODY: Mail contains an image with
>> common spam text inside Words found: "wexe" in 3 lines "alert" in
>> 2 lines "alert" in 2 lines "investor" in 1 lines "trade" in 3
>> lines (11 word occurrences found)
>
>> But I don't find any of these words in th text above !
>
> You can try lowering your fuzz from 0.3 to 0.2, I didn't make any
> experience so far how the plugin reacts to text in different
> languages, so this might produce false positives.
>> 2) How remove an image which as been stored by mistake in the
>> hash database ?
> In version 2.3b, this is not possible yet with a tool,
> unfortunately. But the database is only a textfile, so you can
> simply search the hash there and delete the line. Version 3.4.1
> brings a tool that removes a given hash from the database, but I am
> still improving it a bit, so one can also pass it an image file to
> look for.
I must correct myself there, passing it an image is already supported :)

Best regards,

Chris

>
> Best regards,
>
> Chris
>> Thanks -- Pascal
>
>
>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
iD8DBQFFVeMqJQIKXnJyDxURAhIbAKCpiYddgBqEBZZt1WnM9e4qjkgFfgCePG/R
mWU8mtJuXQlVIHdO90e6xR0=
=hMuz
-----END PGP SIGNATURE-----


Re: Questions about FuzzyOCR

Posted by decoder <de...@own-hero.net>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
 
Pascal Maes wrote:
>
> Version 2.3b
>
>
> 1) Here is the ouptut of the scanner (gocr -i) :
>
> _____
>
> date                             Informations
>
>
>
> 9- 11-lO06    1O_30   Le __ek-end du 3-4r'11, les adresses de cou
> r_er jlectron_que des jtud_ants non ri_nscmts j _UCL ont jtj
> ddsact_vjes. La ra_son est pÄrement adm_n_strat_ve et I_je j Ia
> caNe j puce. Pour permeNre j ces jtud_ants de rjcupdrer leurs
> messaqes, nous avons fa_t en soNe qu'_Is pu_ssent encore accjder j
> leur boîte aux leNres jusqu'au l4.r l 1 ,/lo 06 . ANent_on, la
> consuttat_on se fera av_ un cI_ent de messager_e !Thunderb_rd.
> Eudora, Outlook.. .7 ou v_a le _IebMa_I ma_s plus v_a le poNa_I .
>
>
> We get almost the same result with gocr -l 180 -d 2 -i
>
> And FuzzyOCr says :
>
> 13 FUZZY_OCR              BODY: Mail contains an image with common
> spam text inside Words found: "wexe" in 3 lines "alert" in 2 lines
> "alert" in 2 lines "investor" in 1 lines "trade" in 3 lines (11
> word occurrences found)
>
> But I don't find any of these words in th text above !
>
You can try lowering your fuzz from 0.3 to 0.2, I didn't make any
experience so far how the plugin reacts to text in different
languages, so this might produce false positives.
>
> 2) How remove an image which as been stored by mistake in the hash
> database ?
In version 2.3b, this is not possible yet with a tool, unfortunately.
But the database is only a textfile, so you can simply search the hash
there and delete the line. Version 3.4.1 brings a tool that removes a
given hash from the database, but I am still improving it a bit, so
one can also pass it an image file to look for.

Best regards,

Chris
>
> Thanks -- Pascal
>
>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
iD8DBQFFVbIjJQIKXnJyDxURAkYjAJ9iFDj2oFrY+mVMyEBvEusYxxBxFQCgjZoM
SJny4nTsw1G3XgGqBOVl7S8=
=5S1J
-----END PGP SIGNATURE-----