You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Robert S <ro...@gmail.com> on 2007/02/21 00:36:14 UTC

FuzzyOcr - how do I "teach" it?

I have just installed FOCR 3.5.1 with the hashdb option.  I have been
receiving image spams about China Fruits Corporation which are
cleverly designed not to contain words in the words list.  How do I
insert the hash into the database and label this image as spam?

I have tried - unsuccessfully:

fuzzy-find --score=10 --learn-spam --verbose
"367563:437:282:32::49:1:18:17:55642::44:40:7:37:54950::218:144:172:169:1131::96:99:179:107:1094::100:122:122:115:1093::156:136:162:145:1066"
(I got the hash score from running "spamassassin -D < message")

and

fuzzy-find  --score=10 --learn-spam 'notary_public.gif'

I'd like to avoid tampering with the words list to avoid FPs.

Could somebody please tell me where I'm going wrong.

It would be nice if images could be automatically stored in the hashdb
as spam if SA gives them a positive score, but FOCR does not.

Re: Fwd: FuzzyOcr - how do I "teach" it?

Posted by Brian Wilson <wi...@bubba.org>.

On Fri, 23 Feb 2007, Jorge Valdes wrote:

> Brian Wilson wrote:
>> On Feb 20, 2007, at 6:36 PM, Robert S wrote:
>> 
>>> I have just installed FOCR 3.5.1 with the hashdb option.  I have been
>>> receiving image spams about China Fruits Corporation which are
>>> cleverly designed not to contain words in the words list.  How do I
>>> insert the hash into the database and label this image as spam?
>>> 
>>> I have tried - unsuccessfully:
>>> 
>>> fuzzy-find --score=10 --learn-spam --verbose
>>> "367563:437:282:32::49:1:18:17:55642::44:40:7:37:54950::218:144:172:169:1131::96:99:179:107:1094::100:122:122:115:1093::156:136:162:145:1066" 
>>> (I got the hash score from running "spamassassin -D < message")
>>> 
>>> and
>>> 
>>> fuzzy-find  --score=10 --learn-spam 'notary_public.gif'
>>> 
>>> I'd like to avoid tampering with the words list to avoid FPs.
>>> 
>>> Could somebody please tell me where I'm going wrong.
>>> 
>>> It would be nice if images could be automatically stored in the hashdb
>>> as spam if SA gives them a positive score, but FOCR does not.
>>> 
>> 
>> I have the same problem as you, so you are not alone.  I first deleted the 
>> hash using fuzzy-find to make sure it didn't exist in either hash, then 
>> added it with a score of 10.  I re-ran spamassassin with debug on for 
>> FuzzyOcr and it did not see the entry in the spam db.  I even compared the 
>> hashes and they were the same:
>> 
>> % fuzzy-find --delete 
>> 278502:292:319:128::203:248:219:231:26298::202:200:236:205:25148::247:249:185:241:16996::192:236:242:224:16482::136:34:15:62:630::108:30:158:68:410 
>> Img =    278502 292x319x128
>> 
>> % fuzzy-find --learn-spam --score=10 
>> 278502:292:319:128::203:248:219:231:26298::202:200:236:205:25148::247:249:185:241:16996::192:236:242:224:16482::136:34:15:62:630::108:30:158:68:410 
>> Img =    278502 292x319x128
>> 
>> Rerun the spam  through SA (China Fruits also: http://bubba.org/spam/)
>> 
>> Adding key to database...
>> [1548] dbg: FuzzyOcr: Not enough OCR Hits without space stripping, doing 
>> second matching pass...
>> [1548] info: FuzzyOcr: Message is ham, saving...
>> [1548] info: FuzzyOcr: Adding Hash to 
>> "/etc/mail/spamassassin/FuzzyOcr.safe.db" with score "0"
>> [1548] dbg: FuzzyOcr: Digest: 
>> 278502:292:319:128::203:248:219:231:26298::202:200:236:205:25148::247:249:185:241:16996::192:236:242:224:16482::136:34:15:62:630::108:30:158:68:410 
>> 
>> 
>> 
> Remember that in order for things to work right, the safe database is checked 
> first.  The rationale behind this is that if an image "fingerprint" is found 
> here, there is no need to do OCR.  If you already have the image learned as 
> HAM, you must delete it first, then optionally add it to the SPAM database.
>
> Jorge.
>
>

Is that not what I did?  It's obvious to me that since it was learned as 
ham, it must be deleted before being reclassified as spam (which I did). 
Then I re-ran spamassassin and it was again tagged as ham, which it 
shouldn't be since I removed it from ham and reclassified it as spam.

Re: Fwd: FuzzyOcr - how do I "teach" it?

Posted by Jorge Valdes <jo...@joval.info>.

Brian Wilson wrote:
> On Feb 20, 2007, at 6:36 PM, Robert S wrote:
>
>> I have just installed FOCR 3.5.1 with the hashdb option.  I have been
>> receiving image spams about China Fruits Corporation which are
>> cleverly designed not to contain words in the words list.  How do I
>> insert the hash into the database and label this image as spam?
>>
>> I have tried - unsuccessfully:
>>
>> fuzzy-find --score=10 --learn-spam --verbose
>> "367563:437:282:32::49:1:18:17:55642::44:40:7:37:54950::218:144:172:169:1131::96:99:179:107:1094::100:122:122:115:1093::156:136:162:145:1066" 
>>
>> (I got the hash score from running "spamassassin -D < message")
>>
>> and
>>
>> fuzzy-find  --score=10 --learn-spam 'notary_public.gif'
>>
>> I'd like to avoid tampering with the words list to avoid FPs.
>>
>> Could somebody please tell me where I'm going wrong.
>>
>> It would be nice if images could be automatically stored in the hashdb
>> as spam if SA gives them a positive score, but FOCR does not.
>>
>
> I have the same problem as you, so you are not alone.  I first deleted 
> the hash using fuzzy-find to make sure it didn't exist in either hash, 
> then added it with a score of 10.  I re-ran spamassassin with debug on 
> for FuzzyOcr and it did not see the entry in the spam db.  I even 
> compared the hashes and they were the same:
>
> % fuzzy-find --delete 
> 278502:292:319:128::203:248:219:231:26298::202:200:236:205:25148::247:249:185:241:16996::192:236:242:224:16482::136:34:15:62:630::108:30:158:68:410 
>
> Img =    278502 292x319x128
>
> % fuzzy-find --learn-spam --score=10 
> 278502:292:319:128::203:248:219:231:26298::202:200:236:205:25148::247:249:185:241:16996::192:236:242:224:16482::136:34:15:62:630::108:30:158:68:410 
>
> Img =    278502 292x319x128
>
> Rerun the spam  through SA (China Fruits also: http://bubba.org/spam/)
>
> Adding key to database...
> [1548] dbg: FuzzyOcr: Not enough OCR Hits without space stripping, 
> doing second matching pass...
> [1548] info: FuzzyOcr: Message is ham, saving...
> [1548] info: FuzzyOcr: Adding Hash to 
> "/etc/mail/spamassassin/FuzzyOcr.safe.db" with score "0"
> [1548] dbg: FuzzyOcr: Digest: 
> 278502:292:319:128::203:248:219:231:26298::202:200:236:205:25148::247:249:185:241:16996::192:236:242:224:16482::136:34:15:62:630::108:30:158:68:410 
>
>
>
>
Remember that in order for things to work right, the safe database is 
checked first.  The rationale behind this is that if an image 
"fingerprint" is found here, there is no need to do OCR.  If you already 
have the image learned as HAM, you must delete it first, then optionally 
add it to the SPAM database.

Jorge.

-- 
-----BEGIN GEEK CODE BLOCK-----
Name: Jorge Valdes
EMail: jorge<at>joval.info
Version: 3.12
GED/J d+(-) s:+> a+ C++ ULS++++$ P++++$ L++ E--- W+++ N+ 
o? K- w+  M-@ V+ PS- PE+ Y? PGP-@ t++ 5@ X++ R tv+ b+ DI
D? G e++ h---- r+++ y+++
-----END GEEK CODE BLOCK-----

Fwd: FuzzyOcr - how do I "teach" it?

Posted by Brian Wilson <wi...@bubba.org>.

On Feb 20, 2007, at 6:36 PM, Robert S wrote:

> I have just installed FOCR 3.5.1 with the hashdb option.  I have been
> receiving image spams about China Fruits Corporation which are
> cleverly designed not to contain words in the words list.  How do I
> insert the hash into the database and label this image as spam?
>
> I have tried - unsuccessfully:
>
> fuzzy-find --score=10 --learn-spam --verbose
> "367563:437:282:32::49:1:18:17:55642::44:40:7:37:54950::218:144:172:16 
> 9:1131::96:99:179:107:1094::100:122:122:115:1093::156:136:162:145:1066 
> "
> (I got the hash score from running "spamassassin -D < message")
>
> and
>
> fuzzy-find  --score=10 --learn-spam 'notary_public.gif'
>
> I'd like to avoid tampering with the words list to avoid FPs.
>
> Could somebody please tell me where I'm going wrong.
>
> It would be nice if images could be automatically stored in the hashdb
> as spam if SA gives them a positive score, but FOCR does not.
>

I have the same problem as you, so you are not alone.  I first  
deleted the hash using fuzzy-find to make sure it didn't exist in  
either hash, then added it with a score of 10.  I re-ran spamassassin  
with debug on for FuzzyOcr and it did not see the entry in the spam  
db.  I even compared the hashes and they were the same:

% fuzzy-find --delete  
278502:292:319:128::203:248:219:231:26298::202:200:236:205:25148::247:24 
9:185:241:16996::192:236:242:224:16482::136:34:15:62:630::108:30:158:68: 
410
Img =    278502 292x319x128

% fuzzy-find --learn-spam --score=10  
278502:292:319:128::203:248:219:231:26298::202:200:236:205:25148::247:24 
9:185:241:16996::192:236:242:224:16482::136:34:15:62:630::108:30:158:68: 
410
Img =    278502 292x319x128

Rerun the spam  through SA (China Fruits also: http://bubba.org/spam/)

Adding key to database...
[1548] dbg: FuzzyOcr: Not enough OCR Hits without space stripping,  
doing second matching pass...
[1548] info: FuzzyOcr: Message is ham, saving...
[1548] info: FuzzyOcr: Adding Hash to "/etc/mail/spamassassin/ 
FuzzyOcr.safe.db" with score "0"
[1548] dbg: FuzzyOcr: Digest:  
278502:292:319:128::203:248:219:231:26298::202:200:236:205:25148::247:24 
9:185:241:16996::192:236:242:224:16482::136:34:15:62:630::108:30:158:68: 
410

Re: FuzzyOcr - how do I "teach" it?

Posted by Brian Wilson <wi...@bubba.org>.

On Feb 20, 2007, at 6:36 PM, Robert S wrote:

> I have just installed FOCR 3.5.1 with the hashdb option.  I have been
> receiving image spams about China Fruits Corporation which are
> cleverly designed not to contain words in the words list.  How do I
> insert the hash into the database and label this image as spam?
>
> I have tried - unsuccessfully:
>
> fuzzy-find --score=10 --learn-spam --verbose
> "367563:437:282:32::49:1:18:17:55642::44:40:7:37:54950::218:144:172:16 
> 9:1131::96:99:179:107:1094::100:122:122:115:1093::156:136:162:145:1066 
> "
> (I got the hash score from running "spamassassin -D < message")
>
> and
>
> fuzzy-find  --score=10 --learn-spam 'notary_public.gif'
>
> I'd like to avoid tampering with the words list to avoid FPs.
>
> Could somebody please tell me where I'm going wrong.
>
> It would be nice if images could be automatically stored in the hashdb
> as spam if SA gives them a positive score, but FOCR does not.

I have the same problem as you, so you are not alone.  I first  
deleted the hash using fuzzy-find to make sure it didn't exist in  
either hash, then added it with a score of 10.  I re-ran spamassassin  
with debug on for FuzzyOcr and it did not see the entry in the spam  
db.  I even compared the hashes and they were the same:

% fuzzy-find --delete  
278502:292:319:128::203:248:219:231:26298::202:200:236:205:25148::247:24 
9:185:241:16996::192:236:242:224:16482::136:34:15:62:630::108:30:158:68: 
410
Img =    278502 292x319x128

% fuzzy-find --learn-spam  
278502:292:319:128::203:248:219:231:26298::202:200:236:205:25148::247:24 
9:185:241:16996::192:236:242:224:16482::136:34:15:62:630::108:30:158:68: 
410
Img =    278502 292x319x128

Rerun the spam  through SA (China Fruits also: http://bubba.org/spam/)

Adding key to database...
[1548] dbg: FuzzyOcr: Not enough OCR Hits without space stripping,  
doing second matching pass...
[1548] info: FuzzyOcr: Message is ham, saving...
[1548] info: FuzzyOcr: Adding Hash to "/etc/mail/spamassassin/ 
FuzzyOcr.safe.db" with score "0"
[1548] dbg: FuzzyOcr: Digest:  
278502:292:319:128::203:248:219:231:26298::202:200:236:205:25148::247:24 
9:185:241:16996::192:236:242:224:16482::136:34:15:62:630::108:30:158:68: 
410