You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Andy Dills <an...@xecu.net> on 2007/01/08 19:46:35 UTC

Re: [Devel-spam] FuzzyOcr 3.5.1 released

On Mon, 8 Jan 2007, Jorge Valdes wrote:

> I do understand that in large environments, optimizations have to be made in
> order not to kill server performance, and expiration is probably something
> that could be done at "more convenient times".  I will commit a script that
> can safely be run as a cronjob soon.

Excellent.

> I understand that the "order" keyword in select is potentially expensive, but
> necessary because matches occur generally towards the most recent entries,
> thus increasing the possibility of a match earlier on.  When your hash count
> is in the thousands, earlier matches mean less queries to the database, and
> potentially faster results.

It's not just the order directive, it's the iteration throughout the 
entire database.

Consider when the database grows to >50k records. For a new image that 
doesn't have a hash, that's 50k records that must be sorted then sent from 
the DB server to the mail server, then all 50k records must be checked 
against the hash before we decide that we haven't seen this image before. 
That just isn't a workable algorithm. If iteration throughout the entire 
database is a requirement, hashing is a performance hit rather than a 
performance gain.

A better solution might be a seperate daemon that holds the hashes in 
memory, to which you submit the hash being considered.

Honestly, I have been extremely impressed with having hashing turned 
completely off.

Andy

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---

Re: [Devel-spam] FuzzyOcr 3.5.1 released

Posted by Jorge Valdes <jo...@joval.info>.
Andy Dills wrote:
> On Mon, 8 Jan 2007, Jorge Valdes wrote:
>
>   
>> I do understand that in large environments, optimizations have to be made in
>> order not to kill server performance, and expiration is probably something
>> that could be done at "more convenient times".  I will commit a script that
>> can safely be run as a cronjob soon.
>>     
>
> Excellent.
>
>   
>> I understand that the "order" keyword in select is potentially expensive, but
>> necessary because matches occur generally towards the most recent entries,
>> thus increasing the possibility of a match earlier on.  When your hash count
>> is in the thousands, earlier matches mean less queries to the database, and
>> potentially faster results.
>>     
>
> It's not just the order directive, it's the iteration throughout the 
> entire database.
>
> Consider when the database grows to >50k records. For a new image that 
> doesn't have a hash, that's 50k records that must be sorted then sent from 
> the DB server to the mail server, then all 50k records must be checked 
> against the hash before we decide that we haven't seen this image before. 
> That just isn't a workable algorithm. If iteration throughout the entire 
> database is a requirement, hashing is a performance hit rather than a 
> performance gain.
>
> A better solution might be a seperate daemon that holds the hashes in 
> memory, to which you submit the hash being considered.
>
> Honestly, I have been extremely impressed with having hashing turned 
> completely off.
>
> Andy
>
> ---
> Andy Dills
> Xecunet, Inc.
> www.xecu.net
> 301-682-9972
> ---
>   
Right now my DB is ~21K records, and I expire after 21 days... I could 
always reduce the size of the DB by expiring sooner. The default value 
is set for 35 days, a little over one month (5 full weeks), so tunning 
this value could help you out. After looking at my logs, ~2/3 of the 
matches happen within 24hrs, so just keeping matches for 24 hours will 
get me 2/3 of the way, as you can always rescan the images from the 
other 1/3 of the messages and will probably be faster than looking for 
the database match in your case.

Remember, when working in large environments, optimization of resources 
is key, so here are a few suggestions:

+ expiring the DB after only 1-3 days may be the optimal setting for 
you, since this will reduce the number of records in the DB and still 
reap the benefits of saving the hashes. Check your logs...

+ use BerkeleyDB on a Ramdisk will certainly be faster, just make sure 
that the Ramdisk will not run out of free space (not generally a 
problem). Also, remember to save a copy from Ramdisk to Harddisk 
periodically in order to keep backups in case of a system restart, or 
you will loose the DB.

+ tune your MySQL setup, including but not limited to adding another 
index: 'Hash.check' in order to reduce the sorting times, allocation 
more RAM for sorting, etc.

+ use a dedicated MySQL server if you use the MySQL solution to share 
the database among several SMTP servers, possibly using this server for 
other common tasks as well.

Remember, that the solution you may implement depends largely on the 
resources you have available. There are other solutions that can help 
you reduce the amount of work sent to the plugin, the Botnet plugin 
helps a lot, and setting *focr_autodisable_score* to a value more suited 
to your situation (default: 10), since most people set it to a higher 
value in order to test the plugin and never reset this value.


Jorge.


Re: [Devel-spam] FuzzyOcr 3.5.1 released

Posted by jdow <jd...@earthlink.net>.
Yup - if you are looking for "within 10 miles" you can perform a raw
comparison by looking at the lat-lon degrees number to remove anything
more than two degrees apart. That knocks down your search by 180 time
in each direction, over 30000:1 savings right there. If you store all
the data as degree and fractional degree you can remove everything more
than a small fraction of a degree apart.

But for the first cut storing everything in the grid square 117 to 118
longitude and 34 to 35 longitude in its own part of the tree structure
allows almost instant selection of "likely" candidates. You could also
use links to store 117 to 118, 34-35 in one box, 117.5-118.5, 34-35 in
another box - noting the overlap in the concept. That means a site right
on a corner or edge of a criterion marker isn't lost. Anything like that
which can be used to reduce the amount of data that needs to be tested
even at the expence of cross-linked trees is a huge savings. You enter
an item into the database once, that performs the searches for the crude
region linkages. Then the searches, the "many" operation, can proceed
quicker due to filtering out excess searches.

{^_^}
----- Original Message ----- 
From: "Dan Barker" <db...@visioncomm.net>


> Giampaolo: I hope you succeed.
>
> I've given up hope on convincing folks (Mapquest in particular) that 
> radius
> searches can be indexed. You needn't pull the lat/long of every single 
> entry
> to run the distance function, and then discard the ones too far away. You
> can index on LAT and LONG and structure the query such that only the
> "possible" lat/long values need the distance function (and the rest of the
> record fetched) evaluated.
>
> Just because it's two orders of magnitude more efficient doesn't make
> anybody listen.
>
> Same conversation, different universe!
>
> Dan
>
> -----Original Message-----
> From: Giampaolo Tomassoni [mailto:g.tomassoni@libero.it]
>
> From: Andy Dills [mailto:andy@xecu.net]
>>
>> ...omissis...
>>
>> > I understand that the "order" keyword in select is potentially
>> expensive, but
>> > necessary because matches occur generally towards the most
>> recent entries,
>> > thus increasing the possibility of a match earlier on.  When
>> your hash count
>> > is in the thousands, earlier matches mean less queries to the
>> database, and
>> > potentially faster results.
>>
>> It's not just the order directive, it's the iteration throughout the
>> entire database.
>>
>> Consider when the database grows to >50k records. For a new image that
>> doesn't have a hash, that's 50k records that must be sorted then
>> sent from
>> the DB server to the mail server, then all 50k records must be checked
>> against the hash before we decide that we haven't seen this image before.
>> That just isn't a workable algorithm. If iteration throughout the entire
>> database is a requirement, hashing is a performance hit rather than a
>> performance gain.
>>
>> A better solution might be a seperate daemon that holds the hashes in
>> memory, to which you submit the hash being considered.
>
> Other ways could be the ones depicted in my recent post (Message-ID:
> <NB...@libero.it>), in which close 
> images
> are basicly clustered together thanks to a surrogate index.
>
> giampaolo
>
>>
>> Honestly, I have been extremely impressed with having hashing turned
>> completely off.
>>
>> Andy
>>
>> ---
>> Andy Dills
>> Xecunet, Inc.
>> www.xecu.net
>> 301-682-9972
>> ---
> 


RE: [Devel-spam] FuzzyOcr 3.5.1 released

Posted by Giampaolo Tomassoni <g....@libero.it>.
From: Dan Barker [mailto:dbarker@visioncomm.net]
> 
> Giampaolo: I hope you succeed.
> 
> I've given up hope on convincing folks (Mapquest in particular) 
> that radius
> searches can be indexed. You needn't pull the lat/long of every 
> single entry
> to run the distance function, and then discard the ones too far away. You
> can index on LAT and LONG and structure the query such that only the
> "possible" lat/long values need the distance function (and the rest of the
> record fetched) evaluated.

Right.


> Just because it's two orders of magnitude more efficient doesn't make
> anybody listen.
>
> Same conversation, different universe!

You mean that it is probably a concept to far away from the origin of someone's comprehensibility space? :)

giampaolo


> Dan
> 
> -----Original Message-----
> From: Giampaolo Tomassoni [mailto:g.tomassoni@libero.it]
> Sent: Monday, January 08, 2007 2:00 PM
> To: devel-spam@lists.own-hero.net; users@spamassassin.apache.org
> Subject: RE: [Devel-spam] FuzzyOcr 3.5.1 released
> 
> 
> From: Andy Dills [mailto:andy@xecu.net]
> >
> > ...omissis...
> >
> > > I understand that the "order" keyword in select is potentially
> > expensive, but
> > > necessary because matches occur generally towards the most
> > recent entries,
> > > thus increasing the possibility of a match earlier on.  When
> > your hash count
> > > is in the thousands, earlier matches mean less queries to the
> > database, and
> > > potentially faster results.
> >
> > It's not just the order directive, it's the iteration throughout the
> > entire database.
> >
> > Consider when the database grows to >50k records. For a new image that
> > doesn't have a hash, that's 50k records that must be sorted then
> > sent from
> > the DB server to the mail server, then all 50k records must be checked
> > against the hash before we decide that we haven't seen this 
> image before.
> > That just isn't a workable algorithm. If iteration throughout the entire
> > database is a requirement, hashing is a performance hit rather than a
> > performance gain.
> >
> > A better solution might be a seperate daemon that holds the hashes in
> > memory, to which you submit the hash being considered.
> 
> Other ways could be the ones depicted in my recent post (Message-ID:
> <NB...@libero.it>), in which 
> close images
> are basicly clustered together thanks to a surrogate index.
> 
> giampaolo
> 
> >
> > Honestly, I have been extremely impressed with having hashing turned
> > completely off.
> >
> > Andy
> >
> > ---
> > Andy Dills
> > Xecunet, Inc.
> > www.xecu.net
> > 301-682-9972
> > ---
> 
> 


RE: [Devel-spam] FuzzyOcr 3.5.1 released

Posted by Dan Barker <db...@visioncomm.net>.
Giampaolo: I hope you succeed.

I've given up hope on convincing folks (Mapquest in particular) that radius
searches can be indexed. You needn't pull the lat/long of every single entry
to run the distance function, and then discard the ones too far away. You
can index on LAT and LONG and structure the query such that only the
"possible" lat/long values need the distance function (and the rest of the
record fetched) evaluated.

Just because it's two orders of magnitude more efficient doesn't make
anybody listen.

Same conversation, different universe!

Dan

-----Original Message-----
From: Giampaolo Tomassoni [mailto:g.tomassoni@libero.it]
Sent: Monday, January 08, 2007 2:00 PM
To: devel-spam@lists.own-hero.net; users@spamassassin.apache.org
Subject: RE: [Devel-spam] FuzzyOcr 3.5.1 released


From: Andy Dills [mailto:andy@xecu.net]
>
> ...omissis...
>
> > I understand that the "order" keyword in select is potentially
> expensive, but
> > necessary because matches occur generally towards the most
> recent entries,
> > thus increasing the possibility of a match earlier on.  When
> your hash count
> > is in the thousands, earlier matches mean less queries to the
> database, and
> > potentially faster results.
>
> It's not just the order directive, it's the iteration throughout the
> entire database.
>
> Consider when the database grows to >50k records. For a new image that
> doesn't have a hash, that's 50k records that must be sorted then
> sent from
> the DB server to the mail server, then all 50k records must be checked
> against the hash before we decide that we haven't seen this image before.
> That just isn't a workable algorithm. If iteration throughout the entire
> database is a requirement, hashing is a performance hit rather than a
> performance gain.
>
> A better solution might be a seperate daemon that holds the hashes in
> memory, to which you submit the hash being considered.

Other ways could be the ones depicted in my recent post (Message-ID:
<NB...@libero.it>), in which close images
are basicly clustered together thanks to a surrogate index.

giampaolo

>
> Honestly, I have been extremely impressed with having hashing turned
> completely off.
>
> Andy
>
> ---
> Andy Dills
> Xecunet, Inc.
> www.xecu.net
> 301-682-9972
> ---



RE: [Devel-spam] FuzzyOcr 3.5.1 released

Posted by Giampaolo Tomassoni <g....@libero.it>.
From: Andy Dills [mailto:andy@xecu.net]
> 
> ...omissis...
>
> > I understand that the "order" keyword in select is potentially 
> expensive, but
> > necessary because matches occur generally towards the most 
> recent entries,
> > thus increasing the possibility of a match earlier on.  When 
> your hash count
> > is in the thousands, earlier matches mean less queries to the 
> database, and
> > potentially faster results.
> 
> It's not just the order directive, it's the iteration throughout the 
> entire database.
> 
> Consider when the database grows to >50k records. For a new image that 
> doesn't have a hash, that's 50k records that must be sorted then 
> sent from 
> the DB server to the mail server, then all 50k records must be checked 
> against the hash before we decide that we haven't seen this image before. 
> That just isn't a workable algorithm. If iteration throughout the entire 
> database is a requirement, hashing is a performance hit rather than a 
> performance gain.
> 
> A better solution might be a seperate daemon that holds the hashes in 
> memory, to which you submit the hash being considered.

Other ways could be the ones depicted in my recent post (Message-ID: <NB...@libero.it>), in which close images are basicly clustered together thanks to a surrogate index.

giampaolo

> 
> Honestly, I have been extremely impressed with having hashing turned 
> completely off.
> 
> Andy
> 
> ---
> Andy Dills
> Xecunet, Inc.
> www.xecu.net
> 301-682-9972
> ---