You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Timo Sirainen <ts...@iki.fi> on 2008/11/23 15:02:16 UTC

Using Solr for indexing emails

Hi,

A while ago I implemented searching emails with Solr for my IMAP server
(www.dovecot.org). Seems to work ok, but now I'm having a bit of trouble
trying to figure out how to implement searching from multiple mailboxes
efficiently. Would be great if someone had suggestions how to do things
better.

The main problem is that before doing the search, I first have to check
if there are any unindexed messages and then add them to Solr. This is
done using a query like:

 - fl=uid
 - rows=1
 - sort=uid desc
 - q=uidv:<uidvalidity> box:<mailbox> user:<user>

So it returns the highest IMAP UID field (which is an always-ascending
integer) for the given mailbox (you can ignore the uidvalidity). I can
then add all messages with higher UIDs to Solr before doing the actual
search.

When searching multiple mailboxes the above query would have to be sent
to every mailbox separately. That really doesn't seem like the best
solution, especially when there are a lot of mailboxes. But I don't
think Solr has a way to return "highest uid field for each
box:<mailbox>"?

Is that above query even efficient for a single mailbox? I did consider
using separate documents for storing the highest UID for each mailbox,
but that causes annoying desynchronization possibilities. Especially
because currently I can just keep sending documents to Solr without
locking and let it drop duplicates automatically (should be rare). With
per-mailbox highest-uid documents I can't really see a way to do this
without locking or allowing duplicate fields to be added and later some
garbage collection deleting all but the one highest value (annoyingly
complex).

I could of course also keep track of what's indexed on Dovecot's side,
but that could also lead to desynchronization issues and I'd like to
avoid them.

I guess the ideal solution would be if it was somehow possible to create
a SQL-like trigger that updates the per-mailbox highest-uid document
whenever adding a new document with a higher UID value.

Re: Using Solr for indexing emails

Posted by Norberto Meijome <nu...@gmail.com>.
On Tue, 25 Nov 2008 03:59:31 +0200
Timo Sirainen <ts...@iki.fi> wrote:

> > would it be faster to say q=user:<user> AND highestuid:[ * TO *]  ?  
> 
> Now that I read again what fq really did, yes, sounds like you're right.

you may want to compare them both to see which one is better... I just went
from memory :P

> > ( and i
> > guess you'd sort DESC and return 1 record only).  
> 
> No, I'd use the above for getting highestuid value for all mailboxes
> (there should be only one record per mailbox (each mailbox has separate
> uid values -> separate highestuid value)) so I can look at the returned
> highestuid values to see what mailboxes aren't fully indexed yet.

gotcha. It is an interesting use of SOLR, i must say... I for one am not used
to having to deal with up to the second update needs.

good luck,
B

_________________________
{Beto|Norberto|Numard} Meijome

"Never offend people with style when you can offend them with substance."
  Sam Brown

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

Re: Using Solr for indexing emails

Posted by Timo Sirainen <ts...@iki.fi>.
On Tue, 2008-11-25 at 12:20 +1100, Norberto Meijome wrote:
> > Store the per-mailbox highest indexed UID in a new unique field created
> > like "<user>/<uidvalidity>/<mailbox>". Always update it by deleting the
> > old one first and then adding the new one.
> 
> you mean delete, commit, add, commit? if you replace the record, simply
> submitting the new document and committing would do (of course, you must ensure
> the value of the  uniqueKey field matches, so SOLR replaces the old doc).

Oh, I thought it ignored the new document in that case. Sure, I'll then
don't do the delete if it gets replaced anyway.

> > So to find out the highest
> > indexed UID for a mailbox just look it up using its unique field. For
> > finding the highest indexed UID for a user's all mailboxes do a single
> > query:
> > 
> >  - fl=highestuid
> >  - q=highestuid:[* TO *]
> >  - fq=user:<user>
> 
> would it be faster to say q=user:<user> AND highestuid:[ * TO *]  ?

Now that I read again what fq really did, yes, sounds like you're right.

> ( and i
> guess you'd sort DESC and return 1 record only).

No, I'd use the above for getting highestuid value for all mailboxes
(there should be only one record per mailbox (each mailbox has separate
uid values -> separate highestuid value)) so I can look at the returned
highestuid values to see what mailboxes aren't fully indexed yet.

Re: Using Solr for indexing emails

Posted by Norberto Meijome <nu...@gmail.com>.
On Mon, 24 Nov 2008 20:21:17 +0200
Timo Sirainen <ts...@iki.fi> wrote:

> I think I gave enough reasons above for why I don't like this
> solution. :) I also don't like adding new shared global state databases
> just for Solr. Solr should be the one shared global state database..

fair enough - it makes more sense to me now :)

[...]
> Store the per-mailbox highest indexed UID in a new unique field created
> like "<user>/<uidvalidity>/<mailbox>". Always update it by deleting the
> old one first and then adding the new one.

you mean delete, commit, add, commit? if you replace the record, simply
submitting the new document and committing would do (of course, you must ensure
the value of the  uniqueKey field matches, so SOLR replaces the old doc).

> So to find out the highest
> indexed UID for a mailbox just look it up using its unique field. For
> finding the highest indexed UID for a user's all mailboxes do a single
> query:
> 
>  - fl=highestuid
>  - q=highestuid:[* TO *]
>  - fq=user:<user>

would it be faster to say q=user:<user> AND highestuid:[ * TO *]  ?  ( and i
guess you'd sort DESC and return 1 record only).

> If messages are being simultaneously indexed by multiple processes the
> highest-uid value may sometimes (rarely) be set too low, but that
> doesn't matter. The next search will try to re-add some of the messages
> that were already in index, but because they'll have the same unique IDs
> than what already exists they won't get added again. The highest-uid
> gets updated and all is well.

B
_________________________
{Beto|Norberto|Numard} Meijome

Mind over matter: if you don't mind, it doesn't matter

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

Re: Using Solr for indexing emails

Posted by Timo Sirainen <ts...@iki.fi>.
On Tue, 2008-11-25 at 20:45 +0530, Shalin Shekhar Mangar wrote:
> On Mon, Nov 24, 2008 at 11:51 PM, Timo Sirainen <ts...@iki.fi> wrote:
> 
> >
> > DIH seems to be about Solr pulling data into it from an external source.
> > That's not really practical with Dovecot since there's no central
> > repository of any kind of data, so there's no way to know what has
> > changed since last pull.
> 
> 
> Isn't your IMAP server the external data source? DIH can consume from any
> data store. Tools for consuming from databases and files have been written.
> I think it is possible to write one which consumes from IMAP.

Yes, but that would require going through all users' all mailboxes to
find out which ones have new nonindexed messages. The data isn't stored
in any centralized database that would allow quickly returning all
non-indexed messages. Instead for each mailbox it would have to (at
minimum) open and read two files. That won't really scale for large
installations with a huge amount of mailboxes.

(At some point I probably am going to implement something that allows
finding "everyone's all new messages" more easily so that I can
implement replication support, but for now that kind of a change would
be way too much work.)

Re: Using Solr for indexing emails

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Mon, Nov 24, 2008 at 11:51 PM, Timo Sirainen <ts...@iki.fi> wrote:

>
> DIH seems to be about Solr pulling data into it from an external source.
> That's not really practical with Dovecot since there's no central
> repository of any kind of data, so there's no way to know what has
> changed since last pull.


Isn't your IMAP server the external data source? DIH can consume from any
data store. Tools for consuming from databases and files have been written.
I think it is possible to write one which consumes from IMAP.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Using Solr for indexing emails

Posted by Timo Sirainen <ts...@iki.fi>.
On Mon, 2008-11-24 at 14:25 +1100, Norberto Meijome wrote:
> > The main problem is that before doing the search, I first have to check
> > if there are any unindexed messages and then add them to Solr. This is
> > done using a query like:
> >  - fl=uid
> >  - rows=1
> >  - sort=uid desc
> >  - q=uidv:<uidvalidity> box:<mailbox> user:<user>
> 
> So, if I understand correctly, the process is :
> 
> 1. user sends search query Q to search interface
> 2. interface checks highest indexed uidv in SOLR
> 3. checks in IMAP store for mailbox if there are any objects ('emails') newer
> than uidv from 2.
> 4. anything found in 3. is processed, submitted to SOLR, committed.
> 5. interface submits search query Q to index, gets results
> 6. results are presented / returned to user

Right. Except "uid", not "uidv" (uidv = <uidvalidity> = basically
<mailbox> and <uidvalidity> uniquely identifies a mailbox between
recreations/renames).

> It strikes me that this may work ok in some situations but may not scale. I
> would decouple the {find new documents / submit / commit } process from the
> { search / presentation} layer - SPECIALLY if you plan to have several
> mailboxes in play now.

The idea was that not all users are searching their mails, especially in
all mailboxes, so there's no point in wasting CPU and disk space on
indexing messages that are never used.

Also nothing prevents the administrator from configuring the kind of a
setup where message indexing is done on the background for all new
messages. But even if this is done, the search *must* find all the
messages that were added recently (even 1 second ago). So this kind of a
check before searching is still a requirement.

Also I hate all kinds of potential desynchronization issues. For example
if Dovecot relied on message saving to add the message to Solr
immediately there wouldn't need to be a way to do the "check what's
missing query". But this kind of a setup breaks easily if

a) Mail delivery crashes in the middle (or power is lost) between saving
message and indexing it to Solr. Now searching Solr will never find the
message.

b) Solr server breaks (e.g. hardware) and the latest changes get lost.
Since only new messages are indexed, you now have a lot of messages that
can never be searched.

Having separate nightly runs of "check what mails aren't indexed" would
work, but as the number of users increases this checks becomes longer
and longer. There are installations that have millions of mailboxes..

> > So it returns the highest IMAP UID field (which is an always-ascending
> > integer) for the given mailbox (you can ignore the uidvalidity). I can
> > then add all messages with higher UIDs to Solr before doing the actual
> > search.
> > 
> > When searching multiple mailboxes the above query would have to be sent
> > to every mailbox separately. 
> 
> hmm...not sure what you mean by "query would have to be sent to every
> MAILBOX" ... 

I meant that for each mailbox that needs to be checked a separate Solr
query would have to be sent.

> > That really doesn't seem like the best
> > solution, especially when there are a lot of mailboxes. But I don't
> > think Solr has a way to return "highest uid field for each
> > box:<mailbox>"?
> 
> hmmm... maybe you can use facets on 'box' ... ? though you'd still have to
> query for each box, i think...

I see a lot of detailed documentation about facets in the wiki, but they
didn't really help me understand what the facets are all about.. The
"fq" parameter seemed to be somehow relevant to it. I am actually using
it when doing the actual search query:

 - fl=uid,score
 - rows=<a lot>
 - sort=uid asc
 - q=body:stuff hdr:stuff or any:stuff
 - fq=uidv:<uidvalidity> box:<mailbox> user:<user>

I didn't use fq with the "check what's missing query" because if there
was no q parameter Solr gave an error.

> > Is that above query even efficient for a single mailbox? 
> 
> i don't think so.

I guess that'll need changing then too.

> >I did consider
> > using separate documents for storing the highest UID for each mailbox,
> > but that causes annoying desynchronization possibilities. Especially
> > because currently I can just keep sending documents to Solr without
> > locking and let it drop duplicates automatically (should be rare). With
> > per-mailbox highest-uid documents I can't really see a way to do this
> > without locking or allowing duplicate fields to be added and later some
> > garbage collection deleting all but the one highest value (annoyingly
> > complex).
> 
> I have a feeling the issues arise from serialising the whole process (as I
> described above... ). It makes more sense (to me)  to implement something
> similar to DIH, where you load data as needed (even a 'delta query', which
> would only return new data... I am not sure whether you could use DIH ( RSS
> feed from IMAP store? )

DIH seems to be about Solr pulling data into it from an external source.
That's not really practical with Dovecot since there's no central
repository of any kind of data, so there's no way to know what has
changed since last pull.

> > I could of course also keep track of what's indexed on Dovecot's side,
> > but that could also lead to desynchronization issues and I'd like to
> > avoid them.
> > 
> > I guess the ideal solution would be if it was somehow possible to create
> > a SQL-like trigger that updates the per-mailbox highest-uid document
> > whenever adding a new document with a higher UID value.
> 
> I am not sure how much effort you want to put into this...but I would think
> that writing a lean app that periodically (for a period that makes sense for
> your hardware and user's expectation... 5 minutes? 10?  1? ) crawls the IMAP
> stores for UID, processes them and submits to SOLR, and keeps its own state
> ( dbm or sqlite ) may be a more flexible approach. Or, if dovecot support this,
> a 'plugin / hook ' that sends a msg to your indexing app everytime a new
> document is created.

I think I gave enough reasons above for why I don't like this
solution. :) I also don't like adding new shared global state databases
just for Solr. Solr should be the one shared global state database..

But I did think of a new solution that I guess could work. Or I guess
it's one of the solutions I already thought of but discarded it because
I wasn't thinking clearly enough:

Store the per-mailbox highest indexed UID in a new unique field created
like "<user>/<uidvalidity>/<mailbox>". Always update it by deleting the
old one first and then adding the new one. So to find out the highest
indexed UID for a mailbox just look it up using its unique field. For
finding the highest indexed UID for a user's all mailboxes do a single
query:

 - fl=highestuid
 - q=highestuid:[* TO *]
 - fq=user:<user>

If messages are being simultaneously indexed by multiple processes the
highest-uid value may sometimes (rarely) be set too low, but that
doesn't matter. The next search will try to re-add some of the messages
that were already in index, but because they'll have the same unique IDs
than what already exists they won't get added again. The highest-uid
gets updated and all is well.


Re: Using Solr for indexing emails

Posted by Norberto Meijome <nu...@gmail.com>.
On Sun, 23 Nov 2008 16:02:16 +0200
Timo Sirainen <ts...@iki.fi> wrote:

> Hi,

Hi Timo,

> 
[...]

> The main problem is that before doing the search, I first have to check
> if there are any unindexed messages and then add them to Solr. This is
> done using a query like:
>  - fl=uid
>  - rows=1
>  - sort=uid desc
>  - q=uidv:<uidvalidity> box:<mailbox> user:<user>

So, if I understand correctly, the process is :

1. user sends search query Q to search interface
2. interface checks highest indexed uidv in SOLR
3. checks in IMAP store for mailbox if there are any objects ('emails') newer
than uidv from 2.
4. anything found in 3. is processed, submitted to SOLR, committed.
5. interface submits search query Q to index, gets results
6. results are presented / returned to user

It strikes me that this may work ok in some situations but may not scale. I
would decouple the {find new documents / submit / commit } process from the
{ search / presentation} layer - SPECIALLY if you plan to have several
mailboxes in play now.

> So it returns the highest IMAP UID field (which is an always-ascending
> integer) for the given mailbox (you can ignore the uidvalidity). I can
> then add all messages with higher UIDs to Solr before doing the actual
> search.
> 
> When searching multiple mailboxes the above query would have to be sent
> to every mailbox separately. 

hmm...not sure what you mean by "query would have to be sent to every
MAILBOX" ... 

> That really doesn't seem like the best
> solution, especially when there are a lot of mailboxes. But I don't
> think Solr has a way to return "highest uid field for each
> box:<mailbox>"?

hmmm... maybe you can use facets on 'box' ... ? though you'd still have to
query for each box, i think...

> Is that above query even efficient for a single mailbox? 

i don't think so.

>I did consider
> using separate documents for storing the highest UID for each mailbox,
> but that causes annoying desynchronization possibilities. Especially
> because currently I can just keep sending documents to Solr without
> locking and let it drop duplicates automatically (should be rare). With
> per-mailbox highest-uid documents I can't really see a way to do this
> without locking or allowing duplicate fields to be added and later some
> garbage collection deleting all but the one highest value (annoyingly
> complex).

I have a feeling the issues arise from serialising the whole process (as I
described above... ). It makes more sense (to me)  to implement something
similar to DIH, where you load data as needed (even a 'delta query', which
would only return new data... I am not sure whether you could use DIH ( RSS
feed from IMAP store? )

> I could of course also keep track of what's indexed on Dovecot's side,
> but that could also lead to desynchronization issues and I'd like to
> avoid them.
> 
> I guess the ideal solution would be if it was somehow possible to create
> a SQL-like trigger that updates the per-mailbox highest-uid document
> whenever adding a new document with a higher UID value.

I am not sure how much effort you want to put into this...but I would think
that writing a lean app that periodically (for a period that makes sense for
your hardware and user's expectation... 5 minutes? 10?  1? ) crawls the IMAP
stores for UID, processes them and submits to SOLR, and keeps its own state
( dbm or sqlite ) may be a more flexible approach. Or, if dovecot support this,
a 'plugin / hook ' that sends a msg to your indexing app everytime a new
document is created.

I am interested to hear what you decide to go with, and why.

cheers,
B

_________________________
{Beto|Norberto|Numard} Meijome

"All parts should go together without forcing. You must remember that the parts
you are reassembling were disassembled by you. Therefore, if you can't get them
together again, there must be a reason. By all means, do not use hammer." IBM
maintenance manual, 1975

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.