You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael Bell <mi...@yahoo.com> on 2007/08/15 20:01:36 UTC

Seeking Advice

We are writing a mail archiving program. Each piece of the message (eg each attachment) is stored separately.

I'll try to keep this short and sweet :)

Currently we index the main header fields, like

subject
sender
recipients (space delimited)

etc.

This stuff is really only needed once per e-mail

We also index the attachment info:

attachment size (changed to a range like "large", "medium", etc)
attachment name
full text index
etc.

This stuff is needed to be distinct for each attachment in the e-mail

Our current algorithm is wasteful, but I see no better way to do it.

In a loop, for each attachment (and once if we have none), we add all the main header stuff and the attachment stuff, as a separate Document per attachment. This is wasteful, because the main header stuff is needlessly repeated.

Now, it would seem better and more efficient to have one Document for the whole e-mail, storing the main header stuff only once, and storing the Attachment stuff as multiple instances of the same field. Lucene supports this.


The problem is then a search on attachment stuff will return cross cartesian results.

Example

 if I have 2 attachments one named A.doc and one B.doc. And A.doc contains the full text "turnip" and B.doc contains the text "dog".


Now if the user enters a search requesting email that contains Attachment name A.Doc, and contents dog, the results will be

For the Per-Document storage:

no results found (correct I'd argue)

For the Single Document storage:

1 result found (because the full text and names of both are stored in the same Document albeit different Field instances)

While tempted by the siren call of the Single Document method, it seems like this would return unexpected results from the users point of view (although one could argue otherwise, since holistically searching the e-mail as a whole it's returning the "right" results.

What do you folks think? Any ideas for a better way to approach this?

Thanks

Mike






       
____________________________________________________________________________________Ready for the edge of your seat? 
Check out tonight's top picks on Yahoo! TV. 
http://tv.yahoo.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Seeking Advice

Posted by "Michael J. Prichard" <mi...@mac.com>.
Hey Michael,

Are you writing this software for yourself or for reselling?  We built 
an email archiving service and we use lucene as our search engine.  We 
approach this a little differently.

BUT, i don't think it is wasteful to index the header information with 
the attachment.  Just don't store the fields.

-Michael



Michael Bell wrote:
> We are writing a mail archiving program. Each piece of the message (eg each attachment) is stored separately.
>
> I'll try to keep this short and sweet :)
>
> Currently we index the main header fields, like
>
> subject
> sender
> recipients (space delimited)
>
> etc.
>
> This stuff is really only needed once per e-mail
>
> We also index the attachment info:
>
> attachment size (changed to a range like "large", "medium", etc)
> attachment name
> full text index
> etc.
>
> This stuff is needed to be distinct for each attachment in the e-mail
>
> Our current algorithm is wasteful, but I see no better way to do it.
>
> In a loop, for each attachment (and once if we have none), we add all the main header stuff and the attachment stuff, as a separate Document per attachment. This is wasteful, because the main header stuff is needlessly repeated.
>
> Now, it would seem better and more efficient to have one Document for the whole e-mail, storing the main header stuff only once, and storing the Attachment stuff as multiple instances of the same field. Lucene supports this.
>
>
> The problem is then a search on attachment stuff will return cross cartesian results.
>
> Example
>
>  if I have 2 attachments one named A.doc and one B.doc. And A.doc contains the full text "turnip" and B.doc contains the text "dog".
>
>
> Now if the user enters a search requesting email that contains Attachment name A.Doc, and contents dog, the results will be
>
> For the Per-Document storage:
>
> no results found (correct I'd argue)
>
> For the Single Document storage:
>
> 1 result found (because the full text and names of both are stored in the same Document albeit different Field instances)
>
> While tempted by the siren call of the Single Document method, it seems like this would return unexpected results from the users point of view (although one could argue otherwise, since holistically searching the e-mail as a whole it's returning the "right" results.
>
> What do you folks think? Any ideas for a better way to approach this?
>
> Thanks
>
> Mike
>
>
>
>
>
>
>        
> ____________________________________________________________________________________Ready for the edge of your seat? 
> Check out tonight's top picks on Yahoo! TV. 
> http://tv.yahoo.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Seeking Advice

Posted by "Michael J. Prichard" <mi...@mac.com>.
I actually know from experience.  Around 20% +/- 5% of emails will have 
attachments.  If that helps.  Again, I say index as much info as you 
can.  Store what you think it necessary.

Erick Erickson wrote:
> Rather than use efficiency arguments to drive the behavior of the
> app, I'd recommend that you define the expected behavior and
> make that behavior happen as necessary.
>
> What would you estimate is the ratio of meta-data to attachments?
> And what is the ratio of documents that have multiple attachments?
> I actually suspect that the number of e-mails that have multiple
> attachments is small enough that storing the meta-data with each
> document would result in a minuscule size increase,
> but you'll only find that by gathering some statistics <G>....
>
> Erick
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Seeking Advice

Posted by Erick Erickson <er...@gmail.com>.
Rather than use efficiency arguments to drive the behavior of the
app, I'd recommend that you define the expected behavior and
make that behavior happen as necessary.

What would you estimate is the ratio of meta-data to attachments?
And what is the ratio of documents that have multiple attachments?
I actually suspect that the number of e-mails that have multiple
attachments is small enough that storing the meta-data with each
document would result in a minuscule size increase,
but you'll only find that by gathering some statistics <G>....

Erick

On 8/15/07, Michael Bell <mi...@yahoo.com> wrote:
>
> We are writing a mail archiving program. Each piece of the message (eg
> each attachment) is stored separately.
>
> I'll try to keep this short and sweet :)
>
> Currently we index the main header fields, like
>
> subject
> sender
> recipients (space delimited)
>
> etc.
>
> This stuff is really only needed once per e-mail
>
> We also index the attachment info:
>
> attachment size (changed to a range like "large", "medium", etc)
> attachment name
> full text index
> etc.
>
> This stuff is needed to be distinct for each attachment in the e-mail
>
> Our current algorithm is wasteful, but I see no better way to do it.
>
> In a loop, for each attachment (and once if we have none), we add all the
> main header stuff and the attachment stuff, as a separate Document per
> attachment. This is wasteful, because the main header stuff is needlessly
> repeated.
>
> Now, it would seem better and more efficient to have one Document for the
> whole e-mail, storing the main header stuff only once, and storing the
> Attachment stuff as multiple instances of the same field. Lucene supports
> this.
>
>
> The problem is then a search on attachment stuff will return cross
> cartesian results.
>
> Example
>
> if I have 2 attachments one named A.doc and one B.doc. And A.doc contains
> the full text "turnip" and B.doc contains the text "dog".
>
>
> Now if the user enters a search requesting email that contains Attachment
> name A.Doc, and contents dog, the results will be
>
> For the Per-Document storage:
>
> no results found (correct I'd argue)
>
> For the Single Document storage:
>
> 1 result found (because the full text and names of both are stored in the
> same Document albeit different Field instances)
>
> While tempted by the siren call of the Single Document method, it seems
> like this would return unexpected results from the users point of view
> (although one could argue otherwise, since holistically searching the e-mail
> as a whole it's returning the "right" results.
>
> What do you folks think? Any ideas for a better way to approach this?
>
> Thanks
>
> Mike
>
>
>
>
>
>
>
> ____________________________________________________________________________________Ready
> for the edge of your seat?
> Check out tonight's top picks on Yahoo! TV.
> http://tv.yahoo.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>