You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by 刘庆志 <li...@gmail.com> on 2010/04/28 08:35:34 UTC

how to design Lucene Document and Field to indexing and searching email message and attachments

hi all:
our bussiness system generate some data,that information structrue like email message,one message have some attachments,so we can use email message to think of our data,I need index and search the message and its attachments,and when display hits,must display two kinds of links for every hit: one kind for message,and the other for attachments which match the query criteria,so the former kind there is only one link,but latter may be zero to n links.
one design may be as:for every message design one Lucene Document,it has a field to record its id,let's name the field id,an other field to correspond all its attachments,let's name the field attachments,afert that we also design a Lucene Document to correspond every message's every attachment,this Document has a field record its message's id,let's name the field messageid, so when query,we can retrieval messages may be itself's cotent or its attachments content match the query criteria,for generating links for attachments which match the query criteria,we can requey again,this time we can query only the message's attachments by adding a query condition that messageid=father query's messge id.it's obviously,there are two disadvantages: 1,it indexes attachments twice,one in message,and the other in Lucene Document for attachment.2,user's one query becomes 1+n query,1 for query message and its all attachments,n for requery the message's every attachments.is there any better solution? 


dazhi

Thanks for any hints!!!

Re: how to design Lucene Document and Field to indexing and searching email message and attachments

Posted by 刘庆志 <li...@gmail.com>.
Erick:
    Thanks for your information.    
    I search for the similar question and get a very like issue:Best Practice: emails and file-attachments on 15 August 2006,in mailing list archives(http://mail-archives.apache.org/mod_mbox/lucene-java-user/200608.mbox/browser),but there is no final answer in that thread.
    
        


dazhi



----- Original Message ----- 
From: "Erick Erickson" <er...@gmail.com>
To: <ja...@lucene.apache.org>
Sent: Wednesday, April 28, 2010 11:00 PM
Subject: Re: how to design Lucene Document and Field to indexing and searching email message and attachments


This problem has been discussed several times, although I can't
remember the answer. So I'd recommend searching the mail archive
first.

Lucid maintains a searchable archive, see:
http://www.lucidimagination.com/About-Search

HTH
Erick

On Wed, Apr 28, 2010 at 2:35 AM, 刘庆志 <li...@gmail.com> wrote:

> hi all:
> our bussiness system generate some data,that information structrue like
> email message,one message have some attachments,so we can use email message
> to think of our data,I need index and search the message and its
> attachments,and when display hits,must display two kinds of links for every
> hit: one kind for message,and the other for attachments which match the
> query criteria,so the former kind there is only one link,but latter may be
> zero to n links.
> one design may be as:for every message design one Lucene Document,it has a
> field to record its id,let's name the field id,an other field to correspond
> all its attachments,let's name the field attachments,afert that we also
> design a Lucene Document to correspond every message's every attachment,this
> Document has a field record its message's id,let's name the field messageid,
> so when query,we can retrieval messages may be itself's cotent or its
> attachments content match the query criteria,for generating links for
> attachments which match the query criteria,we can requey again,this time we
> can query only the message's attachments by adding a query condition that
> messageid=father query's messge id.it's obviously,there are two
> disadvantages: 1,it indexes attachments twice,one in message,and the other
> in Lucene Document for attachment.2,user's one query becomes 1+n query,1 for
> query message and its all attachments,n for requery the message's every
> attachments.is there any better solution?
>
>
> dazhi
>
> Thanks for any hints!!!
>

Re: how to design Lucene Document and Field to indexing and searching email message and attachments

Posted by Erick Erickson <er...@gmail.com>.
This problem has been discussed several times, although I can't
remember the answer. So I'd recommend searching the mail archive
first.

Lucid maintains a searchable archive, see:
http://www.lucidimagination.com/About-Search

HTH
Erick

On Wed, Apr 28, 2010 at 2:35 AM, 刘庆志 <li...@gmail.com> wrote:

> hi all:
> our bussiness system generate some data,that information structrue like
> email message,one message have some attachments,so we can use email message
> to think of our data,I need index and search the message and its
> attachments,and when display hits,must display two kinds of links for every
> hit: one kind for message,and the other for attachments which match the
> query criteria,so the former kind there is only one link,but latter may be
> zero to n links.
> one design may be as:for every message design one Lucene Document,it has a
> field to record its id,let's name the field id,an other field to correspond
> all its attachments,let's name the field attachments,afert that we also
> design a Lucene Document to correspond every message's every attachment,this
> Document has a field record its message's id,let's name the field messageid,
> so when query,we can retrieval messages may be itself's cotent or its
> attachments content match the query criteria,for generating links for
> attachments which match the query criteria,we can requey again,this time we
> can query only the message's attachments by adding a query condition that
> messageid=father query's messge id.it's obviously,there are two
> disadvantages: 1,it indexes attachments twice,one in message,and the other
> in Lucene Document for attachment.2,user's one query becomes 1+n query,1 for
> query message and its all attachments,n for requery the message's every
> attachments.is there any better solution?
>
>
> dazhi
>
> Thanks for any hints!!!
>

Re: how to design Lucene Document and Field to indexing and searching email message and attachments

Posted by 刘庆志 <li...@gmail.com>.
Hoss:
    Thanks for your answer,but what means for References: <t2...@mail.gmail.com>?

    I make a mistake to reply other thread's message,I realize it when I visit mailing list archives through web, this is my first time to use mail list,and I'll take care of this in future.
        

dazhi 
  

----- Original Message ----- 
From: "Chris Hostetter" <ho...@fucit.org>
To: <ja...@lucene.apache.org>
Sent: Thursday, April 29, 2010 3:04 AM
Subject: Re: how to design Lucene Document and Field to indexing and searching email message and attachments


> 
> : References: <t2...@mail.gmail.com>
> :     <i2...@mail.gmail.com>
> :     <x2...@mail.gmail.com>
> :     <2E6A89A648463A4EBF093A9062C16683018293DDFECE@SBMAILBOX1.sb.statsbibliotek
> :     et.dk> <s2...@mail.gmail.com>
> :     <p2...@mail.gmail.com>
> : Subject: how to design Lucene Document and Field to indexing and searching
> :     email message and attachments
> 
> http://people.apache.org/~hossman/#threadhijack
> Thread Hijacking on Mailing Lists
> 
> When starting a new discussion on a mailing list, please do not reply to 
> an existing message, instead start a fresh email.  Even if you change the 
> subject line of your email, other mail headers still track which thread 
> you replied to and your question is "hidden" in that thread and gets less 
> attention.   It makes following discussions in the mailing list archives 
> particularly difficult.
> See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking
> 
> 
> 
> -Hoss
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: how to design Lucene Document and Field to indexing and searching email message and attachments

Posted by Chris Hostetter <ho...@fucit.org>.
: References: <t2...@mail.gmail.com>
:     <i2...@mail.gmail.com>
:     <x2...@mail.gmail.com>
:     <2E6A89A648463A4EBF093A9062C16683018293DDFECE@SBMAILBOX1.sb.statsbibliotek
:     et.dk> <s2...@mail.gmail.com>
:     <p2...@mail.gmail.com>
: Subject: how to design Lucene Document and Field to indexing and searching
:     email message and attachments

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org