You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by miztaken <ju...@gmail.com> on 2008/07/03 13:31:07 UTC

Store/Index Email Address in Lucene

Hi there,
I want to index email address in such a way that i can do WildCard, Phrase
and Simple search on those items.

for each document i will have email addresses string just like in the case
of CC and TO in mails.
for eg "abc@abc.com; dcd@cbd.com; john hopkings; anything@anything.com"

Now what is the best way to store them so that i can do various type of
search on them.

Do i need the split the email address first and further split the single
email address as well and store them in multiple fields?

What is the best way to deal such case?

Your help is highly anticipated

Thank You
miztaken
-- 
View this message in context: http://www.nabble.com/Store-Index-Email-Address-in-Lucene-tp18257247p18257247.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Store/Index Email Address in Lucene

Posted by miztaken <ju...@gmail.com>.
Hi there,
Thanks for the comment.
So basically it will be lame to add new field for each email address, wont
it?

How about getting unique tokens from string of email addresses using
EmailFilter.java class and storing it in as a single field ?




Jamie-52 wrote:
> 
> Hi miztaken
> 
> Check out:
> 
> http://openmailarchiva.svn.sourceforge.net/viewvc/openmailarchiva/Server/trunk/src/com/stimulus/archiva/search/EmailFilter.java?view=markup
> 
> I think its what you want.
>> I want to index email address in such a way that i can do WildCard,
>> Phrase
>> and Simple search on those items.
>>
>> for each document i will have email addresses string just like in the
>> case
>> of CC and TO in mails.
>> for eg "abc@abc.com; dcd@cbd.com; john hopkings; anything@anything.com"
>>
>> Now what is the best way to store them so that i can do various type of
>> search on them.
>>
>> Do i need the split the email address first and further split the single
>> email address as well and store them in multiple fields?
>>
>> What is the best way to deal such case?
>>   
> 
> Regards,
> 
> Jamie
> 
> -- 
> Stimulus Software - MailArchiva
> Email Archiving And Compliance
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Store-Index-Email-Address-in-Lucene-tp18257247p18273786.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Store/Index Email Address in Lucene

Posted by Jamie <ja...@stimulussoft.com>.
Hi miztaken

Check out:

http://openmailarchiva.svn.sourceforge.net/viewvc/openmailarchiva/Server/trunk/src/com/stimulus/archiva/search/EmailFilter.java?view=markup

I think its what you want.
> I want to index email address in such a way that i can do WildCard, Phrase
> and Simple search on those items.
>
> for each document i will have email addresses string just like in the case
> of CC and TO in mails.
> for eg "abc@abc.com; dcd@cbd.com; john hopkings; anything@anything.com"
>
> Now what is the best way to store them so that i can do various type of
> search on them.
>
> Do i need the split the email address first and further split the single
> email address as well and store them in multiple fields?
>
> What is the best way to deal such case?
>   

Regards,

Jamie

-- 
Stimulus Software - MailArchiva
Email Archiving And Compliance



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Store/Index Email Address in Lucene

Posted by John Griffin <jg...@thebluezone.net>.
Miz,

The StandardAnalyzer recognizes email addresses as is. That is, it pays
attention to the '@' symbol. Just store an email address in a field and
search them normally.

This assumes you are going to store the different emails in separate fields.
There is an alternative strategy if you need it. Create a string consisting
of all the emails separated by whitespace. Make sure the field is tokenized
and then you only have to search one field for any of the emails.

Your call.

John G.

-----Original Message-----
From: miztaken [mailto:justjunktome@gmail.com] 
Sent: Thursday, July 03, 2008 5:31 AM
To: java-user@lucene.apache.org
Subject: Store/Index Email Address in Lucene


Hi there,
I want to index email address in such a way that i can do WildCard, Phrase
and Simple search on those items.

for each document i will have email addresses string just like in the case
of CC and TO in mails.
for eg "abc@abc.com; dcd@cbd.com; john hopkings; anything@anything.com"

Now what is the best way to store them so that i can do various type of
search on them.

Do i need the split the email address first and further split the single
email address as well and store them in multiple fields?

What is the best way to deal such case?

Your help is highly anticipated

Thank You
miztaken
-- 
View this message in context:
http://www.nabble.com/Store-Index-Email-Address-in-Lucene-tp18257247p1825724
7.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Store/Index Email Address in Lucene

Posted by miztaken <ju...@gmail.com>.
Hi there,
sorry for the delay

>Q. Can there be multiple addresses in a single document?
  A. Yes there can be multiple addresses in a single document in single
field.

>Q. Do you add any other data to the document that you mean to query for?
  A. Yes there can be other fields as well, if this is what you were asking.

>Q. Can you tell us how you tokenized it?
  A. I used the class EmailFilter.java from following link
http://openmailarchiva.svn.sourceforge.net/viewvc/openmailarchiva/Server/trunk/src/com/stimulus/archiva/search/EmailFilter.java?view=markup
 as provided by Jamie-52. After tokenizing i stored each token as space
separated and indexed it.

>Q. Why do you have to store the original string?
  A. To display the original string to user. I have to search and display as
well. I cant display those tokenized strings.

>I'm sorry, but you still told us very little about what it is you try  
>to achieve with this and nothing about your requirements.
>The only general hints I can give you is to read the wiki pages  
>regarding performance:
http://wiki.apache.org/lucene-java/BasicsOfPerformance

Thanks for the link.


I think you are clear now.


Thank You
miztaken



-- 
View this message in context: http://www.nabble.com/Store-Index-Email-Address-in-Lucene-tp18257247p18310225.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Store/Index Email Address in Lucene

Posted by Karl Wettin <ka...@gmail.com>.
5 jul 2008 kl. 03.29 skrev miztaken:

>
> Hi there,
>
> for email addresses string such as "john@wherever.com; jack smith"

Can there be multiple addresses in a single document?
Do you add any other data to the document that you mean to query for?

> I might do wild card search like john* or jack* or john@* for  
> *wherever.com
> for phrase search i can do "jack smith"
> for general search i might do "john@whereever.com"
>
> I tokenized the string and indexed it into single Field using the  
> Java File,

Can you tell us how you tokenized it?

> as mentioned in my earlier post and did searching.
> Its working fine.
>
> I want to know what can be the pitfalls of doing like this:
> One i know is.. for each string, i should maintain two fields.. one  
> to store
> the original string and one that holds tokenized string.

Why do you have to store the original string?

> What else ?


I'm sorry, but you still told us very little about what it is you try  
to achieve with this and nothing about your requirements.

The only general hints I can give you is to read the wiki pages  
regarding performance: http://wiki.apache.org/lucene-java/BasicsOfPerformance



        karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Store/Index Email Address in Lucene

Posted by miztaken <ju...@gmail.com>.
Hi there,

for email addresses string such as "john@wherever.com; jack smith"

I might do wild card search like john* or jack* or john@* for *wherever.com
for phrase search i can do "jack smith"
for general search i might do "john@whereever.com"

I tokenized the string and indexed it into single Field using the Java File,
as mentioned in my earlier post and did searching.
Its working fine.

I want to know what can be the pitfalls of doing like this:
One i know is.. for each string, i should maintain two fields.. one to store
the original string and one that holds tokenized string.

What else ?

Thank You
miztaken


-- 
View this message in context: http://www.nabble.com/Store-Index-Email-Address-in-Lucene-tp18257247p18288188.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Store/Index Email Address in Lucene

Posted by Karl Wettin <ka...@gmail.com>.
Please show us a couple of examples from the "various type of search"  
you want to be able to handle. The information you supply says nothing  
about your use cases.

In what way do you want to use phrase queries on email addresses? Do  
you want to tokenize parts of a single email address? Or do you want  
to place phrase queries on fields that contains multiple indexed  
single token email addresses? Perhaps a combination? Something else?


          karl

3 jul 2008 kl. 13.31 skrev miztaken:

>
> Hi there,
> I want to index email address in such a way that i can do WildCard,  
> Phrase
> and Simple search on those items.
>
> for each document i will have email addresses string just like in  
> the case
> of CC and TO in mails.
> for eg "abc@abc.com; dcd@cbd.com; john hopkings;  
> anything@anything.com"
>
> Now what is the best way to store them so that i can do various type  
> of
> search on them.
>
> Do i need the split the email address first and further split the  
> single
> email address as well and store them in multiple fields?
>
> What is the best way to deal such case?
>
> Your help is highly anticipated
>
> Thank You
> miztaken
> -- 
> View this message in context: http://www.nabble.com/Store-Index-Email-Address-in-Lucene-tp18257247p18257247.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org