You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Phil Whelan <ph...@gmail.com> on 2009/07/30 20:11:44 UTC

indexing multiple email addresses in one field

Hi,

We have a very large lucene index that we're developing that has a
field of email addresses. (Actually mulitple fields with multiple
emails addresses, but I'll simplify here)

Each document will have one "email" field containing multiple email addresses.

I am indexing email addresses only using WhitespaceAnalyzer, so to
preserve the exact adresses and store multiple emails for one
document.

Example...
doc.add(new Field("email", "foo@bar.com bar@foo.com com@bar.foo",
Field.Store.YES, Field.Index.ANALYZED ));

Terms for this document will then be...
email:foo@bar.com
email:bar@foo.com
email:com@bar.foo

The problem I having is that these terms are rarely re-used in other
documents. There is little overlap with email usage, and there is a
lot of very long emails addresses. Because of this, the number of
terms in my index is very big and I think it's is causing performance
issues and bloating the index.

I think I'm not using Lucene optimally here.


A couple of questions...

1) Is there a way I can analyze these emails down to smaller terms but
still search for the exact email address? For instance, if I used a
different analyzer and broke these down to the terms "foo", "bar", and
"com", is Lucene able to find "email:foo@bar.com" without matching
"email:com@foo.bar"?

2) Does Lucene retain the positional information of tokens in the
index? Knowing this will help me anwer question 1.

Thanks,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: ThreadedIndexWriter vs. IndexWriter

Posted by Phil Whelan <ph...@gmail.com>.

Hi Jibo,

Have you tried optimizing indexes? I do not know anything about the
implementation of ThreadedIndexWriter, but if they both optimize down
to the same size, it could just mean that ThreadedIndexWriter is not
as optimized.

Thanks,
Phil

On Fri, Jul 31, 2009 at 11:38 AM, Jibo John<ji...@mac.com> wrote:
> Number of docs are the same in the index for both the cases (200,000).
> I haven't altered the benchmark/ code, but, used a profiler to verify that
>  Benchmark main thread is closed only after all other  threads are closed.
>
> Thanks,
> -Jibo

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: ThreadedIndexWriter vs. IndexWriter

Posted by Michael McCandless <lu...@mikemccandless.com>.

Woops sorry for the confusion!

Mike

On Sat, Aug 1, 2009 at 1:03 PM, Phil Whelan<ph...@gmail.com> wrote:
> Hi Mike,
>
> It's Jibo, not me, having the problem. But thanks for the link. I was
> interested to look at the code. Will be buying the book soon.
>
> Phil
>
> On Sat, Aug 1, 2009 at 2:08 AM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> (Please note that ThreadedIndexWriter is source code available with
>> the upcoming revision to Lucene in Action.)
>>
>> Phil, is it possible you are using an older version of the book's
>> source code?  In particular, can you check whether your version of
>> ThreadedIndexWriter.java has this:
>>
>>  public void close(boolean doWait) throws CorruptIndexException, IOException {
>>    finish();
>>    super.close(doWait);
>>  }
>>
>> (I vaguely remember that being missing from earlier releases, which
>> could explain what you're seeing).  If you are missing that, can you
>> download the current code from http://www.manning.com/hatcher3 and try
>> again?
>>
>> If that's not the problem... can you post the benchmark alg you are
>> using in each case?
>>
>> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: ThreadedIndexWriter vs. IndexWriter

Posted by Phil Whelan <ph...@gmail.com>.

Hi Mike,

It's Jibo, not me, having the problem. But thanks for the link. I was
interested to look at the code. Will be buying the book soon.

Phil

On Sat, Aug 1, 2009 at 2:08 AM, Michael McCandless
<lu...@mikemccandless.com> wrote:
>
> (Please note that ThreadedIndexWriter is source code available with
> the upcoming revision to Lucene in Action.)
>
> Phil, is it possible you are using an older version of the book's
> source code?  In particular, can you check whether your version of
> ThreadedIndexWriter.java has this:
>
>  public void close(boolean doWait) throws CorruptIndexException, IOException {
>    finish();
>    super.close(doWait);
>  }
>
> (I vaguely remember that being missing from earlier releases, which
> could explain what you're seeing).  If you are missing that, can you
> download the current code from http://www.manning.com/hatcher3 and try
> again?
>
> If that's not the problem... can you post the benchmark alg you are
> using in each case?
>
> Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org