You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by ga...@gsk.com on 2003/03/27 21:32:07 UTC

Very large queries?

Caveat:  I have not yet installed Lucerne or begun to experiment with it
yet.  I have scanned the FAQ, but don't see anything that addresses this
question.  Pardon the somewhat slow buildup to the question below, but I
want to set the context.

I am developing an application for 'text mining' adverse event reports in
the pharmaceutical industry.  The querying will be driven by
'dictionaries', 'thesauri',  'taxonomies' or 'ontologies' (pick your
favorite) of drug names, compounds, and medical conditions.  These thesauri
are quite large.  For example, our drug name thesaurus is on the order of
60,000+ terms.

I was planning on using Verity to accomplish my first approach to shallow
text mining since Verity is our corproate-wide search engine technology and
it supports a number of relevant features (including 'topic sets' for
representing the taxonomies).  However, Verity imposes restrictions on the
size of topic sets that currently prohibit me from using it with our large
taxonomies.  It is not obvious that they will be able to fix this problem
in the timeframe I need.  Thus I am turning to other alternatives, and
Lucerne appears to be one.

So given that context, my question is this:  Does anyone on this list have
experience attempting to use very large queries (potentially thousands or
tens of thousands of terms) in Lucerne?  Does anyone have any knowledge of
design or implementation details that would inhibit the use of such
queries?  Does anyone have any idea of what the performance would be like
in retrieving via such queries?

--------------------------------------
Gary H. Merrill
Director and Principal Scientist, New Applications
Data Exploration Sciences
GlaxoSmithKline Inc.
(919) 483-8456




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Very large queries?

Posted by Serge Knystautas <se...@lokitech.com>.

Gary,

I don't fully understand how you use your drug thesauri, but my approach 
would to use your thesauri into an Analyzer.  This would allow you 
during to coerce the various terms to single meanings, somewhat akin to 
how a stemmer works.

As for size, we're currently using Lucene to index about 100 megs of 
data, and lookup performance is blinding.  Indexing takes a while, but 
that's as much because of how we calculate the 20+ fields we're indexing 
on for each Document.

Can you give more specifics on the type of data you'd index, the query 
you'd want to run, and the desired result of the query?

-- 
Serge Knystautas
President
Lokitech >> software . strategy . design >> http://www.lokitech.com
p. 301.656.5501
e. sergek@lokitech.com

gary.h.merrill@gsk.com wrote:
> Caveat:  I have not yet installed Lucerne or begun to experiment with it
> yet.  I have scanned the FAQ, but don't see anything that addresses this
> question.  Pardon the somewhat slow buildup to the question below, but I
> want to set the context.
> 
> I am developing an application for 'text mining' adverse event reports in
> the pharmaceutical industry.  The querying will be driven by
> 'dictionaries', 'thesauri',  'taxonomies' or 'ontologies' (pick your
> favorite) of drug names, compounds, and medical conditions.  These thesauri
> are quite large.  For example, our drug name thesaurus is on the order of
> 60,000+ terms.
> 
> I was planning on using Verity to accomplish my first approach to shallow
> text mining since Verity is our corproate-wide search engine technology and
> it supports a number of relevant features (including 'topic sets' for
> representing the taxonomies).  However, Verity imposes restrictions on the
> size of topic sets that currently prohibit me from using it with our large
> taxonomies.  It is not obvious that they will be able to fix this problem
> in the timeframe I need.  Thus I am turning to other alternatives, and
> Lucerne appears to be one.
> 
> So given that context, my question is this:  Does anyone on this list have
> experience attempting to use very large queries (potentially thousands or
> tens of thousands of terms) in Lucerne?  Does anyone have any knowledge of
> design or implementation details that would inhibit the use of such
> queries?  Does anyone have any idea of what the performance would be like
> in retrieving via such queries?


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Very large queries?

Posted by Joshua O'Madadhain <jm...@ics.uci.edu>.

On Thu, 27 Mar 2003 gary.h.merrill@gsk.com wrote:

> Caveat:  I have not yet installed Lucerne or begun to experiment with it
> yet.  I have scanned the FAQ, but don't see anything that addresses this
> question.  Pardon the somewhat slow buildup to the question below, but I
> want to set the context.
>
> I am developing an application for 'text mining' adverse event reports in
> the pharmaceutical industry.  The querying will be driven by
> 'dictionaries', 'thesauri',  'taxonomies' or 'ontologies' (pick your
> favorite) of drug names, compounds, and medical conditions.  These thesauri
> are quite large.  For example, our drug name thesaurus is on the order of
> 60,000+ terms.

These terms are not equivalent, so it's not clear exactly what you mean
here.

> I was planning on using Verity to accomplish my first approach to shallow
> text mining since Verity is our corproate-wide search engine technology and
> it supports a number of relevant features (including 'topic sets' for
> representing the taxonomies).  However, Verity imposes restrictions on the
> size of topic sets that currently prohibit me from using it with our large
> taxonomies.  It is not obvious that they will be able to fix this problem
> in the timeframe I need.  Thus I am turning to other alternatives, and
> Lucerne appears to be one.
>
> So given that context, my question is this:  Does anyone on this list have
> experience attempting to use very large queries (potentially thousands or
> tens of thousands of terms) in Lucerne?  Does anyone have any knowledge of
> design or implementation details that would inhibit the use of such
> queries?  Does anyone have any idea of what the performance would be like
> in retrieving via such queries?

I do not have experience with such queries, so I can't speak to that
question directly.  However, I don't understand what the purpose of such a
query would be in the first place.  What are the documents that you are
indexing, and what information need are you trying to address?

Regards,

Joshua O'Madadhain

 jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

RE: Very large queries?

Posted by Alex Murzaku <li...@lissus.com>.

Is this some kind of topic categorization in which you "query" the
taxonomy with documents? If yes, I have done something similar with
e-mail categorization and it worked fine (even though it sounds like
abuse). Also, you could decrease the number of query terms by submitting
only the list of unique terms and corresponding frequencies as weights.

-----Original Message-----
From: gary.h.merrill@gsk.com [mailto:gary.h.merrill@gsk.com] 
Sent: Thursday, March 27, 2003 3:32 PM
To: lucene-user@jakarta.apache.org
Subject: Very large queries?

Caveat:  I have not yet installed Lucerne or begun to experiment with it
yet.  I have scanned the FAQ, but don't see anything that addresses this
question.  Pardon the somewhat slow buildup to the question below, but I
want to set the context.

I am developing an application for 'text mining' adverse event reports
in the pharmaceutical industry.  The querying will be driven by
'dictionaries', 'thesauri',  'taxonomies' or 'ontologies' (pick your
favorite) of drug names, compounds, and medical conditions.  These
thesauri are quite large.  For example, our drug name thesaurus is on
the order of 60,000+ terms.

I was planning on using Verity to accomplish my first approach to
shallow text mining since Verity is our corproate-wide search engine
technology and it supports a number of relevant features (including
'topic sets' for representing the taxonomies).  However, Verity imposes
restrictions on the size of topic sets that currently prohibit me from
using it with our large taxonomies.  It is not obvious that they will be
able to fix this problem in the timeframe I need.  Thus I am turning to
other alternatives, and Lucerne appears to be one.

So given that context, my question is this:  Does anyone on this list
have experience attempting to use very large queries (potentially
thousands or tens of thousands of terms) in Lucerne?  Does anyone have
any knowledge of design or implementation details that would inhibit the
use of such queries?  Does anyone have any idea of what the performance
would be like in retrieving via such queries?

--------------------------------------
Gary H. Merrill
Director and Principal Scientist, New Applications
Data Exploration Sciences
GlaxoSmithKline Inc.
(919) 483-8456

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org