You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Chris Sibert <ch...@comcast.net> on 2003/09/03 20:42:48 UTC

Re: Lucene features

Lucene Users List <lu...@jakarta.apache.org>

> > I am wondering if Lucene is the way to go for my project.
>
>      Probably.  Tell us a little about your project.

It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB
in size. They don't ever change, and are on a CD-ROM. Each file contains a
bunch of small documents. I just create one index for all 4 of them. These
documents are for an association that I belong to - they contain a history
of the association's documents - and my application allows you to search
them.

They are actually currently indexed by an application called 'Sonar', by
Virginia Systems. But I REALLY didn't like using their user interface -
blech - so I decided to
write a new interface for my own use. But Sonar costs some real bucks to be
able to develop against their search API, so I found Lucene, and decided to
go with it.

Here are the search features that 'Sonar' has :
  Boolean Searching
  Proximity Searching
  Wild Card Searching
  Field/Block Searching
  Relevancy Ranking / Date Ranking
  List of Occurrences in Context

  Phonetic Searching
  Synonyms/Concepts
  Relational Searching
  Associated Words
  Drill Down Search Narrowing

I think that Lucene has all the features in the first group. How does it
stack up against the second group ?

I'm writing the whole thing in Swing, which has been time consuming, and so
have invested quite a bit of time into this project. But I'm seeing the end
of the tunnel, and want to make sure that I'm going down the right path
before I spend too much more time on it.


----- Original Message ----- 
From: "Steven J. Owens" <pu...@darksleep.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, September 03, 2003 1:34 AM
Subject: Re: Lucene features


> On Wed, Sep 03, 2003 at 12:45:25AM -0400, Chris Sibert wrote:
> > I am wondering if Lucene is the way to go for my project.
>
>      Probably.  Tell us a little about your project.
>
> > I don't know what other search engines are available out there,
>
>      Lucene isn't a search engine _application_, it's a search engine
> _API_.  Lucene gives you what you need in order to build the search
> engine you want, instead of spending gobs of time trying to figure out
> the 10,000 options available for a search engine application, or
> trying to warp somebody else's ideas of what you need to meet what you
> really need.
>
> > and how Lucene stacks up against them.
>
>      Pretty well, if you're willing to put a (very) little time and
> energy into to building the application you need.  I know.  I've done
> it.
>
> > I am wondering if Lucene has a full set of searching features,
> > comparable to what I might find in a reasonably priced commercial
> > package.
>
>      There is no comparison :-).  Lucene is a fundamentally decent
> piece of technology.  This puts it head and shoulders above most
> commercial packages.
>
>      Specifically, the Lucene search engine API is blindingly fast at
> searching and at indexing, and comes with several built-in packages to
> provide several of the commonly needed functions (like a web search
> engine style query language parser).
>
>      Additionally, a wide variety of people have been down this road
> and done a wide variety of things with Lucene, so you're likely to be
> able to find examples, in the Lucene sandbox or in the lucene-user
> archives, of how to do whatever it is you want to do.
>
> > Anyone with a solid knowledge of Lucene care to make me feel warm
> > and fuzzy about my decision so far to use Lucene ?
>
>      Tell us a little more about your project requirements and I'll
> tell you enough specifics to give you a warm and fuzzy feeling.
> Lucene isn't perfect for _everything_ (and anybody who claims that a
> given technology *is* perfect for _everything_ is lying).  But it's
> quite good for a number of things.
>
> -- 
> Steven J. Owens
> puff@darksleep.com
>
> "I'm going to make broad, sweeping generalizations and strong,
>  declarative statements, because otherwise I'll be here all night and
>  this document will be four times longer and much less fun to read.
>  Take it all with a grain of salt." - Me at http://darksleep.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

Re: Lucene features

Posted by Andrzej Bialecki <ab...@getopt.org>.

Chris Sibert wrote:
> I'm not sure what all of the 'advanced features' were also.
> 
> Phonetic Searching - probably not important to this application.
> 

Phonetic searching may be achieved by writing your own Analyzer, which 
instead (or more probably, along with) the plain tokens provides their 
phonetic codes, e.g. Double Metaphone for English, or the less useful 
but more familiar Soundex. Phonetic searching increases recall but 
lowers precision, especially if you use stemmer before phonetic encoding...

One trick to consider if using phonetic encoding is to keep around the 
histogram of the original words that have been mapped to corresponding 
phonetic codes. Then, if a query fails to provide satisfactory results, 
you can provide a useful suggestion based on the most frequent term 
found in the histogram, with equal phonetic code to the term in the query.

> Synonym searching might be desirable, but now that I'm thinking about it,
> also likely not important.
> 
> Associated Words - sounds very interesting, like 'gold' might return 'metal'
> also, etc.

If you plan on using just English text, you may want to check the 
excellent (and free) WordNet database, which offers also API - both for 
query expansion and for finding "associated words" (synsets?), or 
hypernyms like in your example.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)

Re: Lucene features

Posted by Otis Gospodnetic <ot...@yahoo.com>.

That would be nice.  Contributions are always welcome.

Otis

--- Chris Sibert <ch...@comcast.net> wrote:
> Thanks for all the replies. I feel reassured with using Lucene. If I
> end up
> doing anything with the application that I'm writing, I would like to
> look
> at contributing some documentation of Lucene's features, and what it
> has to
> offer.
> 
> ----- Original Message ----- 
> From: "Leo Galambos" <Le...@seznam.cz>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Thursday, September 11, 2003 4:57 PM
> Subject: Re: Lucene features
> 
> 
> > Doug Cutting wrote:
> >
> > >
> > > I have some extensions to Lucene that I've not yet commited which
> make
> > > it possible to easily define synthetic IndexReaders (not
> currently
> > > supported).  So you could do things that way, once I check these
> in.
> > > But is this really better than just ANDing the clauses together? 
> It
> > > would take some big experiments to know, but my guess is that it
> > > doesn't make much difference to compute a "local" IDF for such
> things.
> >
> >
> > In this case, I think that the operator would be evaluated as "an
> > implication" and not "AND" (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)).
> > Obviously, you have to use an filter to filter out false hits (in
> case
> > of q1->q2, the formula is true when q1 is false, so it is not what
> you
> > really need), but it is not an issue with the auxiliary index. On
> the
> > other hand, it is a feeling and it needs a test, you are right.
> >
> > Leo
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene features

Posted by Otis Gospodnetic <ot...@yahoo.com>.

That would be nice.  Contributions are always welcome.

Otis

--- Chris Sibert <ch...@comcast.net> wrote:
> Thanks for all the replies. I feel reassured with using Lucene. If I
> end up
> doing anything with the application that I'm writing, I would like to
> look
> at contributing some documentation of Lucene's features, and what it
> has to
> offer.
> 
> ----- Original Message ----- 
> From: "Leo Galambos" <Le...@seznam.cz>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Thursday, September 11, 2003 4:57 PM
> Subject: Re: Lucene features
> 
> 
> > Doug Cutting wrote:
> >
> > >
> > > I have some extensions to Lucene that I've not yet commited which
> make
> > > it possible to easily define synthetic IndexReaders (not
> currently
> > > supported).  So you could do things that way, once I check these
> in.
> > > But is this really better than just ANDing the clauses together? 
> It
> > > would take some big experiments to know, but my guess is that it
> > > doesn't make much difference to compute a "local" IDF for such
> things.
> >
> >
> > In this case, I think that the operator would be evaluated as "an
> > implication" and not "AND" (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)).
> > Obviously, you have to use an filter to filter out false hits (in
> case
> > of q1->q2, the formula is true when q1 is false, so it is not what
> you
> > really need), but it is not an issue with the auxiliary index. On
> the
> > other hand, it is a feeling and it needs a test, you are right.
> >
> > Leo
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

Re: Lucene features

Posted by Chris Sibert <ch...@comcast.net>.

Thanks for all the replies. I feel reassured with using Lucene. If I end up
doing anything with the application that I'm writing, I would like to look
at contributing some documentation of Lucene's features, and what it has to
offer.

----- Original Message ----- 
From: "Leo Galambos" <Le...@seznam.cz>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, September 11, 2003 4:57 PM
Subject: Re: Lucene features


> Doug Cutting wrote:
>
> >
> > I have some extensions to Lucene that I've not yet commited which make
> > it possible to easily define synthetic IndexReaders (not currently
> > supported).  So you could do things that way, once I check these in.
> > But is this really better than just ANDing the clauses together?  It
> > would take some big experiments to know, but my guess is that it
> > doesn't make much difference to compute a "local" IDF for such things.
>
>
> In this case, I think that the operator would be evaluated as "an
> implication" and not "AND" (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)).
> Obviously, you have to use an filter to filter out false hits (in case
> of q1->q2, the formula is true when q1 is false, so it is not what you
> really need), but it is not an issue with the auxiliary index. On the
> other hand, it is a feeling and it needs a test, you are right.
>
> Leo
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Lucene features

Posted by Chris Sibert <ch...@comcast.net>.

Thanks for all the replies. I feel reassured with using Lucene. If I end up
doing anything with the application that I'm writing, I would like to look
at contributing some documentation of Lucene's features, and what it has to
offer.

----- Original Message ----- 
From: "Leo Galambos" <Le...@seznam.cz>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, September 11, 2003 4:57 PM
Subject: Re: Lucene features


> Doug Cutting wrote:
>
> >
> > I have some extensions to Lucene that I've not yet commited which make
> > it possible to easily define synthetic IndexReaders (not currently
> > supported).  So you could do things that way, once I check these in.
> > But is this really better than just ANDing the clauses together?  It
> > would take some big experiments to know, but my guess is that it
> > doesn't make much difference to compute a "local" IDF for such things.
>
>
> In this case, I think that the operator would be evaluated as "an
> implication" and not "AND" (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)).
> Obviously, you have to use an filter to filter out false hits (in case
> of q1->q2, the formula is true when q1 is false, so it is not what you
> really need), but it is not an issue with the auxiliary index. On the
> other hand, it is a feeling and it needs a test, you are right.
>
> Leo
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

Re: Lucene features

Posted by Leo Galambos <Le...@seznam.cz>.

Doug Cutting wrote:

>
> I have some extensions to Lucene that I've not yet commited which make 
> it possible to easily define synthetic IndexReaders (not currently 
> supported).  So you could do things that way, once I check these in. 
> But is this really better than just ANDing the clauses together?  It 
> would take some big experiments to know, but my guess is that it 
> doesn't make much difference to compute a "local" IDF for such things.

In this case, I think that the operator would be evaluated as "an 
implication" and not "AND" (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)). 
Obviously, you have to use an filter to filter out false hits (in case 
of q1->q2, the formula is true when q1 is false, so it is not what you 
really need), but it is not an issue with the auxiliary index. On the 
other hand, it is a feeling and it needs a test, you are right.

Leo

Re: Lucene features

Posted by Doug Cutting <cu...@lucene.com>.

Leo Galambos wrote:
> Example: I use this notation: inverted_list_term:{list of W values, "-" 
> denotes W=0, for 12 documents in a collection}
> A:{23[16]------27}
> B:{--[38]--------}
> C:{18[2-]45239812}
> If your first query is B, the subset of documents (denoted by brackets - 
> namely, the 3rd and 4th doc) is selected, and if your second query is "A 
> C", then you cannot use global IDFs, because in the subset, the IDF 
> factors are different. Globally, A is better distriminator, but in the 
> subset, C is better. This fact is then reflected by the hit list you 
> generate, and I guess, the quality will be also affected by this.
> 
> The example shows, that you would rather export the subset to an 
> auxiliary index (RAMDirectory?) and then use this structure instead of 
> the original index. Obviously, it will solve the issue of speed you 
> mentioned.
> 
> Unfortunately, I am not sure, if you can export the inverted lists when 
> you read them. In egothor, I would use a listener in Rider class, in 
> Lucene, I would have to rewrite some classes and it could be a real 
> problem. Maybe, there is a solution I do not see...

I have some extensions to Lucene that I've not yet commited which make 
it possible to easily define synthetic IndexReaders (not currently 
supported).  So you could do things that way, once I check these in. 
But is this really better than just ANDing the clauses together?  It 
would take some big experiments to know, but my guess is that it doesn't 
make much difference to compute a "local" IDF for such things.

Doug

wildcard search

Posted by Gregor Heinrich <Gr...@igd.fhg.de>.

Hi,

when querying with a wildcard, e.g., "project*", QueryParser (v1.3rc2-dev)
does seem to count the * as "at least one character" (RE=/.+/), i.e., the
query finds documents containing "projection" and "projective" (no stemming
involved) but not "project". Has anyone an explanation?

Not sure if I have stepped across a bug or for some reason misconfigured
Lucene. (The implementation uses MultiFieldQueryParser.)

Thanks for comments,

Gregor

Re: Lucene features

Posted by Leo Galambos <Le...@seznam.cz>.

Erik Hatcher wrote:

> On Friday, September 5, 2003, at 07:45  PM, Leo Galambos wrote:
>
>>> And for the second time today.... QueryFilter.  It allows narrowing 
>>> the documents queried to only the documents from a previous Query.
>>
>>
>>
>> I guess, it would not be an ideal solution - the first query does two 
>> things a) it selects a subset from the corpus; b) it assigns a 
>> relevance to each document of this subset. Your solution omits the 
>> second point. It implies, the solution will not return "good hit 
>> lists", because you will not consider the information value of the 
>> first query which was given to you by a user.
>
>
> Yes, you're right.  Getting the scores of a second query based on the 
> scores of the first query is probably not trivial, but probably 
> possible with Lucene.  And that combined with a QueryFilter would do 
> the trick I suspect.  Somehow the scores of the first query could be 
> remembered and used as a boost (or other type of factor) the scores of 
> the second query.

Well, I do not want to be a pessimist, but the boost vector is not a 
good solution due to CWI statistics. On the other hand, it is much 
better than the simple QueryFilter which, in fact, works as 0/1 boost.

Example: I use this notation: inverted_list_term:{list of W values, "-" 
denotes W=0, for 12 documents in a collection}
A:{23[16]------27}
B:{--[38]--------}
C:{18[2-]45239812}
If your first query is B, the subset of documents (denoted by brackets - 
namely, the 3rd and 4th doc) is selected, and if your second query is "A 
C", then you cannot use global IDFs, because in the subset, the IDF 
factors are different. Globally, A is better distriminator, but in the 
subset, C is better. This fact is then reflected by the hit list you 
generate, and I guess, the quality will be also affected by this.

The example shows, that you would rather export the subset to an 
auxiliary index (RAMDirectory?) and then use this structure instead of 
the original index. Obviously, it will solve the issue of speed you 
mentioned.

Unfortunately, I am not sure, if you can export the inverted lists when 
you read them. In egothor, I would use a listener in Rider class, in 
Lucene, I would have to rewrite some classes and it could be a real 
problem. Maybe, there is a solution I do not see...

Your turn ;-)
Cheers,
Leo

>
> Am I off base here?
>
>> Thus I think, Chris would implement something more complex than 
>> QueryFilter. If not, the results will be poorer than with the 
>> commercial packages he may get. He could use a different model where 
>> "AND" is not an associative operator (i.e. some modification of the 
>> extended Boolean model). It implies, he would implement it in 
>> Similarity.java (if I remember that class name correctly).
>
>
> Right... but you'd still need the filtering capability as well, I 
> would think - at least for performance reasons.
>
>     Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>

Re: Lucene features

Posted by Leo Galambos <Le...@seznam.cz>.

Doug Cutting wrote:

> Erik Hatcher wrote:
>
>> Yes, you're right.  Getting the scores of a second query based on the 
>> scores of the first query is probably not trivial, but probably 
>> possible with Lucene.  And that combined with a QueryFilter would do 
>> the trick I suspect.  Somehow the scores of the first query could be 
>> remembered and used as a boost (or other type of factor) the scores 
>> of the second query.
>
>
> Why not just AND together the first and second query?  That way 
> they're both incorporated in the ranking.  Filters are good when you 
> don't want it to affect the ranking, and also when the first query is 
> a criterion that you'll reuse for many queries (e.g., 
> language=french), since the bit vectors can be cached (as by 
> QueryFilter).


You probably missed the start of our discussion - we are talking about 
this: "q1 -> q2" which means "NOT q1 OR q2", versus "q2 -> q1" which 
means "q1 OR NOT q2". It causes the issue, and it also shows why you 
cannot use the simple "AND", because "q1 AND q2" != "NOT q1 OR q2" != 
"q1 OR NOT q2".

Leo

BTW: I didn't see the logic formulas for many years, so it is without 
any guarantee ;-)

Re: Lucene features

Posted by Doug Cutting <cu...@lucene.com>.

Erik Hatcher wrote:
> Yes, you're right.  Getting the scores of a second query based on the 
> scores of the first query is probably not trivial, but probably possible 
> with Lucene.  And that combined with a QueryFilter would do the trick I 
> suspect.  Somehow the scores of the first query could be remembered and 
> used as a boost (or other type of factor) the scores of the second query.

Why not just AND together the first and second query?  That way they're 
both incorporated in the ranking.  Filters are good when you don't want 
it to affect the ranking, and also when the first query is a criterion 
that you'll reuse for many queries (e.g., language=french), since the 
bit vectors can be cached (as by QueryFilter).

Doug

Re: Lucene features

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Friday, September 5, 2003, at 07:45  PM, Leo Galambos wrote:
>> And for the second time today.... QueryFilter.  It allows narrowing 
>> the documents queried to only the documents from a previous Query.
>
>
> I guess, it would not be an ideal solution - the first query does two 
> things a) it selects a subset from the corpus; b) it assigns a 
> relevance to each document of this subset. Your solution omits the 
> second point. It implies, the solution will not return "good hit 
> lists", because you will not consider the information value of the 
> first query which was given to you by a user.

Yes, you're right.  Getting the scores of a second query based on the 
scores of the first query is probably not trivial, but probably 
possible with Lucene.  And that combined with a QueryFilter would do 
the trick I suspect.  Somehow the scores of the first query could be 
remembered and used as a boost (or other type of factor) the scores of 
the second query.

Am I off base here?

> Thus I think, Chris would implement something more complex than 
> QueryFilter. If not, the results will be poorer than with the 
> commercial packages he may get. He could use a different model where 
> "AND" is not an associative operator (i.e. some modification of the 
> extended Boolean model). It implies, he would implement it in 
> Similarity.java (if I remember that class name correctly).

Right... but you'd still need the filtering capability as well, I would 
think - at least for performance reasons.

	Erik

Re: Lucene features

Posted by Leo Galambos <Le...@seznam.cz>.

>> But Drill Down searching is very desirable. It's where you're able to 
>> search
>> within the results of a previous search. I'm assuming that I'll have to
>> implement that myself, by keeping a copy of the previous Hits list, 
>> and only
>> returning results that are in both lists.
>
>
> And for the second time today.... QueryFilter.  It allows narrowing 
> the documents queried to only the documents from a previous Query.


I guess, it would not be an ideal solution - the first query does two 
things a) it selects a subset from the corpus; b) it assigns a relevance 
to each document of this subset. Your solution omits the second point. 
It implies, the solution will not return "good hit lists", because you 
will not consider the information value of the first query which was 
given to you by a user.

For instance, "neologism" > "George Bush" (1st>2nd query) would return 
different order of hits than "George Bush" > "neologism". Other 
examples, "Prague Berlin" > "flight" (I must go there, and I prefer an 
airplane) versus "flight" > "Prague Berlin" (I must fly, and I prefer 
Berlin).

Thus I think, Chris would implement something more complex than 
QueryFilter. If not, the results will be poorer than with the commercial 
packages he may get. He could use a different model where "AND" is not 
an associative operator (i.e. some modification of the extended Boolean 
model). It implies, he would implement it in Similarity.java (if I 
remember that class name correctly).

Leo

Re: Lucene features

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Friday, September 5, 2003, at 02:36  PM, Chris Sibert wrote:
> Synonym searching might be desirable, but now that I'm thinking about 
> it,
> also likely not important.

This could be done with a custom Analyzer.

> Associated Words - sounds very interesting, like 'gold' might return 
> 'metal'
> also, etc.

How is that different from Synonym searching?

> But Drill Down searching is very desirable. It's where you're able to 
> search
> within the results of a previous search. I'm assuming that I'll have to
> implement that myself, by keeping a copy of the previous Hits list, 
> and only
> returning results that are in both lists.

And for the second time today.... QueryFilter.  It allows narrowing the 
documents queried to only the documents from a previous Query.

	Erik

Re: Lucene features

Posted by Chris Sibert <ch...@comcast.net>.

I'm not sure what all of the 'advanced features' were also.

Phonetic Searching - probably not important to this application.

Synonym searching might be desirable, but now that I'm thinking about it,
also likely not important.

Associated Words - sounds very interesting, like 'gold' might return 'metal'
also, etc.

But Drill Down searching is very desirable. It's where you're able to search
within the results of a previous search. I'm assuming that I'll have to
implement that myself, by keeping a copy of the previous Hits list, and only
returning results that are in both lists.

Thanks very much for your reply.

----- Original Message ----- 
From: "Steven J. Owens" <pu...@darksleep.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, September 04, 2003 3:02 AM
Subject: Re: Lucene features


> On Wed, Sep 03, 2003 at 02:42:48PM -0400, Chris Sibert wrote:
> > Lucene Users List <lu...@jakarta.apache.org>
> > > > I am wondering if Lucene is the way to go for my project.
> > >      Probably.  Tell us a little about your project.
> >
> > It's pretty basic. I'm just indexing 4 large text files, ranging up to
100MB
> > in size. They don't ever change, and are on a CD-ROM. Each file contains
a
> > bunch of small documents. I just create one index for all 4 of them.
These
> > documents are for an association that I belong to - they contain a
history
> > of the association's documents - and my application allows you to search
> > them.
>
>      Well, aside from your concerns about the second list, Lucene
> seems perfect for your needs.  You'd parse apart the four big files
> into a bunch of small documents, the parse those small documents and
> create lucene Documents, containing Fields, and add them to the index.
>
> > They are actually currently indexed by an application called
> > 'Sonar', by Virginia Systems. But I REALLY didn't like using their
> > user interface - blech - so I decided to write a new interface for
> > my own use. But Sonar costs some real bucks to be able to develop
> > against their search API, so I found Lucene, and decided to go with
> > it.
> >
> > Here are the search features that 'Sonar' has :
> >   Boolean Searching
> >   Proximity Searching
> >   Wild Card Searching
> >   Field/Block Searching
>
>      I'm not sure what Field/Block means.  Boolean, Proximity and
> WildCard, are pretty typical in Lucene searches.  You should probably
> take a look at the Query Parser syntax docs:
>
>      http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
>
>
> >   Relevancy Ranking / Date Ranking
>
>      Lucene search results are typically ranked by relevance, and you
> can tweak the search to adjust this (there's a fair bit of discussion
> of this in the lucene-user archives, a good keyword to look for is
> "slop" and "boost").
>
>      Sorting output by date might take some finesse.  I haven't played
> with sorting by date, but I'd expect to handle that by directly
> instantiating a QueryTerm to indicate the date issues.
>
> >   List of Occurrences in Context
>
>      I assume here that you mean displaying the results with a little
> snapshot of the text around it.  There have been discussions about how
> best to do this (often focused around highlighting the search terms in
> the displayed text) on the lucene-users list.  Check the list archive.
>
> >   Phonetic Searching
>
>      I'd guess you need to build this one yourself, perhaps by using a
> soundex algorithm when indexing the original data files.
>
> >   Synonyms/Concepts
>
>      Likewise... you'd need to come up with some sort of ontology of
> synonyms and concepts, then parse the fields you're indexing and
> generate a synonym/concept field that you'd add to the lucene
> Document.
>
> >   Relational Searching
> >   Associated Words
> >   Drill Down Search Narrowing
>
>      I'm not sure what these three mean.
>
> > I think that Lucene has all the features in the first group. How does it
> > stack up against the second group ?
>
>      I'm afraid I haven't been too helpful here.  Perhaps if you
> clarify what the above mean, folks can post about how to implement it
> in Lucene.
>
> > I'm writing the whole thing in Swing, which has been time consuming,
> > and so have invested quite a bit of time into this project. But I'm
> > seeing the end of the tunnel, and want to make sure that I'm going
> > down the right path before I spend too much more time on it.
>
>      It sounds like you ought to at least seriously consider using
> Lucene, if you can find or implement equivalent features, or decide
> you can live without them.
>
> -- 
> Steven J. Owens
> puff@darksleep.com
>
> "I'm going to make broad, sweeping generalizations and strong,
>  declarative statements, because otherwise I'll be here all night and
>  this document will be four times longer and much less fun to read.
>  Take it all with a grain of salt." - Me at http://darksleep.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

Re: Lucene features

Posted by "Steven J. Owens" <pu...@darksleep.com>.

On Wed, Sep 03, 2003 at 02:42:48PM -0400, Chris Sibert wrote:
> Lucene Users List <lu...@jakarta.apache.org>
> > > I am wondering if Lucene is the way to go for my project.
> >      Probably.  Tell us a little about your project.
> 
> It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB
> in size. They don't ever change, and are on a CD-ROM. Each file contains a
> bunch of small documents. I just create one index for all 4 of them. These
> documents are for an association that I belong to - they contain a history
> of the association's documents - and my application allows you to search
> them.

     Well, aside from your concerns about the second list, Lucene
seems perfect for your needs.  You'd parse apart the four big files
into a bunch of small documents, the parse those small documents and
create lucene Documents, containing Fields, and add them to the index.
 
> They are actually currently indexed by an application called
> 'Sonar', by Virginia Systems. But I REALLY didn't like using their
> user interface - blech - so I decided to write a new interface for
> my own use. But Sonar costs some real bucks to be able to develop
> against their search API, so I found Lucene, and decided to go with
> it.
> 
> Here are the search features that 'Sonar' has :
>   Boolean Searching
>   Proximity Searching
>   Wild Card Searching
>   Field/Block Searching

     I'm not sure what Field/Block means.  Boolean, Proximity and
WildCard, are pretty typical in Lucene searches.  You should probably
take a look at the Query Parser syntax docs:

     http://jakarta.apache.org/lucene/docs/queryparsersyntax.html


>   Relevancy Ranking / Date Ranking

     Lucene search results are typically ranked by relevance, and you
can tweak the search to adjust this (there's a fair bit of discussion
of this in the lucene-user archives, a good keyword to look for is
"slop" and "boost").

     Sorting output by date might take some finesse.  I haven't played
with sorting by date, but I'd expect to handle that by directly
instantiating a QueryTerm to indicate the date issues.

>   List of Occurrences in Context

     I assume here that you mean displaying the results with a little
snapshot of the text around it.  There have been discussions about how
best to do this (often focused around highlighting the search terms in
the displayed text) on the lucene-users list.  Check the list archive.
 
>   Phonetic Searching

     I'd guess you need to build this one yourself, perhaps by using a
soundex algorithm when indexing the original data files.

>   Synonyms/Concepts

     Likewise... you'd need to come up with some sort of ontology of
synonyms and concepts, then parse the fields you're indexing and
generate a synonym/concept field that you'd add to the lucene
Document.

>   Relational Searching
>   Associated Words
>   Drill Down Search Narrowing

     I'm not sure what these three mean.

> I think that Lucene has all the features in the first group. How does it
> stack up against the second group ?

     I'm afraid I haven't been too helpful here.  Perhaps if you
clarify what the above mean, folks can post about how to implement it
in Lucene.

> I'm writing the whole thing in Swing, which has been time consuming,
> and so have invested quite a bit of time into this project. But I'm
> seeing the end of the tunnel, and want to make sure that I'm going
> down the right path before I spend too much more time on it.

     It sounds like you ought to at least seriously consider using
Lucene, if you can find or implement equivalent features, or decide
you can live without them.

-- 
Steven J. Owens
puff@darksleep.com

"I'm going to make broad, sweeping generalizations and strong,
 declarative statements, because otherwise I'll be here all night and
 this document will be four times longer and much less fun to read.
 Take it all with a grain of salt." - Me at http://darksleep.com