You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Chris Sibert <ch...@comcast.net> on 2003/09/03 20:42:48 UTC
Re: Lucene features
Lucene Users List <lu...@jakarta.apache.org>
> > I am wondering if Lucene is the way to go for my project.
>
> Probably. Tell us a little about your project.
It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB
in size. They don't ever change, and are on a CD-ROM. Each file contains a
bunch of small documents. I just create one index for all 4 of them. These
documents are for an association that I belong to - they contain a history
of the association's documents - and my application allows you to search
them.
They are actually currently indexed by an application called 'Sonar', by
Virginia Systems. But I REALLY didn't like using their user interface -
blech - so I decided to
write a new interface for my own use. But Sonar costs some real bucks to be
able to develop against their search API, so I found Lucene, and decided to
go with it.
Here are the search features that 'Sonar' has :
Boolean Searching
Proximity Searching
Wild Card Searching
Field/Block Searching
Relevancy Ranking / Date Ranking
List of Occurrences in Context
Phonetic Searching
Synonyms/Concepts
Relational Searching
Associated Words
Drill Down Search Narrowing
I think that Lucene has all the features in the first group. How does it
stack up against the second group ?
I'm writing the whole thing in Swing, which has been time consuming, and so
have invested quite a bit of time into this project. But I'm seeing the end
of the tunnel, and want to make sure that I'm going down the right path
before I spend too much more time on it.
----- Original Message -----
From: "Steven J. Owens" <pu...@darksleep.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, September 03, 2003 1:34 AM
Subject: Re: Lucene features
> On Wed, Sep 03, 2003 at 12:45:25AM -0400, Chris Sibert wrote:
> > I am wondering if Lucene is the way to go for my project.
>
> Probably. Tell us a little about your project.
>
> > I don't know what other search engines are available out there,
>
> Lucene isn't a search engine _application_, it's a search engine
> _API_. Lucene gives you what you need in order to build the search
> engine you want, instead of spending gobs of time trying to figure out
> the 10,000 options available for a search engine application, or
> trying to warp somebody else's ideas of what you need to meet what you
> really need.
>
> > and how Lucene stacks up against them.
>
> Pretty well, if you're willing to put a (very) little time and
> energy into to building the application you need. I know. I've done
> it.
>
> > I am wondering if Lucene has a full set of searching features,
> > comparable to what I might find in a reasonably priced commercial
> > package.
>
> There is no comparison :-). Lucene is a fundamentally decent
> piece of technology. This puts it head and shoulders above most
> commercial packages.
>
> Specifically, the Lucene search engine API is blindingly fast at
> searching and at indexing, and comes with several built-in packages to
> provide several of the commonly needed functions (like a web search
> engine style query language parser).
>
> Additionally, a wide variety of people have been down this road
> and done a wide variety of things with Lucene, so you're likely to be
> able to find examples, in the Lucene sandbox or in the lucene-user
> archives, of how to do whatever it is you want to do.
>
> > Anyone with a solid knowledge of Lucene care to make me feel warm
> > and fuzzy about my decision so far to use Lucene ?
>
> Tell us a little more about your project requirements and I'll
> tell you enough specifics to give you a warm and fuzzy feeling.
> Lucene isn't perfect for _everything_ (and anybody who claims that a
> given technology *is* perfect for _everything_ is lying). But it's
> quite good for a number of things.
>
> --
> Steven J. Owens
> puff@darksleep.com
>
> "I'm going to make broad, sweeping generalizations and strong,
> declarative statements, because otherwise I'll be here all night and
> this document will be four times longer and much less fun to read.
> Take it all with a grain of salt." - Me at http://darksleep.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
Re: Lucene features
Posted by Andrzej Bialecki <ab...@getopt.org>.
Chris Sibert wrote:
> I'm not sure what all of the 'advanced features' were also.
>
> Phonetic Searching - probably not important to this application.
>
Phonetic searching may be achieved by writing your own Analyzer, which
instead (or more probably, along with) the plain tokens provides their
phonetic codes, e.g. Double Metaphone for English, or the less useful
but more familiar Soundex. Phonetic searching increases recall but
lowers precision, especially if you use stemmer before phonetic encoding...
One trick to consider if using phonetic encoding is to keep around the
histogram of the original words that have been mapped to corresponding
phonetic codes. Then, if a query fails to provide satisfactory results,
you can provide a useful suggestion based on the most frequent term
found in the histogram, with equal phonetic code to the term in the query.
> Synonym searching might be desirable, but now that I'm thinking about it,
> also likely not important.
>
> Associated Words - sounds very interesting, like 'gold' might return 'metal'
> also, etc.
If you plan on using just English text, you may want to check the
excellent (and free) WordNet database, which offers also API - both for
query expansion and for finding "associated words" (synsets?), or
hypernyms like in your example.
--
Best regards,
Andrzej Bialecki
-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)
Re: Lucene features
Posted by Otis Gospodnetic <ot...@yahoo.com>.
That would be nice. Contributions are always welcome.
Otis
--- Chris Sibert <ch...@comcast.net> wrote:
> Thanks for all the replies. I feel reassured with using Lucene. If I
> end up
> doing anything with the application that I'm writing, I would like to
> look
> at contributing some documentation of Lucene's features, and what it
> has to
> offer.
>
> ----- Original Message -----
> From: "Leo Galambos" <Le...@seznam.cz>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Thursday, September 11, 2003 4:57 PM
> Subject: Re: Lucene features
>
>
> > Doug Cutting wrote:
> >
> > >
> > > I have some extensions to Lucene that I've not yet commited which
> make
> > > it possible to easily define synthetic IndexReaders (not
> currently
> > > supported). So you could do things that way, once I check these
> in.
> > > But is this really better than just ANDing the clauses together?
> It
> > > would take some big experiments to know, but my guess is that it
> > > doesn't make much difference to compute a "local" IDF for such
> things.
> >
> >
> > In this case, I think that the operator would be evaluated as "an
> > implication" and not "AND" (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)).
> > Obviously, you have to use an filter to filter out false hits (in
> case
> > of q1->q2, the formula is true when q1 is false, so it is not what
> you
> > really need), but it is not an issue with the auxiliary index. On
> the
> > other hand, it is a feeling and it needs a test, you are right.
> >
> > Leo
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Lucene features
Posted by Otis Gospodnetic <ot...@yahoo.com>.
That would be nice. Contributions are always welcome.
Otis
--- Chris Sibert <ch...@comcast.net> wrote:
> Thanks for all the replies. I feel reassured with using Lucene. If I
> end up
> doing anything with the application that I'm writing, I would like to
> look
> at contributing some documentation of Lucene's features, and what it
> has to
> offer.
>
> ----- Original Message -----
> From: "Leo Galambos" <Le...@seznam.cz>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Thursday, September 11, 2003 4:57 PM
> Subject: Re: Lucene features
>
>
> > Doug Cutting wrote:
> >
> > >
> > > I have some extensions to Lucene that I've not yet commited which
> make
> > > it possible to easily define synthetic IndexReaders (not
> currently
> > > supported). So you could do things that way, once I check these
> in.
> > > But is this really better than just ANDing the clauses together?
> It
> > > would take some big experiments to know, but my guess is that it
> > > doesn't make much difference to compute a "local" IDF for such
> things.
> >
> >
> > In this case, I think that the operator would be evaluated as "an
> > implication" and not "AND" (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)).
> > Obviously, you have to use an filter to filter out false hits (in
> case
> > of q1->q2, the formula is true when q1 is false, so it is not what
> you
> > really need), but it is not an issue with the auxiliary index. On
> the
> > other hand, it is a feeling and it needs a test, you are right.
> >
> > Leo
> >
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com
Re: Lucene features
Posted by Chris Sibert <ch...@comcast.net>.
Thanks for all the replies. I feel reassured with using Lucene. If I end up
doing anything with the application that I'm writing, I would like to look
at contributing some documentation of Lucene's features, and what it has to
offer.
----- Original Message -----
From: "Leo Galambos" <Le...@seznam.cz>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, September 11, 2003 4:57 PM
Subject: Re: Lucene features
> Doug Cutting wrote:
>
> >
> > I have some extensions to Lucene that I've not yet commited which make
> > it possible to easily define synthetic IndexReaders (not currently
> > supported). So you could do things that way, once I check these in.
> > But is this really better than just ANDing the clauses together? It
> > would take some big experiments to know, but my guess is that it
> > doesn't make much difference to compute a "local" IDF for such things.
>
>
> In this case, I think that the operator would be evaluated as "an
> implication" and not "AND" (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)).
> Obviously, you have to use an filter to filter out false hits (in case
> of q1->q2, the formula is true when q1 is false, so it is not what you
> really need), but it is not an issue with the auxiliary index. On the
> other hand, it is a feeling and it needs a test, you are right.
>
> Leo
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Lucene features
Posted by Chris Sibert <ch...@comcast.net>.
Thanks for all the replies. I feel reassured with using Lucene. If I end up
doing anything with the application that I'm writing, I would like to look
at contributing some documentation of Lucene's features, and what it has to
offer.
----- Original Message -----
From: "Leo Galambos" <Le...@seznam.cz>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, September 11, 2003 4:57 PM
Subject: Re: Lucene features
> Doug Cutting wrote:
>
> >
> > I have some extensions to Lucene that I've not yet commited which make
> > it possible to easily define synthetic IndexReaders (not currently
> > supported). So you could do things that way, once I check these in.
> > But is this really better than just ANDing the clauses together? It
> > would take some big experiments to know, but my guess is that it
> > doesn't make much difference to compute a "local" IDF for such things.
>
>
> In this case, I think that the operator would be evaluated as "an
> implication" and not "AND" (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)).
> Obviously, you have to use an filter to filter out false hits (in case
> of q1->q2, the formula is true when q1 is false, so it is not what you
> really need), but it is not an issue with the auxiliary index. On the
> other hand, it is a feeling and it needs a test, you are right.
>
> Leo
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
Re: Lucene features
Posted by Leo Galambos <Le...@seznam.cz>.
Doug Cutting wrote:
>
> I have some extensions to Lucene that I've not yet commited which make
> it possible to easily define synthetic IndexReaders (not currently
> supported). So you could do things that way, once I check these in.
> But is this really better than just ANDing the clauses together? It
> would take some big experiments to know, but my guess is that it
> doesn't make much difference to compute a "local" IDF for such things.
In this case, I think that the operator would be evaluated as "an
implication" and not "AND" (=1-(((1-q1)^p+(1-q2)^p )/2 )^(1/p)).
Obviously, you have to use an filter to filter out false hits (in case
of q1->q2, the formula is true when q1 is false, so it is not what you
really need), but it is not an issue with the auxiliary index. On the
other hand, it is a feeling and it needs a test, you are right.
Leo
Re: Lucene features
Posted by Doug Cutting <cu...@lucene.com>.
Leo Galambos wrote:
> Example: I use this notation: inverted_list_term:{list of W values, "-"
> denotes W=0, for 12 documents in a collection}
> A:{23[16]------27}
> B:{--[38]--------}
> C:{18[2-]45239812}
> If your first query is B, the subset of documents (denoted by brackets -
> namely, the 3rd and 4th doc) is selected, and if your second query is "A
> C", then you cannot use global IDFs, because in the subset, the IDF
> factors are different. Globally, A is better distriminator, but in the
> subset, C is better. This fact is then reflected by the hit list you
> generate, and I guess, the quality will be also affected by this.
>
> The example shows, that you would rather export the subset to an
> auxiliary index (RAMDirectory?) and then use this structure instead of
> the original index. Obviously, it will solve the issue of speed you
> mentioned.
>
> Unfortunately, I am not sure, if you can export the inverted lists when
> you read them. In egothor, I would use a listener in Rider class, in
> Lucene, I would have to rewrite some classes and it could be a real
> problem. Maybe, there is a solution I do not see...
I have some extensions to Lucene that I've not yet commited which make
it possible to easily define synthetic IndexReaders (not currently
supported). So you could do things that way, once I check these in.
But is this really better than just ANDing the clauses together? It
would take some big experiments to know, but my guess is that it doesn't
make much difference to compute a "local" IDF for such things.
Doug
wildcard search
Posted by Gregor Heinrich <Gr...@igd.fhg.de>.
Hi,
when querying with a wildcard, e.g., "project*", QueryParser (v1.3rc2-dev)
does seem to count the * as "at least one character" (RE=/.+/), i.e., the
query finds documents containing "projection" and "projective" (no stemming
involved) but not "project". Has anyone an explanation?
Not sure if I have stepped across a bug or for some reason misconfigured
Lucene. (The implementation uses MultiFieldQueryParser.)
Thanks for comments,
Gregor
Re: Lucene features
Posted by Leo Galambos <Le...@seznam.cz>.
Erik Hatcher wrote:
> On Friday, September 5, 2003, at 07:45 PM, Leo Galambos wrote:
>
>>> And for the second time today.... QueryFilter. It allows narrowing
>>> the documents queried to only the documents from a previous Query.
>>
>>
>>
>> I guess, it would not be an ideal solution - the first query does two
>> things a) it selects a subset from the corpus; b) it assigns a
>> relevance to each document of this subset. Your solution omits the
>> second point. It implies, the solution will not return "good hit
>> lists", because you will not consider the information value of the
>> first query which was given to you by a user.
>
>
> Yes, you're right. Getting the scores of a second query based on the
> scores of the first query is probably not trivial, but probably
> possible with Lucene. And that combined with a QueryFilter would do
> the trick I suspect. Somehow the scores of the first query could be
> remembered and used as a boost (or other type of factor) the scores of
> the second query.
Well, I do not want to be a pessimist, but the boost vector is not a
good solution due to CWI statistics. On the other hand, it is much
better than the simple QueryFilter which, in fact, works as 0/1 boost.
Example: I use this notation: inverted_list_term:{list of W values, "-"
denotes W=0, for 12 documents in a collection}
A:{23[16]------27}
B:{--[38]--------}
C:{18[2-]45239812}
If your first query is B, the subset of documents (denoted by brackets -
namely, the 3rd and 4th doc) is selected, and if your second query is "A
C", then you cannot use global IDFs, because in the subset, the IDF
factors are different. Globally, A is better distriminator, but in the
subset, C is better. This fact is then reflected by the hit list you
generate, and I guess, the quality will be also affected by this.
The example shows, that you would rather export the subset to an
auxiliary index (RAMDirectory?) and then use this structure instead of
the original index. Obviously, it will solve the issue of speed you
mentioned.
Unfortunately, I am not sure, if you can export the inverted lists when
you read them. In egothor, I would use a listener in Rider class, in
Lucene, I would have to rewrite some classes and it could be a real
problem. Maybe, there is a solution I do not see...
Your turn ;-)
Cheers,
Leo
>
> Am I off base here?
>
>> Thus I think, Chris would implement something more complex than
>> QueryFilter. If not, the results will be poorer than with the
>> commercial packages he may get. He could use a different model where
>> "AND" is not an associative operator (i.e. some modification of the
>> extended Boolean model). It implies, he would implement it in
>> Similarity.java (if I remember that class name correctly).
>
>
> Right... but you'd still need the filtering capability as well, I
> would think - at least for performance reasons.
>
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
Re: Lucene features
Posted by Leo Galambos <Le...@seznam.cz>.
Doug Cutting wrote:
> Erik Hatcher wrote:
>
>> Yes, you're right. Getting the scores of a second query based on the
>> scores of the first query is probably not trivial, but probably
>> possible with Lucene. And that combined with a QueryFilter would do
>> the trick I suspect. Somehow the scores of the first query could be
>> remembered and used as a boost (or other type of factor) the scores
>> of the second query.
>
>
> Why not just AND together the first and second query? That way
> they're both incorporated in the ranking. Filters are good when you
> don't want it to affect the ranking, and also when the first query is
> a criterion that you'll reuse for many queries (e.g.,
> language=french), since the bit vectors can be cached (as by
> QueryFilter).
You probably missed the start of our discussion - we are talking about
this: "q1 -> q2" which means "NOT q1 OR q2", versus "q2 -> q1" which
means "q1 OR NOT q2". It causes the issue, and it also shows why you
cannot use the simple "AND", because "q1 AND q2" != "NOT q1 OR q2" !=
"q1 OR NOT q2".
Leo
BTW: I didn't see the logic formulas for many years, so it is without
any guarantee ;-)
Re: Lucene features
Posted by Doug Cutting <cu...@lucene.com>.
Erik Hatcher wrote:
> Yes, you're right. Getting the scores of a second query based on the
> scores of the first query is probably not trivial, but probably possible
> with Lucene. And that combined with a QueryFilter would do the trick I
> suspect. Somehow the scores of the first query could be remembered and
> used as a boost (or other type of factor) the scores of the second query.
Why not just AND together the first and second query? That way they're
both incorporated in the ranking. Filters are good when you don't want
it to affect the ranking, and also when the first query is a criterion
that you'll reuse for many queries (e.g., language=french), since the
bit vectors can be cached (as by QueryFilter).
Doug
Re: Lucene features
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Friday, September 5, 2003, at 07:45 PM, Leo Galambos wrote:
>> And for the second time today.... QueryFilter. It allows narrowing
>> the documents queried to only the documents from a previous Query.
>
>
> I guess, it would not be an ideal solution - the first query does two
> things a) it selects a subset from the corpus; b) it assigns a
> relevance to each document of this subset. Your solution omits the
> second point. It implies, the solution will not return "good hit
> lists", because you will not consider the information value of the
> first query which was given to you by a user.
Yes, you're right. Getting the scores of a second query based on the
scores of the first query is probably not trivial, but probably
possible with Lucene. And that combined with a QueryFilter would do
the trick I suspect. Somehow the scores of the first query could be
remembered and used as a boost (or other type of factor) the scores of
the second query.
Am I off base here?
> Thus I think, Chris would implement something more complex than
> QueryFilter. If not, the results will be poorer than with the
> commercial packages he may get. He could use a different model where
> "AND" is not an associative operator (i.e. some modification of the
> extended Boolean model). It implies, he would implement it in
> Similarity.java (if I remember that class name correctly).
Right... but you'd still need the filtering capability as well, I would
think - at least for performance reasons.
Erik
Re: Lucene features
Posted by Leo Galambos <Le...@seznam.cz>.
>> But Drill Down searching is very desirable. It's where you're able to
>> search
>> within the results of a previous search. I'm assuming that I'll have to
>> implement that myself, by keeping a copy of the previous Hits list,
>> and only
>> returning results that are in both lists.
>
>
> And for the second time today.... QueryFilter. It allows narrowing
> the documents queried to only the documents from a previous Query.
I guess, it would not be an ideal solution - the first query does two
things a) it selects a subset from the corpus; b) it assigns a relevance
to each document of this subset. Your solution omits the second point.
It implies, the solution will not return "good hit lists", because you
will not consider the information value of the first query which was
given to you by a user.
For instance, "neologism" > "George Bush" (1st>2nd query) would return
different order of hits than "George Bush" > "neologism". Other
examples, "Prague Berlin" > "flight" (I must go there, and I prefer an
airplane) versus "flight" > "Prague Berlin" (I must fly, and I prefer
Berlin).
Thus I think, Chris would implement something more complex than
QueryFilter. If not, the results will be poorer than with the commercial
packages he may get. He could use a different model where "AND" is not
an associative operator (i.e. some modification of the extended Boolean
model). It implies, he would implement it in Similarity.java (if I
remember that class name correctly).
Leo
Re: Lucene features
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Friday, September 5, 2003, at 02:36 PM, Chris Sibert wrote:
> Synonym searching might be desirable, but now that I'm thinking about
> it,
> also likely not important.
This could be done with a custom Analyzer.
> Associated Words - sounds very interesting, like 'gold' might return
> 'metal'
> also, etc.
How is that different from Synonym searching?
> But Drill Down searching is very desirable. It's where you're able to
> search
> within the results of a previous search. I'm assuming that I'll have to
> implement that myself, by keeping a copy of the previous Hits list,
> and only
> returning results that are in both lists.
And for the second time today.... QueryFilter. It allows narrowing the
documents queried to only the documents from a previous Query.
Erik
Re: Lucene features
Posted by Chris Sibert <ch...@comcast.net>.
I'm not sure what all of the 'advanced features' were also.
Phonetic Searching - probably not important to this application.
Synonym searching might be desirable, but now that I'm thinking about it,
also likely not important.
Associated Words - sounds very interesting, like 'gold' might return 'metal'
also, etc.
But Drill Down searching is very desirable. It's where you're able to search
within the results of a previous search. I'm assuming that I'll have to
implement that myself, by keeping a copy of the previous Hits list, and only
returning results that are in both lists.
Thanks very much for your reply.
----- Original Message -----
From: "Steven J. Owens" <pu...@darksleep.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Thursday, September 04, 2003 3:02 AM
Subject: Re: Lucene features
> On Wed, Sep 03, 2003 at 02:42:48PM -0400, Chris Sibert wrote:
> > Lucene Users List <lu...@jakarta.apache.org>
> > > > I am wondering if Lucene is the way to go for my project.
> > > Probably. Tell us a little about your project.
> >
> > It's pretty basic. I'm just indexing 4 large text files, ranging up to
100MB
> > in size. They don't ever change, and are on a CD-ROM. Each file contains
a
> > bunch of small documents. I just create one index for all 4 of them.
These
> > documents are for an association that I belong to - they contain a
history
> > of the association's documents - and my application allows you to search
> > them.
>
> Well, aside from your concerns about the second list, Lucene
> seems perfect for your needs. You'd parse apart the four big files
> into a bunch of small documents, the parse those small documents and
> create lucene Documents, containing Fields, and add them to the index.
>
> > They are actually currently indexed by an application called
> > 'Sonar', by Virginia Systems. But I REALLY didn't like using their
> > user interface - blech - so I decided to write a new interface for
> > my own use. But Sonar costs some real bucks to be able to develop
> > against their search API, so I found Lucene, and decided to go with
> > it.
> >
> > Here are the search features that 'Sonar' has :
> > Boolean Searching
> > Proximity Searching
> > Wild Card Searching
> > Field/Block Searching
>
> I'm not sure what Field/Block means. Boolean, Proximity and
> WildCard, are pretty typical in Lucene searches. You should probably
> take a look at the Query Parser syntax docs:
>
> http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
>
>
> > Relevancy Ranking / Date Ranking
>
> Lucene search results are typically ranked by relevance, and you
> can tweak the search to adjust this (there's a fair bit of discussion
> of this in the lucene-user archives, a good keyword to look for is
> "slop" and "boost").
>
> Sorting output by date might take some finesse. I haven't played
> with sorting by date, but I'd expect to handle that by directly
> instantiating a QueryTerm to indicate the date issues.
>
> > List of Occurrences in Context
>
> I assume here that you mean displaying the results with a little
> snapshot of the text around it. There have been discussions about how
> best to do this (often focused around highlighting the search terms in
> the displayed text) on the lucene-users list. Check the list archive.
>
> > Phonetic Searching
>
> I'd guess you need to build this one yourself, perhaps by using a
> soundex algorithm when indexing the original data files.
>
> > Synonyms/Concepts
>
> Likewise... you'd need to come up with some sort of ontology of
> synonyms and concepts, then parse the fields you're indexing and
> generate a synonym/concept field that you'd add to the lucene
> Document.
>
> > Relational Searching
> > Associated Words
> > Drill Down Search Narrowing
>
> I'm not sure what these three mean.
>
> > I think that Lucene has all the features in the first group. How does it
> > stack up against the second group ?
>
> I'm afraid I haven't been too helpful here. Perhaps if you
> clarify what the above mean, folks can post about how to implement it
> in Lucene.
>
> > I'm writing the whole thing in Swing, which has been time consuming,
> > and so have invested quite a bit of time into this project. But I'm
> > seeing the end of the tunnel, and want to make sure that I'm going
> > down the right path before I spend too much more time on it.
>
> It sounds like you ought to at least seriously consider using
> Lucene, if you can find or implement equivalent features, or decide
> you can live without them.
>
> --
> Steven J. Owens
> puff@darksleep.com
>
> "I'm going to make broad, sweeping generalizations and strong,
> declarative statements, because otherwise I'll be here all night and
> this document will be four times longer and much less fun to read.
> Take it all with a grain of salt." - Me at http://darksleep.com
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
Re: Lucene features
Posted by "Steven J. Owens" <pu...@darksleep.com>.
On Wed, Sep 03, 2003 at 02:42:48PM -0400, Chris Sibert wrote:
> Lucene Users List <lu...@jakarta.apache.org>
> > > I am wondering if Lucene is the way to go for my project.
> > Probably. Tell us a little about your project.
>
> It's pretty basic. I'm just indexing 4 large text files, ranging up to 100MB
> in size. They don't ever change, and are on a CD-ROM. Each file contains a
> bunch of small documents. I just create one index for all 4 of them. These
> documents are for an association that I belong to - they contain a history
> of the association's documents - and my application allows you to search
> them.
Well, aside from your concerns about the second list, Lucene
seems perfect for your needs. You'd parse apart the four big files
into a bunch of small documents, the parse those small documents and
create lucene Documents, containing Fields, and add them to the index.
> They are actually currently indexed by an application called
> 'Sonar', by Virginia Systems. But I REALLY didn't like using their
> user interface - blech - so I decided to write a new interface for
> my own use. But Sonar costs some real bucks to be able to develop
> against their search API, so I found Lucene, and decided to go with
> it.
>
> Here are the search features that 'Sonar' has :
> Boolean Searching
> Proximity Searching
> Wild Card Searching
> Field/Block Searching
I'm not sure what Field/Block means. Boolean, Proximity and
WildCard, are pretty typical in Lucene searches. You should probably
take a look at the Query Parser syntax docs:
http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
> Relevancy Ranking / Date Ranking
Lucene search results are typically ranked by relevance, and you
can tweak the search to adjust this (there's a fair bit of discussion
of this in the lucene-user archives, a good keyword to look for is
"slop" and "boost").
Sorting output by date might take some finesse. I haven't played
with sorting by date, but I'd expect to handle that by directly
instantiating a QueryTerm to indicate the date issues.
> List of Occurrences in Context
I assume here that you mean displaying the results with a little
snapshot of the text around it. There have been discussions about how
best to do this (often focused around highlighting the search terms in
the displayed text) on the lucene-users list. Check the list archive.
> Phonetic Searching
I'd guess you need to build this one yourself, perhaps by using a
soundex algorithm when indexing the original data files.
> Synonyms/Concepts
Likewise... you'd need to come up with some sort of ontology of
synonyms and concepts, then parse the fields you're indexing and
generate a synonym/concept field that you'd add to the lucene
Document.
> Relational Searching
> Associated Words
> Drill Down Search Narrowing
I'm not sure what these three mean.
> I think that Lucene has all the features in the first group. How does it
> stack up against the second group ?
I'm afraid I haven't been too helpful here. Perhaps if you
clarify what the above mean, folks can post about how to implement it
in Lucene.
> I'm writing the whole thing in Swing, which has been time consuming,
> and so have invested quite a bit of time into this project. But I'm
> seeing the end of the tunnel, and want to make sure that I'm going
> down the right path before I spend too much more time on it.
It sounds like you ought to at least seriously consider using
Lucene, if you can find or implement equivalent features, or decide
you can live without them.
--
Steven J. Owens
puff@darksleep.com
"I'm going to make broad, sweeping generalizations and strong,
declarative statements, because otherwise I'll be here all night and
this document will be four times longer and much less fun to read.
Take it all with a grain of salt." - Me at http://darksleep.com