You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Ananth T. Sarathy" <an...@gmail.com> on 2006/04/11 23:19:25 UTC

Lucene Seaches VS. Relational database Queries

H,
 We have made documents out of the rows in our database and one of the team
is suggesting that we abandon some of our database queries and instead use
lucene. I think there are some fundamental problems with this especially
when it comes to association tables (where there is a 1 one to many
relation, we basically take all the values in the associated table and store
them in one field).  One of the problems I see would be with values that
over lap (Example, name where one name is Jason Bateman, and one is Jason
Bateman Black, and it would be hard to replicate the Discrete Search for
Jason Bateman. Also I think there would be some issues with insuring that
all updates, adds and deletes were properly synced with the index, as well
as the possibility of duplicate rows in the index. Can people out there help
with any other pros and cons to this approach?

--
Ananth T Sarathy

Re: Lucene Seaches VS. Relational database Queries

Posted by Paul Elschot <pa...@xs4all.nl>.

On Saturday 15 April 2006 03:36, Jeryl Cook wrote:
> 
> Im the co-worker who suggested to Ananth( I've think we have been debating
> this for 3 days now,from the post it seems he is winning :)... )
> 
> Anway, as Ananth stated I suggested this because I am wondering if  lucene
> could solve a bottle neck query that is taking a deathly long time to
> complete(read-only)....and the orginal design actually generated a threaded
> 60+ queries on the database to return results per userThread who hit our
> website for this view..., I know that this will kill our server when
> user-load increases...i know that lucene is built for speed and can handle a
> very large number of peopel searching(we are using singleton Searcher), and

One way to have more queries per second with a singleton Searcher is
by merging the retrievals of documents for multiple queries.
This will increase query throughput (less disk head movement) but it will
also increase the response time for the individual queries.

> the (threaded)results will be the "hits" returned from lucene.. , also this
> query will NOT be executed by any user in a text field , but rather in our
> application code only when user selects differnt parts of the site...if all
> values in this 1:n relationship we are trying to query in lucene then the
> "application-provided" query will return accurate results.  

To follow 1:n relationships avoid using Hits, use your own HitCollector
instead. From application code, try and use TermDocs from the index
reader.
 
> we are using Quartz, and not creating threads in servlets...
> 
> FINAL SOLUTION MAYBE?:
> if our client EVER gives us a requirement that says we must have accurate
> text-searching even if somthing on our index for  1:  "Jason" and "Jason
> Black" relationship, then we should just simply say we cannot implement this
> because  lucene search will yield inaccurate results correct???????
> 
> comments?

Assuming I understand the problem correctly, one can solve this by
indexing such fields twice: once as keyword to search for the specific
individual, and once with indexed terms to search for name(s). 
In both fields one could use an extra word from a relational db,
for example a client id.

Regards,
Paul Elschot


> View this message in context: 
http://www.nabble.com/Lucene-Seaches-VS.-Relational-database-Queries-t1434583.html#a3925693
> Sent from the Lucene - Java Users forum at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Seaches VS. Relational database Queries

Posted by Jeryl Cook <tw...@hotmail.com>.

Im the co-worker who suggested to Ananth( I've think we have been debating
this for 3 days now,from the post it seems he is winning :)... )

Anway, as Ananth stated I suggested this because I am wondering if  lucene
could solve a bottle neck query that is taking a deathly long time to
complete(read-only)....and the orginal design actually generated a threaded
60+ queries on the database to return results per userThread who hit our
website for this view..., I know that this will kill our server when
user-load increases...i know that lucene is built for speed and can handle a
very large number of peopel searching(we are using singleton Searcher), and
the (threaded)results will be the "hits" returned from lucene.. , also this
query will NOT be executed by any user in a text field , but rather in our
application code only when user selects differnt parts of the site...if all
values in this 1:n relationship we are trying to query in lucene then the
"application-provided" query will return accurate results.  

we are using Quartz, and not creating threads in servlets...

FINAL SOLUTION MAYBE?:
if our client EVER gives us a requirement that says we must have accurate
text-searching even if somthing on our index for  1:  "Jason" and "Jason
Black" relationship, then we should just simply say we cannot implement this
because  lucene search will yield inaccurate results correct???????

comments?
--
View this message in context: http://www.nabble.com/Lucene-Seaches-VS.-Relational-database-Queries-t1434583.html#a3925693
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Seaches VS. Relational database Queries

Posted by "Ananth T. Sarathy" <an...@gmail.com>.

Sorry, hit submit in mid email

Ok,
 Some of the stuff makes  some sense. I was a little loopy from lack of
sleep and some of these solutions don't really cover my concerns....

Let's take this movie example. If each member of a production Crew can have
multiple titles that come from a lookup table of Distinct Jobs

Titles
Assistant Producer
Producer
Executive Producer
Director
Director Trainee
Stunt Director

In the Database there would be an Association Table Linking each Crew member
to the titles they had

Crew_Titles
Crew_ID   Title
1             Director
 1             Producer
1             Director
2             Stunt Director
2             Producer
3             Director Trainee
3             Assistant Producer
4             Producer
5             Executive Producer

So when I turn a Crew Member into a java object, each producer has a
collection of Titles. I want one field that is searchable so if I am looking
for "Director" all directors. I take the collection of title and make a
String out of them so for Crew Member 1 the Title String will "Executive
Producer Director". This is so that on an Text Search box people can find
Crew Members by title with a natural language search.

Now, what we want to do is switch to out a database Query to Lucene Search.
Is there anyway to get the Equivalent of
select count(distinct Crew_ID) from Crew_TItles where Title="Producer" which
would give you a result of 3

If you do that as a Lucene Search You will get hits for  Producer,Executive
Producer, and Assistant Producer. You would get 5 hits instead of three.  We
can't make it a key word field since it is a multi value field, and we need
all the values searchable (IE Producer Director should return for a search
of Producer or Director).  Also we can't really add Special Characters the
Certain Titles that are contained in other titles since we want to use it as
a query field for users searching for Crew Members. We use the hits to get
Counts of unique Crew Members, and then Use a Stored Field called Crew_ID to
pull the appropriate Crew Member from the Database.

Also, I assume this wouldn't be a problem if all titles were Unique in a way
that none are contained in other Title (I.E If there is A Title Director,
there would be Director Trainee or Stunt Director).

Also, I assume that I would have to be Responsible to ensure that the Index
Contained one and only one Document per Crew ID, as well as ensuring
Crew_IDs that were not in the Database anymore were removed from the Index.
Now, we have been Using Lucene for over two years and I am aware of most of
the things in regards to creating and maintaining the Index. My Issue is
that I have never seen Lucene as a replacement for a Database, but rather a
supplemental tool that allows for better native language search.   We have
had some performance issues with our Database, and thus the Discussion of
changing it out to use Lucene. I am of the Opinion that the Database and
Lucene are two separate things, and that The Database is always going to be
accurately reflect whatever data we have (whether or not the data is correct
it's still what we would use) and the Lucene Index, no matter what steps you
take to ensure synchronization can be incorrect and there for the most
accurate results will always come from the DB, and we need to continue
trying to performance tune the DB.

Other issues I would see that wouldn't allow for replacing of DB queries
would include multi sort (IE order by Date1, Date2, Name), as well as range
searches, unless the Too ManyOpenClauses has been fixed in 1.9.

Hopefully this explains what we are trying to do here. Please let me know
where I have gone wrong and what else I have missed (pros and con)

Much Thanks,
Ananth

On 4/13/06, Ananth T. Sarathy <an...@gmail.com> wrote:
>
>  Ok,
>  Some of the stuff makes  some sense. I was a little loopy from lack of
> sleep and some of these solutions don't really cover my concerns....
>
>
> Let's take this movie example. If each member of a production Crew can
> have multiple titles that come from a lookup table of Distinct Jobs
>
> Titles
> Assistant Producer
> Producer
> Executive Producer
> Director
> Director Trainee
> Stunt Director
>
> In the Database there would be a Assocation Table Linking each Crew member
> the titles they had
>
> Crew_Titles
> Crew_ID   Title
> 1             Producer
> 1
>
>  On 4/12/06, Nadav Har'El <NY...@il.ibm.com> wrote:
> >
> > Chris Hostetter <ho...@fucit.org> wrote on 12/04/2006 01:41:37
> > AM:
> > > : them in one field).  One of the problems I see would be with values
> > that
> > > : over lap (Example, name where one name is Jason Bateman, and one is
> > Jason
> > > : Bateman Black, and it would be hard to replicate the Discrete Search
> >
> > for
> > >
> > > they way field values are "analyzed" is extremely configurable -- down
> > to
> > > the individual field level.  Which means that while you can have an
> > actor
> > > field where you can do loose text searching for "bateman" and get back
> >
> > > movies staring "Jason Bateman" and "Jason Bateman Black" (and even
> > Guido
> > > Batemans" if you use stemming) you can also have another field using a
> > > KeywordAnalyzer such that a record with teh values "Jason Bateman" and
> >
> > > "Jack Black" will only be matched if hte user searches for "Jason
> > Bateman"
> > > or "Jack Black" ... searching for "Bateman Jack" or "Black Jason" will
> > not
> > > work.
> >
> > Another possible trick is to have one field, but mark its end with
> > special
> > tokens, say "^" and "$", so that "Jason Bateman" gets indexed as four
> > tokens:
> >      ^ Jason Bateman $
> > Then, if you want to search for the name Jason Bateman and that name
> > only,
> > just search for the phrase "^ Jason Bateman $" - and only this entry
> > will
> > match. (you can also continue to search this field normally)
> >
> > If you'll think about this, you'll notice that you don't actually need
> > the beginning-of-field marker ("^") because it's easy to recognize the
> > beginning of a field because the position there is 0. Unfortunately,
> > I don't know how to match position 0 using the standard QueryParser,
> > but you can do it with the SpanFirstQuery: for example if we index
> > Jason Bateman as the three tokens
> >      Jason Bateman $
> > then we can search for it using something like
> >      SpanQuery[] terms = {
> >            new SpanTermQuery(new Term("actor", "Jason")),
> >            new SpanTermQuery(new Term("actor", "Bateman")),
> >            new SpanTermQuery(new Term("actor", "$")) };
> >      new SpanFirstQuery(new SpanNearQuery(terms, 0, true), 3);
> > (or something like that... I didn't test this)
> >
> >
> > --
> > Nadav Har'El
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
>
> Ananth T Sarathy
>

--
Ananth T Sarathy

Re: Lucene Seaches VS. Relational database Queries

Posted by "Ananth T. Sarathy" <an...@gmail.com>.

Ok,
 Some of the stuff makes  some sense. I was a little loopy from lack of
sleep and some of these solutions don't really cover my concerns....


Let's take this movie example. If each member of a production Crew can have
multiple titles that come from a lookup table of Distinct Jobs

Titles
Assistant Producer
Producer
Executive Producer
Director
Director Trainee
Stunt Director

In the Database there would be a Assocation Table Linking each Crew member
the titles they had

Crew_Titles
Crew_ID   Title
1             Producer
1

On 4/12/06, Nadav Har'El <NY...@il.ibm.com> wrote:
>
> Chris Hostetter <ho...@fucit.org> wrote on 12/04/2006 01:41:37
> AM:
> > : them in one field).  One of the problems I see would be with values
> that
> > : over lap (Example, name where one name is Jason Bateman, and one is
> Jason
> > : Bateman Black, and it would be hard to replicate the Discrete Search
> for
> >
> > they way field values are "analyzed" is extremely configurable -- down
> to
> > the individual field level.  Which means that while you can have an
> actor
> > field where you can do loose text searching for "bateman" and get back
> > movies staring "Jason Bateman" and "Jason Bateman Black" (and even Guido
> > Batemans" if you use stemming) you can also have another field using a
> > KeywordAnalyzer such that a record with teh values "Jason Bateman" and
> > "Jack Black" will only be matched if hte user searches for "Jason
> Bateman"
> > or "Jack Black" ... searching for "Bateman Jack" or "Black Jason" will
> not
> > work.
>
> Another possible trick is to have one field, but mark its end with special
> tokens, say "^" and "$", so that "Jason Bateman" gets indexed as four
> tokens:
>      ^ Jason Bateman $
> Then, if you want to search for the name Jason Bateman and that name only,
> just search for the phrase "^ Jason Bateman $" - and only this entry will
> match. (you can also continue to search this field normally)
>
> If you'll think about this, you'll notice that you don't actually need
> the beginning-of-field marker ("^") because it's easy to recognize the
> beginning of a field because the position there is 0. Unfortunately,
> I don't know how to match position 0 using the standard QueryParser,
> but you can do it with the SpanFirstQuery: for example if we index
> Jason Bateman as the three tokens
>      Jason Bateman $
> then we can search for it using something like
>      SpanQuery[] terms = {
>            new SpanTermQuery(new Term("actor", "Jason")),
>            new SpanTermQuery(new Term("actor", "Bateman")),
>            new SpanTermQuery(new Term("actor", "$")) };
>      new SpanFirstQuery(new SpanNearQuery(terms, 0, true), 3);
> (or something like that... I didn't test this)
>
>
> --
> Nadav Har'El
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--
Ananth T Sarathy

Re: Lucene Seaches VS. Relational database Queries

Posted by Nadav Har'El <NY...@il.ibm.com>.

Chris Hostetter <ho...@fucit.org> wrote on 12/04/2006 01:41:37 AM:
> : them in one field).  One of the problems I see would be with values
that
> : over lap (Example, name where one name is Jason Bateman, and one is
Jason
> : Bateman Black, and it would be hard to replicate the Discrete Search
for
>
> they way field values are "analyzed" is extremely configurable -- down to
> the individual field level.  Which means that while you can have an actor
> field where you can do loose text searching for "bateman" and get back
> movies staring "Jason Bateman" and "Jason Bateman Black" (and even Guido
> Batemans" if you use stemming) you can also have another field using a
> KeywordAnalyzer such that a record with teh values "Jason Bateman" and
> "Jack Black" will only be matched if hte user searches for "Jason
Bateman"
> or "Jack Black" ... searching for "Bateman Jack" or "Black Jason" will
not
> work.

Another possible trick is to have one field, but mark its end with special
tokens, say "^" and "$", so that "Jason Bateman" gets indexed as four
tokens:
      ^ Jason Bateman $
Then, if you want to search for the name Jason Bateman and that name only,
just search for the phrase "^ Jason Bateman $" - and only this entry will
match. (you can also continue to search this field normally)

If you'll think about this, you'll notice that you don't actually need
the beginning-of-field marker ("^") because it's easy to recognize the
beginning of a field because the position there is 0. Unfortunately,
I don't know how to match position 0 using the standard QueryParser,
but you can do it with the SpanFirstQuery: for example if we index
Jason Bateman as the three tokens
      Jason Bateman $
then we can search for it using something like
      SpanQuery[] terms = {
            new SpanTermQuery(new Term("actor", "Jason")),
            new SpanTermQuery(new Term("actor", "Bateman")),
            new SpanTermQuery(new Term("actor", "$")) };
      new SpanFirstQuery(new SpanNearQuery(terms, 0, true), 3);
(or something like that... I didn't test this)


--
Nadav Har'El


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene Seaches VS. Relational database Queries

Posted by Chris Hostetter <ho...@fucit.org>.

1) An inverted full text index is not a replacment for a relational
database.

2) many people think they need a relational database, when all they really
need is a well designed full text index.

To get to some of your specific questions...

: them in one field).  One of the problems I see would be with values that
: over lap (Example, name where one name is Jason Bateman, and one is Jason
: Bateman Black, and it would be hard to replicate the Discrete Search for

they way field values are "analyzed" is extremely configurable -- down to
the individual field level.  Which means that while you can have an actor
field where you can do loose text searching for "bateman" and get back
movies staring "Jason Bateman" and "Jason Bateman Black" (and even Guido
Batemans" if you use stemming) you can also have another field using a
KeywordAnalyzer such that a record with teh values "Jason Bateman" and
"Jack Black" will only be matched if hte user searches for "Jason Bateman"
or "Jack Black" ... searching for "Bateman Jack" or "Black Jason" will not
work.

furthermore, as you learn more about lucene you'll find things like
positionIncrimentGap, PhraseQueries and "slop" which will show you how you
can support queries like "Jason Black" matching movies staring "Jason
Bateman Black" or "Jason Black" but not movies staring both "Jason
Bateman" and "Jack Black" (unless you want them to)

: Jason Bateman. Also I think there would be some issues with insuring that
: all updates, adds and deletes were properly synced with the index, as well
: as the possibility of duplicate rows in the index. Can people out there help
: with any other pros and cons to this approach?

concern about replication is a perfectly valid one ... but it's an issue
that also plauges RDBMs ... i've yet to see a database that could garuntee
100% that every slave would surface newly repliated data from the master
at the *exact* same time.

I would suggest you take a look at the Solr project...

	http://incubator.apache.org/solr/

...it's very easy to get the demo up and running and to play with things
like positionIncrimentGap and various analyzers (all without ever writting
any lines of code) and Solr has scripts to make master=>slave replication
easy on machines that support rsync and hard links (ie: Unix/Linux)


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org