You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Vikas Khengare <Vi...@symantec.com> on 2006/01/17 11:52:00 UTC

Why we use Lucene for Database search like Oracle / Sybase ?

Hi Friends,
 
        I have very basic question that
 
    1] Why we use Lucene for Database search like Oracle / Sybase ?
    2] For that first we have to convert all records one bye one in
string then build lucene document and then index it ?
 
Thanks
 
From,
[ vikas_khengare@symantec.com ]

Re: Why we use Lucene for Database search like Oracle / Sybase ?

Posted by Kan Deng <ka...@yahoo.com>.

1. The conventional database uses B+tree as the
indexing mechanism, while search engine uses
inverted-index. 

   When user needs to update the data frequently, then
B+tree is a better choice. However, for search engine,
the data and index doesn't change too often. 

   Inverted indexes are tables. For example, .tis
index file is a table of columns including "Field
name", "Field value" and "doc freq". However, there
are no B+tree associated with those tables. 

   Inverted index is less space consuming comparing
with B+tree index. Imagine there is a table which has
a ID column. Suppose there are N rows in this table,
the size of B+tree is O(Nlog(N)), but the size of
inverted index is O(N). 

2. Searching:

   Conventional RDBMS searches among the B+tree. But a
search engine "hops" along the inverted index, which
is sorted. Suppose a field value of inverted index is
"0, 3, 4, 6, 20, 29, 39, 60, 202", to search for a
given value say 6, the search engine may start with
"0", then hop with a fixed length, say 4, to "20",
then to "202". If nesssary, it "hops" backforwards but
with shrinked pace.

   A search engine assumes the inverted indexes are
sorted. This is a strong assumption, especially it is
very hard to maintain if the user can update the data
thus index at any time. B+tree doesn't comply to this
strong assumption. 

   When the dataset and index is small, B+tree is
faster than inverted index search. However, with
gigabytes, inverted index search tends to be faster,
because inverted index is smaller in size, thus less
disk IO required. 


3. Compression. 

  Lucene compressed the data in its inverted indexes,
because it assumes the indexes do not change very
frequently. However, B+tree doesn't compress, because
it doesn't assume the same stability of its indexes. 


4. Inverted index doesn't required fix-length of
columns. 

  Conventional RDBMS assumes that every row of a table
must be of the same columns. When one row may have
some extra columns, a workaround is to use "flex".
However, each Document of Lucene may be of different
fields. 

  
5. Inverted index is convenient for alias/synonyms. 

  Since inverted indexes refer to the original dataset
by offset pointers, it is convenient to inject alias
and synonym into the inverted indexes, as long as they
point to the same offset of the original dataset. 

  However, it is not very convenient to do the same
job with conventional database enpowered by B+tree. 

6. Ranking. 

  A search engine's implementation makes it convenient
to rank the search result. But with conventional
database, it is not so convenient. 

7. Search engine doesn't bother SQL language. 

  Usually conventional database suppose SQL language
to make it convenient for user to organize data, set
up index, query, etc. However, SQL's convenience comes
with price, because RDBMS engine has to handle the
compilation and figure out execution plan for most
queries. 

  However, SQL language is not mandatory for RDBMS,
the embedded database like BerkeleyDB (Sleepycat)
doesn't support SQL, therefore, it is faster than
using SQL. 

  Lucene doesn't support SQL-like language. But it is
possible to do so if people like SQL's convenient. 


In summary, for many applications, search engine and
database are competitive solutions. One has to
consider in depth to choose either search engine or
database, and in some cases, the border is blurred. 


Kan



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Why we use Lucene for Database search like Oracle / Sybase ?

Posted by Chris Lu <ch...@gmail.com>.

1) Because database mainly uses B-tree to index records' some columns.
Lucene or other full text search software uses to analyze the content
of each document and saved to inverted index. "Lucene's index falls
into the family of indexes known as an inverted index. This is because
it can list, for a term, the documents that contain it. This is the
inverse of the natural relationship, in which documents list terms."

Oracle has it's full-text search, ultrasearch. But you need to pay a
lot extra for it.

2) It'll work. But that will lose the unique information for each
column. And you may want to assign different weight to each filed.
I(working for DBSight) recommend you to take a look at
http://www.DBSight.net. It will help you to understand Lucene with
databases better.

Chris Lu
--------------
Full-Text Search on Any Databases
http://www.dbsight.net

On 1/17/06, Vikas Khengare <Vi...@symantec.com> wrote:
> Hi Friends,
>
>         I have very basic question that
>
>     1] Why we use Lucene for Database search like Oracle / Sybase ?
>     2] For that first we have to convert all records one bye one in
> string then build lucene document and then index it ?
>
> Thanks
>
> From,
> [ vikas_khengare@symantec.com ]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org