You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by shailesh kumar <sh...@yahoo.com> on 2006/01/11 13:23:25 UTC

BTree

Does Lucene use a BTree kind of structure for storing the index (atleast in the memory)  .? or is it just a list. Based on the file format in the index directory ( where in the terms are are lexicographically sorted in one of the files ) I   am not sure if BTree is used. ( Because constructing a Btree from a lexicographicaly sorted list    will be difficult. ) 
 
 Shailesh
 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: BTree

Posted by Daniel Naber <lu...@danielnaber.de>.

On Donnerstag 12 Januar 2006 05:47, shailesh kumar wrote:

> I had   looked at the document you had listed as well as used a  Hex
> editor to look at the segment files. .That is how I came to know about
> the lexicographic sorting. But was not sure if BTree is used.  If I
> understand correctly a Binary tree  (i.e each node only  2 children) or
>  a high order Balanced tree (where in a range of values are stored in
> the node and each node can have more than 2 children)  is the best way
> to search.

The list of terms is a list on disk. Every n-th (128th I think) entry is 
kept in memory as an index to the list on disk. Both lists are sorted to 
allow binary searching.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BTree

Posted by Kan Deng <ka...@yahoo.com>.

Thanks, Yonik. 

TermInfosReader is exactly the class I am looking for.


Kan


--- Yonik Seeley <ys...@gmail.com> wrote:

> On 1/12/06, Kan Deng <ka...@yahoo.com> wrote:
> > Many thanks, Doug.
> >
> > A quick question, which class implements the
> following
> > logic?
> 
> It looks to me like
> org.apache.lucene.index.TermInfosReader
> 
> -Yonik
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BTree

Posted by Yonik Seeley <ys...@gmail.com>.

On 1/12/06, Kan Deng <ka...@yahoo.com> wrote:
> Many thanks, Doug.
>
> A quick question, which class implements the following
> logic?

It looks to me like org.apache.lucene.index.TermInfosReader

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BTree

Posted by Kan Deng <ka...@yahoo.com>.

Many thanks, Doug. 

A quick question, which class implements the following
logic? 

  org.apache.lucene.search.IndexSearcher? 

> For access, Lucene is equivalent to a B-Tree 
> with all but the leaves cached in memory, so
> that accesses require only a single disk access.


thanks,
Kan



--- Doug Cutting <cu...@apache.org> wrote:

> B-Tree's are best for random, incremental updates. 
> They require 
> log_b(N) disk accesses for inserts, deletes and
> accesses, where b is the 
> number of entries per page, and N is the total
> number of entries in the 
> tree.  But that's too slow for text indexing. 
> Rather Lucene uses a 
> combination of file sorting and merging to update
> indexes, which is much 
> faster than a B-tree would be.  For access, Lucene
> is equivalent to a 
> B-Tree with all but the leaves cached in memory, so
> that accesses 
> require only a single disk access.
> 
> Slides 5 and 6 of the following presentation discuss
> this a bit:
> 
>
http://www.research.ibm.com/haifa/Workshops/ir2005/papers/DougCutting-Haifa05.pdf
> 
> Doug
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BTree

Posted by Doug Cutting <cu...@apache.org>.

B-Tree's are best for random, incremental updates.  They require 
log_b(N) disk accesses for inserts, deletes and accesses, where b is the 
number of entries per page, and N is the total number of entries in the 
tree.  But that's too slow for text indexing.  Rather Lucene uses a 
combination of file sorting and merging to update indexes, which is much 
faster than a B-tree would be.  For access, Lucene is equivalent to a 
B-Tree with all but the leaves cached in memory, so that accesses 
require only a single disk access.

Slides 5 and 6 of the following presentation discuss this a bit:

http://www.research.ibm.com/haifa/Workshops/ir2005/papers/DougCutting-Haifa05.pdf

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BTree

Posted by Kan Deng <ka...@yahoo.com>.

After reading into the source code, I think Lucene
doeesn't use B+tree or other tree structure for index.

A possible reason is that, since Lucene aims at
handling gigabytes  , it has to be cautious about the
index file's size. B+tree may grow rapidly when the
number of leaves grows. Hence, B+tree doesn't scale
well with gigabytes data. 

Seems to me Lucene's search is implemented by
"hopping" along the sorted index. 

For example, suppose there is a field "first name"
whose values are the first names of all California
residents. Those values are listed sequentially but
sorted. Given a certain query, the search can start
from the first name, if the first name is smaller than
the query, hop to the 10'th one, if still smaller,
continue hop forward, otherwise, hop back. so on. 

By this, we don't need B+tree. Of course, there are
tricks to improve the hopping algorithm. However, my
doubt is that the search performance may not so
competitive as B+tree. 

Is my understanding correct?

Kan

--- Kan Deng <ka...@yahoo.com> wrote:

> 
> I have similar problem about the internal indexing
> data structure
> 
> According to Paolo Ferragina of Univ Pisa, B+tree
> with
> cluster is best for sorting. However, referring to
> the
> implementation of
> org.apache.lucene.search.IndexSearch, it looks like
> the impl doesn't take B+tree, never mention cluster.
> 
> 
> Can anyone explain this issue in more depth?
> 
> thanks,
> Kan
> 
> 
> 
> --- shailesh kumar <sh...@yahoo.com> wrote:
> 
> > I had   looked at the document you had listed as
> > well as used a  Hex editor to look at the segment
> > files. .That is how I came to know about the
> > lexicographic sorting. But was not sure if BTree
> is
> > used.  If I understand correctly a Binary tree 
> (i.e
> > each node only  2 children) or  a high order
> > Balanced tree (where in a range of values are
> stored
> > in the node and each node can have more than 2
> > children)  is the best way to search.
> >   So wantted to know if Lucene implements it that
> > way  ( if not in the data storage of the index in
> > the file, atleast in the memory during lookups??)
> >   
> >   Shailesh
> >  
> > 
> >  
> > 
> > 
> > 		
> > ---------------------------------
> > Yahoo! Photos
> >  Ring in the New Year with Photo Calendars. Add
> > photos, events, holidays, whatever.
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> 
> 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BTree

Posted by Kan Deng <ka...@yahoo.com>.

I have similar problem about the internal indexing
data structure

According to Paolo Ferragina of Univ Pisa, B+tree with
cluster is best for sorting. However, referring to the
implementation of
org.apache.lucene.search.IndexSearch, it looks like
the impl doesn't take B+tree, never mention cluster. 

Can anyone explain this issue in more depth?

thanks,
Kan



--- shailesh kumar <sh...@yahoo.com> wrote:

> I had   looked at the document you had listed as
> well as used a  Hex editor to look at the segment
> files. .That is how I came to know about the
> lexicographic sorting. But was not sure if BTree is
> used.  If I understand correctly a Binary tree  (i.e
> each node only  2 children) or  a high order
> Balanced tree (where in a range of values are stored
> in the node and each node can have more than 2
> children)  is the best way to search.
>   So wantted to know if Lucene implements it that
> way  ( if not in the data storage of the index in
> the file, atleast in the memory during lookups??)
>   
>   Shailesh
>  
> 
>  
> 
> 
> 		
> ---------------------------------
> Yahoo! Photos
>  Ring in the New Year with Photo Calendars. Add
> photos, events, holidays, whatever.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: BTree

Posted by shailesh kumar <sh...@yahoo.com>.

I had   looked at the document you had listed as well as used a  Hex editor to look at the segment files. .That is how I came to know about the lexicographic sorting. But was not sure if BTree is used.  If I understand correctly a Binary tree  (i.e each node only  2 children) or  a high order Balanced tree (where in a range of values are stored in the node and each node can have more than 2 children)  is the best way to search.
  So wantted to know if Lucene implements it that way  ( if not in the data storage of the index in the file, atleast in the memory during lookups??)
  
  Shailesh
 

 


		
---------------------------------
Yahoo! Photos
 Ring in the New Year with Photo Calendars. Add photos, events, holidays, whatever.

Re: BTree

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jan 11, 2006, at 7:23 AM, shailesh kumar wrote:
> Does Lucene use a BTree kind of structure for storing the index  
> (atleast in the memory)  .? or is it just a list. Based on the file  
> format in the index directory ( where in the terms are are  
> lexicographically sorted in one of the files ) I   am not sure if  
> BTree is used. ( Because constructing a Btree from a  
> lexicographicaly sorted list    will be difficult. )

The index structure is documented here:

	http://lucene.apache.org/java/docs/fileformats.html



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org