You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by shailesh kumar <sh...@yahoo.com> on 2006/01/11 13:23:25 UTC
BTree
Does Lucene use a BTree kind of structure for storing the index (atleast in the memory) .? or is it just a list. Based on the file format in the index directory ( where in the terms are are lexicographically sorted in one of the files ) I am not sure if BTree is used. ( Because constructing a Btree from a lexicographicaly sorted list will be difficult. )
Shailesh
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Re: BTree
Posted by Daniel Naber <lu...@danielnaber.de>.
On Donnerstag 12 Januar 2006 05:47, shailesh kumar wrote:
> I had looked at the document you had listed as well as used a Hex
> editor to look at the segment files. .That is how I came to know about
> the lexicographic sorting. But was not sure if BTree is used. If I
> understand correctly a Binary tree (i.e each node only 2 children) or
> a high order Balanced tree (where in a range of values are stored in
> the node and each node can have more than 2 children) is the best way
> to search.
The list of terms is a list on disk. Every n-th (128th I think) entry is
kept in memory as an index to the list on disk. Both lists are sorted to
allow binary searching.
Regards
Daniel
--
http://www.danielnaber.de
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: BTree
Posted by Kan Deng <ka...@yahoo.com>.
Thanks, Yonik.
TermInfosReader is exactly the class I am looking for.
Kan
--- Yonik Seeley <ys...@gmail.com> wrote:
> On 1/12/06, Kan Deng <ka...@yahoo.com> wrote:
> > Many thanks, Doug.
> >
> > A quick question, which class implements the
> following
> > logic?
>
> It looks to me like
> org.apache.lucene.index.TermInfosReader
>
> -Yonik
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: BTree
Posted by Yonik Seeley <ys...@gmail.com>.
On 1/12/06, Kan Deng <ka...@yahoo.com> wrote:
> Many thanks, Doug.
>
> A quick question, which class implements the following
> logic?
It looks to me like org.apache.lucene.index.TermInfosReader
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: BTree
Posted by Kan Deng <ka...@yahoo.com>.
Many thanks, Doug.
A quick question, which class implements the following
logic?
org.apache.lucene.search.IndexSearcher?
> For access, Lucene is equivalent to a B-Tree
> with all but the leaves cached in memory, so
> that accesses require only a single disk access.
thanks,
Kan
--- Doug Cutting <cu...@apache.org> wrote:
> B-Tree's are best for random, incremental updates.
> They require
> log_b(N) disk accesses for inserts, deletes and
> accesses, where b is the
> number of entries per page, and N is the total
> number of entries in the
> tree. But that's too slow for text indexing.
> Rather Lucene uses a
> combination of file sorting and merging to update
> indexes, which is much
> faster than a B-tree would be. For access, Lucene
> is equivalent to a
> B-Tree with all but the leaves cached in memory, so
> that accesses
> require only a single disk access.
>
> Slides 5 and 6 of the following presentation discuss
> this a bit:
>
>
http://www.research.ibm.com/haifa/Workshops/ir2005/papers/DougCutting-Haifa05.pdf
>
> Doug
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: BTree
Posted by Doug Cutting <cu...@apache.org>.
B-Tree's are best for random, incremental updates. They require
log_b(N) disk accesses for inserts, deletes and accesses, where b is the
number of entries per page, and N is the total number of entries in the
tree. But that's too slow for text indexing. Rather Lucene uses a
combination of file sorting and merging to update indexes, which is much
faster than a B-tree would be. For access, Lucene is equivalent to a
B-Tree with all but the leaves cached in memory, so that accesses
require only a single disk access.
Slides 5 and 6 of the following presentation discuss this a bit:
http://www.research.ibm.com/haifa/Workshops/ir2005/papers/DougCutting-Haifa05.pdf
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: BTree
Posted by Kan Deng <ka...@yahoo.com>.
After reading into the source code, I think Lucene
doeesn't use B+tree or other tree structure for index.
A possible reason is that, since Lucene aims at
handling gigabytes , it has to be cautious about the
index file's size. B+tree may grow rapidly when the
number of leaves grows. Hence, B+tree doesn't scale
well with gigabytes data.
Seems to me Lucene's search is implemented by
"hopping" along the sorted index.
For example, suppose there is a field "first name"
whose values are the first names of all California
residents. Those values are listed sequentially but
sorted. Given a certain query, the search can start
from the first name, if the first name is smaller than
the query, hop to the 10'th one, if still smaller,
continue hop forward, otherwise, hop back. so on.
By this, we don't need B+tree. Of course, there are
tricks to improve the hopping algorithm. However, my
doubt is that the search performance may not so
competitive as B+tree.
Is my understanding correct?
Kan
--- Kan Deng <ka...@yahoo.com> wrote:
>
> I have similar problem about the internal indexing
> data structure
>
> According to Paolo Ferragina of Univ Pisa, B+tree
> with
> cluster is best for sorting. However, referring to
> the
> implementation of
> org.apache.lucene.search.IndexSearch, it looks like
> the impl doesn't take B+tree, never mention cluster.
>
>
> Can anyone explain this issue in more depth?
>
> thanks,
> Kan
>
>
>
> --- shailesh kumar <sh...@yahoo.com> wrote:
>
> > I had looked at the document you had listed as
> > well as used a Hex editor to look at the segment
> > files. .That is how I came to know about the
> > lexicographic sorting. But was not sure if BTree
> is
> > used. If I understand correctly a Binary tree
> (i.e
> > each node only 2 children) or a high order
> > Balanced tree (where in a range of values are
> stored
> > in the node and each node can have more than 2
> > children) is the best way to search.
> > So wantted to know if Lucene implements it that
> > way ( if not in the data storage of the index in
> > the file, atleast in the memory during lookups??)
> >
> > Shailesh
> >
> >
> >
> >
> >
> >
> > ---------------------------------
> > Yahoo! Photos
> > Ring in the New Year with Photo Calendars. Add
> > photos, events, holidays, whatever.
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam
> protection around
> http://mail.yahoo.com
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail:
> java-user-help@lucene.apache.org
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: BTree
Posted by Kan Deng <ka...@yahoo.com>.
I have similar problem about the internal indexing
data structure
According to Paolo Ferragina of Univ Pisa, B+tree with
cluster is best for sorting. However, referring to the
implementation of
org.apache.lucene.search.IndexSearch, it looks like
the impl doesn't take B+tree, never mention cluster.
Can anyone explain this issue in more depth?
thanks,
Kan
--- shailesh kumar <sh...@yahoo.com> wrote:
> I had looked at the document you had listed as
> well as used a Hex editor to look at the segment
> files. .That is how I came to know about the
> lexicographic sorting. But was not sure if BTree is
> used. If I understand correctly a Binary tree (i.e
> each node only 2 children) or a high order
> Balanced tree (where in a range of values are stored
> in the node and each node can have more than 2
> children) is the best way to search.
> So wantted to know if Lucene implements it that
> way ( if not in the data storage of the index in
> the file, atleast in the memory during lookups??)
>
> Shailesh
>
>
>
>
>
>
> ---------------------------------
> Yahoo! Photos
> Ring in the New Year with Photo Calendars. Add
> photos, events, holidays, whatever.
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: BTree
Posted by shailesh kumar <sh...@yahoo.com>.
I had looked at the document you had listed as well as used a Hex editor to look at the segment files. .That is how I came to know about the lexicographic sorting. But was not sure if BTree is used. If I understand correctly a Binary tree (i.e each node only 2 children) or a high order Balanced tree (where in a range of values are stored in the node and each node can have more than 2 children) is the best way to search.
So wantted to know if Lucene implements it that way ( if not in the data storage of the index in the file, atleast in the memory during lookups??)
Shailesh
---------------------------------
Yahoo! Photos
Ring in the New Year with Photo Calendars. Add photos, events, holidays, whatever.
Re: BTree
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jan 11, 2006, at 7:23 AM, shailesh kumar wrote:
> Does Lucene use a BTree kind of structure for storing the index
> (atleast in the memory) .? or is it just a list. Based on the file
> format in the index directory ( where in the terms are are
> lexicographically sorted in one of the files ) I am not sure if
> BTree is used. ( Because constructing a Btree from a
> lexicographicaly sorted list will be difficult. )
The index structure is documented here:
http://lucene.apache.org/java/docs/fileformats.html
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org