You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Prasenjit Mukherjee <pr...@aol.com> on 2006/03/29 06:57:55 UTC

Data structure of a Lucene Index

It seems to me that lucene doesn't use B-tree for its indexing storage. 
Any paper/article which explains the theory behind data-structure of  
single index(segment).  I am not referring to the merge algorithm, I am 
curious to know the storage structure of a single optimized lucene index.

Any pointer is greatly appreciated.
--Prasen

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Data structure of a Lucene Index

Posted by Prasenjit Mukherjee <pr...@aol.com>.
I think Doug's paper ( specifically the Seek and Transfer section ) is 
the closest I could get. A little bit detailed explanation can be found 
in Yates' book on Information-Retreival.  I agree with Dimitry, a 
detailed explanation (or even pointers to some existing arcticle would 
be beneficial to all of us).

--prasen

------------------------------------------------------------


I talked about this a bit in a presentation at Haifa last year:

http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf

See the section on "Seek versus Transfer".

Doug


Dmitry Goldenberg wrote:

>Ideally, I'd love to see an article explaining both in detail: the index structure as well as the merge algorithm...
>
>________________________________
>
>From: Prasenjit Mukherjee [mailto:prasenjitm@aol.com]
>Sent: Tue 3/28/2006 11:57 PM
>To: java-user@lucene.apache.org
>Subject: Data structure of a Lucene Index
>
>
>
>It seems to me that lucene doesn't use B-tree for its indexing storage.
>Any paper/article which explains the theory behind data-structure of 
>single index(segment).  I am not referring to the merge algorithm, I am
>curious to know the storage structure of a single optimized lucene index.
>
>Any pointer is greatly appreciated.
>--Prasen
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>  
>
>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>  
>


RE: Data structure of a Lucene Index

Posted by Dmitry Goldenberg <dm...@weblayers.com>.
Ideally, I'd love to see an article explaining both in detail: the index structure as well as the merge algorithm...

________________________________

From: Prasenjit Mukherjee [mailto:prasenjitm@aol.com]
Sent: Tue 3/28/2006 11:57 PM
To: java-user@lucene.apache.org
Subject: Data structure of a Lucene Index



It seems to me that lucene doesn't use B-tree for its indexing storage.
Any paper/article which explains the theory behind data-structure of 
single index(segment).  I am not referring to the merge algorithm, I am
curious to know the storage structure of a single optimized lucene index.

Any pointer is greatly appreciated.
--Prasen

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





Re: Data structure of a Lucene Index

Posted by Prasenjit Mukherjee <pr...@aol.com>.
I have already gone through the fileformat. What I was looking for, is 
the underlying  theory behind the chosen fileformats. I am sure those 
fileformats were decided based on some theoritical axioms.

--prasen

erik@ehatchersolutions.com wrote:

>
> On Mar 28, 2006, at 11:57 PM, Prasenjit Mukherjee wrote:
>
>> It seems to me that lucene doesn't use B-tree for its indexing  
>> storage. Any paper/article which explains the theory behind data- 
>> structure of  single index(segment).  I am not referring to the  
>> merge algorithm, I am curious to know the storage structure of a  
>> single optimized lucene index.
>>
>> Any pointer is greatly appreciated.
>
>
> How about this for starters?
>
>    http://lucene.apache.org/java/docs/fileformats.html
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Data structure of a Lucene Index

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Mar 28, 2006, at 11:57 PM, Prasenjit Mukherjee wrote:

> It seems to me that lucene doesn't use B-tree for its indexing  
> storage. Any paper/article which explains the theory behind data- 
> structure of  single index(segment).  I am not referring to the  
> merge algorithm, I am curious to know the storage structure of a  
> single optimized lucene index.
>
> Any pointer is greatly appreciated.

How about this for starters?

	http://lucene.apache.org/java/docs/fileformats.html



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re[2]: Implemented subclasses of Similarity class in Lucene

Posted by Charlie <ch...@gmail.com>.
Hi Edgar,

Are there any technical reports explaining your design and
implementation of LM on Lucene?  Or what source files are exactly "LM
extension"?
-- 
Best regards,
 Charlie


---
Friday, May 26, 2006, 7:36:14 AM, you wrote:

> Hi Edgar,
> While doing the integration/updating for Lucene 1.9, could you be more
> open and clear about the design so that people can
> 1)Understand it,
> 2)Extend it,

> Just an recommendation.

> Cheers,
> Murat

> Edgar Meij wrote:

>> Hi Ganesh,
>> 
>> We have developed a Language Modeling extension to Lucene at the
>> University of Amsterdam. It can be found here:
>> 
>> http://ilps.science.uva.nl/Resources/#lm-lucen
>> 
>> It was build around Lucene 1.4.3, so it isn't source compatible with
>> the latest Lucene version. We are currently working on
>> integrating/updating it to Lucene 1.9.
>> 
>> Best,
>> 
>> Edgar Meij
>> 
>> 
>> On 3/31/06, Ganesh Ramakrishnan
>> <ga...@yahoo.com> wrote:
>> 
>>> Hi
>>>
>>> Is anyone aware of subclasses of the Similarity class in Lucene? Two
>>> subclasses are: DefaultSimilarity and  SimilarityDelegator . Are any
>>> other implemented subclasses of Similarity, developed by anyone else
>>> available on the web?  For example, Language Model based similarity,
>>> or Okapi-BM similarity or different TFIDF weighing scehemes for 
>>> similarity.
>>>
>>>   If so, can you point me to them?
>>>
>>>   Thanks and regards,
>>>   Ganesh.
>>>




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Implemented subclasses of Similarity class in Lucene

Posted by Murat Yakici <mu...@cis.strath.ac.uk>.
Hi Edgar,
While doing the integration/updating for Lucene 1.9, could you be more 
open and clear about the design so that people can
1)Understand it,
2)Extend it,

Just an recommendation.

Cheers,
Murat

Edgar Meij wrote:

> Hi Ganesh,
> 
> We have developed a Language Modeling extension to Lucene at the
> University of Amsterdam. It can be found here:
> 
> http://ilps.science.uva.nl/Resources/#lm-lucen
> 
> It was build around Lucene 1.4.3, so it isn't source compatible with
> the latest Lucene version. We are currently working on
> integrating/updating it to Lucene 1.9.
> 
> Best,
> 
> Edgar Meij
> 
> 
> On 3/31/06, Ganesh Ramakrishnan <ga...@yahoo.com> wrote:
> 
>> Hi
>>
>> Is anyone aware of subclasses of the Similarity class in Lucene? Two 
>> subclasses are: DefaultSimilarity and  SimilarityDelegator . Are any 
>> other implemented subclasses of Similarity, developed by anyone else 
>> available on the web?  For example, Language Model based similarity, 
>> or Okapi-BM similarity or different TFIDF weighing scehemes for 
>> similarity.
>>
>>   If so, can you point me to them?
>>
>>   Thanks and regards,
>>   Ganesh.
>>
>> ---------------------------------
>> Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low 
>> rates.
>>
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Implemented subclasses of Similarity class in Lucene

Posted by Edgar Meij <ed...@gmail.com>.
Hi Ganesh,

We have developed a Language Modeling extension to Lucene at the
University of Amsterdam. It can be found here:

http://ilps.science.uva.nl/Resources/#lm-lucen

It was build around Lucene 1.4.3, so it isn't source compatible with
the latest Lucene version. We are currently working on
integrating/updating it to Lucene 1.9.

Best,

Edgar Meij


On 3/31/06, Ganesh Ramakrishnan <ga...@yahoo.com> wrote:
> Hi
>
> Is anyone aware of subclasses of the Similarity class in Lucene? Two subclasses are: DefaultSimilarity and  SimilarityDelegator . Are any other implemented subclasses of Similarity, developed by anyone else available on the web?  For example, Language Model based similarity, or Okapi-BM similarity or different TFIDF weighing scehemes for similarity.
>
>   If so, can you point me to them?
>
>   Thanks and regards,
>   Ganesh.
>
> ---------------------------------
> Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.
>


-- 
'An approximate answer to the right question is worth a great deal
more than a precise answer to the wrong question'

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Implemented subclasses of Similarity class in Lucene

Posted by Ganesh Ramakrishnan <ga...@yahoo.com>.
Hi

Is anyone aware of subclasses of the Similarity class in Lucene? Two subclasses are: DefaultSimilarity and  SimilarityDelegator . Are any other implemented subclasses of Similarity, developed by anyone else available on the web?  For example, Language Model based similarity, or Okapi-BM similarity or different TFIDF weighing scehemes for similarity.
  
  If so, can you point me to them?
  
  Thanks and regards,
  Ganesh.
			
---------------------------------
Yahoo! Messenger with Voice. PC-to-Phone calls for ridiculously low rates.

Re: Data structure of a Lucene Index

Posted by Doug Cutting <cu...@apache.org>.
I talked about this a bit in a presentation at Haifa last year:

http://www.haifa.ibm.com/Workshops/ir2005/papers/DougCutting-Haifa05.pdf

See the section on "Seek versus Transfer".

Doug

Prasenjit Mukherjee wrote:
> It seems to me that lucene doesn't use B-tree for its indexing storage. 
> Any paper/article which explains the theory behind data-structure of  
> single index(segment).  I am not referring to the merge algorithm, I am 
> curious to know the storage structure of a single optimized lucene index.
> 
> Any pointer is greatly appreciated.
> --Prasen
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org