You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Harshvardhan Ojha <oj...@gmail.com> on 2014/02/13 05:27:31 UTC

Algorithm of retrieving docs

Hi All,

I have a question regarding retrieval of documents by lucene.
I know lucene uses many files on disk to keep documents, each comprising
fields in it, and uses many IR algorithms, and inverted index to match
documents.

My question is :
1. How lucene stores these documents inside file system and gets it so fast?
2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not
which DS is         used by lucene ?
3. Except id provided by us at the time of indexing, is there any other
unique identifier       which is assigned by lucene to its documents ?

I will appreciate If someone can provide me with source file names to study
these algorithms in detail.

Regards
Harshvardhan Ojha

Re: Algorithm of retrieving docs

Posted by Harshvardhan Ojha <oj...@gmail.com>.

Thanks Michael for help, this helped me with my problem.

Regards
Harshvardhan Ojha


On Thu, Feb 13, 2014 at 8:51 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> The bloom filter is only used by the postings format wrapper, and
> we've had mixed results on whether it helps performance or not (seems
> to depend heavily on the exact usage).
>
> We have bit set / iterator abstractions (oal.util.Bits,
> oal.search.DocIdSet/Iterator) to manage "sets" of documents, but most
> implementations don't use a hash set under the hood.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Feb 13, 2014 at 7:11 AM, Harshvardhan Ojha
> <oj...@gmail.com> wrote:
> > Hi Mike/Mikhail,
> >
> > Don't you guys think org.apache.lucene.codecs.bloom.FuzzySet.java,
> > contains(BytesRef value) methods returns probablity of having a field,
> and
> > it is a place where we are using hashing ?
> >
> > Are there any other place in source which when given with document id,
> could
> > determine by calculating its hash and say if document with this id is
> > present or not in a single lookup O(1) ?
> >
> > Regards
> > Harshvardhan Ojha
> >
> >
> > On Thu, Feb 13, 2014 at 5:11 PM, Michael McCandless
> > <lu...@mikemccandless.com> wrote:
> >>
> >> Lucene only assigns its int docID during indexing.
> >>
> >> Retrieving a previously stored document is a O(1), but that involves a
> >> disk seek which can be very costly when the page is not in the OS's IO
> >> cache.  Lucene does not do any caching itself (relies on the OS
> >> instead).
> >>
> >> Have a look at the current default stored fields codec format:
> >>
> >>
> lucene/core/src/java/org/apache/lucene/codec/lucene41/Lucene41StoredFieldsFormat
> >> for details.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Wed, Feb 12, 2014 at 11:27 PM, Harshvardhan Ojha
> >> <oj...@gmail.com> wrote:
> >> > Hi All,
> >> >
> >> > I have a question regarding retrieval of documents by lucene.
> >> > I know lucene uses many files on disk to keep documents, each
> comprising
> >> > fields in it, and uses many IR algorithms, and inverted index to match
> >> > documents.
> >> >
> >> > My question is :
> >> > 1. How lucene stores these documents inside file system and gets it so
> >> > fast?
> >> > 2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not
> >> > which
> >> > DS is         used by lucene ?
> >> > 3. Except id provided by us at the time of indexing, is there any
> other
> >> > unique identifier       which is assigned by lucene to its documents ?
> >> >
> >> > I will appreciate If someone can provide me with source file names to
> >> > study
> >> > these algorithms in detail.
> >> >
> >> > Regards
> >> > Harshvardhan Ojha
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Algorithm of retrieving docs

Posted by Michael McCandless <lu...@mikemccandless.com>.

The bloom filter is only used by the postings format wrapper, and
we've had mixed results on whether it helps performance or not (seems
to depend heavily on the exact usage).

We have bit set / iterator abstractions (oal.util.Bits,
oal.search.DocIdSet/Iterator) to manage "sets" of documents, but most
implementations don't use a hash set under the hood.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 13, 2014 at 7:11 AM, Harshvardhan Ojha
<oj...@gmail.com> wrote:
> Hi Mike/Mikhail,
>
> Don't you guys think org.apache.lucene.codecs.bloom.FuzzySet.java,
> contains(BytesRef value) methods returns probablity of having a field, and
> it is a place where we are using hashing ?
>
> Are there any other place in source which when given with document id, could
> determine by calculating its hash and say if document with this id is
> present or not in a single lookup O(1) ?
>
> Regards
> Harshvardhan Ojha
>
>
> On Thu, Feb 13, 2014 at 5:11 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>>
>> Lucene only assigns its int docID during indexing.
>>
>> Retrieving a previously stored document is a O(1), but that involves a
>> disk seek which can be very costly when the page is not in the OS's IO
>> cache.  Lucene does not do any caching itself (relies on the OS
>> instead).
>>
>> Have a look at the current default stored fields codec format:
>>
>> lucene/core/src/java/org/apache/lucene/codec/lucene41/Lucene41StoredFieldsFormat
>> for details.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Feb 12, 2014 at 11:27 PM, Harshvardhan Ojha
>> <oj...@gmail.com> wrote:
>> > Hi All,
>> >
>> > I have a question regarding retrieval of documents by lucene.
>> > I know lucene uses many files on disk to keep documents, each comprising
>> > fields in it, and uses many IR algorithms, and inverted index to match
>> > documents.
>> >
>> > My question is :
>> > 1. How lucene stores these documents inside file system and gets it so
>> > fast?
>> > 2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not
>> > which
>> > DS is         used by lucene ?
>> > 3. Except id provided by us at the time of indexing, is there any other
>> > unique identifier       which is assigned by lucene to its documents ?
>> >
>> > I will appreciate If someone can provide me with source file names to
>> > study
>> > these algorithms in detail.
>> >
>> > Regards
>> > Harshvardhan Ojha
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Algorithm of retrieving docs

Posted by Harshvardhan Ojha <oj...@gmail.com>.

Hi Mike/Mikhail,

Don't you guys
think org.apache.lucene.codecs.bloom.FuzzySet.java, contains(BytesRef
value) methods returns probablity of having a field, and it is a place
where we are using hashing ?

Are there any other place in source which when given with document id,
could determine by calculating its hash and say if document with this id is
present or not in a single lookup O(1) ?

Regards
Harshvardhan Ojha


On Thu, Feb 13, 2014 at 5:11 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> Lucene only assigns its int docID during indexing.
>
> Retrieving a previously stored document is a O(1), but that involves a
> disk seek which can be very costly when the page is not in the OS's IO
> cache.  Lucene does not do any caching itself (relies on the OS
> instead).
>
> Have a look at the current default stored fields codec format:
>
> lucene/core/src/java/org/apache/lucene/codec/lucene41/Lucene41StoredFieldsFormat
> for details.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Feb 12, 2014 at 11:27 PM, Harshvardhan Ojha
> <oj...@gmail.com> wrote:
> > Hi All,
> >
> > I have a question regarding retrieval of documents by lucene.
> > I know lucene uses many files on disk to keep documents, each comprising
> > fields in it, and uses many IR algorithms, and inverted index to match
> > documents.
> >
> > My question is :
> > 1. How lucene stores these documents inside file system and gets it so
> fast?
> > 2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not
> which
> > DS is         used by lucene ?
> > 3. Except id provided by us at the time of indexing, is there any other
> > unique identifier       which is assigned by lucene to its documents ?
> >
> > I will appreciate If someone can provide me with source file names to
> study
> > these algorithms in detail.
> >
> > Regards
> > Harshvardhan Ojha
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Algorithm of retrieving docs

Posted by Michael McCandless <lu...@mikemccandless.com>.

Lucene only assigns its int docID during indexing.

Retrieving a previously stored document is a O(1), but that involves a
disk seek which can be very costly when the page is not in the OS's IO
cache.  Lucene does not do any caching itself (relies on the OS
instead).

Have a look at the current default stored fields codec format:
lucene/core/src/java/org/apache/lucene/codec/lucene41/Lucene41StoredFieldsFormat
for details.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 12, 2014 at 11:27 PM, Harshvardhan Ojha
<oj...@gmail.com> wrote:
> Hi All,
>
> I have a question regarding retrieval of documents by lucene.
> I know lucene uses many files on disk to keep documents, each comprising
> fields in it, and uses many IR algorithms, and inverted index to match
> documents.
>
> My question is :
> 1. How lucene stores these documents inside file system and gets it so fast?
> 2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not which
> DS is         used by lucene ?
> 3. Except id provided by us at the time of indexing, is there any other
> unique identifier       which is assigned by lucene to its documents ?
>
> I will appreciate If someone can provide me with source file names to study
> these algorithms in detail.
>
> Regards
> Harshvardhan Ojha
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org