You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Harshvardhan Ojha <oj...@gmail.com> on 2014/02/13 09:56:26 UTC

Algorithm for retrieving documents

Hi All,

I have a question regarding retrieval of documents by lucene.
I know lucene uses many files on disk to keep documents, each comprising
fields in it, and uses many IR algorithms, and inverted index to match
documents.

My question is :
1. How lucene stores these documents inside file system and gets it so fast?
2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not
which DS is         used by lucene ?
3. Except id provided by us at the time of indexing, is there any other
unique identifier       which is assigned by lucene to its documents ?

I will appreciate If someone can provide me with source file names to study
these algorithms in detail.

Regards
Harshvardhan Ojha

Re: Algorithm for retrieving documents

Posted by Harshvardhan Ojha <oj...@gmail.com>.

Hi Mikhail,

Don't you
think org.apache.lucene.codecs.bloom.FuzzySet.java, contains(BytesRef
value) methods returns probability of having a field, and it is a place
where we are using hashing ?

Are there any other place in source which when given with document id,
could determine by calculating its hash and say if document with this id is
present or not in a single look up O(1) ?

Regards
Harshvardhan Ojha


On Thu, Feb 13, 2014 at 4:07 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Harshvardhan,
>
> There almost nothing like this in bare Lucene, the closest analogy is
> http://wiki.apache.org/solr/SolrCaching#documentCache
>
>
> On Thu, Feb 13, 2014 at 1:46 PM, Harshvardhan Ojha <
> ojha.harshvardhan@gmail.com> wrote:
>
> > Hi Mikhail,
> >
> > Thanks for sharing this nice link. I am pretty comfortable with searching
> > of lucene and this is very beginner level question on storage, mainly
> > Hashing part(storage and retrieval).
> > Which DS(I don't know currently), is being used to keep and again
> calculate
> > that hash to get document back?
> >
> > Lets me put it very clearly,
> > If I know document to search id:1, and there is no other query, after
> > knowing this much about doc, there should ideally be no searching at
> > all(although it was indexed), its only fast retrieval.
> >
> > Let me know, If you want me to clarify question.
> >
> > Regards
> > Harshvardhan Ojha
> >
> >
> > On Thu, Feb 13, 2014 at 2:53 PM, Mikhail Khludnev <
> > mkhludnev@griddynamics.com> wrote:
> >
> > > Hello
> > >
> > > I think you can start from
> > > http://www.lucenerevolution.org/2013/What-is-in-a-lucene-index
> > >
> > >
> > >
> > > On Thu, Feb 13, 2014 at 12:56 PM, Harshvardhan Ojha <
> > > ojha.harshvardhan@gmail.com> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I have a question regarding retrieval of documents by lucene.
> > > > I know lucene uses many files on disk to keep documents, each
> > comprising
> > > > fields in it, and uses many IR algorithms, and inverted index to
> match
> > > > documents.
> > > >
> > > > My question is :
> > > > 1. How lucene stores these documents inside file system and gets it
> so
> > > > fast?
> > > > 2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If
> not
> > > > which DS is         used by lucene ?
> > > > 3. Except id provided by us at the time of indexing, is there any
> other
> > > > unique identifier       which is assigned by lucene to its documents
> ?
> > > >
> > > > I will appreciate If someone can provide me with source file names to
> > > study
> > > > these algorithms in detail.
> > > >
> > > > Regards
> > > > Harshvardhan Ojha
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > >  <mk...@griddynamics.com>
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: Algorithm for retrieving documents

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Harshvardhan,

There almost nothing like this in bare Lucene, the closest analogy is
http://wiki.apache.org/solr/SolrCaching#documentCache


On Thu, Feb 13, 2014 at 1:46 PM, Harshvardhan Ojha <
ojha.harshvardhan@gmail.com> wrote:

> Hi Mikhail,
>
> Thanks for sharing this nice link. I am pretty comfortable with searching
> of lucene and this is very beginner level question on storage, mainly
> Hashing part(storage and retrieval).
> Which DS(I don't know currently), is being used to keep and again calculate
> that hash to get document back?
>
> Lets me put it very clearly,
> If I know document to search id:1, and there is no other query, after
> knowing this much about doc, there should ideally be no searching at
> all(although it was indexed), its only fast retrieval.
>
> Let me know, If you want me to clarify question.
>
> Regards
> Harshvardhan Ojha
>
>
> On Thu, Feb 13, 2014 at 2:53 PM, Mikhail Khludnev <
> mkhludnev@griddynamics.com> wrote:
>
> > Hello
> >
> > I think you can start from
> > http://www.lucenerevolution.org/2013/What-is-in-a-lucene-index
> >
> >
> >
> > On Thu, Feb 13, 2014 at 12:56 PM, Harshvardhan Ojha <
> > ojha.harshvardhan@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > I have a question regarding retrieval of documents by lucene.
> > > I know lucene uses many files on disk to keep documents, each
> comprising
> > > fields in it, and uses many IR algorithms, and inverted index to match
> > > documents.
> > >
> > > My question is :
> > > 1. How lucene stores these documents inside file system and gets it so
> > > fast?
> > > 2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not
> > > which DS is         used by lucene ?
> > > 3. Except id provided by us at the time of indexing, is there any other
> > > unique identifier       which is assigned by lucene to its documents ?
> > >
> > > I will appreciate If someone can provide me with source file names to
> > study
> > > these algorithms in detail.
> > >
> > > Regards
> > > Harshvardhan Ojha
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynamics
> >
> > <http://www.griddynamics.com>
> >  <mk...@griddynamics.com>
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: Algorithm for retrieving documents

Posted by Harshvardhan Ojha <oj...@gmail.com>.

Hi Mikhail,

Thanks for sharing this nice link. I am pretty comfortable with searching
of lucene and this is very beginner level question on storage, mainly
Hashing part(storage and retrieval).
Which DS(I don't know currently), is being used to keep and again calculate
that hash to get document back?

Lets me put it very clearly,
If I know document to search id:1, and there is no other query, after
knowing this much about doc, there should ideally be no searching at
all(although it was indexed), its only fast retrieval.

Let me know, If you want me to clarify question.

Regards
Harshvardhan Ojha

On Thu, Feb 13, 2014 at 2:53 PM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Hello
>
> I think you can start from
> http://www.lucenerevolution.org/2013/What-is-in-a-lucene-index
>
>
>
> On Thu, Feb 13, 2014 at 12:56 PM, Harshvardhan Ojha <
> ojha.harshvardhan@gmail.com> wrote:
>
> > Hi All,
> >
> > I have a question regarding retrieval of documents by lucene.
> > I know lucene uses many files on disk to keep documents, each comprising
> > fields in it, and uses many IR algorithms, and inverted index to match
> > documents.
> >
> > My question is :
> > 1. How lucene stores these documents inside file system and gets it so
> > fast?
> > 2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not
> > which DS is         used by lucene ?
> > 3. Except id provided by us at the time of indexing, is there any other
> > unique identifier       which is assigned by lucene to its documents ?
> >
> > I will appreciate If someone can provide me with source file names to
> study
> > these algorithms in detail.
> >
> > Regards
> > Harshvardhan Ojha
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>  <mk...@griddynamics.com>
>

Re: Algorithm for retrieving documents

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hello

I think you can start from
http://www.lucenerevolution.org/2013/What-is-in-a-lucene-index



On Thu, Feb 13, 2014 at 12:56 PM, Harshvardhan Ojha <
ojha.harshvardhan@gmail.com> wrote:

> Hi All,
>
> I have a question regarding retrieval of documents by lucene.
> I know lucene uses many files on disk to keep documents, each comprising
> fields in it, and uses many IR algorithms, and inverted index to match
> documents.
>
> My question is :
> 1. How lucene stores these documents inside file system and gets it so
> fast?
> 2. Does lucene uses any Hashing algorithm to get docs in O(1) ? If not
> which DS is         used by lucene ?
> 3. Except id provided by us at the time of indexing, is there any other
> unique identifier       which is assigned by lucene to its documents ?
>
> I will appreciate If someone can provide me with source file names to study
> these algorithms in detail.
>
> Regards
> Harshvardhan Ojha
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>