You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Ting Yao <ti...@gmail.com> on 2016/06/01 03:32:14 UTC

Re: How can Docvalues so efficient

Sorry for the delay. And thank you for your answers.
Can I understand it like this:
Because the doc values are stored on disk, when Searcher gets a few values
of a field, then it reads disk to get them. The Lucene stores the start
position of *every *field. So when reads disk, it can find the start
position and read sequentially. So it's fast when reads doc values of a
field from disk.
But I still have a question.
In my opinion, the field data (we call it uninverted index data) can be
stored on disk like <doc_id -- field data> in doc id order, when we need
most fields values at a time, is this way the more efficient when the field
datas are not very big? And if it is stored in DocValues, the times of
reading disk are more than stored in field data.
Do I understand it right?

2016-05-30 19:01 GMT+08:00 Adrien Grand <jp...@gmail.com>:

> When executing queries, Lucene has an abstraction called Scorer, which is
> responsible for returning matching documents in doc id order. Since doc
> values are stored on disk in doc id order, reads are sequential. There is
> an adversary case when few documents match since you might need to jump
> over large numbers of doc ids in order to reach the next matching one, but
> those queries that match few documents should be very fast anyway.
>
> Le lun. 30 mai 2016 à 12:52, Ting Yao <ti...@gmail.com> a écrit :
>
> > Thank you very much for answering me.
> >  But could you explain how Lucene reads the doc values files
> sequentially?
> >
> > 2016-05-30 18:15 GMT+08:00 Adrien Grand <jp...@gmail.com>:
> >
> > > Doc values indeed need to read from disk. However, the fact that Lucene
> > > reads the doc values files sequentially (disks perform better at
> > sequential
> > > access than random access) and that the filesystem cache helps keep hot
> > > regions of the doc values files in memory usually helps keep
> perfermance
> > > close to what we would get if the data was stored in memory.
> > >
> > > Le lun. 30 mai 2016 à 12:01, Ting Yao <ti...@gmail.com> a
> écrit
> > :
> > >
> > > > Hi all,
> > > >        I am reading Lucene source code recently and we also use the
> > > Elastic
> > > > Search as our search engine. As far as I know, the elastic search
> > > > performance is pretty good. The elastic search is based on Lucene.
> So I
> > > am
> > > > wondering that how it can search words so fast when the field data
> > > > (uninverted index) are stored in disk.
> > > >     The DocValues make access filed values fast. From my perspective，
> > > it's
> > > > of course fast when few values of a field are read. But when few
> fields
> > > > need to access, I think it's not fast again. Because when access a
> > field,
> > > > all of its doc values need to read with MMap. So the system needs to
> > read
> > > > disk to load the data.
> > > >     So could anyone help me understand the DocValues operating
> > mechanism?
> > > >
> > > > Echo Yao
> > > >
> > >
> >
> >
> >
> > --
> > Echo Yao
> >
>



-- 
Echo Yao

Re: How can Docvalues so efficient

Posted by Adrien Grand <jp...@gmail.com>.

Le mer. 1 juin 2016 à 05:32, Ting Yao <ti...@gmail.com> a écrit :

> In my opinion, the field data (we call it uninverted index data) can be
> stored on disk like <doc_id -- field data> in doc id order, when we need
> most fields values at a time, is this way the more efficient when the field
> datas are not very big? And if it is stored in DocValues, the times of
> reading disk are more than stored in field data.
> Do I understand it right?
>

Sorry but I do not understand the question. Are you suggesting a way to be
more efficient with sparse doc values (
https://issues.apache.org/jira/browse/LUCENE-7253)  or maybe to use
columnar storage for stored fields as well?