You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Tommaso Teofili <to...@gmail.com> on 2017/05/18 09:03:54 UTC

enhancing data locality wrt certain document clusters

Hi all,

I am working on a use case where my Lucene index stores documents composed
by (relatively short) text and binary values, at retrieval time I need to
retrieve documents that belong to a set of cluster values (e.g. facets).
In that context I was wondering if and how it'd be possible to make it more
probable that documents (and associated docValues) that belong to a same
cluster fall into the same segment.
That would allow to have a higher storage locality [1] and presumably a
better performance (given docs belonging to the same clusters get retrieved
together most of the times in my use case).
At first I had looked into extending the DV format but that's segment
agnostic therefore I am thinking of coming up with a merge policy which
produces segments whose docs belong to the same cluster with a high
probability.
Any other ideas / suggestions ?

Regards,
Tommaso

[1] : https://en.wikipedia.org/wiki/Locality_of_reference

Re: enhancing data locality wrt certain document clusters

Posted by Tommaso Teofili <to...@gmail.com>.
p.s.
Adrien, any docs / references on how to implement index time sorting for
versions prior to 6.2 and LUCENE-6766 ?

Il giorno ven 19 mag 2017 alle ore 12:38 Tommaso Teofili <
tommaso.teofili@gmail.com> ha scritto:

> Thanks Adrien, it sounds like a good suggestion, I'll try it out.
> Another approach might be to use separate per cluster indexes, there one
> can somehow control the no. of segments, however that wouldn't probably
> scale with lots of clusters (and sounds weird too).
>
> Regards,
> Tommaso
>
>
> Il giorno gio 18 mag 2017 alle ore 16:54 Adrien Grand <jp...@gmail.com>
> ha scritto:
>
>> You can't make documents more likely to be in the same segment, however
>> I'm thinking you could use index sorting to make documents closer to each
>> other on a per-segment basis?
>>
>> Le jeu. 18 mai 2017 à 11:04, Tommaso Teofili <to...@gmail.com>
>> a écrit :
>>
>>> Hi all,
>>>
>>> I am working on a use case where my Lucene index stores documents
>>> composed by (relatively short) text and binary values, at retrieval time I
>>> need to retrieve documents that belong to a set of cluster values (e.g.
>>> facets).
>>> In that context I was wondering if and how it'd be possible to make it
>>> more probable that documents (and associated docValues) that belong to a
>>> same cluster fall into the same segment.
>>> That would allow to have a higher storage locality [1] and presumably a
>>> better performance (given docs belonging to the same clusters get retrieved
>>> together most of the times in my use case).
>>> At first I had looked into extending the DV format but that's segment
>>> agnostic therefore I am thinking of coming up with a merge policy which
>>> produces segments whose docs belong to the same cluster with a high
>>> probability.
>>> Any other ideas / suggestions ?
>>>
>>> Regards,
>>> Tommaso
>>>
>>> [1] : https://en.wikipedia.org/wiki/Locality_of_reference
>>>
>>

Re: enhancing data locality wrt certain document clusters

Posted by Tommaso Teofili <to...@gmail.com>.
Thanks Adrien, it sounds like a good suggestion, I'll try it out.
Another approach might be to use separate per cluster indexes, there one
can somehow control the no. of segments, however that wouldn't probably
scale with lots of clusters (and sounds weird too).

Regards,
Tommaso


Il giorno gio 18 mag 2017 alle ore 16:54 Adrien Grand <jp...@gmail.com>
ha scritto:

> You can't make documents more likely to be in the same segment, however
> I'm thinking you could use index sorting to make documents closer to each
> other on a per-segment basis?
>
> Le jeu. 18 mai 2017 à 11:04, Tommaso Teofili <to...@gmail.com>
> a écrit :
>
>> Hi all,
>>
>> I am working on a use case where my Lucene index stores documents
>> composed by (relatively short) text and binary values, at retrieval time I
>> need to retrieve documents that belong to a set of cluster values (e.g.
>> facets).
>> In that context I was wondering if and how it'd be possible to make it
>> more probable that documents (and associated docValues) that belong to a
>> same cluster fall into the same segment.
>> That would allow to have a higher storage locality [1] and presumably a
>> better performance (given docs belonging to the same clusters get retrieved
>> together most of the times in my use case).
>> At first I had looked into extending the DV format but that's segment
>> agnostic therefore I am thinking of coming up with a merge policy which
>> produces segments whose docs belong to the same cluster with a high
>> probability.
>> Any other ideas / suggestions ?
>>
>> Regards,
>> Tommaso
>>
>> [1] : https://en.wikipedia.org/wiki/Locality_of_reference
>>
>

Re: enhancing data locality wrt certain document clusters

Posted by Adrien Grand <jp...@gmail.com>.
You can't make documents more likely to be in the same segment, however I'm
thinking you could use index sorting to make documents closer to each other
on a per-segment basis?

Le jeu. 18 mai 2017 à 11:04, Tommaso Teofili <to...@gmail.com> a
écrit :

> Hi all,
>
> I am working on a use case where my Lucene index stores documents composed
> by (relatively short) text and binary values, at retrieval time I need to
> retrieve documents that belong to a set of cluster values (e.g. facets).
> In that context I was wondering if and how it'd be possible to make it
> more probable that documents (and associated docValues) that belong to a
> same cluster fall into the same segment.
> That would allow to have a higher storage locality [1] and presumably a
> better performance (given docs belonging to the same clusters get retrieved
> together most of the times in my use case).
> At first I had looked into extending the DV format but that's segment
> agnostic therefore I am thinking of coming up with a merge policy which
> produces segments whose docs belong to the same cluster with a high
> probability.
> Any other ideas / suggestions ?
>
> Regards,
> Tommaso
>
> [1] : https://en.wikipedia.org/wiki/Locality_of_reference
>