You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Shouvik Bardhan <sb...@gisfederal.com> on 2014/09/19 06:23:58 UTC

Quickest way to collect one field from the searched docs....

Pardon the length of the question. I have an index with 100 million docs
(lucene not solr) and term queries (A*, A AND B* type queries) return
pretty quickly (2 -4 secs) and I pick the lucene docIds up pretty quickly
with a collector. This is good for us since we take the docIds and do
further filtering based on another database we maintain whose record ids
match with the stored lucene doc ids and we are able to do what we want. I
know that depending on the lucene doc id value is not a good thing, since
after delete/merge/optimize, the doc ids may change and if that was to
happen, our other datastore will not line up with lucene doc index and
chaps will ensue. Thus we do not optimize the index etc.

My question is what is the fastest way I can gather 1 field value from the
docs which are found to match the query? Is there any way to do this as
fast as (or at least not much slower) I am able to collect the lucene
docids?  I want to get away from depending on the "lucene docids not
changing" if possible.

Thanks for any suggestions.

Re: Quickest way to collect one field from the searched docs....

Posted by Shouvik Bardhan <sb...@gisfederal.com>.
I will take a look at DocField. Thanks for the suggestion.


On Fri, Sep 19, 2014 at 6:30 PM, Neil Bacon <Ne...@nicta.com.au> wrote:

> Hi
> Have you looked at DocFieldValue / DocField? It's fast for this use case.
> Regards
> Neil
>
> Sent from my mobile doovalaki
>
> On 20/09/2014 6:44 am, Shouvik Bardhan <sb...@gisfederal.com> wrote:
> Sujit, thanks for the response. I have already done what you said. My issue
> is that after setting up the data in lucene index and the DB, when a query
> comes and say it matches 25 million docs in Lucene, then I need to get all
> the 25 million values of this field (record_id in your example) quickly. In
> the current way, I can get all those Lucene doc Ids (even 25 million of
> them) very fast. But I dont know a way to get one field (recordid) from all
> the matched documents (when 25 million docs have matched) that fast.
>
> thanks
> Shouvik
>
> On Fri, Sep 19, 2014 at 2:26 PM, Sujit Pal <su...@comcast.net> wrote:
>
> > Hi Shouvik, not sure if you have already considered this, but you could
> put
> > the database primary key for the record into the index - ie, reverse your
> > insert to do DB first, get the record_id and then add this to the Lucene
> > index as "record_id" field. During retrieval you can minimize the network
> > traffic by setting field list to only this record_id.
> >
> > -sujit
> >
> >
> > On Thu, Sep 18, 2014 at 9:23 PM, Shouvik Bardhan <
> sbardhan@gisfederal.com>
> > wrote:
> >
> > > Pardon the length of the question. I have an index with 100 million
> docs
> > > (lucene not solr) and term queries (A*, A AND B* type queries) return
> > > pretty quickly (2 -4 secs) and I pick the lucene docIds up pretty
> quickly
> > > with a collector. This is good for us since we take the docIds and do
> > > further filtering based on another database we maintain whose record
> ids
> > > match with the stored lucene doc ids and we are able to do what we
> want.
> > I
> > > know that depending on the lucene doc id value is not a good thing,
> since
> > > after delete/merge/optimize, the doc ids may change and if that was to
> > > happen, our other datastore will not line up with lucene doc index and
> > > chaps will ensue. Thus we do not optimize the index etc.
> > >
> > > My question is what is the fastest way I can gather 1 field value from
> > the
> > > docs which are found to match the query? Is there any way to do this as
> > > fast as (or at least not much slower) I am able to collect the lucene
> > > docids?  I want to get away from depending on the "lucene docids not
> > > changing" if possible.
> > >
> > > Thanks for any suggestions.
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Quickest way to collect one field from the searched docs....

Posted by Neil Bacon <Ne...@nicta.com.au>.
Hi
Have you looked at DocFieldValue / DocField? It's fast for this use case.
Regards
Neil

Sent from my mobile doovalaki

On 20/09/2014 6:44 am, Shouvik Bardhan <sb...@gisfederal.com> wrote:
Sujit, thanks for the response. I have already done what you said. My issue
is that after setting up the data in lucene index and the DB, when a query
comes and say it matches 25 million docs in Lucene, then I need to get all
the 25 million values of this field (record_id in your example) quickly. In
the current way, I can get all those Lucene doc Ids (even 25 million of
them) very fast. But I dont know a way to get one field (recordid) from all
the matched documents (when 25 million docs have matched) that fast.

thanks
Shouvik

On Fri, Sep 19, 2014 at 2:26 PM, Sujit Pal <su...@comcast.net> wrote:

> Hi Shouvik, not sure if you have already considered this, but you could put
> the database primary key for the record into the index - ie, reverse your
> insert to do DB first, get the record_id and then add this to the Lucene
> index as "record_id" field. During retrieval you can minimize the network
> traffic by setting field list to only this record_id.
>
> -sujit
>
>
> On Thu, Sep 18, 2014 at 9:23 PM, Shouvik Bardhan <sb...@gisfederal.com>
> wrote:
>
> > Pardon the length of the question. I have an index with 100 million docs
> > (lucene not solr) and term queries (A*, A AND B* type queries) return
> > pretty quickly (2 -4 secs) and I pick the lucene docIds up pretty quickly
> > with a collector. This is good for us since we take the docIds and do
> > further filtering based on another database we maintain whose record ids
> > match with the stored lucene doc ids and we are able to do what we want.
> I
> > know that depending on the lucene doc id value is not a good thing, since
> > after delete/merge/optimize, the doc ids may change and if that was to
> > happen, our other datastore will not line up with lucene doc index and
> > chaps will ensue. Thus we do not optimize the index etc.
> >
> > My question is what is the fastest way I can gather 1 field value from
> the
> > docs which are found to match the query? Is there any way to do this as
> > fast as (or at least not much slower) I am able to collect the lucene
> > docids?  I want to get away from depending on the "lucene docids not
> > changing" if possible.
> >
> > Thanks for any suggestions.
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Quickest way to collect one field from the searched docs....

Posted by Shouvik Bardhan <sb...@gisfederal.com>.
Sujit, thanks for the response. I have already done what you said. My issue
is that after setting up the data in lucene index and the DB, when a query
comes and say it matches 25 million docs in Lucene, then I need to get all
the 25 million values of this field (record_id in your example) quickly. In
the current way, I can get all those Lucene doc Ids (even 25 million of
them) very fast. But I dont know a way to get one field (recordid) from all
the matched documents (when 25 million docs have matched) that fast.

thanks
Shouvik

On Fri, Sep 19, 2014 at 2:26 PM, Sujit Pal <su...@comcast.net> wrote:

> Hi Shouvik, not sure if you have already considered this, but you could put
> the database primary key for the record into the index - ie, reverse your
> insert to do DB first, get the record_id and then add this to the Lucene
> index as "record_id" field. During retrieval you can minimize the network
> traffic by setting field list to only this record_id.
>
> -sujit
>
>
> On Thu, Sep 18, 2014 at 9:23 PM, Shouvik Bardhan <sb...@gisfederal.com>
> wrote:
>
> > Pardon the length of the question. I have an index with 100 million docs
> > (lucene not solr) and term queries (A*, A AND B* type queries) return
> > pretty quickly (2 -4 secs) and I pick the lucene docIds up pretty quickly
> > with a collector. This is good for us since we take the docIds and do
> > further filtering based on another database we maintain whose record ids
> > match with the stored lucene doc ids and we are able to do what we want.
> I
> > know that depending on the lucene doc id value is not a good thing, since
> > after delete/merge/optimize, the doc ids may change and if that was to
> > happen, our other datastore will not line up with lucene doc index and
> > chaps will ensue. Thus we do not optimize the index etc.
> >
> > My question is what is the fastest way I can gather 1 field value from
> the
> > docs which are found to match the query? Is there any way to do this as
> > fast as (or at least not much slower) I am able to collect the lucene
> > docids?  I want to get away from depending on the "lucene docids not
> > changing" if possible.
> >
> > Thanks for any suggestions.
> >
>

Re: Quickest way to collect one field from the searched docs....

Posted by Sujit Pal <su...@comcast.net>.
Hi Shouvik, not sure if you have already considered this, but you could put
the database primary key for the record into the index - ie, reverse your
insert to do DB first, get the record_id and then add this to the Lucene
index as "record_id" field. During retrieval you can minimize the network
traffic by setting field list to only this record_id.

-sujit


On Thu, Sep 18, 2014 at 9:23 PM, Shouvik Bardhan <sb...@gisfederal.com>
wrote:

> Pardon the length of the question. I have an index with 100 million docs
> (lucene not solr) and term queries (A*, A AND B* type queries) return
> pretty quickly (2 -4 secs) and I pick the lucene docIds up pretty quickly
> with a collector. This is good for us since we take the docIds and do
> further filtering based on another database we maintain whose record ids
> match with the stored lucene doc ids and we are able to do what we want. I
> know that depending on the lucene doc id value is not a good thing, since
> after delete/merge/optimize, the doc ids may change and if that was to
> happen, our other datastore will not line up with lucene doc index and
> chaps will ensue. Thus we do not optimize the index etc.
>
> My question is what is the fastest way I can gather 1 field value from the
> docs which are found to match the query? Is there any way to do this as
> fast as (or at least not much slower) I am able to collect the lucene
> docids?  I want to get away from depending on the "lucene docids not
> changing" if possible.
>
> Thanks for any suggestions.
>