You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Petersen, Robert" <ro...@mail.rakuten.com> on 2013/12/17 18:29:27 UTC

solr as nosql - pulling all docs vs deep paging limitations

Hi solr users,

We have a new use case where need to make a pile of data available as XML to a client and I was thinking we could easily put all this data into a solr collection and the client could just do a star search and page through all the results to obtain the data we need to give them. Then I remembered we currently don't allow deep paging in our current search indexes as performance declines the deeper you go. Is this still the case?

If so, is there another approach to make all the data in a collection easily available for retrieval? The only thing I can think of is to query our DB for all the unique IDs of all the documents in the collection and then pull out the documents out in small groups with successive queries like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR idn+2 OR ... etc)' which doesn't seem like a very good approach because the DB might have been updated with new data which hasn't been indexed yet and so all the ids might not be in there (which may or may not matter I suppose).

Then I was thinking we could have a field with an incrementing numeric value which could be used to perform range queries as a substitute for paging through everything. Ie queries like 'IncrementalField:[1 TO 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to maintain as we update the index unless we reindex the entire collection every time we update any docs at all.

Is this perhaps not a good use case for solr? Should I use something else or is there another approach that would work here to allow a client to pull groups of docs in a collection through the rest api until the client has gotten them all?

Thanks
Robi

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Michael Della Bitta <mi...@appinions.com>.

Us too. That's going to be huge for us!

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Wed, Dec 18, 2013 at 9:55 AM, Mikhail Khludnev <
mkhludnev@griddynamics.com> wrote:

> Aha! SOLR-5244 is a particular case which I'm asking about. I wonder who
> else consider it useful?
> (I.m sorry if I hijacked the thread)
> 18.12.2013 5:41 пользователь "Joel Bernstein" <jo...@gmail.com>
> написал:
>
> > They are for different use cases. Hoss's approach, I believe, focuses on
> > deep paging of ranked search results. SOLR-5244 focuses on the batch
> export
> > of an entire unranked search result in binary format. It's basically a
> very
> > efficient bulk extract for Solr.
> >
> >
> > On Tue, Dec 17, 2013 at 6:51 PM, Otis Gospodnetic <
> > otis.gospodnetic@gmail.com> wrote:
> >
> > > Joel - can you please elaborate a bit on how this compares with Hoss'
> > > approach?  Complementary?
> > >
> > > Thanks,
> > > Otis
> > > --
> > > Performance Monitoring * Log Analytics * Search Analytics
> > > Solr & Elasticsearch Support * http://sematext.com/
> > >
> > >
> > > On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein <jo...@gmail.com>
> > > wrote:
> > >
> > > > SOLR-5244 is also working in this direction. This focuses on
> efficient
> > > > binary extract of entire search results.
> > > >
> > > >
> > > > On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic <
> > > > otis.gospodnetic@gmail.com> wrote:
> > > >
> > > > > Hoss is working on it. Search for deep paging or cursor in JIRA.
> > > > >
> > > > > Otis
> > > > > Solr & ElasticSearch Support
> > > > > http://sematext.com/
> > > > > On Dec 17, 2013 12:30 PM, "Petersen, Robert" <
> > > > > robert.petersen@mail.rakuten.com> wrote:
> > > > >
> > > > > > Hi solr users,
> > > > > >
> > > > > > We have a new use case where need to make a pile of data
> available
> > as
> > > > XML
> > > > > > to a client and I was thinking we could easily put all this data
> > > into a
> > > > > > solr collection and the client could just do a star search and
> page
> > > > > through
> > > > > > all the results to obtain the data we need to give them.  Then I
> > > > > remembered
> > > > > > we currently don't allow deep paging in our current search
> indexes
> > as
> > > > > > performance declines the deeper you go.  Is this still the case?
> > > > > >
> > > > > > If so, is there another approach to make all the data in a
> > collection
> > > > > > easily available for retrieval?  The only thing I can think of is
> > to
> > > > > query
> > > > > > our DB for all the unique IDs of all the documents in the
> > collection
> > > > and
> > > > > > then pull out the documents out in small groups with successive
> > > queries
> > > > > > like 'UniqueIdField:(id1 OR id2 OR ... OR idn)'
> > 'UniqueIdField:(idn+1
> > > > OR
> > > > > > idn+2 OR ... etc)' which doesn't seem like a very good approach
> > > because
> > > > > the
> > > > > > DB might have been updated with new data which hasn't been
> indexed
> > > yet
> > > > > and
> > > > > > so all the ids might not be in there (which may or may not
> matter I
> > > > > > suppose).
> > > > > >
> > > > > > Then I was thinking we could have a field with an incrementing
> > > numeric
> > > > > > value which could be used to perform range queries as a
> substitute
> > > for
> > > > > > paging through everything.  Ie queries like 'IncrementalField:[1
> TO
> > > > 100]'
> > > > > > 'IncrementalField:[101 TO 200]' but this would be difficult to
> > > maintain
> > > > > as
> > > > > > we update the index unless we reindex the entire collection every
> > > time
> > > > we
> > > > > > update any docs at all.
> > > > > >
> > > > > > Is this perhaps not a good use case for solr?  Should I use
> > something
> > > > > else
> > > > > > or is there another approach that would work here to allow a
> client
> > > to
> > > > > pull
> > > > > > groups of docs in a collection through the rest api until the
> > client
> > > > has
> > > > > > gotten them all?
> > > > > >
> > > > > > Thanks
> > > > > > Robi
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Joel Bernstein
> > > > Search Engineer at Heliosearch
> > > >
> > >
> >
> >
> >
> > --
> > Joel Bernstein
> > Search Engineer at Heliosearch
> >
>

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Aha! SOLR-5244 is a particular case which I'm asking about. I wonder who
else consider it useful?
(I.m sorry if I hijacked the thread)
18.12.2013 5:41 пользователь "Joel Bernstein" <jo...@gmail.com> написал:

> They are for different use cases. Hoss's approach, I believe, focuses on
> deep paging of ranked search results. SOLR-5244 focuses on the batch export
> of an entire unranked search result in binary format. It's basically a very
> efficient bulk extract for Solr.
>
>
> On Tue, Dec 17, 2013 at 6:51 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
> > Joel - can you please elaborate a bit on how this compares with Hoss'
> > approach?  Complementary?
> >
> > Thanks,
> > Otis
> > --
> > Performance Monitoring * Log Analytics * Search Analytics
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> > On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein <jo...@gmail.com>
> > wrote:
> >
> > > SOLR-5244 is also working in this direction. This focuses on efficient
> > > binary extract of entire search results.
> > >
> > >
> > > On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic <
> > > otis.gospodnetic@gmail.com> wrote:
> > >
> > > > Hoss is working on it. Search for deep paging or cursor in JIRA.
> > > >
> > > > Otis
> > > > Solr & ElasticSearch Support
> > > > http://sematext.com/
> > > > On Dec 17, 2013 12:30 PM, "Petersen, Robert" <
> > > > robert.petersen@mail.rakuten.com> wrote:
> > > >
> > > > > Hi solr users,
> > > > >
> > > > > We have a new use case where need to make a pile of data available
> as
> > > XML
> > > > > to a client and I was thinking we could easily put all this data
> > into a
> > > > > solr collection and the client could just do a star search and page
> > > > through
> > > > > all the results to obtain the data we need to give them.  Then I
> > > > remembered
> > > > > we currently don't allow deep paging in our current search indexes
> as
> > > > > performance declines the deeper you go.  Is this still the case?
> > > > >
> > > > > If so, is there another approach to make all the data in a
> collection
> > > > > easily available for retrieval?  The only thing I can think of is
> to
> > > > query
> > > > > our DB for all the unique IDs of all the documents in the
> collection
> > > and
> > > > > then pull out the documents out in small groups with successive
> > queries
> > > > > like 'UniqueIdField:(id1 OR id2 OR ... OR idn)'
> 'UniqueIdField:(idn+1
> > > OR
> > > > > idn+2 OR ... etc)' which doesn't seem like a very good approach
> > because
> > > > the
> > > > > DB might have been updated with new data which hasn't been indexed
> > yet
> > > > and
> > > > > so all the ids might not be in there (which may or may not matter I
> > > > > suppose).
> > > > >
> > > > > Then I was thinking we could have a field with an incrementing
> > numeric
> > > > > value which could be used to perform range queries as a substitute
> > for
> > > > > paging through everything.  Ie queries like 'IncrementalField:[1 TO
> > > 100]'
> > > > > 'IncrementalField:[101 TO 200]' but this would be difficult to
> > maintain
> > > > as
> > > > > we update the index unless we reindex the entire collection every
> > time
> > > we
> > > > > update any docs at all.
> > > > >
> > > > > Is this perhaps not a good use case for solr?  Should I use
> something
> > > > else
> > > > > or is there another approach that would work here to allow a client
> > to
> > > > pull
> > > > > groups of docs in a collection through the rest api until the
> client
> > > has
> > > > > gotten them all?
> > > > >
> > > > > Thanks
> > > > > Robi
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Joel Bernstein
> > > Search Engineer at Heliosearch
> > >
> >
>
>
>
> --
> Joel Bernstein
> Search Engineer at Heliosearch
>

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Joel Bernstein <jo...@gmail.com>.

They are for different use cases. Hoss's approach, I believe, focuses on
deep paging of ranked search results. SOLR-5244 focuses on the batch export
of an entire unranked search result in binary format. It's basically a very
efficient bulk extract for Solr.


On Tue, Dec 17, 2013 at 6:51 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Joel - can you please elaborate a bit on how this compares with Hoss'
> approach?  Complementary?
>
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein <jo...@gmail.com>
> wrote:
>
> > SOLR-5244 is also working in this direction. This focuses on efficient
> > binary extract of entire search results.
> >
> >
> > On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic <
> > otis.gospodnetic@gmail.com> wrote:
> >
> > > Hoss is working on it. Search for deep paging or cursor in JIRA.
> > >
> > > Otis
> > > Solr & ElasticSearch Support
> > > http://sematext.com/
> > > On Dec 17, 2013 12:30 PM, "Petersen, Robert" <
> > > robert.petersen@mail.rakuten.com> wrote:
> > >
> > > > Hi solr users,
> > > >
> > > > We have a new use case where need to make a pile of data available as
> > XML
> > > > to a client and I was thinking we could easily put all this data
> into a
> > > > solr collection and the client could just do a star search and page
> > > through
> > > > all the results to obtain the data we need to give them.  Then I
> > > remembered
> > > > we currently don't allow deep paging in our current search indexes as
> > > > performance declines the deeper you go.  Is this still the case?
> > > >
> > > > If so, is there another approach to make all the data in a collection
> > > > easily available for retrieval?  The only thing I can think of is to
> > > query
> > > > our DB for all the unique IDs of all the documents in the collection
> > and
> > > > then pull out the documents out in small groups with successive
> queries
> > > > like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1
> > OR
> > > > idn+2 OR ... etc)' which doesn't seem like a very good approach
> because
> > > the
> > > > DB might have been updated with new data which hasn't been indexed
> yet
> > > and
> > > > so all the ids might not be in there (which may or may not matter I
> > > > suppose).
> > > >
> > > > Then I was thinking we could have a field with an incrementing
> numeric
> > > > value which could be used to perform range queries as a substitute
> for
> > > > paging through everything.  Ie queries like 'IncrementalField:[1 TO
> > 100]'
> > > > 'IncrementalField:[101 TO 200]' but this would be difficult to
> maintain
> > > as
> > > > we update the index unless we reindex the entire collection every
> time
> > we
> > > > update any docs at all.
> > > >
> > > > Is this perhaps not a good use case for solr?  Should I use something
> > > else
> > > > or is there another approach that would work here to allow a client
> to
> > > pull
> > > > groups of docs in a collection through the rest api until the client
> > has
> > > > gotten them all?
> > > >
> > > > Thanks
> > > > Robi
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Joel Bernstein
> > Search Engineer at Heliosearch
> >
>



-- 
Joel Bernstein
Search Engineer at Heliosearch

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Otis Gospodnetic <ot...@gmail.com>.

Joel - can you please elaborate a bit on how this compares with Hoss'
approach?  Complementary?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Dec 17, 2013 at 6:45 PM, Joel Bernstein <jo...@gmail.com> wrote:

> SOLR-5244 is also working in this direction. This focuses on efficient
> binary extract of entire search results.
>
>
> On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
> > Hoss is working on it. Search for deep paging or cursor in JIRA.
> >
> > Otis
> > Solr & ElasticSearch Support
> > http://sematext.com/
> > On Dec 17, 2013 12:30 PM, "Petersen, Robert" <
> > robert.petersen@mail.rakuten.com> wrote:
> >
> > > Hi solr users,
> > >
> > > We have a new use case where need to make a pile of data available as
> XML
> > > to a client and I was thinking we could easily put all this data into a
> > > solr collection and the client could just do a star search and page
> > through
> > > all the results to obtain the data we need to give them.  Then I
> > remembered
> > > we currently don't allow deep paging in our current search indexes as
> > > performance declines the deeper you go.  Is this still the case?
> > >
> > > If so, is there another approach to make all the data in a collection
> > > easily available for retrieval?  The only thing I can think of is to
> > query
> > > our DB for all the unique IDs of all the documents in the collection
> and
> > > then pull out the documents out in small groups with successive queries
> > > like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1
> OR
> > > idn+2 OR ... etc)' which doesn't seem like a very good approach because
> > the
> > > DB might have been updated with new data which hasn't been indexed yet
> > and
> > > so all the ids might not be in there (which may or may not matter I
> > > suppose).
> > >
> > > Then I was thinking we could have a field with an incrementing numeric
> > > value which could be used to perform range queries as a substitute for
> > > paging through everything.  Ie queries like 'IncrementalField:[1 TO
> 100]'
> > > 'IncrementalField:[101 TO 200]' but this would be difficult to maintain
> > as
> > > we update the index unless we reindex the entire collection every time
> we
> > > update any docs at all.
> > >
> > > Is this perhaps not a good use case for solr?  Should I use something
> > else
> > > or is there another approach that would work here to allow a client to
> > pull
> > > groups of docs in a collection through the rest api until the client
> has
> > > gotten them all?
> > >
> > > Thanks
> > > Robi
> > >
> > >
> >
>
>
>
> --
> Joel Bernstein
> Search Engineer at Heliosearch
>

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Joel Bernstein <jo...@gmail.com>.

SOLR-5244 is also working in this direction. This focuses on efficient
binary extract of entire search results.


On Tue, Dec 17, 2013 at 2:33 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hoss is working on it. Search for deep paging or cursor in JIRA.
>
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Dec 17, 2013 12:30 PM, "Petersen, Robert" <
> robert.petersen@mail.rakuten.com> wrote:
>
> > Hi solr users,
> >
> > We have a new use case where need to make a pile of data available as XML
> > to a client and I was thinking we could easily put all this data into a
> > solr collection and the client could just do a star search and page
> through
> > all the results to obtain the data we need to give them.  Then I
> remembered
> > we currently don't allow deep paging in our current search indexes as
> > performance declines the deeper you go.  Is this still the case?
> >
> > If so, is there another approach to make all the data in a collection
> > easily available for retrieval?  The only thing I can think of is to
> query
> > our DB for all the unique IDs of all the documents in the collection and
> > then pull out the documents out in small groups with successive queries
> > like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR
> > idn+2 OR ... etc)' which doesn't seem like a very good approach because
> the
> > DB might have been updated with new data which hasn't been indexed yet
> and
> > so all the ids might not be in there (which may or may not matter I
> > suppose).
> >
> > Then I was thinking we could have a field with an incrementing numeric
> > value which could be used to perform range queries as a substitute for
> > paging through everything.  Ie queries like 'IncrementalField:[1 TO 100]'
> > 'IncrementalField:[101 TO 200]' but this would be difficult to maintain
> as
> > we update the index unless we reindex the entire collection every time we
> > update any docs at all.
> >
> > Is this perhaps not a good use case for solr?  Should I use something
> else
> > or is there another approach that would work here to allow a client to
> pull
> > groups of docs in a collection through the rest api until the client has
> > gotten them all?
> >
> > Thanks
> > Robi
> >
> >
>



-- 
Joel Bernstein
Search Engineer at Heliosearch

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hoss is working on it. Search for deep paging or cursor in JIRA.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Dec 17, 2013 12:30 PM, "Petersen, Robert" <
robert.petersen@mail.rakuten.com> wrote:

> Hi solr users,
>
> We have a new use case where need to make a pile of data available as XML
> to a client and I was thinking we could easily put all this data into a
> solr collection and the client could just do a star search and page through
> all the results to obtain the data we need to give them.  Then I remembered
> we currently don't allow deep paging in our current search indexes as
> performance declines the deeper you go.  Is this still the case?
>
> If so, is there another approach to make all the data in a collection
> easily available for retrieval?  The only thing I can think of is to query
> our DB for all the unique IDs of all the documents in the collection and
> then pull out the documents out in small groups with successive queries
> like 'UniqueIdField:(id1 OR id2 OR ... OR idn)' 'UniqueIdField:(idn+1 OR
> idn+2 OR ... etc)' which doesn't seem like a very good approach because the
> DB might have been updated with new data which hasn't been indexed yet and
> so all the ids might not be in there (which may or may not matter I
> suppose).
>
> Then I was thinking we could have a field with an incrementing numeric
> value which could be used to perform range queries as a substitute for
> paging through everything.  Ie queries like 'IncrementalField:[1 TO 100]'
> 'IncrementalField:[101 TO 200]' but this would be difficult to maintain as
> we update the index unless we reindex the entire collection every time we
> update any docs at all.
>
> Is this perhaps not a good use case for solr?  Should I use something else
> or is there another approach that would work here to allow a client to pull
> groups of docs in a collection through the rest api until the client has
> gotten them all?
>
> Thanks
> Robi
>
>

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Chris Hostetter <ho...@fucit.org>.

: One question that I was never sure about when trying to do things like this --
: is this going to end up blowing the query and/or document caches if used on a
: live Solr?  By filling up those caches with the results of the 'bulk' export?
: If so, is there any way to avoid that? Or does it probably not really matter?

  q={!cache=false}...


-Hoss
http://www.lucidworks.com/

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Jonathan Rochkind <ro...@jhu.edu>.

On 12/17/13 1:16 PM, Chris Hostetter wrote:
> As i mentioned in the blog above, as long as you have a uniqueKey field
> that supports range queries, bulk exporting of all documents is fairly
> trivial by sorting on your uniqueKey field and using an fq that also
> filters on your uniqueKey field modify the fq each time to change the
> lower bound to match the highest ID you got on the previous "page".

Aha, very nice suggestion, I hadn't thought of this, when myself trying 
to figure out decent ways to 'fetch all documents matching a query' for 
some bulk offline processing.

One question that I was never sure about when trying to do things like 
this -- is this going to end up blowing the query and/or document caches 
if used on a live Solr?  By filling up those caches with the results of 
the 'bulk' export?  If so, is there any way to avoid that? Or does it 
probably not really matter?

Jonathan

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

On Wed, Dec 18, 2013 at 8:03 PM, Chris Hostetter
<ho...@fucit.org>wrote:

> :
> : What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've
> been
> : asked many times for that.
> : What if client don't need to rank results somehow, but just requesting
> : unordered filtering result like they are used to in RDBMS?
> : Do you feel it will never considered as a resonable usecase for Solr? or
> : there is a well known approach for dealing with?
>
> If you don't care about ordering, then the approach i described (either
> using SOLR-5463, or just using a sort by uniqueKey with increasing
> range filters on the id) should work fine -- the fact that they come back
> sorted by id is just an implementation detail that makes it possible to
> batch the records

>From the functional standpoint it's true, but performance might matter, in
that side cases. eg. I wonder why the priority queue is needed even if we
request sort=_docid_.

 (the same way most SQL databases will likely give you
> back the docs based on whatever primary key index you have)
>
> I think the key difference between approaches like SOLR-5244 vs the cursor
> work in SOLR-5463 is that SOLR-5244 is really targeted at dumping all
> data about all docs from a core (matching the query) in a single
> request/response -- for something like SolrCloud, the client would
> manually need to hit each shard (but as i understand it fro mthe
> dscription, that's kind of the point, it's aiming to be a very low level
> bulk export).  With the cursor approach in SOLR-5463, we do
> agregation across all shards, and we support arbitrary sorts, and you can
> control the batch size from the client and iterate over multiple
> request/responses of that size.  if there is any network hucups, you can
> re-do a request.  If you process half the docs that match (in a
> particular order) and then decide "I've got all the docs i need for my
> purposes", ou can stop requesting the continuation of that cursor.
>
>
>
> -Hoss
> http://www.lucidworks.com/
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Chris Hostetter <ho...@fucit.org>.

: 
: What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been
: asked many times for that.
: What if client don't need to rank results somehow, but just requesting
: unordered filtering result like they are used to in RDBMS?
: Do you feel it will never considered as a resonable usecase for Solr? or
: there is a well known approach for dealing with?

If you don't care about ordering, then the approach i described (either 
using SOLR-5463, or just using a sort by uniqueKey with increasing 
range filters on the id) should work fine -- the fact that they come back 
sorted by id is just an implementation detail that makes it possible to 
batch the records (the same way most SQL databases will likely give you 
back the docs based on whatever primary key index you have)

I think the key difference between approaches like SOLR-5244 vs the cursor 
work in SOLR-5463 is that SOLR-5244 is really targeted at dumping all 
data about all docs from a core (matching the query) in a single 
request/response -- for something like SolrCloud, the client would 
manually need to hit each shard (but as i understand it fro mthe 
dscription, that's kind of the point, it's aiming to be a very low level 
bulk export).  With the cursor approach in SOLR-5463, we do 
agregation across all shards, and we support arbitrary sorts, and you can 
control the batch size from the client and iterate over multiple 
request/responses of that size.  if there is any network hucups, you can 
re-do a request.  If you process half the docs that match (in a 
particular order) and then decide "I've got all the docs i need for my 
purposes", ou can stop requesting the continuation of that cursor.



-Hoss
http://www.lucidworks.com/

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Chris Hostetter <ho...@fucit.org>.

: You can do range queries without an upper bound and just limit the number of
: results. Then you look at the last result to obtain the new lower bound.

exactly.  instead of this:

   First: q=foo&start=0&rows=$ROWS
   After: q=foo&start=$X&rows=$ROWS

...where $ROWS is how big a batch of docsy you can handle at one time, 
and you increase the value of $X by the value of $ROWS on each successive 
request, you can just do this...

   First: q=foo&start=0&rows=$ROWS&sort=id+asc
   After: q=foo&start=0&rows=$ROWS&sort=id+asc&fq=id:{$X TO *]

...where $X is whatever the "last" id you got on the previous page.

Or: you try out the patch in SOLR-5463 and do something like this...

   First: q=foo&start=0&rows=$ROWS&sort=id+asc&cursorMark=*
   After: q=foo&start=0&rows=$ROWS&sort=id+asc&cursorMark=$X

...where $X is whatever "nextCursorMark" you got from the previous page.



-Hoss
http://www.lucidworks.com/

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Jens Grivolla <j+...@grivolla.net>.

You can do range queries without an upper bound and just limit the 
number of results. Then you look at the last result to obtain the new 
lower bound.

-- Jens


On 17/12/13 20:23, Petersen, Robert wrote:
> My use case is basically to do a dump of all contents of the index with no ordering needed.  It's actually to be a product data export for third parties.  Unique key is product sku.  I could take the min sku and range query up to the max sku but the skus are not contiguous because some get turned off and only some are valid for export so each range would return a different number of products (which may or may not be acceptable and I might be able to kind of hide that with some code).
>
> -----Original Message-----
> From: Mikhail Khludnev [mailto:mkhludnev@griddynamics.com]
> Sent: Tuesday, December 17, 2013 10:41 AM
> To: solr-user
> Subject: Re: solr as nosql - pulling all docs vs deep paging limitations
>
> Hoss,
>
> What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been asked many times for that.
> What if client don't need to rank results somehow, but just requesting unordered filtering result like they are used to in RDBMS?
> Do you feel it will never considered as a resonable usecase for Solr? or there is a well known approach for dealing with?
>
>
> On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter
> <ho...@fucit.org>wrote:
>
>>
>> : Then I remembered we currently don't allow deep paging in our
>> current
>> : search indexes as performance declines the deeper you go.  Is this
>> still
>> : the case?
>>
>> Coincidently, i'm working on a new cursor based API to make this much
>> more feasible as we speak..
>>
>> https://issues.apache.org/jira/browse/SOLR-5463
>>
>> I did some simple perf testing of the strawman approach and posted the
>> results last week...
>>
>>
>> http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iterat
>> ion-of-large-result-sets/
>>
>> ...current iterations on the patch are to eliminate the strawman code
>> to improve performance even more and beef up the test cases.
>>
>> : If so, is there another approach to make all the data in a
>> collection
>> : easily available for retrieval?  The only thing I can think of is to
>>          ...
>> : Then I was thinking we could have a field with an incrementing
>> numeric
>> : value which could be used to perform range queries as a substitute
>> for
>> : paging through everything.  Ie queries like 'IncrementalField:[1 TO
>> : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
>> : maintain as we update the index unless we reindex the entire
>> collection
>> : every time we update any docs at all.
>>
>> As i mentioned in the blog above, as long as you have a uniqueKey
>> field that supports range queries, bulk exporting of all documents is
>> fairly trivial by sorting on your uniqueKey field and using an fq that
>> also filters on your uniqueKey field modify the fq each time to change
>> the lower bound to match the highest ID you got on the previous "page".
>>
>> This approach works really well in simple cases where you wnat to
>> "fetch all" documents matching a query and then process/sort them by
>> some other criteria on the client -- but it's not viable if it's
>> important to you that the documents come back from solr in score order
>> before your client gets them because you want to "stop fetching" once
>> some criteria is met in your client.  Example: you have billions of
>> documents matching a query, you want to fetch all sorted by score desc
>> and crunch them on your client to compute some stats, and once your
>> client side stat crunching tells you you have enough results (which
>> might be after the 1000th result, or might be after the millionth result) then you want to stop.
>>
>> SOLR-5463 will help even in that later case.  The bulk of the patch
>> should easy to use in the next day or so (having other people try out
>> and test in their applications would be *very* helpful) and hopefully
>> show up in Solr 4.7
>>
>> -Hoss
>> http://www.lucidworks.com/
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
>   <mk...@griddynamics.com>
>
>

RE: solr as nosql - pulling all docs vs deep paging limitations

Posted by "Petersen, Robert" <ro...@mail.rakuten.com>.

My use case is basically to do a dump of all contents of the index with no ordering needed.  It's actually to be a product data export for third parties.  Unique key is product sku.  I could take the min sku and range query up to the max sku but the skus are not contiguous because some get turned off and only some are valid for export so each range would return a different number of products (which may or may not be acceptable and I might be able to kind of hide that with some code).

-----Original Message-----
From: Mikhail Khludnev [mailto:mkhludnev@griddynamics.com] 
Sent: Tuesday, December 17, 2013 10:41 AM
To: solr-user
Subject: Re: solr as nosql - pulling all docs vs deep paging limitations

Hoss,

What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been asked many times for that.
What if client don't need to rank results somehow, but just requesting unordered filtering result like they are used to in RDBMS?
Do you feel it will never considered as a resonable usecase for Solr? or there is a well known approach for dealing with?


On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : Then I remembered we currently don't allow deep paging in our 
> current
> : search indexes as performance declines the deeper you go.  Is this 
> still
> : the case?
>
> Coincidently, i'm working on a new cursor based API to make this much 
> more feasible as we speak..
>
> https://issues.apache.org/jira/browse/SOLR-5463
>
> I did some simple perf testing of the strawman approach and posted the 
> results last week...
>
>
> http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iterat
> ion-of-large-result-sets/
>
> ...current iterations on the patch are to eliminate the strawman code 
> to improve performance even more and beef up the test cases.
>
> : If so, is there another approach to make all the data in a 
> collection
> : easily available for retrieval?  The only thing I can think of is to
>         ...
> : Then I was thinking we could have a field with an incrementing 
> numeric
> : value which could be used to perform range queries as a substitute 
> for
> : paging through everything.  Ie queries like 'IncrementalField:[1 TO
> : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
> : maintain as we update the index unless we reindex the entire 
> collection
> : every time we update any docs at all.
>
> As i mentioned in the blog above, as long as you have a uniqueKey 
> field that supports range queries, bulk exporting of all documents is 
> fairly trivial by sorting on your uniqueKey field and using an fq that 
> also filters on your uniqueKey field modify the fq each time to change 
> the lower bound to match the highest ID you got on the previous "page".
>
> This approach works really well in simple cases where you wnat to 
> "fetch all" documents matching a query and then process/sort them by 
> some other criteria on the client -- but it's not viable if it's 
> important to you that the documents come back from solr in score order 
> before your client gets them because you want to "stop fetching" once 
> some criteria is met in your client.  Example: you have billions of 
> documents matching a query, you want to fetch all sorted by score desc 
> and crunch them on your client to compute some stats, and once your 
> client side stat crunching tells you you have enough results (which 
> might be after the 1000th result, or might be after the millionth result) then you want to stop.
>
> SOLR-5463 will help even in that later case.  The bulk of the patch 
> should easy to use in the next day or so (having other people try out 
> and test in their applications would be *very* helpful) and hopefully 
> show up in Solr 4.7
>
> -Hoss
> http://www.lucidworks.com/
>



--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hoss,

What about SELECT * FROM WHERE ... like misusing Solr? I'm sure you've been
asked many times for that.
What if client don't need to rank results somehow, but just requesting
unordered filtering result like they are used to in RDBMS?
Do you feel it will never considered as a resonable usecase for Solr? or
there is a well known approach for dealing with?


On Tue, Dec 17, 2013 at 10:16 PM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : Then I remembered we currently don't allow deep paging in our current
> : search indexes as performance declines the deeper you go.  Is this still
> : the case?
>
> Coincidently, i'm working on a new cursor based API to make this much more
> feasible as we speak..
>
> https://issues.apache.org/jira/browse/SOLR-5463
>
> I did some simple perf testing of the strawman approach and posted the
> results last week...
>
>
> http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
>
> ...current iterations on the patch are to eliminate the
> strawman code to improve performance even more and beef up the test
> cases.
>
> : If so, is there another approach to make all the data in a collection
> : easily available for retrieval?  The only thing I can think of is to
>         ...
> : Then I was thinking we could have a field with an incrementing numeric
> : value which could be used to perform range queries as a substitute for
> : paging through everything.  Ie queries like 'IncrementalField:[1 TO
> : 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to
> : maintain as we update the index unless we reindex the entire collection
> : every time we update any docs at all.
>
> As i mentioned in the blog above, as long as you have a uniqueKey field
> that supports range queries, bulk exporting of all documents is fairly
> trivial by sorting on your uniqueKey field and using an fq that also
> filters on your uniqueKey field modify the fq each time to change the
> lower bound to match the highest ID you got on the previous "page".
>
> This approach works really well in simple cases where you wnat to "fetch
> all" documents matching a query and then process/sort them by some other
> criteria on the client -- but it's not viable if it's important to you
> that the documents come back from solr in score order before your client
> gets them because you want to "stop fetching" once some criteria is met in
> your client.  Example: you have billions of documents matching a query,
> you want to fetch all sorted by score desc and crunch them on your client
> to compute some stats, and once your client side stat crunching tells you
> you have enough results (which might be after the 1000th result, or might
> be after the millionth result) then you want to stop.
>
> SOLR-5463 will help even in that later case.  The bulk of the patch should
> easy to use in the next day or so (having other people try out and
> test in their applications would be *very* helpful) and hopefully show up
> in Solr 4.7
>
> -Hoss
> http://www.lucidworks.com/
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>

Re: solr as nosql - pulling all docs vs deep paging limitations

Posted by Chris Hostetter <ho...@fucit.org>.

: Then I remembered we currently don't allow deep paging in our current 
: search indexes as performance declines the deeper you go.  Is this still 
: the case?

Coincidently, i'm working on a new cursor based API to make this much more 
feasible as we speak..

https://issues.apache.org/jira/browse/SOLR-5463

I did some simple perf testing of the strawman approach and posted the 
results last week...

http://searchhub.org/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

...current iterations on the patch are to eliminate the 
strawman code to improve performance even more and beef up the test 
cases.

: If so, is there another approach to make all the data in a collection 
: easily available for retrieval?  The only thing I can think of is to 
	...
: Then I was thinking we could have a field with an incrementing numeric 
: value which could be used to perform range queries as a substitute for 
: paging through everything.  Ie queries like 'IncrementalField:[1 TO 
: 100]' 'IncrementalField:[101 TO 200]' but this would be difficult to 
: maintain as we update the index unless we reindex the entire collection 
: every time we update any docs at all.

As i mentioned in the blog above, as long as you have a uniqueKey field 
that supports range queries, bulk exporting of all documents is fairly 
trivial by sorting on your uniqueKey field and using an fq that also 
filters on your uniqueKey field modify the fq each time to change the 
lower bound to match the highest ID you got on the previous "page".  

This approach works really well in simple cases where you wnat to "fetch 
all" documents matching a query and then process/sort them by some other 
criteria on the client -- but it's not viable if it's important to you 
that the documents come back from solr in score order before your client 
gets them because you want to "stop fetching" once some criteria is met in 
your client.  Example: you have billions of documents matching a query, 
you want to fetch all sorted by score desc and crunch them on your client 
to compute some stats, and once your client side stat crunching tells you 
you have enough results (which might be after the 1000th result, or might 
be after the millionth result) then you want to stop.

SOLR-5463 will help even in that later case.  The bulk of the patch should 
easy to use in the next day or so (having other people try out and 
test in their applications would be *very* helpful) and hopefully show up 
in Solr 4.7

-Hoss
http://www.lucidworks.com/