You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by vsriram30 <vs...@gmail.com> on 2015/03/12 23:10:15 UTC

Best way to dump out entire solr content?

Hi All,

I am having a solr cloud cluster of 20 nodes with each node having close to
20 Million records and total index size is around 400GB ( 20GB per node X 20
nodes ). I am trying to know the best way to dump out the entire solr data
in say CSV format. 

I use successive queries by incrementing the start param with 2000 and
keeping the rows as 2000 and hitting each individual servers using
distrib=false so that I don't overload the top level server and causing any
timeouts between top level and lower level servers. I am getting response
from solr very quickly when the start param is in lower millions < 2
millions. As the start param grows towards 16 million, solr takes almost 2
to 3 minutes to return back those 2000 records for a single query. I assume
this is because of skipping all the lower level index positions to get to
that start index of > 16 millions and then provide the results.

Is there any better way to do this? I saw cursor feature in solr pagination
Wiki but it is mentioned that it is for sort on a unique field. Would it
make sense for my use this to sort on my solr key field(Solr unique key
field) with rows as 2000 and keep on using the nextCursorMark to dump out
all the documents in csv format?

Thanks,
Sriram




--
View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-dump-out-entire-solr-content-tp4192734.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Best way to dump out entire solr content?

Posted by vsriram30 <vs...@gmail.com>.

Great! Thanks for providing more info Toke Eskildsen

Thanks,
Sriram



--
View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-dump-out-entire-solr-content-tp4192734p4192892.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Best way to dump out entire solr content?

Posted by Toke Eskildsen <te...@statsbiblioteket.dk>.

On Fri, 2015-03-13 at 00:32 +0100, vsriram30 wrote:
> But as you say probably the internal skips using the cursor might be more
> efficient than the skip done with increasing the start, I will use the
> cursors. Kindly correct me if my understanding is not right.

Let's say you want page 5.000 and that page size is 1.000.

Non-cursor skipping is the same as making a request for top 5.000.000,
then extracting the last 1.000 entries from that. It just happens under
the hood.

Cursor-based skipping is performance-wise the same as making a request
for the first top 1000. There is practically no difference in speed
between page 1 and page 5.000. I say practically because on paper,
requesting page 5.000 will be a smidgen faster (there are less inserts
into the priority queue), but I doubt it can be measured in real world
setups.

- Toke Eskildsen

Re: Best way to dump out entire solr content?

Posted by vsriram30 <vs...@gmail.com>.

Thanks Alex for explanation. Actually since I am scraping all the contents
from Solr, I am doing a generic query of *:* So I think it should not take
so much time right?

But as you say probably the internal skips using the cursor might be more
efficient than the skip done with increasing the start, I will use the
cursors. Kindly correct me if my understanding is not right.

Thanks,
Sriram



--
View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-dump-out-entire-solr-content-tp4192734p4192750.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Best way to dump out entire solr content?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Without cursor, you are rerunning a full search every time. So, slow
down is entirely expected.

With cursor, you do not. It does an internal skip based on cursor
value. I think the sort is there to ensure the value is stable.

Basically, you need to use the cursor.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 12 March 2015 at 19:05, vsriram30 <vs...@gmail.com> wrote:
> Thanks Alex for quick response. I wanted to avoid reading the lucene index to
> prevent complications of merging deleted info. Also I would like to do this
> on very frequent basis as well like once in two or three days.
>
> I am wondering if the issues that I faced while scraping the index towards
> higher order of millions will get resolved with Cursor. Do you think using
> cursor to scrap solr with sort on unique key field is better than not using
> it and does it not do the same skip operations and take more time as without
> using cursor?
>
> Thanks,
> Sriram
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-dump-out-entire-solr-content-tp4192734p4192745.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Best way to dump out entire solr content?

Posted by vsriram30 <vs...@gmail.com>.

Thanks Alex for quick response. I wanted to avoid reading the lucene index to
prevent complications of merging deleted info. Also I would like to do this
on very frequent basis as well like once in two or three days.

I am wondering if the issues that I faced while scraping the index towards
higher order of millions will get resolved with Cursor. Do you think using
cursor to scrap solr with sort on unique key field is better than not using
it and does it not do the same skip operations and take more time as without
using cursor?

Thanks,
Sriram



--
View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-dump-out-entire-solr-content-tp4192734p4192745.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Best way to dump out entire solr content?

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Well, it's cursor or nothing. Well, or some sort of custom code to
manually read Lucene indexes (good luck with deleted items, etc).

I think your understanding is correct.

Regards,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 12 March 2015 at 18:10, vsriram30 <vs...@gmail.com> wrote:
> Hi All,
>
> I am having a solr cloud cluster of 20 nodes with each node having close to
> 20 Million records and total index size is around 400GB ( 20GB per node X 20
> nodes ). I am trying to know the best way to dump out the entire solr data
> in say CSV format.
>
> I use successive queries by incrementing the start param with 2000 and
> keeping the rows as 2000 and hitting each individual servers using
> distrib=false so that I don't overload the top level server and causing any
> timeouts between top level and lower level servers. I am getting response
> from solr very quickly when the start param is in lower millions < 2
> millions. As the start param grows towards 16 million, solr takes almost 2
> to 3 minutes to return back those 2000 records for a single query. I assume
> this is because of skipping all the lower level index positions to get to
> that start index of > 16 millions and then provide the results.
>
> Is there any better way to do this? I saw cursor feature in solr pagination
> Wiki but it is mentioned that it is for sort on a unique field. Would it
> make sense for my use this to sort on my solr key field(Solr unique key
> field) with rows as 2000 and keep on using the nextCursorMark to dump out
> all the documents in csv format?
>
> Thanks,
> Sriram
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Best-way-to-dump-out-entire-solr-content-tp4192734.html
> Sent from the Solr - User mailing list archive at Nabble.com.