You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ian Connor <ia...@gmail.com> on 2009/01/25 21:16:06 UTC

fastest way to index/reindex

Hi,

Given the only real way to reindex is to save the document again, what is
the fastest way to extract all the documents from a solr index to resave
them.

I have tried the id:[* TO *] trick however, it takes a while once you get a
few thousand into the index. Are there any tools that will quickly export
the index to a text file or making queries 1000 at a time is the best option
and dealing with the time it takes to query once you are deep into the
index?

-- 
Regards,

Ian Connor

Re: fastest way to index/reindex

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

*:* will default to sorting by document insertion order (Lucene's  
document id, _not_ your Solr uniqueKey).  And no, you won't miss any  
by paging - order will be maintained.

	Erik

On Jan 27, 2009, at 9:52 AM, Ian Connor wrote:

> When you query by *:*, what order does it use. Is there a chance  
> they will
> come in a different order as you page through the results (and miss/ 
> dupicate
> some). Is it best to put the order explicitly by 'id' or is that  
> implied
> already?
>
> On Mon, Jan 26, 2009 at 12:00 PM, Ian Connor <ia...@gmail.com>  
> wrote:
>
>> *:* took it up to 45/sec from 28/sec so a nice 60% bump in  
>> performance -
>> thanks!
>>
>>
>> On Sun, Jan 25, 2009 at 5:46 PM, Ryan McKinley <ry...@gmail.com>  
>> wrote:
>>
>>> I don't know of any standard export/import tool -- i think luke has
>>> something, but it will be faster if you write your own.
>>>
>>> Rather then id:[* TO *], just try *:*  -- this should match all  
>>> documents
>>> without using a range query.
>>>
>>>
>>>
>>> On Jan 25, 2009, at 3:16 PM, Ian Connor wrote:
>>>
>>> Hi,
>>>>
>>>> Given the only real way to reindex is to save the document again,  
>>>> what is
>>>> the fastest way to extract all the documents from a solr index to  
>>>> resave
>>>> them.
>>>>
>>>> I have tried the id:[* TO *] trick however, it takes a while once  
>>>> you get
>>>> a
>>>> few thousand into the index. Are there any tools that will  
>>>> quickly export
>>>> the index to a text file or making queries 1000 at a time is the  
>>>> best
>>>> option
>>>> and dealing with the time it takes to query once you are deep  
>>>> into the
>>>> index?
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> Ian Connor
>>>>
>>>
>>>
>>
>>
>> --
>> Regards,
>>
>> Ian Connor
>>
>
>
>
> -- 
> Regards,
>
> Ian Connor
> 1 Leighton St #723
> Cambridge, MA 02141
> Call Center Phone: +1 (714) 239 3875 (24 hrs)
> Fax: +1(770) 818 5697
> Skype: ian.connor

Re: fastest way to index/reindex

Posted by Josiane Gamgo <ga...@googlemail.com>.

Thanks for your Answer.
this is what I am trying to do :
I would like to find out how to customize the Lucene Indexing Prozess to
obtain a faster search.
etheir with Luke or with some other tool.


On Mon, Feb 23, 2009 at 6:53 PM, Erick Erickson <er...@gmail.com>wrote:

> please don't hijack topic threads, start a new one
>
> http://en.wikipedia.org/wiki/Thread_hijacking
>
> Best
> Erick
>
> MergeFactor isn't very related to searching, Luke isn't used in
> the indexing process and why do you care how fast Luke is?
>
> When you start a new post on this topic, please give an idea of
> the problem you're trying to solve or that you are having, it'll lead
> to much better answers.
>
>
>
> On Mon, Feb 23, 2009 at 11:07 AM, Josiane Gamgo <gamgoster@googlemail.com
> >wrote:
>
> > How fast is the search if the MergeFactor of Lucene Index is set to 20 or
> > more?did somebody uses Luke to optimize the indexing process? I would
> like
> > to know how fast is Luke.
> > Thanks
> >
> >
> > On Tue, Jan 27, 2009 at 3:52 PM, Ian Connor <ia...@gmail.com>
> wrote:
> >
> > > When you query by *:*, what order does it use. Is there a chance they
> > will
> > > come in a different order as you page through the results (and
> > > miss/dupicate
> > > some). Is it best to put the order explicitly by 'id' or is that
> implied
> > > already?
> > >
> > > On Mon, Jan 26, 2009 at 12:00 PM, Ian Connor <ia...@gmail.com>
> > wrote:
> > >
> > > > *:* took it up to 45/sec from 28/sec so a nice 60% bump in
> performance
> > -
> > > > thanks!
> > > >
> > > >
> > > > On Sun, Jan 25, 2009 at 5:46 PM, Ryan McKinley <ry...@gmail.com>
> > > wrote:
> > > >
> > > >> I don't know of any standard export/import tool -- i think luke has
> > > >> something, but it will be faster if you write your own.
> > > >>
> > > >> Rather then id:[* TO *], just try *:*  -- this should match all
> > > documents
> > > >> without using a range query.
> > > >>
> > > >>
> > > >>
> > > >> On Jan 25, 2009, at 3:16 PM, Ian Connor wrote:
> > > >>
> > > >>  Hi,
> > > >>>
> > > >>> Given the only real way to reindex is to save the document again,
> > what
> > > is
> > > >>> the fastest way to extract all the documents from a solr index to
> > > resave
> > > >>> them.
> > > >>>
> > > >>> I have tried the id:[* TO *] trick however, it takes a while once
> you
> > > get
> > > >>> a
> > > >>> few thousand into the index. Are there any tools that will quickly
> > > export
> > > >>> the index to a text file or making queries 1000 at a time is the
> best
> > > >>> option
> > > >>> and dealing with the time it takes to query once you are deep into
> > the
> > > >>> index?
> > > >>>
> > > >>> --
> > > >>> Regards,
> > > >>>
> > > >>> Ian Connor
> > > >>>
> > > >>
> > > >>
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Ian Connor
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Ian Connor
> > > 1 Leighton St #723
> > > Cambridge, MA 02141
> > > Call Center Phone: +1 (714) 239 3875 (24 hrs)
> > > Fax: +1(770) 818 5697
> > > Skype: ian.connor
> > >
> >
>

Re: fastest way to index/reindex

Posted by Erick Erickson <er...@gmail.com>.

please don't hijack topic threads, start a new one

http://en.wikipedia.org/wiki/Thread_hijacking

Best
Erick

MergeFactor isn't very related to searching, Luke isn't used in
the indexing process and why do you care how fast Luke is?

When you start a new post on this topic, please give an idea of
the problem you're trying to solve or that you are having, it'll lead
to much better answers.



On Mon, Feb 23, 2009 at 11:07 AM, Josiane Gamgo <ga...@googlemail.com>wrote:

> How fast is the search if the MergeFactor of Lucene Index is set to 20 or
> more?did somebody uses Luke to optimize the indexing process? I would like
> to know how fast is Luke.
> Thanks
>
>
> On Tue, Jan 27, 2009 at 3:52 PM, Ian Connor <ia...@gmail.com> wrote:
>
> > When you query by *:*, what order does it use. Is there a chance they
> will
> > come in a different order as you page through the results (and
> > miss/dupicate
> > some). Is it best to put the order explicitly by 'id' or is that implied
> > already?
> >
> > On Mon, Jan 26, 2009 at 12:00 PM, Ian Connor <ia...@gmail.com>
> wrote:
> >
> > > *:* took it up to 45/sec from 28/sec so a nice 60% bump in performance
> -
> > > thanks!
> > >
> > >
> > > On Sun, Jan 25, 2009 at 5:46 PM, Ryan McKinley <ry...@gmail.com>
> > wrote:
> > >
> > >> I don't know of any standard export/import tool -- i think luke has
> > >> something, but it will be faster if you write your own.
> > >>
> > >> Rather then id:[* TO *], just try *:*  -- this should match all
> > documents
> > >> without using a range query.
> > >>
> > >>
> > >>
> > >> On Jan 25, 2009, at 3:16 PM, Ian Connor wrote:
> > >>
> > >>  Hi,
> > >>>
> > >>> Given the only real way to reindex is to save the document again,
> what
> > is
> > >>> the fastest way to extract all the documents from a solr index to
> > resave
> > >>> them.
> > >>>
> > >>> I have tried the id:[* TO *] trick however, it takes a while once you
> > get
> > >>> a
> > >>> few thousand into the index. Are there any tools that will quickly
> > export
> > >>> the index to a text file or making queries 1000 at a time is the best
> > >>> option
> > >>> and dealing with the time it takes to query once you are deep into
> the
> > >>> index?
> > >>>
> > >>> --
> > >>> Regards,
> > >>>
> > >>> Ian Connor
> > >>>
> > >>
> > >>
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Ian Connor
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Ian Connor
> > 1 Leighton St #723
> > Cambridge, MA 02141
> > Call Center Phone: +1 (714) 239 3875 (24 hrs)
> > Fax: +1(770) 818 5697
> > Skype: ian.connor
> >
>

Re: fastest way to index/reindex

Posted by Josiane Gamgo <ga...@googlemail.com>.

How fast is the search if the MergeFactor of Lucene Index is set to 20 or
more?did somebody uses Luke to optimize the indexing process? I would like
to know how fast is Luke.
Thanks


On Tue, Jan 27, 2009 at 3:52 PM, Ian Connor <ia...@gmail.com> wrote:

> When you query by *:*, what order does it use. Is there a chance they will
> come in a different order as you page through the results (and
> miss/dupicate
> some). Is it best to put the order explicitly by 'id' or is that implied
> already?
>
> On Mon, Jan 26, 2009 at 12:00 PM, Ian Connor <ia...@gmail.com> wrote:
>
> > *:* took it up to 45/sec from 28/sec so a nice 60% bump in performance -
> > thanks!
> >
> >
> > On Sun, Jan 25, 2009 at 5:46 PM, Ryan McKinley <ry...@gmail.com>
> wrote:
> >
> >> I don't know of any standard export/import tool -- i think luke has
> >> something, but it will be faster if you write your own.
> >>
> >> Rather then id:[* TO *], just try *:*  -- this should match all
> documents
> >> without using a range query.
> >>
> >>
> >>
> >> On Jan 25, 2009, at 3:16 PM, Ian Connor wrote:
> >>
> >>  Hi,
> >>>
> >>> Given the only real way to reindex is to save the document again, what
> is
> >>> the fastest way to extract all the documents from a solr index to
> resave
> >>> them.
> >>>
> >>> I have tried the id:[* TO *] trick however, it takes a while once you
> get
> >>> a
> >>> few thousand into the index. Are there any tools that will quickly
> export
> >>> the index to a text file or making queries 1000 at a time is the best
> >>> option
> >>> and dealing with the time it takes to query once you are deep into the
> >>> index?
> >>>
> >>> --
> >>> Regards,
> >>>
> >>> Ian Connor
> >>>
> >>
> >>
> >
> >
> > --
> > Regards,
> >
> > Ian Connor
> >
>
>
>
> --
> Regards,
>
> Ian Connor
> 1 Leighton St #723
> Cambridge, MA 02141
> Call Center Phone: +1 (714) 239 3875 (24 hrs)
> Fax: +1(770) 818 5697
> Skype: ian.connor
>

Re: fastest way to index/reindex

Posted by Ian Connor <ia...@gmail.com>.

When you query by *:*, what order does it use. Is there a chance they will
come in a different order as you page through the results (and miss/dupicate
some). Is it best to put the order explicitly by 'id' or is that implied
already?

On Mon, Jan 26, 2009 at 12:00 PM, Ian Connor <ia...@gmail.com> wrote:

> *:* took it up to 45/sec from 28/sec so a nice 60% bump in performance -
> thanks!
>
>
> On Sun, Jan 25, 2009 at 5:46 PM, Ryan McKinley <ry...@gmail.com> wrote:
>
>> I don't know of any standard export/import tool -- i think luke has
>> something, but it will be faster if you write your own.
>>
>> Rather then id:[* TO *], just try *:*  -- this should match all documents
>> without using a range query.
>>
>>
>>
>> On Jan 25, 2009, at 3:16 PM, Ian Connor wrote:
>>
>>  Hi,
>>>
>>> Given the only real way to reindex is to save the document again, what is
>>> the fastest way to extract all the documents from a solr index to resave
>>> them.
>>>
>>> I have tried the id:[* TO *] trick however, it takes a while once you get
>>> a
>>> few thousand into the index. Are there any tools that will quickly export
>>> the index to a text file or making queries 1000 at a time is the best
>>> option
>>> and dealing with the time it takes to query once you are deep into the
>>> index?
>>>
>>> --
>>> Regards,
>>>
>>> Ian Connor
>>>
>>
>>
>
>
> --
> Regards,
>
> Ian Connor
>



-- 
Regards,

Ian Connor
1 Leighton St #723
Cambridge, MA 02141
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Fax: +1(770) 818 5697
Skype: ian.connor

Re: fastest way to index/reindex

Posted by Ian Connor <ia...@gmail.com>.

*:* took it up to 45/sec from 28/sec so a nice 60% bump in performance -
thanks!

On Sun, Jan 25, 2009 at 5:46 PM, Ryan McKinley <ry...@gmail.com> wrote:

> I don't know of any standard export/import tool -- i think luke has
> something, but it will be faster if you write your own.
>
> Rather then id:[* TO *], just try *:*  -- this should match all documents
> without using a range query.
>
>
>
> On Jan 25, 2009, at 3:16 PM, Ian Connor wrote:
>
>  Hi,
>>
>> Given the only real way to reindex is to save the document again, what is
>> the fastest way to extract all the documents from a solr index to resave
>> them.
>>
>> I have tried the id:[* TO *] trick however, it takes a while once you get
>> a
>> few thousand into the index. Are there any tools that will quickly export
>> the index to a text file or making queries 1000 at a time is the best
>> option
>> and dealing with the time it takes to query once you are deep into the
>> index?
>>
>> --
>> Regards,
>>
>> Ian Connor
>>
>
>


-- 
Regards,

Ian Connor

Re: fastest way to index/reindex

Posted by Ian Connor <ia...@gmail.com>.

I have about 2.5 million per shard and seem to be getting through 28/sec
using a 1000 at a time. It ran all yesterday and part of the night. It is
over the 1.6 million mark now so hope it can keep up a similar rate as it
gets deeper into the index.

I need to reindex it all because I changed how some of the fields are
indexed (not stored). So the stored data is fine - I just needed the index
to change how the index filter was applied.

This is probably going to be an iterative process so reindex all will not be
an unusual event as I find optimizations and new ways to do things so I
wanted it to be as painless as possible.

id:[*TO*] seems okay but wanted to know if there was a smarter way.

On Mon, Jan 26, 2009 at 3:11 AM, Julian Davchev <jm...@drun.net> wrote:

> I kinda don't get why would you reindex all data at once?
> Each document has unique id.... you will reindex only whats needed. Also
> if too many stuff I'd suggest using some
> batch processor that will add N tasks with range query 1:10  10:20
> etc... and cronjob executing those. Thousends seems ok but when you hit
> millions you're in trouble.
> Cheers.
>
> Ryan McKinley wrote:
> > I don't know of any standard export/import tool -- i think luke has
> > something, but it will be faster if you write your own.
> >
> > Rather then id:[* TO *], just try *:*  -- this should match all
> > documents without using a range query.
> >
> >
> > On Jan 25, 2009, at 3:16 PM, Ian Connor wrote:
> >
> >> Hi,
> >>
> >> Given the only real way to reindex is to save the document again,
> >> what is
> >> the fastest way to extract all the documents from a solr index to resave
> >> them.
> >>
> >> I have tried the id:[* TO *] trick however, it takes a while once you
> >> get a
> >> few thousand into the index. Are there any tools that will quickly
> >> export
> >> the index to a text file or making queries 1000 at a time is the best
> >> option
> >> and dealing with the time it takes to query once you are deep into the
> >> index?
> >>
> >> --
> >> Regards,
> >>
> >> Ian Connor
> >
>
>

-- 
Regards,

Ian Connor
1 Leighton St #723
Cambridge, MA 02141
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Fax: +1(770) 818 5697
Skype: ian.connor

Re: fastest way to index/reindex

Posted by Julian Davchev <jm...@drun.net>.

I kinda don't get why would you reindex all data at once?
Each document has unique id.... you will reindex only whats needed. Also
if too many stuff I'd suggest using some
batch processor that will add N tasks with range query 1:10  10:20
etc... and cronjob executing those. Thousends seems ok but when you hit
millions you're in trouble.
Cheers.

Ryan McKinley wrote:
> I don't know of any standard export/import tool -- i think luke has
> something, but it will be faster if you write your own.
>
> Rather then id:[* TO *], just try *:*  -- this should match all
> documents without using a range query.
>
>
> On Jan 25, 2009, at 3:16 PM, Ian Connor wrote:
>
>> Hi,
>>
>> Given the only real way to reindex is to save the document again,
>> what is
>> the fastest way to extract all the documents from a solr index to resave
>> them.
>>
>> I have tried the id:[* TO *] trick however, it takes a while once you
>> get a
>> few thousand into the index. Are there any tools that will quickly
>> export
>> the index to a text file or making queries 1000 at a time is the best
>> option
>> and dealing with the time it takes to query once you are deep into the
>> index?
>>
>> -- 
>> Regards,
>>
>> Ian Connor
>

Re: fastest way to index/reindex

Posted by Ryan McKinley <ry...@gmail.com>.

I don't know of any standard export/import tool -- i think luke has  
something, but it will be faster if you write your own.

Rather then id:[* TO *], just try *:*  -- this should match all  
documents without using a range query.

On Jan 25, 2009, at 3:16 PM, Ian Connor wrote:

> Hi,
>
> Given the only real way to reindex is to save the document again,  
> what is
> the fastest way to extract all the documents from a solr index to  
> resave
> them.
>
> I have tried the id:[* TO *] trick however, it takes a while once  
> you get a
> few thousand into the index. Are there any tools that will quickly  
> export
> the index to a text file or making queries 1000 at a time is the  
> best option
> and dealing with the time it takes to query once you are deep into the
> index?
>
> -- 
> Regards,
>
> Ian Connor