You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Bjarke Buur Mortensen <mo...@eluence.com> on 2020/04/27 11:11:08 UTC

Reindexing using dataimporthandler

Hi list,

Let's say I add a copyField to my solr schema, or change the analysis chain
of a field or some other change.
It seems to me to be an alluring choice to use a very simple
dataimporthandler to reindex all documents, by using a SolrEntityProcessor
that points to itself. I have just done this for a very small collection,
but I was wondering what the caveats are, since this is not the recommended
practice. What can go wrong using this approach?

<document> <entity name="all_from_self" processor="SolrEntityProcessor" url=
"http://localhost:8983/solr/mycollection" qt="lucene" query="*:*" wt=
"javabin" rows="1000" cursorMark="true" sort="id asc" fl=
"*,orig_version_l:_version_"/> </document>

PS: (It is probably necessary to add a version:[* TO
<current_highest_version>] to ensure it terminates for large imports)
PPS: (Obviously you shouldn't add the clean parameter)

/Bjarke

Re: Reindexing using dataimporthandler

Posted by Erick Erickson <er...@gmail.com>.

You’re welcome.

Solr is a huge beast, I don’t think any single individual
knows all the bits and pieces… Or, in my case, can
remember them ;)

> On Apr 27, 2020, at 9:15 AM, Bjarke Buur Mortensen <mo...@eluence.com> wrote:
> 
> Wow, thanks. Erick. That's actually much better :-)
> You live and you learn.
> 
> Cheers,
> Bjarke
> 
> Den man. 27. apr. 2020 kl. 15.00 skrev Erick Erickson <
> erickerickson@gmail.com>:
> 
>> What about the Collections API REINDEXCOLLECTION? That has the
>> advantage of being something officially supported, puts the source
>> collection into read-only mode, uses a much more efficient query
>> process (streaming actually) etc.
>> 
>> It has the disadvantage of producing a new collection under the
>> covers and aliasing to it. But you can always rename the collection
>> later.
>> 
>> Best,
>> Erick
>> 
>>> On Apr 27, 2020, at 8:23 AM, Bjarke Buur Mortensen <
>> mortensen@eluence.com> wrote:
>>> 
>>> Thanks for the reply,
>>> I'm on solr 8.2 so cursorMark is there.
>>> 
>>> Doing this from one collection to another collection, and then use a
>>> collection alias is probably the way to go, but  actually, my suggestion
>>> was a little more bold:
>>> 
>>> I'm indexing on top of the same core, i.e from
>>> http://localhost:8983/solr/mycollection to
>>> http://localhost:8983/solr/mycollection
>>> 
>>> (This is why I suggested adding a version:[* TO
>> <current_highest_version>]
>>> to ensure it terminates for large imports.)
>>> 
>>> With this in mind, are you still thinking this is a safe approach?
>>> 
>>> Thanks,
>>> Bjarke
>>> 
>>> 
>>> Den man. 27. apr. 2020 kl. 13.46 skrev Emir Arnautović <
>>> emir.arnautovic@sematext.com>:
>>> 
>>>> Hi Bjarke,
>>>> I don’t see a problem with that approach if you have enough resources to
>>>> handle both cores at the same time, especially if you are doing that
>> while
>>>> serving production queries. The only issue is that if you plan to do
>> that
>>>> then you have to have all fields stored. Also note that cursorMark
>> support
>>>> was added a bit later to entity processor, so if you are running a bit
>>>> older version of Solr, you might not have cursors - I’ve found it the
>> hard
>>>> way.
>>>> 
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>> 
>>>> 
>>>> 
>>>>> On 27 Apr 2020, at 13:11, Bjarke Buur Mortensen <mortensen@eluence.com
>>> 
>>>> wrote:
>>>>> 
>>>>> Hi list,
>>>>> 
>>>>> Let's say I add a copyField to my solr schema, or change the analysis
>>>> chain
>>>>> of a field or some other change.
>>>>> It seems to me to be an alluring choice to use a very simple
>>>>> dataimporthandler to reindex all documents, by using a
>>>> SolrEntityProcessor
>>>>> that points to itself. I have just done this for a very small
>> collection,
>>>>> but I was wondering what the caveats are, since this is not the
>>>> recommended
>>>>> practice. What can go wrong using this approach?
>>>>> 
>>>>> <document> <entity name="all_from_self" processor="SolrEntityProcessor"
>>>> url=
>>>>> "http://localhost:8983/solr/mycollection" qt="lucene" query="*:*" wt=
>>>>> "javabin" rows="1000" cursorMark="true" sort="id asc" fl=
>>>>> "*,orig_version_l:_version_"/> </document>
>>>>> 
>>>>> PS: (It is probably necessary to add a version:[* TO
>>>>> <current_highest_version>] to ensure it terminates for large imports)
>>>>> PPS: (Obviously you shouldn't add the clean parameter)
>>>>> 
>>>>> /Bjarke
>>>> 
>>>> 
>> 
>>

Re: Reindexing using dataimporthandler

Posted by Bjarke Buur Mortensen <mo...@eluence.com>.

Wow, thanks. Erick. That's actually much better :-)
You live and you learn.

Cheers,
Bjarke

Den man. 27. apr. 2020 kl. 15.00 skrev Erick Erickson <
erickerickson@gmail.com>:

> What about the Collections API REINDEXCOLLECTION? That has the
> advantage of being something officially supported, puts the source
> collection into read-only mode, uses a much more efficient query
> process (streaming actually) etc.
>
> It has the disadvantage of producing a new collection under the
> covers and aliasing to it. But you can always rename the collection
> later.
>
> Best,
> Erick
>
> > On Apr 27, 2020, at 8:23 AM, Bjarke Buur Mortensen <
> mortensen@eluence.com> wrote:
> >
> > Thanks for the reply,
> > I'm on solr 8.2 so cursorMark is there.
> >
> > Doing this from one collection to another collection, and then use a
> > collection alias is probably the way to go, but  actually, my suggestion
> > was a little more bold:
> >
> > I'm indexing on top of the same core, i.e from
> > http://localhost:8983/solr/mycollection to
> > http://localhost:8983/solr/mycollection
> >
> > (This is why I suggested adding a version:[* TO
> <current_highest_version>]
> > to ensure it terminates for large imports.)
> >
> > With this in mind, are you still thinking this is a safe approach?
> >
> > Thanks,
> > Bjarke
> >
> >
> > Den man. 27. apr. 2020 kl. 13.46 skrev Emir Arnautović <
> > emir.arnautovic@sematext.com>:
> >
> >> Hi Bjarke,
> >> I don’t see a problem with that approach if you have enough resources to
> >> handle both cores at the same time, especially if you are doing that
> while
> >> serving production queries. The only issue is that if you plan to do
> that
> >> then you have to have all fields stored. Also note that cursorMark
> support
> >> was added a bit later to entity processor, so if you are running a bit
> >> older version of Solr, you might not have cursors - I’ve found it the
> hard
> >> way.
> >>
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 27 Apr 2020, at 13:11, Bjarke Buur Mortensen <mortensen@eluence.com
> >
> >> wrote:
> >>>
> >>> Hi list,
> >>>
> >>> Let's say I add a copyField to my solr schema, or change the analysis
> >> chain
> >>> of a field or some other change.
> >>> It seems to me to be an alluring choice to use a very simple
> >>> dataimporthandler to reindex all documents, by using a
> >> SolrEntityProcessor
> >>> that points to itself. I have just done this for a very small
> collection,
> >>> but I was wondering what the caveats are, since this is not the
> >> recommended
> >>> practice. What can go wrong using this approach?
> >>>
> >>> <document> <entity name="all_from_self" processor="SolrEntityProcessor"
> >> url=
> >>> "http://localhost:8983/solr/mycollection" qt="lucene" query="*:*" wt=
> >>> "javabin" rows="1000" cursorMark="true" sort="id asc" fl=
> >>> "*,orig_version_l:_version_"/> </document>
> >>>
> >>> PS: (It is probably necessary to add a version:[* TO
> >>> <current_highest_version>] to ensure it terminates for large imports)
> >>> PPS: (Obviously you shouldn't add the clean parameter)
> >>>
> >>> /Bjarke
> >>
> >>
>
>

Re: Reindexing using dataimporthandler

Posted by Erick Erickson <er...@gmail.com>.

What about the Collections API REINDEXCOLLECTION? That has the
advantage of being something officially supported, puts the source
collection into read-only mode, uses a much more efficient query
process (streaming actually) etc. 

It has the disadvantage of producing a new collection under the
covers and aliasing to it. But you can always rename the collection
later.

Best,
Erick

> On Apr 27, 2020, at 8:23 AM, Bjarke Buur Mortensen <mo...@eluence.com> wrote:
> 
> Thanks for the reply,
> I'm on solr 8.2 so cursorMark is there.
> 
> Doing this from one collection to another collection, and then use a
> collection alias is probably the way to go, but  actually, my suggestion
> was a little more bold:
> 
> I'm indexing on top of the same core, i.e from
> http://localhost:8983/solr/mycollection to
> http://localhost:8983/solr/mycollection
> 
> (This is why I suggested adding a version:[* TO <current_highest_version>]
> to ensure it terminates for large imports.)
> 
> With this in mind, are you still thinking this is a safe approach?
> 
> Thanks,
> Bjarke
> 
> 
> Den man. 27. apr. 2020 kl. 13.46 skrev Emir Arnautović <
> emir.arnautovic@sematext.com>:
> 
>> Hi Bjarke,
>> I don’t see a problem with that approach if you have enough resources to
>> handle both cores at the same time, especially if you are doing that while
>> serving production queries. The only issue is that if you plan to do that
>> then you have to have all fields stored. Also note that cursorMark support
>> was added a bit later to entity processor, so if you are running a bit
>> older version of Solr, you might not have cursors - I’ve found it the hard
>> way.
>> 
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 27 Apr 2020, at 13:11, Bjarke Buur Mortensen <mo...@eluence.com>
>> wrote:
>>> 
>>> Hi list,
>>> 
>>> Let's say I add a copyField to my solr schema, or change the analysis
>> chain
>>> of a field or some other change.
>>> It seems to me to be an alluring choice to use a very simple
>>> dataimporthandler to reindex all documents, by using a
>> SolrEntityProcessor
>>> that points to itself. I have just done this for a very small collection,
>>> but I was wondering what the caveats are, since this is not the
>> recommended
>>> practice. What can go wrong using this approach?
>>> 
>>> <document> <entity name="all_from_self" processor="SolrEntityProcessor"
>> url=
>>> "http://localhost:8983/solr/mycollection" qt="lucene" query="*:*" wt=
>>> "javabin" rows="1000" cursorMark="true" sort="id asc" fl=
>>> "*,orig_version_l:_version_"/> </document>
>>> 
>>> PS: (It is probably necessary to add a version:[* TO
>>> <current_highest_version>] to ensure it terminates for large imports)
>>> PPS: (Obviously you shouldn't add the clean parameter)
>>> 
>>> /Bjarke
>> 
>>

Re: Reindexing using dataimporthandler

Posted by Bjarke Buur Mortensen <mo...@eluence.com>.

Thanks for the reply,
I'm on solr 8.2 so cursorMark is there.

Doing this from one collection to another collection, and then use a
collection alias is probably the way to go, but  actually, my suggestion
was a little more bold:

I'm indexing on top of the same core, i.e from
http://localhost:8983/solr/mycollection to
http://localhost:8983/solr/mycollection

(This is why I suggested adding a version:[* TO <current_highest_version>]
to ensure it terminates for large imports.)

With this in mind, are you still thinking this is a safe approach?

Thanks,
Bjarke


Den man. 27. apr. 2020 kl. 13.46 skrev Emir Arnautović <
emir.arnautovic@sematext.com>:

> Hi Bjarke,
> I don’t see a problem with that approach if you have enough resources to
> handle both cores at the same time, especially if you are doing that while
> serving production queries. The only issue is that if you plan to do that
> then you have to have all fields stored. Also note that cursorMark support
> was added a bit later to entity processor, so if you are running a bit
> older version of Solr, you might not have cursors - I’ve found it the hard
> way.
>
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 27 Apr 2020, at 13:11, Bjarke Buur Mortensen <mo...@eluence.com>
> wrote:
> >
> > Hi list,
> >
> > Let's say I add a copyField to my solr schema, or change the analysis
> chain
> > of a field or some other change.
> > It seems to me to be an alluring choice to use a very simple
> > dataimporthandler to reindex all documents, by using a
> SolrEntityProcessor
> > that points to itself. I have just done this for a very small collection,
> > but I was wondering what the caveats are, since this is not the
> recommended
> > practice. What can go wrong using this approach?
> >
> > <document> <entity name="all_from_self" processor="SolrEntityProcessor"
> url=
> > "http://localhost:8983/solr/mycollection" qt="lucene" query="*:*" wt=
> > "javabin" rows="1000" cursorMark="true" sort="id asc" fl=
> > "*,orig_version_l:_version_"/> </document>
> >
> > PS: (It is probably necessary to add a version:[* TO
> > <current_highest_version>] to ensure it terminates for large imports)
> > PPS: (Obviously you shouldn't add the clean parameter)
> >
> > /Bjarke
>
>

Re: Reindexing using dataimporthandler

Posted by Emir Arnautović <em...@sematext.com>.

Hi Bjarke,
I don’t see a problem with that approach if you have enough resources to handle both cores at the same time, especially if you are doing that while serving production queries. The only issue is that if you plan to do that then you have to have all fields stored. Also note that cursorMark support was added a bit later to entity processor, so if you are running a bit older version of Solr, you might not have cursors - I’ve found it the hard way.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Apr 2020, at 13:11, Bjarke Buur Mortensen <mo...@eluence.com> wrote:
> 
> Hi list,
> 
> Let's say I add a copyField to my solr schema, or change the analysis chain
> of a field or some other change.
> It seems to me to be an alluring choice to use a very simple
> dataimporthandler to reindex all documents, by using a SolrEntityProcessor
> that points to itself. I have just done this for a very small collection,
> but I was wondering what the caveats are, since this is not the recommended
> practice. What can go wrong using this approach?
> 
> <document> <entity name="all_from_self" processor="SolrEntityProcessor" url=
> "http://localhost:8983/solr/mycollection" qt="lucene" query="*:*" wt=
> "javabin" rows="1000" cursorMark="true" sort="id asc" fl=
> "*,orig_version_l:_version_"/> </document>
> 
> PS: (It is probably necessary to add a version:[* TO
> <current_highest_version>] to ensure it terminates for large imports)
> PPS: (Obviously you shouldn't add the clean parameter)
> 
> /Bjarke