You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Christopher Schultz <ch...@christopherschultz.net> on 2023/05/17 16:01:06 UTC

Copy-field doesn't seem to be working as expected

All,

I have a Solr 7.7.3 server with a pretty simply index including a 
copy-field. I have confirmed that the actually-loaded index contains 
these (among other) fields:

- identifier, type=text_general, multivalued=true ("copied to 'all'")
- all, type-text_general, multivalued=true ("copied from 'all' (and 
others)")

The "all" field contains copies of the other fields values for each 
record I've studied except for "identifier".

I have re-indexed the whole document set and the "all" field still does 
not contain the values I can see (in the search results) for "identifier".

I'm using Solr's console for all investigations, so there is no other 
software playing games with what is shown in the index, etc.

The other fields being copied into "all" are all multivalued=false which 
is the only thing I can think of that might be a problem, but I can't 
find any documentation which suggests it wouldn't work. In fact, the 
documentation[1] seems to explicitly declare that this should work:

"
Remember to configure your fields as multivalued="true" if they will 
ultimately get multiple values (either from a multivalued source or from 
multiple copyField directives).
"

So multi-valued source should not be a problem.

Any suggestions for where to look for a problem?

-chris

[1] https://solr.apache.org/guide/7_7/copying-fields.html

Re: Copy-field doesn't seem to be working as expected

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Is your "all" field set to store=true? It should not be, but that means it
would not show up in search results but still will be indexed and
searchable against.

You could look at tokenized content or temporarily try setting fields to
stored=true and reindexing.

On Wed., May 17, 2023, 12:01 p.m. Christopher Schultz, <
chris@christopherschultz.net> wrote:

> All,
>
> I have a Solr 7.7.3 server with a pretty simply index including a
> copy-field. I have confirmed that the actually-loaded index contains
> these (among other) fields:
>
> - identifier, type=text_general, multivalued=true ("copied to 'all'")
> - all, type-text_general, multivalued=true ("copied from 'all' (and
> others)")
>
> The "all" field contains copies of the other fields values for each
> record I've studied except for "identifier".
>
> I have re-indexed the whole document set and the "all" field still does
> not contain the values I can see (in the search results) for "identifier".
>
> I'm using Solr's console for all investigations, so there is no other
> software playing games with what is shown in the index, etc.
>
> The other fields being copied into "all" are all multivalued=false which
> is the only thing I can think of that might be a problem, but I can't
> find any documentation which suggests it wouldn't work. In fact, the
> documentation[1] seems to explicitly declare that this should work:
>
> "
> Remember to configure your fields as multivalued="true" if they will
> ultimately get multiple values (either from a multivalued source or from
> multiple copyField directives).
> "
>
> So multi-valued source should not be a problem.
>
> Any suggestions for where to look for a problem?
>
> -chris
>
> [1] https://solr.apache.org/guide/7_7/copying-fields.html
>

Re: Copy-field doesn't seem to be working as expected

Posted by Thomas Corthals <th...@klascement.net>.
Op za 20 mei 2023 om 21:18 schreef Shawn Heisey <ap...@elyograg.org>:

> Agreed.  There are many situations outside of version upgrades where
> rebuilding the index from scratch is an absolute requirement.  It is
> something all Solr users need to be able to do at ANY time.  I used to
> maintain an index where a full rebuild would quite literally take about
> six or seven days, but I found a way to do it with zero downtime.
>

My rebuild procedure indexes the most recently added/modified documents
first and works its way back through almost 20 years of data. When the most
recent 1/4th of documents are reindexed after about a day, we can already
satisfy 90% of the search requests. This doesn't necessarily mean users
will always filter by date. For most searches they'll get fewer results
until indexing is completed, but they most likely won't page far enough to
notice the difference. It works because our default sort is by date and
more recent results are usually preferred. Just throwing it out there
because a similar might be "good enough" for someone else too.

Thomas

Re: Copy-field doesn't seem to be working as expected

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/20/23 12:14, Dave wrote:
> I never trust a solr upgrade path with an index from one major version to another.  It has to be completely recreated in my opinion with the updated schema as sometimes there may be major changes, even though it’s said you can go two versions up with the same index using the upgrade path.   I’ve had to rebuild indexes that take weeks to coordinate but the mechanism was in place and ready to do.  I love the idea of holding one index in a core and building the next one in a secondary core and switching the names.  It’s almost seamless and has been a trusted mechanism in traditional databases for decades.

Version enforcement on upgrades started in 8.x.  Before 8.x, you COULD 
upgrade a Lucene index more than one major version.  It has always been 
discouraged, but it was still possible.  Now it's not possible.  I did 
once see a tool that would perform delicate surgery on an index to make 
such upgrades possible ... but that is a bad idea.

> Best of luck, but you should always have a path to completely destroy and rebuild a solr index as it’s not to be trusted to be consistent, it’s not a database. I mean if you want speed it’s on an ssd, which can fail at any given moment but you want the speed, just things to consider going forward.

Agreed.  There are many situations outside of version upgrades where 
rebuilding the index from scratch is an absolute requirement.  It is 
something all Solr users need to be able to do at ANY time.  I used to 
maintain an index where a full rebuild would quite literally take about 
six or seven days, but I found a way to do it with zero downtime.

Thanks,
Shawn

Re: Copy-field doesn't seem to be working as expected

Posted by Dave <ha...@gmail.com>.
I never trust a solr upgrade path with an index from one major version to another.  It has to be completely recreated in my opinion with the updated schema as sometimes there may be major changes, even though it’s said you can go two versions up with the same index using the upgrade path.   I’ve had to rebuild indexes that take weeks to coordinate but the mechanism was in place and ready to do.  I love the idea of holding one index in a core and building the next one in a secondary core and switching the names.  It’s almost seamless and has been a trusted mechanism in traditional databases for decades.  

Best of luck, but you should always have a path to completely destroy and rebuild a solr index as it’s not to be trusted to be consistent, it’s not a database. I mean if you want speed it’s on an ssd, which can fail at any given moment but you want the speed, just things to consider going forward. 

Also you can index document’s asynchronous and fork out the indexing processes to speed it up.  So something that takes four hours can be done in one if it’s forked four times etc if the solr server has the cpus and you commit wisely (don’t commit until your process is done)

Hope it works, look forward to the follow up

Dave

> On May 20, 2023, at 1:53 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> 
> On 5/19/23 15:39, Christopher Schultz wrote:
>> Please confirm the following:
>> 1. Solr index is created with Solr 7.something
>> 2. Solr 8.x is deployed and all is well
>> 3. Index is re-built by replacing 100% of documents in the index
>> 4. Solr 9.x is deployed and all is well
>> Is that correct, especially #4? I'd hate to have to literally delete the index and re-create it, since it's supposed to be online all the time and it takes hours to re-index everything.
> 
> With that sequence, you might have a problem at step 4.  I am not completely sure whether all the version 7 info is gone.  It might work fine.
> 
> Given that you're not in cloud mode, here is how I would arrange things.  I have used this before with good success:
> 
> * Two cores.
>  * Directories named example_0 and example_1
>  * Cores named example and example_build
> 
> Build a new index in the example_build core and swap the cores using CoreAdmin when the full rebuild is done.  Nothing ever goes down.
> 
> Using the _0 and _1 directory names stays true to the principle of least surprise.  Otherwise you will find yourself in a situation where the core named "example" is housed in a directory named "example_build" because the cores have been swapped.
> 
> In cloud mode, I would use the alias feature.  Have collections named "example_2023.05.20" (or whatever naming convention makes sense to you), with an alias named example that points to whichever real collection is online.
> 
> Thanks,
> Shawn

Re: Copy-field doesn't seem to be working as expected

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/19/23 15:39, Christopher Schultz wrote:
> Please confirm the following:
> 
> 1. Solr index is created with Solr 7.something
> 2. Solr 8.x is deployed and all is well
> 3. Index is re-built by replacing 100% of documents in the index
> 4. Solr 9.x is deployed and all is well
> 
> Is that correct, especially #4? I'd hate to have to literally delete the 
> index and re-create it, since it's supposed to be online all the time 
> and it takes hours to re-index everything.

With that sequence, you might have a problem at step 4.  I am not 
completely sure whether all the version 7 info is gone.  It might work fine.

Given that you're not in cloud mode, here is how I would arrange things. 
  I have used this before with good success:

* Two cores.
   * Directories named example_0 and example_1
   * Cores named example and example_build

Build a new index in the example_build core and swap the cores using 
CoreAdmin when the full rebuild is done.  Nothing ever goes down.

Using the _0 and _1 directory names stays true to the principle of least 
surprise.  Otherwise you will find yourself in a situation where the 
core named "example" is housed in a directory named "example_build" 
because the cores have been swapped.

In cloud mode, I would use the alias feature.  Have collections named 
"example_2023.05.20" (or whatever naming convention makes sense to you), 
with an alias named example that points to whichever real collection is 
online.

Thanks,
Shawn

Re: Copy-field doesn't seem to be working as expected

Posted by Christopher Schultz <ch...@christopherschultz.net>.
Shawn,

On 5/18/23 14:35, Shawn Heisey wrote:
> On 5/18/23 10:27, Christopher Schultz wrote:
>> I didn't know there were multiple SolrJ implementations. I'm using the 
>> client library directly from the Solr project with a version number of 
>> 7.7.3. It looks like I have been running against an 8.1.1 server in my 
>> development environment while we have 7.7.3 in both staging and 
>> production. My goal was to upgrade to Solr 8.latest in the very near 
>> future, but I wanted to have all this code in-place to allow for 
>> completely automated schema updates and index re-build before doing 
>> that, because I understand that moving between major versions 
>> basically requires a complete index re-build. I'd rather have a 
>> completely point-and-click admin-initiated process for that than a 
>> manual "type these 40 commands" process to make the migration super 
>> duper easy.
> 
> There are basically four client implementations that most end users 
> might use.
> 
> 1) Cloud client based on Apache HttpClient.  Deprecated.  Class name 
> CloudSolrClient.
> 2) Http client based on Apache HttpClient.  Deprecated.  Class name 
> HttpSolrClient.
> 3) Cloud client based on Jetty HttpClient.  Capable of HTTP2.  Class 
> name CloudHttp2SolrClient.
> 4) Http client based on Jetty HttpClient.  Capable of HTTP2.  Class name 
> Http2SolrClient.

I'm using the org.apache.solr.client.solrj.impl.HttpSolrClient class 
from solr-solrj-7.7.3.jar library. It loos like I'm using it with 
org.apache.http.client.HttpClient so I guess I'm in bucket #2 above.

> There are some other client implementations, but they are mainly used 
> internally by the four mentioned clients or internally by Solr itself.
> 
> If your version 7 index was built from scratch by Solr 7.x, then you can 
> upgrade it to 8.x with no problem.  If any version before 7.0 has EVER 
> touched the index, then 8.x will not open it.  Version 9 is similar, 
> only opening indexes originally built by 8.0 or later.

Please confirm the following:

1. Solr index is created with Solr 7.something
2. Solr 8.x is deployed and all is well
3. Index is re-built by replacing 100% of documents in the index
4. Solr 9.x is deployed and all is well

Is that correct, especially #4? I'd hate to have to literally delete the 
index and re-create it, since it's supposed to be online all the time 
and it takes hours to re-index everything.

> Even when a version upgrade can use the existing index, a full re-index 
> is still recommended.

Is a full-re-index defined as "replace every single document with a new 
fresh copy of itself"? If so, then I'm all good.

Thanks,
-chris


Re: Copy-field doesn't seem to be working as expected

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/18/23 10:27, Christopher Schultz wrote:
> I didn't know there were multiple SolrJ implementations. I'm using the 
> client library directly from the Solr project with a version number of 
> 7.7.3. It looks like I have been running against an 8.1.1 server in my 
> development environment while we have 7.7.3 in both staging and 
> production. My goal was to upgrade to Solr 8.latest in the very near 
> future, but I wanted to have all this code in-place to allow for 
> completely automated schema updates and index re-build before doing 
> that, because I understand that moving between major versions basically 
> requires a complete index re-build. I'd rather have a completely 
> point-and-click admin-initiated process for that than a manual "type 
> these 40 commands" process to make the migration super duper easy.

There are basically four client implementations that most end users 
might use.

1) Cloud client based on Apache HttpClient.  Deprecated.  Class name 
CloudSolrClient.
2) Http client based on Apache HttpClient.  Deprecated.  Class name 
HttpSolrClient.
3) Cloud client based on Jetty HttpClient.  Capable of HTTP2.  Class 
name CloudHttp2SolrClient.
4) Http client based on Jetty HttpClient.  Capable of HTTP2.  Class name 
Http2SolrClient.

There are some other client implementations, but they are mainly used 
internally by the four mentioned clients or internally by Solr itself.

If your version 7 index was built from scratch by Solr 7.x, then you can 
upgrade it to 8.x with no problem.  If any version before 7.0 has EVER 
touched the index, then 8.x will not open it.  Version 9 is similar, 
only opening indexes originally built by 8.0 or later.

Even when a version upgrade can use the existing index, a full re-index 
is still recommended.

Thanks,
Shawn

Re: Copy-field doesn't seem to be working as expected

Posted by Christopher Schultz <ch...@christopherschultz.net>.
Shawn,

On 5/17/23 21:45, Shawn Heisey wrote:
> On 5/17/23 11:40, Christopher Schultz wrote:
>> Thanks for your replies and I apologize for the noise. I'll pick this 
>> thread back up if for some reason I am able to reproduce the issue.
> 
> I can't tell you how many times I have done this.  Ask for help, and 
> while working diligently to document the problem beyond my initial 
> description, I either can't reproduce it or the solution becomes 
> extremely obvious.  I consider that to be a learning experience.

:)

I usually discover most problems and fix them during the "asking for 
help" drafting process, and end up never sending the message. In this 
case, I mist have missed something.

>> Speaking of the lag-between-insert-and-searchability, is there any 
>> information Solr is able to provide regarding a core's freshness?
> 
> <snip>
> 
>>              // lastModified=Mon Mar 06 14:58:22 EST 
>> 2023,sizeInBytes=56606,size=55.28 KB}}}
>>
>> Presumably, lastModified gives me the timestamp the last document was 
>> added. What about when the index was opened for searching?
> 
> Yes, that is exactly what I was going to point you at.  The info comes 
> from Lucene and is the last time ANY change was made to the index ... 
> add, update, delete, etc.

Great. I'm already reporting this to the admin user, so I now just have 
to add...

> As for when the searcher was opened, that is slightly complicated if 
> you're in cloud mode, but it's really easy if you're in standalone mode, 
> because the core name will be known in advance.  In cloud mode you won't 
> always know the core name to look for just based on the collection name.

As it happens, we are using standalone cores, so I get to choose the 
"easy path" for now. ;)

> Here's how you would parse it with jq ... I am pretty sure there is a 
> way to do it with SolrJ too:
> 
> https://www.dropbox.com/s/fs01ogmtqwsj3yd/using_jq_to_parse_metrics_for_searcher_open_time.png?dl=0

So... call /solr/admin/metrics and look through the stuff. I'll see if 
there is a convenient SolrJ mechanism for that.

> I will see if I can hack together some SolrJ code to duplicate that. Are 
> you in cloud mode or standalone?  If cloud mode, which SolrClient 
> implementation are you using?  I will use SolrJ 9.2.1 to work on it ... 
> hopefully it's not horrible to translate to a version 7 SolrJ.

I didn't know there were multiple SolrJ implementations. I'm using the 
client library directly from the Solr project with a version number of 
7.7.3. It looks like I have been running against an 8.1.1 server in my 
development environment while we have 7.7.3 in both staging and 
production. My goal was to upgrade to Solr 8.latest in the very near 
future, but I wanted to have all this code in-place to allow for 
completely automated schema updates and index re-build before doing 
that, because I understand that moving between major versions basically 
requires a complete index re-build. I'd rather have a completely 
point-and-click admin-initiated process for that than a manual "type 
these 40 commands" process to make the migration super duper easy.

So, long story short, if there is a specific reason that upgrading to 
8.latest will make this easier, consider it done.

Thanks,
-chris

Re: Copy-field doesn't seem to be working as expected

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/17/23 11:40, Christopher Schultz wrote:
> Thanks for your replies and I apologize for the noise. I'll pick this 
> thread back up if for some reason I am able to reproduce the issue.

I can't tell you how many times I have done this.  Ask for help, and 
while working diligently to document the problem beyond my initial 
description, I either can't reproduce it or the solution becomes 
extremely obvious.  I consider that to be a learning experience.

> Speaking of the lag-between-insert-and-searchability, is there any 
> information Solr is able to provide regarding a core's freshness?

<snip>

>              // lastModified=Mon Mar 06 14:58:22 EST 
> 2023,sizeInBytes=56606,size=55.28 KB}}}
> 
> Presumably, lastModified gives me the timestamp the last document was 
> added. What about when the index was opened for searching?

Yes, that is exactly what I was going to point you at.  The info comes 
from Lucene and is the last time ANY change was made to the index ... 
add, update, delete, etc.

As for when the searcher was opened, that is slightly complicated if 
you're in cloud mode, but it's really easy if you're in standalone mode, 
because the core name will be known in advance.  In cloud mode you won't 
always know the core name to look for just based on the collection name.

Here's how you would parse it with jq ... I am pretty sure there is a 
way to do it with SolrJ too:

https://www.dropbox.com/s/fs01ogmtqwsj3yd/using_jq_to_parse_metrics_for_searcher_open_time.png?dl=0

I will see if I can hack together some SolrJ code to duplicate that. 
Are you in cloud mode or standalone?  If cloud mode, which SolrClient 
implementation are you using?  I will use SolrJ 9.2.1 to work on it ... 
hopefully it's not horrible to translate to a version 7 SolrJ.

Thanks,
Shawn

Re: Copy-field doesn't seem to be working as expected

Posted by Christopher Schultz <ch...@christopherschultz.net>.
Shawn and Alexandre,

On 5/17/23 13:12, Shawn Heisey wrote:
> On 5/17/23 10:01, Christopher Schultz wrote:
>> The "all" field contains copies of the other fields values for each 
>> record I've studied except for "identifier".
>>
>> I have re-indexed the whole document set and the "all" field still 
>> does not contain the values I can see (in the search results) for 
>> "identifier".
> 
> Can you share your schema?  If you need to redact sensitive info from 
> it, please do it in a way that ensures we can distinguish one bit of 
> redacted data from other redacted bits.
> 
> Part of my intent in asking is to find out the answer to the question 
> that Alexandre asked.  It will also provide data to determine what 
> questions I will ask next.

All of my fields are stored (this is how I knew that other field values 
were in fact available in the "all" field).

I think this might be a false-alarm. As careful as I tried to be to make 
sure to described the situation as accurately and completely as 
possible, I cannot replicate it. I inserted a new document into the 
index (via my own software) and the field values were copied as expected.

I wonder if my problem was a timing issue between inserting the document 
and the server-specified soft-auto-commit value. We know there is a 
delay between when the data are inserted into the index and when they 
can be found successfully via a search. I did not take any screenshots 
at the time of the field values so I can't even be sure I wasn't just 
having selective-vision at the time.

Thanks for your replies and I apologize for the noise. I'll pick this 
thread back up if for some reason I am able to reproduce the issue.

Speaking of the lag-between-insert-and-searchability, is there any 
information Solr is able to provide regarding a core's freshness? I have 
an administrative interface in my application I've been building which 
is able to provide some basic information about a core, "freshen" a core 
schema, and re-index the core with data from my application. I would 
love to be able to show "last data added to index today 13:34:46" and 
"last soft commit/searcher-open (or whatever the right term is) today 
13:32:00" so the admin can see "okay, we have a blind-spot which extends 
00:02:46 into the past". Does the core metadata give that kind of info? 
I'm currently using SolrJ's CoreAdminRequest.getStatus call to get the 
metadata.

I can see this kind of data in there (this is old data in a 
code-comment; please ignore the actual values):
             // index={numDocs=85,maxDoc=90,deletedDocs=5,
             // indexHeapUsageBytes=-1,
             // version=2093,
             // segmentCount=8,
             // current=true,
             // hasDeletions=true,
             // 
directory=org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/path 
lockFactory=org.apache.lucene.store.NativeFSLockFactory@52e9883c; 
maxCacheMB=48.0 maxMergeSizeMB=4.0),
             // segmentsFile=segments_az,
             // segmentsFileSizeInBytes=650,
             // userData={commitCommandVer=0, commitTimeMSec=1678132702948},
             // lastModified=Mon Mar 06 14:58:22 EST 
2023,sizeInBytes=56606,size=55.28 KB}}}

Presumably, lastModified gives me the timestamp the last document was 
added. What about when the index was opened for searching?

As always, thank you for your thoughts.

-chris

Re: Copy-field doesn't seem to be working as expected

Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/17/23 10:01, Christopher Schultz wrote:
> The "all" field contains copies of the other fields values for each 
> record I've studied except for "identifier".
> 
> I have re-indexed the whole document set and the "all" field still does 
> not contain the values I can see (in the search results) for "identifier".

Can you share your schema?  If you need to redact sensitive info from 
it, please do it in a way that ensures we can distinguish one bit of 
redacted data from other redacted bits.

Part of my intent in asking is to find out the answer to the question 
that Alexandre asked.  It will also provide data to determine what 
questions I will ask next.

Thanks,
Shawn