You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@solr.apache.org by Christopher Schultz <ch...@christopherschultz.net> on 2023/05/17 16:01:06 UTC
Copy-field doesn't seem to be working as expected
All,
I have a Solr 7.7.3 server with a pretty simply index including a
copy-field. I have confirmed that the actually-loaded index contains
these (among other) fields:
- identifier, type=text_general, multivalued=true ("copied to 'all'")
- all, type-text_general, multivalued=true ("copied from 'all' (and
others)")
The "all" field contains copies of the other fields values for each
record I've studied except for "identifier".
I have re-indexed the whole document set and the "all" field still does
not contain the values I can see (in the search results) for "identifier".
I'm using Solr's console for all investigations, so there is no other
software playing games with what is shown in the index, etc.
The other fields being copied into "all" are all multivalued=false which
is the only thing I can think of that might be a problem, but I can't
find any documentation which suggests it wouldn't work. In fact, the
documentation[1] seems to explicitly declare that this should work:
"
Remember to configure your fields as multivalued="true" if they will
ultimately get multiple values (either from a multivalued source or from
multiple copyField directives).
"
So multi-valued source should not be a problem.
Any suggestions for where to look for a problem?
-chris
[1] https://solr.apache.org/guide/7_7/copying-fields.html
Re: Copy-field doesn't seem to be working as expected
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Is your "all" field set to store=true? It should not be, but that means it
would not show up in search results but still will be indexed and
searchable against.
You could look at tokenized content or temporarily try setting fields to
stored=true and reindexing.
On Wed., May 17, 2023, 12:01 p.m. Christopher Schultz, <
chris@christopherschultz.net> wrote:
> All,
>
> I have a Solr 7.7.3 server with a pretty simply index including a
> copy-field. I have confirmed that the actually-loaded index contains
> these (among other) fields:
>
> - identifier, type=text_general, multivalued=true ("copied to 'all'")
> - all, type-text_general, multivalued=true ("copied from 'all' (and
> others)")
>
> The "all" field contains copies of the other fields values for each
> record I've studied except for "identifier".
>
> I have re-indexed the whole document set and the "all" field still does
> not contain the values I can see (in the search results) for "identifier".
>
> I'm using Solr's console for all investigations, so there is no other
> software playing games with what is shown in the index, etc.
>
> The other fields being copied into "all" are all multivalued=false which
> is the only thing I can think of that might be a problem, but I can't
> find any documentation which suggests it wouldn't work. In fact, the
> documentation[1] seems to explicitly declare that this should work:
>
> "
> Remember to configure your fields as multivalued="true" if they will
> ultimately get multiple values (either from a multivalued source or from
> multiple copyField directives).
> "
>
> So multi-valued source should not be a problem.
>
> Any suggestions for where to look for a problem?
>
> -chris
>
> [1] https://solr.apache.org/guide/7_7/copying-fields.html
>
Re: Copy-field doesn't seem to be working as expected
Posted by Thomas Corthals <th...@klascement.net>.
Op za 20 mei 2023 om 21:18 schreef Shawn Heisey <ap...@elyograg.org>:
> Agreed. There are many situations outside of version upgrades where
> rebuilding the index from scratch is an absolute requirement. It is
> something all Solr users need to be able to do at ANY time. I used to
> maintain an index where a full rebuild would quite literally take about
> six or seven days, but I found a way to do it with zero downtime.
>
My rebuild procedure indexes the most recently added/modified documents
first and works its way back through almost 20 years of data. When the most
recent 1/4th of documents are reindexed after about a day, we can already
satisfy 90% of the search requests. This doesn't necessarily mean users
will always filter by date. For most searches they'll get fewer results
until indexing is completed, but they most likely won't page far enough to
notice the difference. It works because our default sort is by date and
more recent results are usually preferred. Just throwing it out there
because a similar might be "good enough" for someone else too.
Thomas
Re: Copy-field doesn't seem to be working as expected
Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/20/23 12:14, Dave wrote:
> I never trust a solr upgrade path with an index from one major version to another. It has to be completely recreated in my opinion with the updated schema as sometimes there may be major changes, even though it’s said you can go two versions up with the same index using the upgrade path. I’ve had to rebuild indexes that take weeks to coordinate but the mechanism was in place and ready to do. I love the idea of holding one index in a core and building the next one in a secondary core and switching the names. It’s almost seamless and has been a trusted mechanism in traditional databases for decades.
Version enforcement on upgrades started in 8.x. Before 8.x, you COULD
upgrade a Lucene index more than one major version. It has always been
discouraged, but it was still possible. Now it's not possible. I did
once see a tool that would perform delicate surgery on an index to make
such upgrades possible ... but that is a bad idea.
> Best of luck, but you should always have a path to completely destroy and rebuild a solr index as it’s not to be trusted to be consistent, it’s not a database. I mean if you want speed it’s on an ssd, which can fail at any given moment but you want the speed, just things to consider going forward.
Agreed. There are many situations outside of version upgrades where
rebuilding the index from scratch is an absolute requirement. It is
something all Solr users need to be able to do at ANY time. I used to
maintain an index where a full rebuild would quite literally take about
six or seven days, but I found a way to do it with zero downtime.
Thanks,
Shawn
Re: Copy-field doesn't seem to be working as expected
Posted by Dave <ha...@gmail.com>.
I never trust a solr upgrade path with an index from one major version to another. It has to be completely recreated in my opinion with the updated schema as sometimes there may be major changes, even though it’s said you can go two versions up with the same index using the upgrade path. I’ve had to rebuild indexes that take weeks to coordinate but the mechanism was in place and ready to do. I love the idea of holding one index in a core and building the next one in a secondary core and switching the names. It’s almost seamless and has been a trusted mechanism in traditional databases for decades.
Best of luck, but you should always have a path to completely destroy and rebuild a solr index as it’s not to be trusted to be consistent, it’s not a database. I mean if you want speed it’s on an ssd, which can fail at any given moment but you want the speed, just things to consider going forward.
Also you can index document’s asynchronous and fork out the indexing processes to speed it up. So something that takes four hours can be done in one if it’s forked four times etc if the solr server has the cpus and you commit wisely (don’t commit until your process is done)
Hope it works, look forward to the follow up
Dave
> On May 20, 2023, at 1:53 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>
> On 5/19/23 15:39, Christopher Schultz wrote:
>> Please confirm the following:
>> 1. Solr index is created with Solr 7.something
>> 2. Solr 8.x is deployed and all is well
>> 3. Index is re-built by replacing 100% of documents in the index
>> 4. Solr 9.x is deployed and all is well
>> Is that correct, especially #4? I'd hate to have to literally delete the index and re-create it, since it's supposed to be online all the time and it takes hours to re-index everything.
>
> With that sequence, you might have a problem at step 4. I am not completely sure whether all the version 7 info is gone. It might work fine.
>
> Given that you're not in cloud mode, here is how I would arrange things. I have used this before with good success:
>
> * Two cores.
> * Directories named example_0 and example_1
> * Cores named example and example_build
>
> Build a new index in the example_build core and swap the cores using CoreAdmin when the full rebuild is done. Nothing ever goes down.
>
> Using the _0 and _1 directory names stays true to the principle of least surprise. Otherwise you will find yourself in a situation where the core named "example" is housed in a directory named "example_build" because the cores have been swapped.
>
> In cloud mode, I would use the alias feature. Have collections named "example_2023.05.20" (or whatever naming convention makes sense to you), with an alias named example that points to whichever real collection is online.
>
> Thanks,
> Shawn
Re: Copy-field doesn't seem to be working as expected
Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/19/23 15:39, Christopher Schultz wrote:
> Please confirm the following:
>
> 1. Solr index is created with Solr 7.something
> 2. Solr 8.x is deployed and all is well
> 3. Index is re-built by replacing 100% of documents in the index
> 4. Solr 9.x is deployed and all is well
>
> Is that correct, especially #4? I'd hate to have to literally delete the
> index and re-create it, since it's supposed to be online all the time
> and it takes hours to re-index everything.
With that sequence, you might have a problem at step 4. I am not
completely sure whether all the version 7 info is gone. It might work fine.
Given that you're not in cloud mode, here is how I would arrange things.
I have used this before with good success:
* Two cores.
* Directories named example_0 and example_1
* Cores named example and example_build
Build a new index in the example_build core and swap the cores using
CoreAdmin when the full rebuild is done. Nothing ever goes down.
Using the _0 and _1 directory names stays true to the principle of least
surprise. Otherwise you will find yourself in a situation where the
core named "example" is housed in a directory named "example_build"
because the cores have been swapped.
In cloud mode, I would use the alias feature. Have collections named
"example_2023.05.20" (or whatever naming convention makes sense to you),
with an alias named example that points to whichever real collection is
online.
Thanks,
Shawn
Re: Copy-field doesn't seem to be working as expected
Posted by Christopher Schultz <ch...@christopherschultz.net>.
Shawn,
On 5/18/23 14:35, Shawn Heisey wrote:
> On 5/18/23 10:27, Christopher Schultz wrote:
>> I didn't know there were multiple SolrJ implementations. I'm using the
>> client library directly from the Solr project with a version number of
>> 7.7.3. It looks like I have been running against an 8.1.1 server in my
>> development environment while we have 7.7.3 in both staging and
>> production. My goal was to upgrade to Solr 8.latest in the very near
>> future, but I wanted to have all this code in-place to allow for
>> completely automated schema updates and index re-build before doing
>> that, because I understand that moving between major versions
>> basically requires a complete index re-build. I'd rather have a
>> completely point-and-click admin-initiated process for that than a
>> manual "type these 40 commands" process to make the migration super
>> duper easy.
>
> There are basically four client implementations that most end users
> might use.
>
> 1) Cloud client based on Apache HttpClient. Deprecated. Class name
> CloudSolrClient.
> 2) Http client based on Apache HttpClient. Deprecated. Class name
> HttpSolrClient.
> 3) Cloud client based on Jetty HttpClient. Capable of HTTP2. Class
> name CloudHttp2SolrClient.
> 4) Http client based on Jetty HttpClient. Capable of HTTP2. Class name
> Http2SolrClient.
I'm using the org.apache.solr.client.solrj.impl.HttpSolrClient class
from solr-solrj-7.7.3.jar library. It loos like I'm using it with
org.apache.http.client.HttpClient so I guess I'm in bucket #2 above.
> There are some other client implementations, but they are mainly used
> internally by the four mentioned clients or internally by Solr itself.
>
> If your version 7 index was built from scratch by Solr 7.x, then you can
> upgrade it to 8.x with no problem. If any version before 7.0 has EVER
> touched the index, then 8.x will not open it. Version 9 is similar,
> only opening indexes originally built by 8.0 or later.
Please confirm the following:
1. Solr index is created with Solr 7.something
2. Solr 8.x is deployed and all is well
3. Index is re-built by replacing 100% of documents in the index
4. Solr 9.x is deployed and all is well
Is that correct, especially #4? I'd hate to have to literally delete the
index and re-create it, since it's supposed to be online all the time
and it takes hours to re-index everything.
> Even when a version upgrade can use the existing index, a full re-index
> is still recommended.
Is a full-re-index defined as "replace every single document with a new
fresh copy of itself"? If so, then I'm all good.
Thanks,
-chris
Re: Copy-field doesn't seem to be working as expected
Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/18/23 10:27, Christopher Schultz wrote:
> I didn't know there were multiple SolrJ implementations. I'm using the
> client library directly from the Solr project with a version number of
> 7.7.3. It looks like I have been running against an 8.1.1 server in my
> development environment while we have 7.7.3 in both staging and
> production. My goal was to upgrade to Solr 8.latest in the very near
> future, but I wanted to have all this code in-place to allow for
> completely automated schema updates and index re-build before doing
> that, because I understand that moving between major versions basically
> requires a complete index re-build. I'd rather have a completely
> point-and-click admin-initiated process for that than a manual "type
> these 40 commands" process to make the migration super duper easy.
There are basically four client implementations that most end users
might use.
1) Cloud client based on Apache HttpClient. Deprecated. Class name
CloudSolrClient.
2) Http client based on Apache HttpClient. Deprecated. Class name
HttpSolrClient.
3) Cloud client based on Jetty HttpClient. Capable of HTTP2. Class
name CloudHttp2SolrClient.
4) Http client based on Jetty HttpClient. Capable of HTTP2. Class name
Http2SolrClient.
There are some other client implementations, but they are mainly used
internally by the four mentioned clients or internally by Solr itself.
If your version 7 index was built from scratch by Solr 7.x, then you can
upgrade it to 8.x with no problem. If any version before 7.0 has EVER
touched the index, then 8.x will not open it. Version 9 is similar,
only opening indexes originally built by 8.0 or later.
Even when a version upgrade can use the existing index, a full re-index
is still recommended.
Thanks,
Shawn
Re: Copy-field doesn't seem to be working as expected
Posted by Christopher Schultz <ch...@christopherschultz.net>.
Shawn,
On 5/17/23 21:45, Shawn Heisey wrote:
> On 5/17/23 11:40, Christopher Schultz wrote:
>> Thanks for your replies and I apologize for the noise. I'll pick this
>> thread back up if for some reason I am able to reproduce the issue.
>
> I can't tell you how many times I have done this. Ask for help, and
> while working diligently to document the problem beyond my initial
> description, I either can't reproduce it or the solution becomes
> extremely obvious. I consider that to be a learning experience.
:)
I usually discover most problems and fix them during the "asking for
help" drafting process, and end up never sending the message. In this
case, I mist have missed something.
>> Speaking of the lag-between-insert-and-searchability, is there any
>> information Solr is able to provide regarding a core's freshness?
>
> <snip>
>
>> // lastModified=Mon Mar 06 14:58:22 EST
>> 2023,sizeInBytes=56606,size=55.28 KB}}}
>>
>> Presumably, lastModified gives me the timestamp the last document was
>> added. What about when the index was opened for searching?
>
> Yes, that is exactly what I was going to point you at. The info comes
> from Lucene and is the last time ANY change was made to the index ...
> add, update, delete, etc.
Great. I'm already reporting this to the admin user, so I now just have
to add...
> As for when the searcher was opened, that is slightly complicated if
> you're in cloud mode, but it's really easy if you're in standalone mode,
> because the core name will be known in advance. In cloud mode you won't
> always know the core name to look for just based on the collection name.
As it happens, we are using standalone cores, so I get to choose the
"easy path" for now. ;)
> Here's how you would parse it with jq ... I am pretty sure there is a
> way to do it with SolrJ too:
>
> https://www.dropbox.com/s/fs01ogmtqwsj3yd/using_jq_to_parse_metrics_for_searcher_open_time.png?dl=0
So... call /solr/admin/metrics and look through the stuff. I'll see if
there is a convenient SolrJ mechanism for that.
> I will see if I can hack together some SolrJ code to duplicate that. Are
> you in cloud mode or standalone? If cloud mode, which SolrClient
> implementation are you using? I will use SolrJ 9.2.1 to work on it ...
> hopefully it's not horrible to translate to a version 7 SolrJ.
I didn't know there were multiple SolrJ implementations. I'm using the
client library directly from the Solr project with a version number of
7.7.3. It looks like I have been running against an 8.1.1 server in my
development environment while we have 7.7.3 in both staging and
production. My goal was to upgrade to Solr 8.latest in the very near
future, but I wanted to have all this code in-place to allow for
completely automated schema updates and index re-build before doing
that, because I understand that moving between major versions basically
requires a complete index re-build. I'd rather have a completely
point-and-click admin-initiated process for that than a manual "type
these 40 commands" process to make the migration super duper easy.
So, long story short, if there is a specific reason that upgrading to
8.latest will make this easier, consider it done.
Thanks,
-chris
Re: Copy-field doesn't seem to be working as expected
Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/17/23 11:40, Christopher Schultz wrote:
> Thanks for your replies and I apologize for the noise. I'll pick this
> thread back up if for some reason I am able to reproduce the issue.
I can't tell you how many times I have done this. Ask for help, and
while working diligently to document the problem beyond my initial
description, I either can't reproduce it or the solution becomes
extremely obvious. I consider that to be a learning experience.
> Speaking of the lag-between-insert-and-searchability, is there any
> information Solr is able to provide regarding a core's freshness?
<snip>
> // lastModified=Mon Mar 06 14:58:22 EST
> 2023,sizeInBytes=56606,size=55.28 KB}}}
>
> Presumably, lastModified gives me the timestamp the last document was
> added. What about when the index was opened for searching?
Yes, that is exactly what I was going to point you at. The info comes
from Lucene and is the last time ANY change was made to the index ...
add, update, delete, etc.
As for when the searcher was opened, that is slightly complicated if
you're in cloud mode, but it's really easy if you're in standalone mode,
because the core name will be known in advance. In cloud mode you won't
always know the core name to look for just based on the collection name.
Here's how you would parse it with jq ... I am pretty sure there is a
way to do it with SolrJ too:
https://www.dropbox.com/s/fs01ogmtqwsj3yd/using_jq_to_parse_metrics_for_searcher_open_time.png?dl=0
I will see if I can hack together some SolrJ code to duplicate that.
Are you in cloud mode or standalone? If cloud mode, which SolrClient
implementation are you using? I will use SolrJ 9.2.1 to work on it ...
hopefully it's not horrible to translate to a version 7 SolrJ.
Thanks,
Shawn
Re: Copy-field doesn't seem to be working as expected
Posted by Christopher Schultz <ch...@christopherschultz.net>.
Shawn and Alexandre,
On 5/17/23 13:12, Shawn Heisey wrote:
> On 5/17/23 10:01, Christopher Schultz wrote:
>> The "all" field contains copies of the other fields values for each
>> record I've studied except for "identifier".
>>
>> I have re-indexed the whole document set and the "all" field still
>> does not contain the values I can see (in the search results) for
>> "identifier".
>
> Can you share your schema? If you need to redact sensitive info from
> it, please do it in a way that ensures we can distinguish one bit of
> redacted data from other redacted bits.
>
> Part of my intent in asking is to find out the answer to the question
> that Alexandre asked. It will also provide data to determine what
> questions I will ask next.
All of my fields are stored (this is how I knew that other field values
were in fact available in the "all" field).
I think this might be a false-alarm. As careful as I tried to be to make
sure to described the situation as accurately and completely as
possible, I cannot replicate it. I inserted a new document into the
index (via my own software) and the field values were copied as expected.
I wonder if my problem was a timing issue between inserting the document
and the server-specified soft-auto-commit value. We know there is a
delay between when the data are inserted into the index and when they
can be found successfully via a search. I did not take any screenshots
at the time of the field values so I can't even be sure I wasn't just
having selective-vision at the time.
Thanks for your replies and I apologize for the noise. I'll pick this
thread back up if for some reason I am able to reproduce the issue.
Speaking of the lag-between-insert-and-searchability, is there any
information Solr is able to provide regarding a core's freshness? I have
an administrative interface in my application I've been building which
is able to provide some basic information about a core, "freshen" a core
schema, and re-index the core with data from my application. I would
love to be able to show "last data added to index today 13:34:46" and
"last soft commit/searcher-open (or whatever the right term is) today
13:32:00" so the admin can see "okay, we have a blind-spot which extends
00:02:46 into the past". Does the core metadata give that kind of info?
I'm currently using SolrJ's CoreAdminRequest.getStatus call to get the
metadata.
I can see this kind of data in there (this is old data in a
code-comment; please ignore the actual values):
// index={numDocs=85,maxDoc=90,deletedDocs=5,
// indexHeapUsageBytes=-1,
// version=2093,
// segmentCount=8,
// current=true,
// hasDeletions=true,
//
directory=org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/path
lockFactory=org.apache.lucene.store.NativeFSLockFactory@52e9883c;
maxCacheMB=48.0 maxMergeSizeMB=4.0),
// segmentsFile=segments_az,
// segmentsFileSizeInBytes=650,
// userData={commitCommandVer=0, commitTimeMSec=1678132702948},
// lastModified=Mon Mar 06 14:58:22 EST
2023,sizeInBytes=56606,size=55.28 KB}}}
Presumably, lastModified gives me the timestamp the last document was
added. What about when the index was opened for searching?
As always, thank you for your thoughts.
-chris
Re: Copy-field doesn't seem to be working as expected
Posted by Shawn Heisey <ap...@elyograg.org>.
On 5/17/23 10:01, Christopher Schultz wrote:
> The "all" field contains copies of the other fields values for each
> record I've studied except for "identifier".
>
> I have re-indexed the whole document set and the "all" field still does
> not contain the values I can see (in the search results) for "identifier".
Can you share your schema? If you need to redact sensitive info from
it, please do it in a way that ensures we can distinguish one bit of
redacted data from other redacted bits.
Part of my intent in asking is to find out the answer to the question
that Alexandre asked. It will also provide data to determine what
questions I will ask next.
Thanks,
Shawn