You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mike Lissner <ml...@michaeljaylissner.com> on 2016/10/08 00:19:33 UTC

Real Time Search and External File Fields

I have an index of about 4M documents with an external file field
configured to do boosting based on pagerank scores of each document. The
pagerank file is about 93MB as of today -- it's pretty big.

Each day, I add about 1,000 new documents to the index, and I need them to
be available as soon as possible so that I can send out alerts to our users
about new content (this is Google Alerts, essentially).

Soft commits seem to be exactly the thing for this, but whenever I open a
new searcher (which soft commits seem to do), the external file is
reloaded, and all queries are halted until it finishes loading. When I just
measured, this took about 30 seconds to complete. Most soft commit
documentation talks about setting up soft commits with <maxtime> of about a
second.

Is there anything I can do to make the external file field not get reloaded
constantly? It only changes about once a month, and I want to use soft
commits to power the alerts feature.

Thanks,

Mike

Re: Real Time Search and External File Fields

Posted by Mike Lissner <ml...@michaeljaylissner.com>.

On Sat, Oct 8, 2016 at 8:46 AM Shawn Heisey <ap...@elyograg.org> wrote:

> Most soft commit
> > documentation talks about setting up soft commits with <maxtime> of
> about a
> > second.
>
> IMHO any documentation that recommends autoSoftCommit with a maxTime of
> one second is bad documentation, and needs to be fixed.  Where have you
> seen such a recommendation?

You know, I must have made that up, sorry. But the documentation you linked
to (on the Lucid Works blog) and the example file says 15 seconds for hard
commits, so it I think that got me thinking that soft commits could be more
frequent.

Should soft commits be less frequent than hard commits
(opensearcher=False)? If so, I didn't find that to be at all clear.

> right now Solr/Lucene has no
> way of knowing that your external file has not changed, so it must read
> the file every time it builds a searcher.

Is it crazy to file a feature request asking that Solr/Lucene keep the
modtime of this file and on reload it if it has changed? Seems like an easy
win.

>  I doubt this feature was
> designed to deal well with an extremely large external file like yours.
>

Perhaps not. It's probably worth mentioning that part of the reason the
file is so large is because pagerank uses very small and accurate floats.
So a typical line is:

1=9.50539603222e-08

Not something smaller like:

1=3.2

Pagerank also provides a value for every item in the index, so that makes
the file long. I'd suspect that anybody with a pagerank boosted index of
moderate size would have a similarly-sized file.

> If the info changes that infrequently, can you just incorporate it
> directly into the index with a standard field, with the info coming in
> as a part of your normal indexing process?

We've considered that, but whenever you re-run pagerank, it updates EVERY
value. So I guess we could try updating every doc in our index whenever we
run pagerank, but that's a nasty solution.

> It seems unlikely that Solr would stop serving queries while setting up
> a new searcher.  The old searcher should continue to serve requests
> until the new searcher is ready.  If this is happening, that definitely
> seems like a bug.
>

I'm positive I've observed this, though you're right, some queries still
seem to come through. Is it possible that queries relying on the field are
stopped while the field is loading? I've observed this two ways:

1. From the front end, things were stalling every time I was doing a hard
commit (opensearcher=true). I had hard commits coming in every ten minutes
via cron job, and sure enough, at ten, twenty, thirty...minutes after every
hour, I'd see stalls.

2. Watching the logs, I saw a flood of queries come through after the line:

Loaded external value source external_pagerank

Some queries were coming through before this line, but I think none of
those queries use the external file field (external_pagerank).

Mike

Re: Real Time Search and External File Fields

Posted by Shawn Heisey <ap...@elyograg.org>.

On 10/7/2016 6:19 PM, Mike Lissner wrote:
> Soft commits seem to be exactly the thing for this, but whenever I open a
> new searcher (which soft commits seem to do), the external file is
> reloaded, and all queries are halted until it finishes loading. When I just
> measured, this took about 30 seconds to complete. Most soft commit
> documentation talks about setting up soft commits with <maxtime> of about a
> second.

IMHO any documentation that recommends autoSoftCommit with a maxTime of
one second is bad documentation, and needs to be fixed.  Where have you
seen such a recommendation?  Unless the index is extremely small and has
been thoroughly optimized for NRT (which usually means *no*
autowarming), achieving commit times of less than one second is usually
not possible.  This is the page that usually comes out when people start
talking about commits:

http://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

On the topic of one-second commit latency, that page has this to say:
"Set your soft commit interval to as long as you can stand. Don\u2019t listen
to your product manager who says \u201cwe need no more than 1 second
latency\u201d. Really. Push back hard and see if the /user/ is best served or
will even notice. Soft commits and NRT are pretty amazing, but they\u2019re
not free."

The kind of intervals for autocommit and autosoftcommit that I like to
see is at LEAST one minute, and preferably longer if you can stand it to
be longer.

> Is there anything I can do to make the external file field not get reloaded
> constantly? It only changes about once a month, and I want to use soft
> commits to power the alerts feature.

Anytime you want changes to show up in your index, you need a new
searcher.  When you're using an external file field, part of that info
will come from that external source, and right now Solr/Lucene has no
way of knowing that your external file has not changed, so it must read
the file every time it builds a searcher.  I doubt this feature was
designed to deal well with an extremely large external file like yours. 
The code looks like it goes line by line reading the file, and although
I'm sure that process has been optimized as far as it can be, it still
takes a lot of time when there are millions of lines.

If the info changes that infrequently, can you just incorporate it
directly into the index with a standard field, with the info coming in
as a part of your normal indexing process?  I'm sure the performance
would be MUCH better if Solr didn't have to reference the external file.

It seems unlikely that Solr would stop serving queries while setting up
a new searcher.  The old searcher should continue to serve requests
until the new searcher is ready.  If this is happening, that definitely
seems like a bug.

Thanks,
Shawn

Re: Real Time Search and External File Fields

Posted by Mike Lissner <ml...@michaeljaylissner.com>.

Thanks for the replies. I made the changes so that the external file field
is loaded per:

<listener event="newSearcher" class="solr.QuerySenderListener">
  <arr name="queries">
      <!-- See:
https://github.com/freelawproject/courtlistener/issues/581#issuecomment-252443419-->
      <lst>
          <str name="q">*</str><str name="sort">score desc</str>
      </lst>
  </arr>
</listener>


Looking at the logs I now have entries like:

8202710 [searcherExecutor-6-thread-1] INFO  org.apache.solr.core.SolrCore
– Loaded external value source external_pagerank :25 missing keys [4026457,
4026464, 4026468, 4026926, 4029539, 4030007, 4030897, 4030898, 4030899,
4031105]
8202722 [searcherExecutor-6-thread-1] INFO  org.apache.solr.core.SolrCore
– QuerySenderListener sending requests to Searcher@6c77ceaf[collection1]
main{StandardDirectoryReader(segments_380g:441995:nrt
_3liq(4.10.4):C3868116/58950:delGen=2145
_3nfp(4.10.4):C29745/12438:delGen=1033 _3pu5(4.10.4):C22649/7807:delGen=575
_3s9r(4.10.4):C30846/3868:delGen=374 _3r4k(4.10.4):C23730/6740:delGen=478
_3ti5(4.10.4):C6980/461:delGen=151 _3spo(4.10.4):C4447/741:delGen=202
_3u5f(4.10.4):C13863/240:delGen=17 _3tzy(4.10.4):C249/25:delGen=14
_3u0j(4.10.4):C229/17:delGen=4 _3u8n(4.10.4):C4440/45:delGen=5
_3u36(4.10.4):C161/12:delGen=9 _3u84(4.10.4):C4599/1685:delGen=8
_3u8v(4.10.4):C597/12:delGen=4 _3u80(4.10.4):C279/40:delGen=5
_3u9t(4.10.4):C357/4:delGen=2 _3u98(4.10.4):C128/10:delGen=5
_3u8h(4.10.4):C214/13:delGen=3 _3u8y(4.10.4):C119/32:delGen=5
_3ua5(4.10.4):C78/6 _3u94(4.10.4):C96/9:delGen=2
_3u93(4.10.4):C101/14:delGen=5 _3u9n(4.10.4):C214/59:delGen=2
_3u9p(4.10.4):C125/30:delGen=1 _3ua6(4.10.4):C5 _3ua7(4.10.4):C10
_3ua8(4.10.4):C1 _3ua9(4.10.4):C1)}
8202795 [searcherExecutor-6-thread-1] INFO  org.apache.solr.core.SolrCore
– [collection1] webapp=null path=null
params={sort=score+desc&event=newSearcher&q=*&distrib=false} hits=3919121
status=0 QTime=73
8202799 [searcherExecutor-6-thread-1] INFO  org.apache.solr.core.SolrCore
– QuerySenderListener done.

And after these entries, the queries are great. It fixed the issue.

BUT...from time to time, I still have entries like:

10921036 [qtp669611164-42] INFO  org.apache.solr.core.SolrCore  – Loaded
external value source external_pagerank :25 missing keys [4026457, 4026464,
4026468, 4026926, 4029539, 4030007, 4030897, 4030898, 4030899, 4031105]
10921037 [qtp669611164-38] INFO  org.apache.solr.core.SolrCore  –
[collection1] webapp=/solr path=/select/ params={foo=bar&bang=boof} hits=12
status=0 QTime=41337
10921038 [qtp669611164-27] INFO  org.apache.solr.core.SolrCore  –
[collection1] webapp=/solr path=/select/ params={foo=bar&bang=boof}
hits=275 status=0 QTime=22363

So, in those cases the listener didn't seem to fire, and the queries are
very slow (you can see the QTimes of 41 and 22 seconds above).

Any ideas why this would be? I didn't set up a listener for firstSearcher.
Is that what this is caused by? My understanding is firstSearcher is only
triggered by solr startup?

(I also created a new ticket to investigate whether the modtime of the
external file field could be used to avoid EFF reloads:
https://issues.apache.org/jira/browse/LUCENE-7488)
<https://issues.apache.org/jira/browse/LUCENE-7488>

Thanks again. This is already a lot of progress.

Mike

On Sun, Oct 9, 2016 at 7:27 AM Shawn Heisey <ap...@elyograg.org> wrote:

> On 10/8/2016 1:18 PM, Mike Lissner wrote:
> > I want to make sure I understand this properly and document this for
> > futurepeople that may find this thread. Here's what I interpret your
> > advice to be:
> > 0. Slacken my auto soft commit interval to something more like a minute.
>
> Yes, I would do this.  I would also increase autoCommit to something
> between one and five minutes, with openSearcher set to false.  There's
> nothing *wrong* with 15 seconds for autoCommit, but I want my server to
> be doing less work during normal operation.
>
> To answer a question you posed in a later message: Yes, it's common for
> users to have a longer interval on autoSoftCommit than autoCommit.
> Remember the mantra in the URL about understanding commits:  Hard
> commits are about durability, soft commits are about visibility.  Hard
> commits when openSearcher is false are almost always *very* fast, so
> it's typically not much of a burden to have them happen more frequently,
> and thus have a better data durability guarantee.  Like I said above, I
> generally use an autoCommit value between one and five minutes.
>
> > I'm a bit confused about the example autowarmcount for the caches, which
> is
> > 0. Why not set this to something higher? I guess it's a RAM utilization
> vs.
> > speed tradeoff? A low number like 16 seems like it'd have minimal impact
> on
> > RAM?
>
> A low autowarmCount is generally chosen for one reason: commit speed.
> If the example configs have it set to zero, I'm sure this was done so
> commits would proceed as fast as possible.  Large values can turn
> opening a new searcher into a process that can take *minutes*.
>
> On my index shards, the autowarmCount on my filterCache is *four*.
> That's it -- execute only four of the most recent filters in the cache
> when a new searcher opens.  That warming *still* sometimes takes as long
> as 20 seconds on the larger shards.  The filters used in queries on my
> indexes are very large and very complex, and can match millions of
> documents.  Pleading with the dev team to decrease query complexity
> doesn't help.
>
> On the idea of reusing the external file data when it doesn't change:  I
> do not know if this is possible.  I have no idea how Solr and Lucene use
> the data found in the external file, so it might be completely necessary
> to re-load it every time.  You can open an issue in Jira to explore the
> idea, but don't be too surprised if it doesn't go anywhere.
>
> Thanks,
> Shawn
>
>

Re: Real Time Search and External File Fields

Posted by Shawn Heisey <ap...@elyograg.org>.

On 10/8/2016 1:18 PM, Mike Lissner wrote:
> I want to make sure I understand this properly and document this for
> futurepeople that may find this thread. Here's what I interpret your
> advice to be:
> 0. Slacken my auto soft commit interval to something more like a minute. 

Yes, I would do this.  I would also increase autoCommit to something
between one and five minutes, with openSearcher set to false.  There's
nothing *wrong* with 15 seconds for autoCommit, but I want my server to
be doing less work during normal operation.

To answer a question you posed in a later message: Yes, it's common for
users to have a longer interval on autoSoftCommit than autoCommit. 
Remember the mantra in the URL about understanding commits:  Hard
commits are about durability, soft commits are about visibility.  Hard
commits when openSearcher is false are almost always *very* fast, so
it's typically not much of a burden to have them happen more frequently,
and thus have a better data durability guarantee.  Like I said above, I
generally use an autoCommit value between one and five minutes.

> I'm a bit confused about the example autowarmcount for the caches, which is
> 0. Why not set this to something higher? I guess it's a RAM utilization vs.
> speed tradeoff? A low number like 16 seems like it'd have minimal impact on
> RAM?

A low autowarmCount is generally chosen for one reason: commit speed. 
If the example configs have it set to zero, I'm sure this was done so
commits would proceed as fast as possible.  Large values can turn
opening a new searcher into a process that can take *minutes*.

On my index shards, the autowarmCount on my filterCache is *four*. 
That's it -- execute only four of the most recent filters in the cache
when a new searcher opens.  That warming *still* sometimes takes as long
as 20 seconds on the larger shards.  The filters used in queries on my
indexes are very large and very complex, and can match millions of
documents.  Pleading with the dev team to decrease query complexity
doesn't help.

On the idea of reusing the external file data when it doesn't change:  I
do not know if this is possible.  I have no idea how Solr and Lucene use
the data found in the external file, so it might be completely necessary
to re-load it every time.  You can open an issue in Jira to explore the
idea, but don't be too surprised if it doesn't go anywhere.

Thanks,
Shawn

Re: Real Time Search and External File Fields

Posted by Erick Erickson <er...@gmail.com>.

I chose 16 as a place to start. You usually reach diminishing returns
pretty quickly, i feel it's a mistake to set your autowarm counts to, say
256 (and I've seen this in the thousands) unless you have some proof
that it's useful to bump higher.

But certainly if you set them to 16 and see spikes just after a searcher
is opened that aren't tolerable, feel free to make them larger.

You've hit on exactly why newSearcher and firstSearcher are there.
The theory behind autowarm counts is that the last N entries are
likely to be useful in the near future. There's no guarantee at all that
this is true and newSearcher/firstSearcher are certain to exercise
what _you_ think is most important.

As for why autowarm counts are set to 0 in the examples, there's no
overarching reason. Certainly if the soft commit interval is 1 second,
autowarming
is largely useless so having it also at 0 makes sense.

Best,
Erick

On Sat, Oct 8, 2016 at 12:31 PM, Walter Underwood <wu...@wunderwood.org> wrote:
> With time-oriented data, you can use an old trick (goes back to Infoseek in 1995).
>
> Make a “today” collection that is very fresh. Nightly, migrate new documents to
> the “not today” collection. The today collection will be small and can be updated
> quickly. The archive collection will be large and slow to update, but who cares?
>
> You can also send all docs to both collections and de-dupe.
>
> Every night, you start over with the “today” collection.
>
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Oct 8, 2016, at 12:18 PM, Mike Lissner <ml...@michaeljaylissner.com> wrote:
>>
>> On Fri, Oct 7, 2016 at 8:18 PM Erick Erickson <er...@gmail.com>
>> wrote:
>>
>>> What you haven't mentioned is how often you add new docs. Is it once a
>>> day? Steadily
>>> from 8:00 to 17:00?
>>>
>>
>> Alas, it's a steady trickle during business hours. We're ingesting court
>> documents as they're posted on court websites, then sending alerts as soon
>> as possible.
>>
>>
>>> Whatever, your soft commit really should be longer than your autowarm
>>> interval. Configure
>>> autowarming to reference queries (firstSearcher or newSearcher events
>>> or autowarm
>>> counts in queryResultCache and filterCache. Say 16 in each of these
>>> latter for a start) such
>>> that they cause the external file to load. That _should_ prevent any
>>> queries from being
>>> blocked since the autowarming will happen in the background and while
>>> it's happening
>>> incoming queries will be served by the old searcher.
>>>
>>
>> I want to make sure I understand this properly and document this for future
>> people that may find this thread. Here's what I interpret your advice to be:
>>
>> 0. Slacken my auto soft commit interval to something more like a minute.
>>
>> 1. Set up a query in the newSearcher listener that uses my external file
>> field.
>> 1a. Do the same in firstSearcher if I want newly started solr to warm up
>> before getting queries (this doesn't matter to me, so I'm skipping this).
>>
>> and/or
>>
>> 2. Set autowarmcount in queryResultCache and filterCache to 16 so that the
>> top 16 query results from the previous searcher are regenerated in the new
>> searcher.
>>
>> Doing #1 seems like a safe strategy since it's guaranteed to hit the
>> external file field. #2 feels like a bonus.
>>
>> I'm a bit confused about the example autowarmcount for the caches, which is
>> 0. Why not set this to something higher? I guess it's a RAM utilization vs.
>> speed tradeoff? A low number like 16 seems like it'd have minimal impact on
>> RAM?
>>
>> Thanks for all the great replies and for everything you do for Solr. I
>> truly appreciate your efforts.
>>
>> Mike
>

Re: Real Time Search and External File Fields

Posted by Walter Underwood <wu...@wunderwood.org>.

With time-oriented data, you can use an old trick (goes back to Infoseek in 1995).

Make a “today” collection that is very fresh. Nightly, migrate new documents to 
the “not today” collection. The today collection will be small and can be updated
quickly. The archive collection will be large and slow to update, but who cares?

You can also send all docs to both collections and de-dupe.

Every night, you start over with the “today” collection.

Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 8, 2016, at 12:18 PM, Mike Lissner <ml...@michaeljaylissner.com> wrote:
> 
> On Fri, Oct 7, 2016 at 8:18 PM Erick Erickson <er...@gmail.com>
> wrote:
> 
>> What you haven't mentioned is how often you add new docs. Is it once a
>> day? Steadily
>> from 8:00 to 17:00?
>> 
> 
> Alas, it's a steady trickle during business hours. We're ingesting court
> documents as they're posted on court websites, then sending alerts as soon
> as possible.
> 
> 
>> Whatever, your soft commit really should be longer than your autowarm
>> interval. Configure
>> autowarming to reference queries (firstSearcher or newSearcher events
>> or autowarm
>> counts in queryResultCache and filterCache. Say 16 in each of these
>> latter for a start) such
>> that they cause the external file to load. That _should_ prevent any
>> queries from being
>> blocked since the autowarming will happen in the background and while
>> it's happening
>> incoming queries will be served by the old searcher.
>> 
> 
> I want to make sure I understand this properly and document this for future
> people that may find this thread. Here's what I interpret your advice to be:
> 
> 0. Slacken my auto soft commit interval to something more like a minute.
> 
> 1. Set up a query in the newSearcher listener that uses my external file
> field.
> 1a. Do the same in firstSearcher if I want newly started solr to warm up
> before getting queries (this doesn't matter to me, so I'm skipping this).
> 
> and/or
> 
> 2. Set autowarmcount in queryResultCache and filterCache to 16 so that the
> top 16 query results from the previous searcher are regenerated in the new
> searcher.
> 
> Doing #1 seems like a safe strategy since it's guaranteed to hit the
> external file field. #2 feels like a bonus.
> 
> I'm a bit confused about the example autowarmcount for the caches, which is
> 0. Why not set this to something higher? I guess it's a RAM utilization vs.
> speed tradeoff? A low number like 16 seems like it'd have minimal impact on
> RAM?
> 
> Thanks for all the great replies and for everything you do for Solr. I
> truly appreciate your efforts.
> 
> Mike

Re: Real Time Search and External File Fields

Posted by Mike Lissner <ml...@michaeljaylissner.com>.

On Fri, Oct 7, 2016 at 8:18 PM Erick Erickson <er...@gmail.com>
wrote:

> What you haven't mentioned is how often you add new docs. Is it once a
> day? Steadily
> from 8:00 to 17:00?
>

Alas, it's a steady trickle during business hours. We're ingesting court
documents as they're posted on court websites, then sending alerts as soon
as possible.

> Whatever, your soft commit really should be longer than your autowarm
> interval. Configure
> autowarming to reference queries (firstSearcher or newSearcher events
> or autowarm
> counts in queryResultCache and filterCache. Say 16 in each of these
> latter for a start) such
> that they cause the external file to load. That _should_ prevent any
> queries from being
> blocked since the autowarming will happen in the background and while
> it's happening
> incoming queries will be served by the old searcher.
>

I want to make sure I understand this properly and document this for future
people that may find this thread. Here's what I interpret your advice to be:

0. Slacken my auto soft commit interval to something more like a minute.

1. Set up a query in the newSearcher listener that uses my external file
field.
1a. Do the same in firstSearcher if I want newly started solr to warm up
before getting queries (this doesn't matter to me, so I'm skipping this).

and/or

2. Set autowarmcount in queryResultCache and filterCache to 16 so that the
top 16 query results from the previous searcher are regenerated in the new
searcher.

Doing #1 seems like a safe strategy since it's guaranteed to hit the
external file field. #2 feels like a bonus.

I'm a bit confused about the example autowarmcount for the caches, which is
0. Why not set this to something higher? I guess it's a RAM utilization vs.
speed tradeoff? A low number like 16 seems like it'd have minimal impact on
RAM?

Thanks for all the great replies and for everything you do for Solr. I
truly appreciate your efforts.

Mike

Re: Real Time Search and External File Fields

Posted by Erick Erickson <er...@gmail.com>.

bq: Most soft commit
documentation talks about setting up soft commits with <maxtime> of about a
second.

I think this is really a consequence of this being included in the
example configs
for illustrative purposes, personally I never liked this.

There is no one right answer. I've seen soft commit intervals from -1
(never soft commit)
to 1 second. The latter means most all of your caches are totally
useless and might
as well be turned off usually.

What you haven't mentioned is how often you add new docs. Is it once a
day? Steadily
from 8:00 to 17:00? All in three hours in the morning?

Whatever, your soft commit really should be longer than your autowarm
interval. Configure
autowarming to reference queries (firstSearcher or newSearcher events
or autowarm
counts in queryResultCache and filterCache. Say 16 in each of these
latter for a start) such
that they cause the external file to load. That _should_ prevent any
queries from being
blocked since the autowarming will happen in the background and while
it's happening
incoming queries will be served by the old searcher.

Best,
Erick

On Fri, Oct 7, 2016 at 5:19 PM, Mike Lissner
<ml...@michaeljaylissner.com> wrote:
> I have an index of about 4M documents with an external file field
> configured to do boosting based on pagerank scores of each document. The
> pagerank file is about 93MB as of today -- it's pretty big.
>
> Each day, I add about 1,000 new documents to the index, and I need them to
> be available as soon as possible so that I can send out alerts to our users
> about new content (this is Google Alerts, essentially).
>
> Soft commits seem to be exactly the thing for this, but whenever I open a
> new searcher (which soft commits seem to do), the external file is
> reloaded, and all queries are halted until it finishes loading. When I just
> measured, this took about 30 seconds to complete. Most soft commit
> documentation talks about setting up soft commits with <maxtime> of about a
> second.
>
> Is there anything I can do to make the external file field not get reloaded
> constantly? It only changes about once a month, and I want to use soft
> commits to power the alerts feature.
>
> Thanks,
>
> Mike