You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Erick Erickson <er...@gmail.com> on 2018/10/01 01:15:35 UTC

Re: Realtime get not always returning existing data

57 million queries later, with constant indexing going on and 9 dummy
collections in the mix and the main collection I'm querying having 2
shards, 2 replicas each, I have no errors.

So unless the code doesn't look like it exercises any similar path,
I'm not sure what more I can test. "It works on my machine" ;)

Here's my querying code, does it look like it what you're seeing?

      while (Main.allStop.get() == false) {
        try (SolrClient client = new HttpSolrClient.Builder()
//("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
            .withBaseSolrUrl("http://localhost:8981/solr/eoe").build()) {

          //SolrQuery query = new SolrQuery();
          String lower = Integer.toString(rand.nextInt(1_000_000));
          SolrDocument rsp = client.getById(lower);
          if (rsp == null) {
            System.out.println("Got a null response!");
            Main.allStop.set(true);
          }

          rsp = client.getById(lower);

          if (rsp.get("id").equals(lower) == false) {
            System.out.println("Got an invalid response, looking for "
+ lower + " got: " + rsp.get("id"));
            Main.allStop.set(true);
          }
          long queries = Main.eoeCounter.incrementAndGet();
          if ((queries % 100_000) == 0) {
            long seconds = (System.currentTimeMillis() - Main.start) / 1000;
            System.out.println("Query count: " +
numFormatter.format(queries) + ", rate is " +
numFormatter.format(queries / seconds) + " QPS");
          }
        } catch (Exception cle) {
          cle.printStackTrace();
          Main.allStop.set(true);
        }
      }
  }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
<er...@gmail.com> wrote:
>
> Steve:
>
> bq.  Basically, one core had data in it that should belong to another
> core. Here's my question about this: Is it possible that two request to the
> /get API coming in at the same time would get confused and either both get
> the same result or result get inverted?
>
> Well, that shouldn't be happening, these are all supposed to be thread-safe
> calls.... All things are possible of course ;)
>
> If two replicas of the same shard have different documents, that could account
> for what you're seeing, meanwhile begging the question of why that is the case
> since it should never be true for a quiescent index. Technically there _are_
> conditions where this is true on a very temporary basis, commits on the leader
> and follower can trigger at different wall-clock times. Say your soft commit
> (or hard-commit-with-opensearcher-true) is 10 seconds. It should never be the
> case that s1r1 and s1r2 are out of sync 10 seconds after the last update was
> sent. This doesn't seem likely from what you've described though...
>
> Hmmmm. I guess that one other thing I can set up is to have a bunch of dummy
> collections laying around. Currently I have only the active one, and
> if there's some
> code path whereby the RTG request goes to a replica of a different
> collection, my
> test setup wouldn't reproduce it.
>
> Currently, I'm running a 2-shard, 1 replica setup, so if there's some
> way that the replicas
> get out of sync that wouldn't show either.
>
> So I'm starting another run with these changes:
> > opening a new connection each query
> > switched so the collection I'm querying is 2x2
> > added some dummy collections that are empty
>
> One nit, while "core" is exactly correct. When we talk about a core
> that's part of a collection, we try to use "replica" to be clear we're
> talking about
> a core with some added characteristics, i.e. we're in SolrCloud-land.
> No big deal
> of course....
>
> Best,
> Erick
> On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <ap...@elyograg.org> wrote:
> >
> > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > @Shawn
> > > We're running two instance on one machine for two reason:
> > > 1. The box has plenty of resources (48 cores / 256GB ram) and since I was
> > > reading that it's not recommended to use more than 31GB of heap in SOLR we
> > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> > > instance was a good idea.
> >
> > Do you know that these Solr instances actually DO need 31 GB of heap, or
> > are you following advice from somewhere, saying "use one quarter of your
> > memory as the heap size"?  That advice is not in the Solr documentation,
> > and never will be.  Figuring out the right heap size requires
> > experimentation.
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> >
> > How big (on disk) are each of these nine cores, and how many documents
> > are in each one?  Which of them is in each Solr instance?  With that
> > information, we can make a *guess* about how big your heap should be.
> > Figuring out whether the guess is correct generally requires careful
> > analysis of a GC log.
> >
> > > 2. We're in testing phase so we wanted a SOLR cloud configuration, we will
> > > most likely have a much bigger deployment once going to production. In prod
> > > right now, we currently to run a six machines Riak cluster. Riak is a
> > > key/value document store an has SOLR built-in for search, but we are trying
> > > to push the key/value aspect of Riak inside SOLR. That way we would have
> > > one less piece to worry about in our system.
> >
> > Solr is not a database.  It is not intended to be a data repository.
> > All of its optimizations (most of which are actually in Lucene) are
> > geared towards search.  While technically it can be a key-value store,
> > that is not what it was MADE for.  Software actually designed for that
> > role is going to be much better than Solr as a key-value store.
> >
> > > When I say null document, I mean the /get API returns: {doc: null}
> > >
> > > The problem is definitely not always there. We also have large period of
> > > time (few hours) were we have no problems. I'm just extremely hesitant on
> > > retrying when I get a null document because in some case, getting a null
> > > document is a valid outcome. Our caching layer heavily rely on this for
> > > example. If I was to retry every nulls I'd pay a big penalty in
> > > performance.
> >
> > I've just done a little test with the 7.5.0 techproducts example.  It
> > looks like returning doc:null actually is how the RTG handler says it
> > didn't find the document.  This seems very wrong to me, but I didn't
> > design it, and that response needs SOME kind of format.
> >
> > Have you done any testing to see whether the standard searching handler
> > (typically /select, but many other URL paths are possible) returns
> > results when RTG doesn't?  Do you know for these failures whether the
> > document has been committed or not?
> >
> > > As for your last comment, part of our testing phase is also testing the
> > > limits. Our framework has auto-scaling built-in so if we have a burst of
> > > request, the system will automatically spin up more clients. We're pushing
> > > 10% of our production system to that Test server to see how it will handle
> > > it.
> >
> > To spin up another replica, Solr must copy all its index data from the
> > leader replica.  Not only can this take a long time if the index is big,
> > but it will put a lot of extra I/O load on the machine(s) with the
> > leader roles.  So performance will actually be WORSE before it gets
> > better when you spin up another replica, and if the index is big, that
> > condition will persist for quite a while.  Copying the index data will
> > be constrained by the speed of your network and by the speed of your
> > disks.  Often the disks are slower than the network, but that is not
> > always the case.
> >
> > Thanks,
> > Shawn
> >

Re: Realtime get not always returning existing data

Posted by da...@gmail.com.
I'm using Solr 7.7.1, 12 shards,
router:{"field":"route", "name":"compositeId"}, and find the realtime get
only returns results if I specify the leader core-url. Most of the time I
see no results.

On Thu, 11 Oct 2018 at 23:41, Chris Ulicny <cu...@iq.media> wrote:

> We are relatively far behind with this one. The collections that we
> experience the problem on are currently running on 6.3.0. If it's easy
> enough for you to upgrade, it might be worth a try, but I didn't see any
> changes to the RealTimeGet in either of the 7.4/5 change logs after a
> cursory glance.
>
> Due to the volume and number of different processes that, this cluster
> requires more coordination to reindex and upgrade. So it's currently the
> last one on our plan to get upgraded to 7.X (or 8.X if timing allows).
>
> On Thu, Oct 11, 2018 at 8:22 AM sgaron cse <sg...@gmail.com> wrote:
>
> > Hey Chris,
> >
> > Which version of SOLR are you running? I was thinking of maybe trying
> > another version to see if it fixes the issue.
> >
> > On Thu, Oct 11, 2018 at 8:11 AM Chris Ulicny <cu...@iq.media> wrote:
> >
> > > We've also run into that issue of not being able to reproduce it
> outside
> > of
> > > running production loads.
> > >
> > > However, we haven't been encountering the problem in live production
> > quite
> > > as much as we used to, and I think that might be from the /get requests
> > > being spread out a little more evenly over the running interval which
> is
> > > due to other process changes.
> > >
> > > If I get any new information, I'll update as well.
> > >
> > > Thanks for your help.
> > >
> > > On Wed, Oct 10, 2018 at 10:53 AM sgaron cse <sg...@gmail.com>
> > wrote:
> > >
> > > > I haven't found a way to reproduce the problem other that running our
> > > > entire set of code. I've also been trying different things to make
> sure
> > > to
> > > > problem is not from my end and so far I haven't managed to fix it by
> > > > changing my code. It has to be a race condition somewhere but I just
> > > can't
> > > > put my finger on it.
> > > >
> > > > I'll message back if I find a way to reproduce.
> > > >
> > > > On Wed, Oct 10, 2018 at 10:48 AM Erick Erickson <
> > erickerickson@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Well assigning a bogus version that generates a 409 error then
> > > > > immediately doing an RTG on the doc doesn't fail for me either 18
> > > > > million tries later. So I'm afraid I haven't a clue where to go
> from
> > > > > here. Unless we can somehow find a way to generate this failure I'm
> > > > > going to drop it for the foreseeable future.
> > > > >
> > > > > Erick
> > > > > On Tue, Oct 9, 2018 at 7:39 AM Erick Erickson <
> > erickerickson@gmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > > Hmmmm. I wonder if a version conflict or perhaps other failure
> can
> > > > > > somehow cause this. It shouldn't be very hard to add that to my
> > test
> > > > > > setup, just randomly add n _version_ field value.
> > > > > >
> > > > > > Erick
> > > > > > On Mon, Oct 1, 2018 at 8:20 AM Erick Erickson <
> > > erickerickson@gmail.com
> > > > >
> > > > > wrote:
> > > > > > >
> > > > > > > Thanks. I'll be away for the rest of the week, so won't be able
> > to
> > > > try
> > > > > > > anything more....
> > > > > > > On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <cu...@iq.media>
> > > > wrote:
> > > > > > > >
> > > > > > > > In our case, we are heavily indexing in the collection while
> > the
> > > > /get
> > > > > > > > requests are happening which is what we assumed was causing
> > this
> > > > > very rare
> > > > > > > > behavior. However, we have experienced the problem for a
> > > collection
> > > > > where
> > > > > > > > the following happens in sequence with minutes in between
> them.
> > > > > > > >
> > > > > > > > 1. Document id=1 is indexed
> > > > > > > > 2. Document successfully retrieved with /get?id=1
> > > > > > > > 3. Document failed to be retrieved with /get?id=1
> > > > > > > > 4. Document successfully retrieved with /get?id=1
> > > > > > > >
> > > > > > > > We've haven't looked at the issue in a while, so I don't have
> > the
> > > > > exact
> > > > > > > > timing of that sequence on hand right now. I'll try to find
> an
> > > > actual
> > > > > > > > example, although I'm relatively certain it was multiple
> > minutes
> > > in
> > > > > between
> > > > > > > > each of those requests. However our autocommit (and soft
> > commit)
> > > > > times are
> > > > > > > > 60s for both collections.
> > > > > > > >
> > > > > > > > I think the following two are probably the biggest
> differences
> > > for
> > > > > our
> > > > > > > > setup, besides the version difference (v6.3.0):
> > > > > > > >
> > > > > > > > > index to this collection, perhaps not at a high rate
> > > > > > > > > separate the machines running solr from the one doing any
> > > > querying
> > > > > or
> > > > > > > > indexing
> > > > > > > >
> > > > > > > > The clients are on 3 hosts separate from the solr instances.
> > The
> > > > > total
> > > > > > > > number of threads that are making updates and making /get
> > > requests
> > > > is
> > > > > > > > around 120-150. About 40-50 per host. Each of our two
> > collections
> > > > > gets an
> > > > > > > > average of 500 requests per second constantly for ~5 minutes,
> > and
> > > > > then the
> > > > > > > > number slowly tapers off to essentially 0 after ~15 minutes.
> > > > > > > >
> > > > > > > > Every thread attempts to make the same series of requests.
> > > > > > > >
> > > > > > > > -- Update with "_version_=-1". If successful, no other
> requests
> > > are
> > > > > made.
> > > > > > > > -- On 409 Conflict failure, it makes a /get request for the
> id
> > > > > > > > -- On doc:null failure, the client handles the error and
> moves
> > on
> > > > > > > >
> > > > > > > > Combining this with the previous series of /get requests, we
> > end
> > > up
> > > > > with
> > > > > > > > situations where an update fails as expected, but the
> > subsequent
> > > > /get
> > > > > > > > request fails to retrieve the existing document:
> > > > > > > >
> > > > > > > > 1. Thread 1 updates id=1 successfully
> > > > > > > > 2. Thread 2 tries to update id=1, fails (409)
> > > > > > > > 3. Thread 2 tries to get id=1 succeeds.
> > > > > > > >
> > > > > > > > ...Minutes later...
> > > > > > > >
> > > > > > > > 4. Thread 3 tries to update id=1, fails (409)
> > > > > > > > 5. Thread 3 tries to get id=1, fails (doc:null)
> > > > > > > >
> > > > > > > > ...Minutes later...
> > > > > > > >
> > > > > > > > 6. Thread 4 tries to update id=1, fails (409)
> > > > > > > > 7. Thread 4 tries to get id=1 succeeds.
> > > > > > > >
> > > > > > > > As Steven mentioned, it happens very, very rarely. We tried
> to
> > > > > recreate it
> > > > > > > > in a more controlled environment, but ran into the same issue
> > > that
> > > > > you are,
> > > > > > > > Erick. Every simplified situation we ran produced no
> problems.
> > > > Since
> > > > > it's
> > > > > > > > not a large issue for us and happens very rarely, we stopped
> > > trying
> > > > > to
> > > > > > > > recreate it.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <
> > > > > erickerickson@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > 57 million queries later, with constant indexing going on
> > and 9
> > > > > dummy
> > > > > > > > > collections in the mix and the main collection I'm querying
> > > > having
> > > > > 2
> > > > > > > > > shards, 2 replicas each, I have no errors.
> > > > > > > > >
> > > > > > > > > So unless the code doesn't look like it exercises any
> similar
> > > > path,
> > > > > > > > > I'm not sure what more I can test. "It works on my machine"
> > ;)
> > > > > > > > >
> > > > > > > > > Here's my querying code, does it look like it what you're
> > > seeing?
> > > > > > > > >
> > > > > > > > >       while (Main.allStop.get() == false) {
> > > > > > > > >         try (SolrClient client = new
> HttpSolrClient.Builder()
> > > > > > > > > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4
> "))
> > {
> > > > > > > > >             .withBaseSolrUrl("
> http://localhost:8981/solr/eoe
> > > > ").build())
> > > > > {
> > > > > > > > >
> > > > > > > > >           //SolrQuery query = new SolrQuery();
> > > > > > > > >           String lower =
> > > > Integer.toString(rand.nextInt(1_000_000));
> > > > > > > > >           SolrDocument rsp = client.getById(lower);
> > > > > > > > >           if (rsp == null) {
> > > > > > > > >             System.out.println("Got a null response!");
> > > > > > > > >             Main.allStop.set(true);
> > > > > > > > >           }
> > > > > > > > >
> > > > > > > > >           rsp = client.getById(lower);
> > > > > > > > >
> > > > > > > > >           if (rsp.get("id").equals(lower) == false) {
> > > > > > > > >             System.out.println("Got an invalid response,
> > > looking
> > > > > for "
> > > > > > > > > + lower + " got: " + rsp.get("id"));
> > > > > > > > >             Main.allStop.set(true);
> > > > > > > > >           }
> > > > > > > > >           long queries = Main.eoeCounter.incrementAndGet();
> > > > > > > > >           if ((queries % 100_000) == 0) {
> > > > > > > > >             long seconds = (System.currentTimeMillis() -
> > > > > Main.start) /
> > > > > > > > > 1000;
> > > > > > > > >             System.out.println("Query count: " +
> > > > > > > > > numFormatter.format(queries) + ", rate is " +
> > > > > > > > > numFormatter.format(queries / seconds) + " QPS");
> > > > > > > > >           }
> > > > > > > > >         } catch (Exception cle) {
> > > > > > > > >           cle.printStackTrace();
> > > > > > > > >           Main.allStop.set(true);
> > > > > > > > >         }
> > > > > > > > >       }
> > > > > > > > >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > > > > > > > > <er...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Steve:
> > > > > > > > > >
> > > > > > > > > > bq.  Basically, one core had data in it that should
> belong
> > to
> > > > > another
> > > > > > > > > > core. Here's my question about this: Is it possible that
> > two
> > > > > request to
> > > > > > > > > the
> > > > > > > > > > /get API coming in at the same time would get confused
> and
> > > > > either both
> > > > > > > > > get
> > > > > > > > > > the same result or result get inverted?
> > > > > > > > > >
> > > > > > > > > > Well, that shouldn't be happening, these are all supposed
> > to
> > > be
> > > > > > > > > thread-safe
> > > > > > > > > > calls.... All things are possible of course ;)
> > > > > > > > > >
> > > > > > > > > > If two replicas of the same shard have different
> documents,
> > > > that
> > > > > could
> > > > > > > > > account
> > > > > > > > > > for what you're seeing, meanwhile begging the question of
> > why
> > > > > that is
> > > > > > > > > the case
> > > > > > > > > > since it should never be true for a quiescent index.
> > > > Technically
> > > > > there
> > > > > > > > > _are_
> > > > > > > > > > conditions where this is true on a very temporary basis,
> > > > commits
> > > > > on the
> > > > > > > > > leader
> > > > > > > > > > and follower can trigger at different wall-clock times.
> Say
> > > > your
> > > > > soft
> > > > > > > > > commit
> > > > > > > > > > (or hard-commit-with-opensearcher-true) is 10 seconds. It
> > > > should
> > > > > never
> > > > > > > > > be the
> > > > > > > > > > case that s1r1 and s1r2 are out of sync 10 seconds after
> > the
> > > > > last update
> > > > > > > > > was
> > > > > > > > > > sent. This doesn't seem likely from what you've described
> > > > > though...
> > > > > > > > > >
> > > > > > > > > > Hmmmm. I guess that one other thing I can set up is to
> > have a
> > > > > bunch of
> > > > > > > > > dummy
> > > > > > > > > > collections laying around. Currently I have only the
> active
> > > > one,
> > > > > and
> > > > > > > > > > if there's some
> > > > > > > > > > code path whereby the RTG request goes to a replica of a
> > > > > different
> > > > > > > > > > collection, my
> > > > > > > > > > test setup wouldn't reproduce it.
> > > > > > > > > >
> > > > > > > > > > Currently, I'm running a 2-shard, 1 replica setup, so if
> > > > there's
> > > > > some
> > > > > > > > > > way that the replicas
> > > > > > > > > > get out of sync that wouldn't show either.
> > > > > > > > > >
> > > > > > > > > > So I'm starting another run with these changes:
> > > > > > > > > > > opening a new connection each query
> > > > > > > > > > > switched so the collection I'm querying is 2x2
> > > > > > > > > > > added some dummy collections that are empty
> > > > > > > > > >
> > > > > > > > > > One nit, while "core" is exactly correct. When we talk
> > about
> > > a
> > > > > core
> > > > > > > > > > that's part of a collection, we try to use "replica" to
> be
> > > > clear
> > > > > we're
> > > > > > > > > > talking about
> > > > > > > > > > a core with some added characteristics, i.e. we're in
> > > > > SolrCloud-land.
> > > > > > > > > > No big deal
> > > > > > > > > > of course....
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Erick
> > > > > > > > > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <
> > > > > apache@elyograg.org>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > > > > > > > > @Shawn
> > > > > > > > > > > > We're running two instance on one machine for two
> > reason:
> > > > > > > > > > > > 1. The box has plenty of resources (48 cores / 256GB
> > ram)
> > > > > and since
> > > > > > > > > I was
> > > > > > > > > > > > reading that it's not recommended to use more than
> 31GB
> > > of
> > > > > heap in
> > > > > > > > > SOLR we
> > > > > > > > > > > > figured 96 GB for keeping index data in OS cache + 31
> > GB
> > > of
> > > > > heap per
> > > > > > > > > > > > instance was a good idea.
> > > > > > > > > > >
> > > > > > > > > > > Do you know that these Solr instances actually DO need
> 31
> > > GB
> > > > > of heap,
> > > > > > > > > or
> > > > > > > > > > > are you following advice from somewhere, saying "use
> one
> > > > > quarter of
> > > > > > > > > your
> > > > > > > > > > > memory as the heap size"?  That advice is not in the
> Solr
> > > > > > > > > documentation,
> > > > > > > > > > > and never will be.  Figuring out the right heap size
> > > requires
> > > > > > > > > > > experimentation.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > >
> > > >
> > >
> >
> https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > > > > > > > > >
> > > > > > > > > > > How big (on disk) are each of these nine cores, and how
> > > many
> > > > > documents
> > > > > > > > > > > are in each one?  Which of them is in each Solr
> instance?
> > > > > With that
> > > > > > > > > > > information, we can make a *guess* about how big your
> > heap
> > > > > should be.
> > > > > > > > > > > Figuring out whether the guess is correct generally
> > > requires
> > > > > careful
> > > > > > > > > > > analysis of a GC log.
> > > > > > > > > > >
> > > > > > > > > > > > 2. We're in testing phase so we wanted a SOLR cloud
> > > > > configuration,
> > > > > > > > > we will
> > > > > > > > > > > > most likely have a much bigger deployment once going
> to
> > > > > production.
> > > > > > > > > In prod
> > > > > > > > > > > > right now, we currently to run a six machines Riak
> > > cluster.
> > > > > Riak is a
> > > > > > > > > > > > key/value document store an has SOLR built-in for
> > search,
> > > > > but we are
> > > > > > > > > trying
> > > > > > > > > > > > to push the key/value aspect of Riak inside SOLR.
> That
> > > way
> > > > > we would
> > > > > > > > > have
> > > > > > > > > > > > one less piece to worry about in our system.
> > > > > > > > > > >
> > > > > > > > > > > Solr is not a database.  It is not intended to be a
> data
> > > > > repository.
> > > > > > > > > > > All of its optimizations (most of which are actually in
> > > > > Lucene) are
> > > > > > > > > > > geared towards search.  While technically it can be a
> > > > > key-value store,
> > > > > > > > > > > that is not what it was MADE for.  Software actually
> > > designed
> > > > > for that
> > > > > > > > > > > role is going to be much better than Solr as a
> key-value
> > > > store.
> > > > > > > > > > >
> > > > > > > > > > > > When I say null document, I mean the /get API
> returns:
> > > > {doc:
> > > > > null}
> > > > > > > > > > > >
> > > > > > > > > > > > The problem is definitely not always there. We also
> > have
> > > > > large
> > > > > > > > > period of
> > > > > > > > > > > > time (few hours) were we have no problems. I'm just
> > > > extremely
> > > > > > > > > hesitant on
> > > > > > > > > > > > retrying when I get a null document because in some
> > case,
> > > > > getting a
> > > > > > > > > null
> > > > > > > > > > > > document is a valid outcome. Our caching layer
> heavily
> > > rely
> > > > > on this
> > > > > > > > > for
> > > > > > > > > > > > example. If I was to retry every nulls I'd pay a big
> > > > penalty
> > > > > in
> > > > > > > > > > > > performance.
> > > > > > > > > > >
> > > > > > > > > > > I've just done a little test with the 7.5.0
> techproducts
> > > > > example.  It
> > > > > > > > > > > looks like returning doc:null actually is how the RTG
> > > handler
> > > > > says it
> > > > > > > > > > > didn't find the document.  This seems very wrong to me,
> > > but I
> > > > > didn't
> > > > > > > > > > > design it, and that response needs SOME kind of format.
> > > > > > > > > > >
> > > > > > > > > > > Have you done any testing to see whether the standard
> > > > > searching handler
> > > > > > > > > > > (typically /select, but many other URL paths are
> > possible)
> > > > > returns
> > > > > > > > > > > results when RTG doesn't?  Do you know for these
> failures
> > > > > whether the
> > > > > > > > > > > document has been committed or not?
> > > > > > > > > > >
> > > > > > > > > > > > As for your last comment, part of our testing phase
> is
> > > also
> > > > > testing
> > > > > > > > > the
> > > > > > > > > > > > limits. Our framework has auto-scaling built-in so if
> > we
> > > > > have a
> > > > > > > > > burst of
> > > > > > > > > > > > request, the system will automatically spin up more
> > > > clients.
> > > > > We're
> > > > > > > > > pushing
> > > > > > > > > > > > 10% of our production system to that Test server to
> see
> > > how
> > > > > it will
> > > > > > > > > handle
> > > > > > > > > > > > it.
> > > > > > > > > > >
> > > > > > > > > > > To spin up another replica, Solr must copy all its
> index
> > > data
> > > > > from the
> > > > > > > > > > > leader replica.  Not only can this take a long time if
> > the
> > > > > index is
> > > > > > > > > big,
> > > > > > > > > > > but it will put a lot of extra I/O load on the
> machine(s)
> > > > with
> > > > > the
> > > > > > > > > > > leader roles.  So performance will actually be WORSE
> > before
> > > > it
> > > > > gets
> > > > > > > > > > > better when you spin up another replica, and if the
> index
> > > is
> > > > > big, that
> > > > > > > > > > > condition will persist for quite a while.  Copying the
> > > index
> > > > > data will
> > > > > > > > > > > be constrained by the speed of your network and by the
> > > speed
> > > > > of your
> > > > > > > > > > > disks.  Often the disks are slower than the network,
> but
> > > that
> > > > > is not
> > > > > > > > > > > always the case.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Shawn
> > > > > > > > > > >
> > > > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Realtime get not always returning existing data

Posted by Chris Ulicny <cu...@iq.media>.
We are relatively far behind with this one. The collections that we
experience the problem on are currently running on 6.3.0. If it's easy
enough for you to upgrade, it might be worth a try, but I didn't see any
changes to the RealTimeGet in either of the 7.4/5 change logs after a
cursory glance.

Due to the volume and number of different processes that, this cluster
requires more coordination to reindex and upgrade. So it's currently the
last one on our plan to get upgraded to 7.X (or 8.X if timing allows).

On Thu, Oct 11, 2018 at 8:22 AM sgaron cse <sg...@gmail.com> wrote:

> Hey Chris,
>
> Which version of SOLR are you running? I was thinking of maybe trying
> another version to see if it fixes the issue.
>
> On Thu, Oct 11, 2018 at 8:11 AM Chris Ulicny <cu...@iq.media> wrote:
>
> > We've also run into that issue of not being able to reproduce it outside
> of
> > running production loads.
> >
> > However, we haven't been encountering the problem in live production
> quite
> > as much as we used to, and I think that might be from the /get requests
> > being spread out a little more evenly over the running interval which is
> > due to other process changes.
> >
> > If I get any new information, I'll update as well.
> >
> > Thanks for your help.
> >
> > On Wed, Oct 10, 2018 at 10:53 AM sgaron cse <sg...@gmail.com>
> wrote:
> >
> > > I haven't found a way to reproduce the problem other that running our
> > > entire set of code. I've also been trying different things to make sure
> > to
> > > problem is not from my end and so far I haven't managed to fix it by
> > > changing my code. It has to be a race condition somewhere but I just
> > can't
> > > put my finger on it.
> > >
> > > I'll message back if I find a way to reproduce.
> > >
> > > On Wed, Oct 10, 2018 at 10:48 AM Erick Erickson <
> erickerickson@gmail.com
> > >
> > > wrote:
> > >
> > > > Well assigning a bogus version that generates a 409 error then
> > > > immediately doing an RTG on the doc doesn't fail for me either 18
> > > > million tries later. So I'm afraid I haven't a clue where to go from
> > > > here. Unless we can somehow find a way to generate this failure I'm
> > > > going to drop it for the foreseeable future.
> > > >
> > > > Erick
> > > > On Tue, Oct 9, 2018 at 7:39 AM Erick Erickson <
> erickerickson@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > Hmmmm. I wonder if a version conflict or perhaps other failure can
> > > > > somehow cause this. It shouldn't be very hard to add that to my
> test
> > > > > setup, just randomly add n _version_ field value.
> > > > >
> > > > > Erick
> > > > > On Mon, Oct 1, 2018 at 8:20 AM Erick Erickson <
> > erickerickson@gmail.com
> > > >
> > > > wrote:
> > > > > >
> > > > > > Thanks. I'll be away for the rest of the week, so won't be able
> to
> > > try
> > > > > > anything more....
> > > > > > On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <cu...@iq.media>
> > > wrote:
> > > > > > >
> > > > > > > In our case, we are heavily indexing in the collection while
> the
> > > /get
> > > > > > > requests are happening which is what we assumed was causing
> this
> > > > very rare
> > > > > > > behavior. However, we have experienced the problem for a
> > collection
> > > > where
> > > > > > > the following happens in sequence with minutes in between them.
> > > > > > >
> > > > > > > 1. Document id=1 is indexed
> > > > > > > 2. Document successfully retrieved with /get?id=1
> > > > > > > 3. Document failed to be retrieved with /get?id=1
> > > > > > > 4. Document successfully retrieved with /get?id=1
> > > > > > >
> > > > > > > We've haven't looked at the issue in a while, so I don't have
> the
> > > > exact
> > > > > > > timing of that sequence on hand right now. I'll try to find an
> > > actual
> > > > > > > example, although I'm relatively certain it was multiple
> minutes
> > in
> > > > between
> > > > > > > each of those requests. However our autocommit (and soft
> commit)
> > > > times are
> > > > > > > 60s for both collections.
> > > > > > >
> > > > > > > I think the following two are probably the biggest differences
> > for
> > > > our
> > > > > > > setup, besides the version difference (v6.3.0):
> > > > > > >
> > > > > > > > index to this collection, perhaps not at a high rate
> > > > > > > > separate the machines running solr from the one doing any
> > > querying
> > > > or
> > > > > > > indexing
> > > > > > >
> > > > > > > The clients are on 3 hosts separate from the solr instances.
> The
> > > > total
> > > > > > > number of threads that are making updates and making /get
> > requests
> > > is
> > > > > > > around 120-150. About 40-50 per host. Each of our two
> collections
> > > > gets an
> > > > > > > average of 500 requests per second constantly for ~5 minutes,
> and
> > > > then the
> > > > > > > number slowly tapers off to essentially 0 after ~15 minutes.
> > > > > > >
> > > > > > > Every thread attempts to make the same series of requests.
> > > > > > >
> > > > > > > -- Update with "_version_=-1". If successful, no other requests
> > are
> > > > made.
> > > > > > > -- On 409 Conflict failure, it makes a /get request for the id
> > > > > > > -- On doc:null failure, the client handles the error and moves
> on
> > > > > > >
> > > > > > > Combining this with the previous series of /get requests, we
> end
> > up
> > > > with
> > > > > > > situations where an update fails as expected, but the
> subsequent
> > > /get
> > > > > > > request fails to retrieve the existing document:
> > > > > > >
> > > > > > > 1. Thread 1 updates id=1 successfully
> > > > > > > 2. Thread 2 tries to update id=1, fails (409)
> > > > > > > 3. Thread 2 tries to get id=1 succeeds.
> > > > > > >
> > > > > > > ...Minutes later...
> > > > > > >
> > > > > > > 4. Thread 3 tries to update id=1, fails (409)
> > > > > > > 5. Thread 3 tries to get id=1, fails (doc:null)
> > > > > > >
> > > > > > > ...Minutes later...
> > > > > > >
> > > > > > > 6. Thread 4 tries to update id=1, fails (409)
> > > > > > > 7. Thread 4 tries to get id=1 succeeds.
> > > > > > >
> > > > > > > As Steven mentioned, it happens very, very rarely. We tried to
> > > > recreate it
> > > > > > > in a more controlled environment, but ran into the same issue
> > that
> > > > you are,
> > > > > > > Erick. Every simplified situation we ran produced no problems.
> > > Since
> > > > it's
> > > > > > > not a large issue for us and happens very rarely, we stopped
> > trying
> > > > to
> > > > > > > recreate it.
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <
> > > > erickerickson@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > 57 million queries later, with constant indexing going on
> and 9
> > > > dummy
> > > > > > > > collections in the mix and the main collection I'm querying
> > > having
> > > > 2
> > > > > > > > shards, 2 replicas each, I have no errors.
> > > > > > > >
> > > > > > > > So unless the code doesn't look like it exercises any similar
> > > path,
> > > > > > > > I'm not sure what more I can test. "It works on my machine"
> ;)
> > > > > > > >
> > > > > > > > Here's my querying code, does it look like it what you're
> > seeing?
> > > > > > > >
> > > > > > > >       while (Main.allStop.get() == false) {
> > > > > > > >         try (SolrClient client = new HttpSolrClient.Builder()
> > > > > > > > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4"))
> {
> > > > > > > >             .withBaseSolrUrl("http://localhost:8981/solr/eoe
> > > ").build())
> > > > {
> > > > > > > >
> > > > > > > >           //SolrQuery query = new SolrQuery();
> > > > > > > >           String lower =
> > > Integer.toString(rand.nextInt(1_000_000));
> > > > > > > >           SolrDocument rsp = client.getById(lower);
> > > > > > > >           if (rsp == null) {
> > > > > > > >             System.out.println("Got a null response!");
> > > > > > > >             Main.allStop.set(true);
> > > > > > > >           }
> > > > > > > >
> > > > > > > >           rsp = client.getById(lower);
> > > > > > > >
> > > > > > > >           if (rsp.get("id").equals(lower) == false) {
> > > > > > > >             System.out.println("Got an invalid response,
> > looking
> > > > for "
> > > > > > > > + lower + " got: " + rsp.get("id"));
> > > > > > > >             Main.allStop.set(true);
> > > > > > > >           }
> > > > > > > >           long queries = Main.eoeCounter.incrementAndGet();
> > > > > > > >           if ((queries % 100_000) == 0) {
> > > > > > > >             long seconds = (System.currentTimeMillis() -
> > > > Main.start) /
> > > > > > > > 1000;
> > > > > > > >             System.out.println("Query count: " +
> > > > > > > > numFormatter.format(queries) + ", rate is " +
> > > > > > > > numFormatter.format(queries / seconds) + " QPS");
> > > > > > > >           }
> > > > > > > >         } catch (Exception cle) {
> > > > > > > >           cle.printStackTrace();
> > > > > > > >           Main.allStop.set(true);
> > > > > > > >         }
> > > > > > > >       }
> > > > > > > >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > > > > > > > <er...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Steve:
> > > > > > > > >
> > > > > > > > > bq.  Basically, one core had data in it that should belong
> to
> > > > another
> > > > > > > > > core. Here's my question about this: Is it possible that
> two
> > > > request to
> > > > > > > > the
> > > > > > > > > /get API coming in at the same time would get confused and
> > > > either both
> > > > > > > > get
> > > > > > > > > the same result or result get inverted?
> > > > > > > > >
> > > > > > > > > Well, that shouldn't be happening, these are all supposed
> to
> > be
> > > > > > > > thread-safe
> > > > > > > > > calls.... All things are possible of course ;)
> > > > > > > > >
> > > > > > > > > If two replicas of the same shard have different documents,
> > > that
> > > > could
> > > > > > > > account
> > > > > > > > > for what you're seeing, meanwhile begging the question of
> why
> > > > that is
> > > > > > > > the case
> > > > > > > > > since it should never be true for a quiescent index.
> > > Technically
> > > > there
> > > > > > > > _are_
> > > > > > > > > conditions where this is true on a very temporary basis,
> > > commits
> > > > on the
> > > > > > > > leader
> > > > > > > > > and follower can trigger at different wall-clock times. Say
> > > your
> > > > soft
> > > > > > > > commit
> > > > > > > > > (or hard-commit-with-opensearcher-true) is 10 seconds. It
> > > should
> > > > never
> > > > > > > > be the
> > > > > > > > > case that s1r1 and s1r2 are out of sync 10 seconds after
> the
> > > > last update
> > > > > > > > was
> > > > > > > > > sent. This doesn't seem likely from what you've described
> > > > though...
> > > > > > > > >
> > > > > > > > > Hmmmm. I guess that one other thing I can set up is to
> have a
> > > > bunch of
> > > > > > > > dummy
> > > > > > > > > collections laying around. Currently I have only the active
> > > one,
> > > > and
> > > > > > > > > if there's some
> > > > > > > > > code path whereby the RTG request goes to a replica of a
> > > > different
> > > > > > > > > collection, my
> > > > > > > > > test setup wouldn't reproduce it.
> > > > > > > > >
> > > > > > > > > Currently, I'm running a 2-shard, 1 replica setup, so if
> > > there's
> > > > some
> > > > > > > > > way that the replicas
> > > > > > > > > get out of sync that wouldn't show either.
> > > > > > > > >
> > > > > > > > > So I'm starting another run with these changes:
> > > > > > > > > > opening a new connection each query
> > > > > > > > > > switched so the collection I'm querying is 2x2
> > > > > > > > > > added some dummy collections that are empty
> > > > > > > > >
> > > > > > > > > One nit, while "core" is exactly correct. When we talk
> about
> > a
> > > > core
> > > > > > > > > that's part of a collection, we try to use "replica" to be
> > > clear
> > > > we're
> > > > > > > > > talking about
> > > > > > > > > a core with some added characteristics, i.e. we're in
> > > > SolrCloud-land.
> > > > > > > > > No big deal
> > > > > > > > > of course....
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Erick
> > > > > > > > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <
> > > > apache@elyograg.org>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > > > > > > > @Shawn
> > > > > > > > > > > We're running two instance on one machine for two
> reason:
> > > > > > > > > > > 1. The box has plenty of resources (48 cores / 256GB
> ram)
> > > > and since
> > > > > > > > I was
> > > > > > > > > > > reading that it's not recommended to use more than 31GB
> > of
> > > > heap in
> > > > > > > > SOLR we
> > > > > > > > > > > figured 96 GB for keeping index data in OS cache + 31
> GB
> > of
> > > > heap per
> > > > > > > > > > > instance was a good idea.
> > > > > > > > > >
> > > > > > > > > > Do you know that these Solr instances actually DO need 31
> > GB
> > > > of heap,
> > > > > > > > or
> > > > > > > > > > are you following advice from somewhere, saying "use one
> > > > quarter of
> > > > > > > > your
> > > > > > > > > > memory as the heap size"?  That advice is not in the Solr
> > > > > > > > documentation,
> > > > > > > > > > and never will be.  Figuring out the right heap size
> > requires
> > > > > > > > > > experimentation.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > >
> > >
> >
> https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > > > > > > > >
> > > > > > > > > > How big (on disk) are each of these nine cores, and how
> > many
> > > > documents
> > > > > > > > > > are in each one?  Which of them is in each Solr instance?
> > > > With that
> > > > > > > > > > information, we can make a *guess* about how big your
> heap
> > > > should be.
> > > > > > > > > > Figuring out whether the guess is correct generally
> > requires
> > > > careful
> > > > > > > > > > analysis of a GC log.
> > > > > > > > > >
> > > > > > > > > > > 2. We're in testing phase so we wanted a SOLR cloud
> > > > configuration,
> > > > > > > > we will
> > > > > > > > > > > most likely have a much bigger deployment once going to
> > > > production.
> > > > > > > > In prod
> > > > > > > > > > > right now, we currently to run a six machines Riak
> > cluster.
> > > > Riak is a
> > > > > > > > > > > key/value document store an has SOLR built-in for
> search,
> > > > but we are
> > > > > > > > trying
> > > > > > > > > > > to push the key/value aspect of Riak inside SOLR. That
> > way
> > > > we would
> > > > > > > > have
> > > > > > > > > > > one less piece to worry about in our system.
> > > > > > > > > >
> > > > > > > > > > Solr is not a database.  It is not intended to be a data
> > > > repository.
> > > > > > > > > > All of its optimizations (most of which are actually in
> > > > Lucene) are
> > > > > > > > > > geared towards search.  While technically it can be a
> > > > key-value store,
> > > > > > > > > > that is not what it was MADE for.  Software actually
> > designed
> > > > for that
> > > > > > > > > > role is going to be much better than Solr as a key-value
> > > store.
> > > > > > > > > >
> > > > > > > > > > > When I say null document, I mean the /get API returns:
> > > {doc:
> > > > null}
> > > > > > > > > > >
> > > > > > > > > > > The problem is definitely not always there. We also
> have
> > > > large
> > > > > > > > period of
> > > > > > > > > > > time (few hours) were we have no problems. I'm just
> > > extremely
> > > > > > > > hesitant on
> > > > > > > > > > > retrying when I get a null document because in some
> case,
> > > > getting a
> > > > > > > > null
> > > > > > > > > > > document is a valid outcome. Our caching layer heavily
> > rely
> > > > on this
> > > > > > > > for
> > > > > > > > > > > example. If I was to retry every nulls I'd pay a big
> > > penalty
> > > > in
> > > > > > > > > > > performance.
> > > > > > > > > >
> > > > > > > > > > I've just done a little test with the 7.5.0 techproducts
> > > > example.  It
> > > > > > > > > > looks like returning doc:null actually is how the RTG
> > handler
> > > > says it
> > > > > > > > > > didn't find the document.  This seems very wrong to me,
> > but I
> > > > didn't
> > > > > > > > > > design it, and that response needs SOME kind of format.
> > > > > > > > > >
> > > > > > > > > > Have you done any testing to see whether the standard
> > > > searching handler
> > > > > > > > > > (typically /select, but many other URL paths are
> possible)
> > > > returns
> > > > > > > > > > results when RTG doesn't?  Do you know for these failures
> > > > whether the
> > > > > > > > > > document has been committed or not?
> > > > > > > > > >
> > > > > > > > > > > As for your last comment, part of our testing phase is
> > also
> > > > testing
> > > > > > > > the
> > > > > > > > > > > limits. Our framework has auto-scaling built-in so if
> we
> > > > have a
> > > > > > > > burst of
> > > > > > > > > > > request, the system will automatically spin up more
> > > clients.
> > > > We're
> > > > > > > > pushing
> > > > > > > > > > > 10% of our production system to that Test server to see
> > how
> > > > it will
> > > > > > > > handle
> > > > > > > > > > > it.
> > > > > > > > > >
> > > > > > > > > > To spin up another replica, Solr must copy all its index
> > data
> > > > from the
> > > > > > > > > > leader replica.  Not only can this take a long time if
> the
> > > > index is
> > > > > > > > big,
> > > > > > > > > > but it will put a lot of extra I/O load on the machine(s)
> > > with
> > > > the
> > > > > > > > > > leader roles.  So performance will actually be WORSE
> before
> > > it
> > > > gets
> > > > > > > > > > better when you spin up another replica, and if the index
> > is
> > > > big, that
> > > > > > > > > > condition will persist for quite a while.  Copying the
> > index
> > > > data will
> > > > > > > > > > be constrained by the speed of your network and by the
> > speed
> > > > of your
> > > > > > > > > > disks.  Often the disks are slower than the network, but
> > that
> > > > is not
> > > > > > > > > > always the case.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Shawn
> > > > > > > > > >
> > > > > > > >
> > > >
> > >
> >
>

Re: Realtime get not always returning existing data

Posted by sgaron cse <sg...@gmail.com>.
Hey Chris,

Which version of SOLR are you running? I was thinking of maybe trying
another version to see if it fixes the issue.

On Thu, Oct 11, 2018 at 8:11 AM Chris Ulicny <cu...@iq.media> wrote:

> We've also run into that issue of not being able to reproduce it outside of
> running production loads.
>
> However, we haven't been encountering the problem in live production quite
> as much as we used to, and I think that might be from the /get requests
> being spread out a little more evenly over the running interval which is
> due to other process changes.
>
> If I get any new information, I'll update as well.
>
> Thanks for your help.
>
> On Wed, Oct 10, 2018 at 10:53 AM sgaron cse <sg...@gmail.com> wrote:
>
> > I haven't found a way to reproduce the problem other that running our
> > entire set of code. I've also been trying different things to make sure
> to
> > problem is not from my end and so far I haven't managed to fix it by
> > changing my code. It has to be a race condition somewhere but I just
> can't
> > put my finger on it.
> >
> > I'll message back if I find a way to reproduce.
> >
> > On Wed, Oct 10, 2018 at 10:48 AM Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> > > Well assigning a bogus version that generates a 409 error then
> > > immediately doing an RTG on the doc doesn't fail for me either 18
> > > million tries later. So I'm afraid I haven't a clue where to go from
> > > here. Unless we can somehow find a way to generate this failure I'm
> > > going to drop it for the foreseeable future.
> > >
> > > Erick
> > > On Tue, Oct 9, 2018 at 7:39 AM Erick Erickson <erickerickson@gmail.com
> >
> > > wrote:
> > > >
> > > > Hmmmm. I wonder if a version conflict or perhaps other failure can
> > > > somehow cause this. It shouldn't be very hard to add that to my test
> > > > setup, just randomly add n _version_ field value.
> > > >
> > > > Erick
> > > > On Mon, Oct 1, 2018 at 8:20 AM Erick Erickson <
> erickerickson@gmail.com
> > >
> > > wrote:
> > > > >
> > > > > Thanks. I'll be away for the rest of the week, so won't be able to
> > try
> > > > > anything more....
> > > > > On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <cu...@iq.media>
> > wrote:
> > > > > >
> > > > > > In our case, we are heavily indexing in the collection while the
> > /get
> > > > > > requests are happening which is what we assumed was causing this
> > > very rare
> > > > > > behavior. However, we have experienced the problem for a
> collection
> > > where
> > > > > > the following happens in sequence with minutes in between them.
> > > > > >
> > > > > > 1. Document id=1 is indexed
> > > > > > 2. Document successfully retrieved with /get?id=1
> > > > > > 3. Document failed to be retrieved with /get?id=1
> > > > > > 4. Document successfully retrieved with /get?id=1
> > > > > >
> > > > > > We've haven't looked at the issue in a while, so I don't have the
> > > exact
> > > > > > timing of that sequence on hand right now. I'll try to find an
> > actual
> > > > > > example, although I'm relatively certain it was multiple minutes
> in
> > > between
> > > > > > each of those requests. However our autocommit (and soft commit)
> > > times are
> > > > > > 60s for both collections.
> > > > > >
> > > > > > I think the following two are probably the biggest differences
> for
> > > our
> > > > > > setup, besides the version difference (v6.3.0):
> > > > > >
> > > > > > > index to this collection, perhaps not at a high rate
> > > > > > > separate the machines running solr from the one doing any
> > querying
> > > or
> > > > > > indexing
> > > > > >
> > > > > > The clients are on 3 hosts separate from the solr instances. The
> > > total
> > > > > > number of threads that are making updates and making /get
> requests
> > is
> > > > > > around 120-150. About 40-50 per host. Each of our two collections
> > > gets an
> > > > > > average of 500 requests per second constantly for ~5 minutes, and
> > > then the
> > > > > > number slowly tapers off to essentially 0 after ~15 minutes.
> > > > > >
> > > > > > Every thread attempts to make the same series of requests.
> > > > > >
> > > > > > -- Update with "_version_=-1". If successful, no other requests
> are
> > > made.
> > > > > > -- On 409 Conflict failure, it makes a /get request for the id
> > > > > > -- On doc:null failure, the client handles the error and moves on
> > > > > >
> > > > > > Combining this with the previous series of /get requests, we end
> up
> > > with
> > > > > > situations where an update fails as expected, but the subsequent
> > /get
> > > > > > request fails to retrieve the existing document:
> > > > > >
> > > > > > 1. Thread 1 updates id=1 successfully
> > > > > > 2. Thread 2 tries to update id=1, fails (409)
> > > > > > 3. Thread 2 tries to get id=1 succeeds.
> > > > > >
> > > > > > ...Minutes later...
> > > > > >
> > > > > > 4. Thread 3 tries to update id=1, fails (409)
> > > > > > 5. Thread 3 tries to get id=1, fails (doc:null)
> > > > > >
> > > > > > ...Minutes later...
> > > > > >
> > > > > > 6. Thread 4 tries to update id=1, fails (409)
> > > > > > 7. Thread 4 tries to get id=1 succeeds.
> > > > > >
> > > > > > As Steven mentioned, it happens very, very rarely. We tried to
> > > recreate it
> > > > > > in a more controlled environment, but ran into the same issue
> that
> > > you are,
> > > > > > Erick. Every simplified situation we ran produced no problems.
> > Since
> > > it's
> > > > > > not a large issue for us and happens very rarely, we stopped
> trying
> > > to
> > > > > > recreate it.
> > > > > >
> > > > > >
> > > > > > On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <
> > > erickerickson@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > 57 million queries later, with constant indexing going on and 9
> > > dummy
> > > > > > > collections in the mix and the main collection I'm querying
> > having
> > > 2
> > > > > > > shards, 2 replicas each, I have no errors.
> > > > > > >
> > > > > > > So unless the code doesn't look like it exercises any similar
> > path,
> > > > > > > I'm not sure what more I can test. "It works on my machine" ;)
> > > > > > >
> > > > > > > Here's my querying code, does it look like it what you're
> seeing?
> > > > > > >
> > > > > > >       while (Main.allStop.get() == false) {
> > > > > > >         try (SolrClient client = new HttpSolrClient.Builder()
> > > > > > > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
> > > > > > >             .withBaseSolrUrl("http://localhost:8981/solr/eoe
> > ").build())
> > > {
> > > > > > >
> > > > > > >           //SolrQuery query = new SolrQuery();
> > > > > > >           String lower =
> > Integer.toString(rand.nextInt(1_000_000));
> > > > > > >           SolrDocument rsp = client.getById(lower);
> > > > > > >           if (rsp == null) {
> > > > > > >             System.out.println("Got a null response!");
> > > > > > >             Main.allStop.set(true);
> > > > > > >           }
> > > > > > >
> > > > > > >           rsp = client.getById(lower);
> > > > > > >
> > > > > > >           if (rsp.get("id").equals(lower) == false) {
> > > > > > >             System.out.println("Got an invalid response,
> looking
> > > for "
> > > > > > > + lower + " got: " + rsp.get("id"));
> > > > > > >             Main.allStop.set(true);
> > > > > > >           }
> > > > > > >           long queries = Main.eoeCounter.incrementAndGet();
> > > > > > >           if ((queries % 100_000) == 0) {
> > > > > > >             long seconds = (System.currentTimeMillis() -
> > > Main.start) /
> > > > > > > 1000;
> > > > > > >             System.out.println("Query count: " +
> > > > > > > numFormatter.format(queries) + ", rate is " +
> > > > > > > numFormatter.format(queries / seconds) + " QPS");
> > > > > > >           }
> > > > > > >         } catch (Exception cle) {
> > > > > > >           cle.printStackTrace();
> > > > > > >           Main.allStop.set(true);
> > > > > > >         }
> > > > > > >       }
> > > > > > >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > > > > > > <er...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Steve:
> > > > > > > >
> > > > > > > > bq.  Basically, one core had data in it that should belong to
> > > another
> > > > > > > > core. Here's my question about this: Is it possible that two
> > > request to
> > > > > > > the
> > > > > > > > /get API coming in at the same time would get confused and
> > > either both
> > > > > > > get
> > > > > > > > the same result or result get inverted?
> > > > > > > >
> > > > > > > > Well, that shouldn't be happening, these are all supposed to
> be
> > > > > > > thread-safe
> > > > > > > > calls.... All things are possible of course ;)
> > > > > > > >
> > > > > > > > If two replicas of the same shard have different documents,
> > that
> > > could
> > > > > > > account
> > > > > > > > for what you're seeing, meanwhile begging the question of why
> > > that is
> > > > > > > the case
> > > > > > > > since it should never be true for a quiescent index.
> > Technically
> > > there
> > > > > > > _are_
> > > > > > > > conditions where this is true on a very temporary basis,
> > commits
> > > on the
> > > > > > > leader
> > > > > > > > and follower can trigger at different wall-clock times. Say
> > your
> > > soft
> > > > > > > commit
> > > > > > > > (or hard-commit-with-opensearcher-true) is 10 seconds. It
> > should
> > > never
> > > > > > > be the
> > > > > > > > case that s1r1 and s1r2 are out of sync 10 seconds after the
> > > last update
> > > > > > > was
> > > > > > > > sent. This doesn't seem likely from what you've described
> > > though...
> > > > > > > >
> > > > > > > > Hmmmm. I guess that one other thing I can set up is to have a
> > > bunch of
> > > > > > > dummy
> > > > > > > > collections laying around. Currently I have only the active
> > one,
> > > and
> > > > > > > > if there's some
> > > > > > > > code path whereby the RTG request goes to a replica of a
> > > different
> > > > > > > > collection, my
> > > > > > > > test setup wouldn't reproduce it.
> > > > > > > >
> > > > > > > > Currently, I'm running a 2-shard, 1 replica setup, so if
> > there's
> > > some
> > > > > > > > way that the replicas
> > > > > > > > get out of sync that wouldn't show either.
> > > > > > > >
> > > > > > > > So I'm starting another run with these changes:
> > > > > > > > > opening a new connection each query
> > > > > > > > > switched so the collection I'm querying is 2x2
> > > > > > > > > added some dummy collections that are empty
> > > > > > > >
> > > > > > > > One nit, while "core" is exactly correct. When we talk about
> a
> > > core
> > > > > > > > that's part of a collection, we try to use "replica" to be
> > clear
> > > we're
> > > > > > > > talking about
> > > > > > > > a core with some added characteristics, i.e. we're in
> > > SolrCloud-land.
> > > > > > > > No big deal
> > > > > > > > of course....
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Erick
> > > > > > > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <
> > > apache@elyograg.org>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > > > > > > @Shawn
> > > > > > > > > > We're running two instance on one machine for two reason:
> > > > > > > > > > 1. The box has plenty of resources (48 cores / 256GB ram)
> > > and since
> > > > > > > I was
> > > > > > > > > > reading that it's not recommended to use more than 31GB
> of
> > > heap in
> > > > > > > SOLR we
> > > > > > > > > > figured 96 GB for keeping index data in OS cache + 31 GB
> of
> > > heap per
> > > > > > > > > > instance was a good idea.
> > > > > > > > >
> > > > > > > > > Do you know that these Solr instances actually DO need 31
> GB
> > > of heap,
> > > > > > > or
> > > > > > > > > are you following advice from somewhere, saying "use one
> > > quarter of
> > > > > > > your
> > > > > > > > > memory as the heap size"?  That advice is not in the Solr
> > > > > > > documentation,
> > > > > > > > > and never will be.  Figuring out the right heap size
> requires
> > > > > > > > > experimentation.
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > >
> >
> https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > > > > > > >
> > > > > > > > > How big (on disk) are each of these nine cores, and how
> many
> > > documents
> > > > > > > > > are in each one?  Which of them is in each Solr instance?
> > > With that
> > > > > > > > > information, we can make a *guess* about how big your heap
> > > should be.
> > > > > > > > > Figuring out whether the guess is correct generally
> requires
> > > careful
> > > > > > > > > analysis of a GC log.
> > > > > > > > >
> > > > > > > > > > 2. We're in testing phase so we wanted a SOLR cloud
> > > configuration,
> > > > > > > we will
> > > > > > > > > > most likely have a much bigger deployment once going to
> > > production.
> > > > > > > In prod
> > > > > > > > > > right now, we currently to run a six machines Riak
> cluster.
> > > Riak is a
> > > > > > > > > > key/value document store an has SOLR built-in for search,
> > > but we are
> > > > > > > trying
> > > > > > > > > > to push the key/value aspect of Riak inside SOLR. That
> way
> > > we would
> > > > > > > have
> > > > > > > > > > one less piece to worry about in our system.
> > > > > > > > >
> > > > > > > > > Solr is not a database.  It is not intended to be a data
> > > repository.
> > > > > > > > > All of its optimizations (most of which are actually in
> > > Lucene) are
> > > > > > > > > geared towards search.  While technically it can be a
> > > key-value store,
> > > > > > > > > that is not what it was MADE for.  Software actually
> designed
> > > for that
> > > > > > > > > role is going to be much better than Solr as a key-value
> > store.
> > > > > > > > >
> > > > > > > > > > When I say null document, I mean the /get API returns:
> > {doc:
> > > null}
> > > > > > > > > >
> > > > > > > > > > The problem is definitely not always there. We also have
> > > large
> > > > > > > period of
> > > > > > > > > > time (few hours) were we have no problems. I'm just
> > extremely
> > > > > > > hesitant on
> > > > > > > > > > retrying when I get a null document because in some case,
> > > getting a
> > > > > > > null
> > > > > > > > > > document is a valid outcome. Our caching layer heavily
> rely
> > > on this
> > > > > > > for
> > > > > > > > > > example. If I was to retry every nulls I'd pay a big
> > penalty
> > > in
> > > > > > > > > > performance.
> > > > > > > > >
> > > > > > > > > I've just done a little test with the 7.5.0 techproducts
> > > example.  It
> > > > > > > > > looks like returning doc:null actually is how the RTG
> handler
> > > says it
> > > > > > > > > didn't find the document.  This seems very wrong to me,
> but I
> > > didn't
> > > > > > > > > design it, and that response needs SOME kind of format.
> > > > > > > > >
> > > > > > > > > Have you done any testing to see whether the standard
> > > searching handler
> > > > > > > > > (typically /select, but many other URL paths are possible)
> > > returns
> > > > > > > > > results when RTG doesn't?  Do you know for these failures
> > > whether the
> > > > > > > > > document has been committed or not?
> > > > > > > > >
> > > > > > > > > > As for your last comment, part of our testing phase is
> also
> > > testing
> > > > > > > the
> > > > > > > > > > limits. Our framework has auto-scaling built-in so if we
> > > have a
> > > > > > > burst of
> > > > > > > > > > request, the system will automatically spin up more
> > clients.
> > > We're
> > > > > > > pushing
> > > > > > > > > > 10% of our production system to that Test server to see
> how
> > > it will
> > > > > > > handle
> > > > > > > > > > it.
> > > > > > > > >
> > > > > > > > > To spin up another replica, Solr must copy all its index
> data
> > > from the
> > > > > > > > > leader replica.  Not only can this take a long time if the
> > > index is
> > > > > > > big,
> > > > > > > > > but it will put a lot of extra I/O load on the machine(s)
> > with
> > > the
> > > > > > > > > leader roles.  So performance will actually be WORSE before
> > it
> > > gets
> > > > > > > > > better when you spin up another replica, and if the index
> is
> > > big, that
> > > > > > > > > condition will persist for quite a while.  Copying the
> index
> > > data will
> > > > > > > > > be constrained by the speed of your network and by the
> speed
> > > of your
> > > > > > > > > disks.  Often the disks are slower than the network, but
> that
> > > is not
> > > > > > > > > always the case.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Shawn
> > > > > > > > >
> > > > > > >
> > >
> >
>

Re: Realtime get not always returning existing data

Posted by Chris Ulicny <cu...@iq.media>.
We've also run into that issue of not being able to reproduce it outside of
running production loads.

However, we haven't been encountering the problem in live production quite
as much as we used to, and I think that might be from the /get requests
being spread out a little more evenly over the running interval which is
due to other process changes.

If I get any new information, I'll update as well.

Thanks for your help.

On Wed, Oct 10, 2018 at 10:53 AM sgaron cse <sg...@gmail.com> wrote:

> I haven't found a way to reproduce the problem other that running our
> entire set of code. I've also been trying different things to make sure to
> problem is not from my end and so far I haven't managed to fix it by
> changing my code. It has to be a race condition somewhere but I just can't
> put my finger on it.
>
> I'll message back if I find a way to reproduce.
>
> On Wed, Oct 10, 2018 at 10:48 AM Erick Erickson <er...@gmail.com>
> wrote:
>
> > Well assigning a bogus version that generates a 409 error then
> > immediately doing an RTG on the doc doesn't fail for me either 18
> > million tries later. So I'm afraid I haven't a clue where to go from
> > here. Unless we can somehow find a way to generate this failure I'm
> > going to drop it for the foreseeable future.
> >
> > Erick
> > On Tue, Oct 9, 2018 at 7:39 AM Erick Erickson <er...@gmail.com>
> > wrote:
> > >
> > > Hmmmm. I wonder if a version conflict or perhaps other failure can
> > > somehow cause this. It shouldn't be very hard to add that to my test
> > > setup, just randomly add n _version_ field value.
> > >
> > > Erick
> > > On Mon, Oct 1, 2018 at 8:20 AM Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> > > >
> > > > Thanks. I'll be away for the rest of the week, so won't be able to
> try
> > > > anything more....
> > > > On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <cu...@iq.media>
> wrote:
> > > > >
> > > > > In our case, we are heavily indexing in the collection while the
> /get
> > > > > requests are happening which is what we assumed was causing this
> > very rare
> > > > > behavior. However, we have experienced the problem for a collection
> > where
> > > > > the following happens in sequence with minutes in between them.
> > > > >
> > > > > 1. Document id=1 is indexed
> > > > > 2. Document successfully retrieved with /get?id=1
> > > > > 3. Document failed to be retrieved with /get?id=1
> > > > > 4. Document successfully retrieved with /get?id=1
> > > > >
> > > > > We've haven't looked at the issue in a while, so I don't have the
> > exact
> > > > > timing of that sequence on hand right now. I'll try to find an
> actual
> > > > > example, although I'm relatively certain it was multiple minutes in
> > between
> > > > > each of those requests. However our autocommit (and soft commit)
> > times are
> > > > > 60s for both collections.
> > > > >
> > > > > I think the following two are probably the biggest differences for
> > our
> > > > > setup, besides the version difference (v6.3.0):
> > > > >
> > > > > > index to this collection, perhaps not at a high rate
> > > > > > separate the machines running solr from the one doing any
> querying
> > or
> > > > > indexing
> > > > >
> > > > > The clients are on 3 hosts separate from the solr instances. The
> > total
> > > > > number of threads that are making updates and making /get requests
> is
> > > > > around 120-150. About 40-50 per host. Each of our two collections
> > gets an
> > > > > average of 500 requests per second constantly for ~5 minutes, and
> > then the
> > > > > number slowly tapers off to essentially 0 after ~15 minutes.
> > > > >
> > > > > Every thread attempts to make the same series of requests.
> > > > >
> > > > > -- Update with "_version_=-1". If successful, no other requests are
> > made.
> > > > > -- On 409 Conflict failure, it makes a /get request for the id
> > > > > -- On doc:null failure, the client handles the error and moves on
> > > > >
> > > > > Combining this with the previous series of /get requests, we end up
> > with
> > > > > situations where an update fails as expected, but the subsequent
> /get
> > > > > request fails to retrieve the existing document:
> > > > >
> > > > > 1. Thread 1 updates id=1 successfully
> > > > > 2. Thread 2 tries to update id=1, fails (409)
> > > > > 3. Thread 2 tries to get id=1 succeeds.
> > > > >
> > > > > ...Minutes later...
> > > > >
> > > > > 4. Thread 3 tries to update id=1, fails (409)
> > > > > 5. Thread 3 tries to get id=1, fails (doc:null)
> > > > >
> > > > > ...Minutes later...
> > > > >
> > > > > 6. Thread 4 tries to update id=1, fails (409)
> > > > > 7. Thread 4 tries to get id=1 succeeds.
> > > > >
> > > > > As Steven mentioned, it happens very, very rarely. We tried to
> > recreate it
> > > > > in a more controlled environment, but ran into the same issue that
> > you are,
> > > > > Erick. Every simplified situation we ran produced no problems.
> Since
> > it's
> > > > > not a large issue for us and happens very rarely, we stopped trying
> > to
> > > > > recreate it.
> > > > >
> > > > >
> > > > > On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <
> > erickerickson@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > 57 million queries later, with constant indexing going on and 9
> > dummy
> > > > > > collections in the mix and the main collection I'm querying
> having
> > 2
> > > > > > shards, 2 replicas each, I have no errors.
> > > > > >
> > > > > > So unless the code doesn't look like it exercises any similar
> path,
> > > > > > I'm not sure what more I can test. "It works on my machine" ;)
> > > > > >
> > > > > > Here's my querying code, does it look like it what you're seeing?
> > > > > >
> > > > > >       while (Main.allStop.get() == false) {
> > > > > >         try (SolrClient client = new HttpSolrClient.Builder()
> > > > > > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
> > > > > >             .withBaseSolrUrl("http://localhost:8981/solr/eoe
> ").build())
> > {
> > > > > >
> > > > > >           //SolrQuery query = new SolrQuery();
> > > > > >           String lower =
> Integer.toString(rand.nextInt(1_000_000));
> > > > > >           SolrDocument rsp = client.getById(lower);
> > > > > >           if (rsp == null) {
> > > > > >             System.out.println("Got a null response!");
> > > > > >             Main.allStop.set(true);
> > > > > >           }
> > > > > >
> > > > > >           rsp = client.getById(lower);
> > > > > >
> > > > > >           if (rsp.get("id").equals(lower) == false) {
> > > > > >             System.out.println("Got an invalid response, looking
> > for "
> > > > > > + lower + " got: " + rsp.get("id"));
> > > > > >             Main.allStop.set(true);
> > > > > >           }
> > > > > >           long queries = Main.eoeCounter.incrementAndGet();
> > > > > >           if ((queries % 100_000) == 0) {
> > > > > >             long seconds = (System.currentTimeMillis() -
> > Main.start) /
> > > > > > 1000;
> > > > > >             System.out.println("Query count: " +
> > > > > > numFormatter.format(queries) + ", rate is " +
> > > > > > numFormatter.format(queries / seconds) + " QPS");
> > > > > >           }
> > > > > >         } catch (Exception cle) {
> > > > > >           cle.printStackTrace();
> > > > > >           Main.allStop.set(true);
> > > > > >         }
> > > > > >       }
> > > > > >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > > > > > <er...@gmail.com> wrote:
> > > > > > >
> > > > > > > Steve:
> > > > > > >
> > > > > > > bq.  Basically, one core had data in it that should belong to
> > another
> > > > > > > core. Here's my question about this: Is it possible that two
> > request to
> > > > > > the
> > > > > > > /get API coming in at the same time would get confused and
> > either both
> > > > > > get
> > > > > > > the same result or result get inverted?
> > > > > > >
> > > > > > > Well, that shouldn't be happening, these are all supposed to be
> > > > > > thread-safe
> > > > > > > calls.... All things are possible of course ;)
> > > > > > >
> > > > > > > If two replicas of the same shard have different documents,
> that
> > could
> > > > > > account
> > > > > > > for what you're seeing, meanwhile begging the question of why
> > that is
> > > > > > the case
> > > > > > > since it should never be true for a quiescent index.
> Technically
> > there
> > > > > > _are_
> > > > > > > conditions where this is true on a very temporary basis,
> commits
> > on the
> > > > > > leader
> > > > > > > and follower can trigger at different wall-clock times. Say
> your
> > soft
> > > > > > commit
> > > > > > > (or hard-commit-with-opensearcher-true) is 10 seconds. It
> should
> > never
> > > > > > be the
> > > > > > > case that s1r1 and s1r2 are out of sync 10 seconds after the
> > last update
> > > > > > was
> > > > > > > sent. This doesn't seem likely from what you've described
> > though...
> > > > > > >
> > > > > > > Hmmmm. I guess that one other thing I can set up is to have a
> > bunch of
> > > > > > dummy
> > > > > > > collections laying around. Currently I have only the active
> one,
> > and
> > > > > > > if there's some
> > > > > > > code path whereby the RTG request goes to a replica of a
> > different
> > > > > > > collection, my
> > > > > > > test setup wouldn't reproduce it.
> > > > > > >
> > > > > > > Currently, I'm running a 2-shard, 1 replica setup, so if
> there's
> > some
> > > > > > > way that the replicas
> > > > > > > get out of sync that wouldn't show either.
> > > > > > >
> > > > > > > So I'm starting another run with these changes:
> > > > > > > > opening a new connection each query
> > > > > > > > switched so the collection I'm querying is 2x2
> > > > > > > > added some dummy collections that are empty
> > > > > > >
> > > > > > > One nit, while "core" is exactly correct. When we talk about a
> > core
> > > > > > > that's part of a collection, we try to use "replica" to be
> clear
> > we're
> > > > > > > talking about
> > > > > > > a core with some added characteristics, i.e. we're in
> > SolrCloud-land.
> > > > > > > No big deal
> > > > > > > of course....
> > > > > > >
> > > > > > > Best,
> > > > > > > Erick
> > > > > > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <
> > apache@elyograg.org>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > > > > > @Shawn
> > > > > > > > > We're running two instance on one machine for two reason:
> > > > > > > > > 1. The box has plenty of resources (48 cores / 256GB ram)
> > and since
> > > > > > I was
> > > > > > > > > reading that it's not recommended to use more than 31GB of
> > heap in
> > > > > > SOLR we
> > > > > > > > > figured 96 GB for keeping index data in OS cache + 31 GB of
> > heap per
> > > > > > > > > instance was a good idea.
> > > > > > > >
> > > > > > > > Do you know that these Solr instances actually DO need 31 GB
> > of heap,
> > > > > > or
> > > > > > > > are you following advice from somewhere, saying "use one
> > quarter of
> > > > > > your
> > > > > > > > memory as the heap size"?  That advice is not in the Solr
> > > > > > documentation,
> > > > > > > > and never will be.  Figuring out the right heap size requires
> > > > > > > > experimentation.
> > > > > > > >
> > > > > > > >
> > > > > >
> >
> https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > > > > > >
> > > > > > > > How big (on disk) are each of these nine cores, and how many
> > documents
> > > > > > > > are in each one?  Which of them is in each Solr instance?
> > With that
> > > > > > > > information, we can make a *guess* about how big your heap
> > should be.
> > > > > > > > Figuring out whether the guess is correct generally requires
> > careful
> > > > > > > > analysis of a GC log.
> > > > > > > >
> > > > > > > > > 2. We're in testing phase so we wanted a SOLR cloud
> > configuration,
> > > > > > we will
> > > > > > > > > most likely have a much bigger deployment once going to
> > production.
> > > > > > In prod
> > > > > > > > > right now, we currently to run a six machines Riak cluster.
> > Riak is a
> > > > > > > > > key/value document store an has SOLR built-in for search,
> > but we are
> > > > > > trying
> > > > > > > > > to push the key/value aspect of Riak inside SOLR. That way
> > we would
> > > > > > have
> > > > > > > > > one less piece to worry about in our system.
> > > > > > > >
> > > > > > > > Solr is not a database.  It is not intended to be a data
> > repository.
> > > > > > > > All of its optimizations (most of which are actually in
> > Lucene) are
> > > > > > > > geared towards search.  While technically it can be a
> > key-value store,
> > > > > > > > that is not what it was MADE for.  Software actually designed
> > for that
> > > > > > > > role is going to be much better than Solr as a key-value
> store.
> > > > > > > >
> > > > > > > > > When I say null document, I mean the /get API returns:
> {doc:
> > null}
> > > > > > > > >
> > > > > > > > > The problem is definitely not always there. We also have
> > large
> > > > > > period of
> > > > > > > > > time (few hours) were we have no problems. I'm just
> extremely
> > > > > > hesitant on
> > > > > > > > > retrying when I get a null document because in some case,
> > getting a
> > > > > > null
> > > > > > > > > document is a valid outcome. Our caching layer heavily rely
> > on this
> > > > > > for
> > > > > > > > > example. If I was to retry every nulls I'd pay a big
> penalty
> > in
> > > > > > > > > performance.
> > > > > > > >
> > > > > > > > I've just done a little test with the 7.5.0 techproducts
> > example.  It
> > > > > > > > looks like returning doc:null actually is how the RTG handler
> > says it
> > > > > > > > didn't find the document.  This seems very wrong to me, but I
> > didn't
> > > > > > > > design it, and that response needs SOME kind of format.
> > > > > > > >
> > > > > > > > Have you done any testing to see whether the standard
> > searching handler
> > > > > > > > (typically /select, but many other URL paths are possible)
> > returns
> > > > > > > > results when RTG doesn't?  Do you know for these failures
> > whether the
> > > > > > > > document has been committed or not?
> > > > > > > >
> > > > > > > > > As for your last comment, part of our testing phase is also
> > testing
> > > > > > the
> > > > > > > > > limits. Our framework has auto-scaling built-in so if we
> > have a
> > > > > > burst of
> > > > > > > > > request, the system will automatically spin up more
> clients.
> > We're
> > > > > > pushing
> > > > > > > > > 10% of our production system to that Test server to see how
> > it will
> > > > > > handle
> > > > > > > > > it.
> > > > > > > >
> > > > > > > > To spin up another replica, Solr must copy all its index data
> > from the
> > > > > > > > leader replica.  Not only can this take a long time if the
> > index is
> > > > > > big,
> > > > > > > > but it will put a lot of extra I/O load on the machine(s)
> with
> > the
> > > > > > > > leader roles.  So performance will actually be WORSE before
> it
> > gets
> > > > > > > > better when you spin up another replica, and if the index is
> > big, that
> > > > > > > > condition will persist for quite a while.  Copying the index
> > data will
> > > > > > > > be constrained by the speed of your network and by the speed
> > of your
> > > > > > > > disks.  Often the disks are slower than the network, but that
> > is not
> > > > > > > > always the case.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Shawn
> > > > > > > >
> > > > > >
> >
>

Re: Realtime get not always returning existing data

Posted by sgaron cse <sg...@gmail.com>.
I haven't found a way to reproduce the problem other that running our
entire set of code. I've also been trying different things to make sure to
problem is not from my end and so far I haven't managed to fix it by
changing my code. It has to be a race condition somewhere but I just can't
put my finger on it.

I'll message back if I find a way to reproduce.

On Wed, Oct 10, 2018 at 10:48 AM Erick Erickson <er...@gmail.com>
wrote:

> Well assigning a bogus version that generates a 409 error then
> immediately doing an RTG on the doc doesn't fail for me either 18
> million tries later. So I'm afraid I haven't a clue where to go from
> here. Unless we can somehow find a way to generate this failure I'm
> going to drop it for the foreseeable future.
>
> Erick
> On Tue, Oct 9, 2018 at 7:39 AM Erick Erickson <er...@gmail.com>
> wrote:
> >
> > Hmmmm. I wonder if a version conflict or perhaps other failure can
> > somehow cause this. It shouldn't be very hard to add that to my test
> > setup, just randomly add n _version_ field value.
> >
> > Erick
> > On Mon, Oct 1, 2018 at 8:20 AM Erick Erickson <er...@gmail.com>
> wrote:
> > >
> > > Thanks. I'll be away for the rest of the week, so won't be able to try
> > > anything more....
> > > On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <cu...@iq.media> wrote:
> > > >
> > > > In our case, we are heavily indexing in the collection while the /get
> > > > requests are happening which is what we assumed was causing this
> very rare
> > > > behavior. However, we have experienced the problem for a collection
> where
> > > > the following happens in sequence with minutes in between them.
> > > >
> > > > 1. Document id=1 is indexed
> > > > 2. Document successfully retrieved with /get?id=1
> > > > 3. Document failed to be retrieved with /get?id=1
> > > > 4. Document successfully retrieved with /get?id=1
> > > >
> > > > We've haven't looked at the issue in a while, so I don't have the
> exact
> > > > timing of that sequence on hand right now. I'll try to find an actual
> > > > example, although I'm relatively certain it was multiple minutes in
> between
> > > > each of those requests. However our autocommit (and soft commit)
> times are
> > > > 60s for both collections.
> > > >
> > > > I think the following two are probably the biggest differences for
> our
> > > > setup, besides the version difference (v6.3.0):
> > > >
> > > > > index to this collection, perhaps not at a high rate
> > > > > separate the machines running solr from the one doing any querying
> or
> > > > indexing
> > > >
> > > > The clients are on 3 hosts separate from the solr instances. The
> total
> > > > number of threads that are making updates and making /get requests is
> > > > around 120-150. About 40-50 per host. Each of our two collections
> gets an
> > > > average of 500 requests per second constantly for ~5 minutes, and
> then the
> > > > number slowly tapers off to essentially 0 after ~15 minutes.
> > > >
> > > > Every thread attempts to make the same series of requests.
> > > >
> > > > -- Update with "_version_=-1". If successful, no other requests are
> made.
> > > > -- On 409 Conflict failure, it makes a /get request for the id
> > > > -- On doc:null failure, the client handles the error and moves on
> > > >
> > > > Combining this with the previous series of /get requests, we end up
> with
> > > > situations where an update fails as expected, but the subsequent /get
> > > > request fails to retrieve the existing document:
> > > >
> > > > 1. Thread 1 updates id=1 successfully
> > > > 2. Thread 2 tries to update id=1, fails (409)
> > > > 3. Thread 2 tries to get id=1 succeeds.
> > > >
> > > > ...Minutes later...
> > > >
> > > > 4. Thread 3 tries to update id=1, fails (409)
> > > > 5. Thread 3 tries to get id=1, fails (doc:null)
> > > >
> > > > ...Minutes later...
> > > >
> > > > 6. Thread 4 tries to update id=1, fails (409)
> > > > 7. Thread 4 tries to get id=1 succeeds.
> > > >
> > > > As Steven mentioned, it happens very, very rarely. We tried to
> recreate it
> > > > in a more controlled environment, but ran into the same issue that
> you are,
> > > > Erick. Every simplified situation we ran produced no problems. Since
> it's
> > > > not a large issue for us and happens very rarely, we stopped trying
> to
> > > > recreate it.
> > > >
> > > >
> > > > On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <
> erickerickson@gmail.com>
> > > > wrote:
> > > >
> > > > > 57 million queries later, with constant indexing going on and 9
> dummy
> > > > > collections in the mix and the main collection I'm querying having
> 2
> > > > > shards, 2 replicas each, I have no errors.
> > > > >
> > > > > So unless the code doesn't look like it exercises any similar path,
> > > > > I'm not sure what more I can test. "It works on my machine" ;)
> > > > >
> > > > > Here's my querying code, does it look like it what you're seeing?
> > > > >
> > > > >       while (Main.allStop.get() == false) {
> > > > >         try (SolrClient client = new HttpSolrClient.Builder()
> > > > > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
> > > > >             .withBaseSolrUrl("http://localhost:8981/solr/eoe").build())
> {
> > > > >
> > > > >           //SolrQuery query = new SolrQuery();
> > > > >           String lower = Integer.toString(rand.nextInt(1_000_000));
> > > > >           SolrDocument rsp = client.getById(lower);
> > > > >           if (rsp == null) {
> > > > >             System.out.println("Got a null response!");
> > > > >             Main.allStop.set(true);
> > > > >           }
> > > > >
> > > > >           rsp = client.getById(lower);
> > > > >
> > > > >           if (rsp.get("id").equals(lower) == false) {
> > > > >             System.out.println("Got an invalid response, looking
> for "
> > > > > + lower + " got: " + rsp.get("id"));
> > > > >             Main.allStop.set(true);
> > > > >           }
> > > > >           long queries = Main.eoeCounter.incrementAndGet();
> > > > >           if ((queries % 100_000) == 0) {
> > > > >             long seconds = (System.currentTimeMillis() -
> Main.start) /
> > > > > 1000;
> > > > >             System.out.println("Query count: " +
> > > > > numFormatter.format(queries) + ", rate is " +
> > > > > numFormatter.format(queries / seconds) + " QPS");
> > > > >           }
> > > > >         } catch (Exception cle) {
> > > > >           cle.printStackTrace();
> > > > >           Main.allStop.set(true);
> > > > >         }
> > > > >       }
> > > > >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > > > > <er...@gmail.com> wrote:
> > > > > >
> > > > > > Steve:
> > > > > >
> > > > > > bq.  Basically, one core had data in it that should belong to
> another
> > > > > > core. Here's my question about this: Is it possible that two
> request to
> > > > > the
> > > > > > /get API coming in at the same time would get confused and
> either both
> > > > > get
> > > > > > the same result or result get inverted?
> > > > > >
> > > > > > Well, that shouldn't be happening, these are all supposed to be
> > > > > thread-safe
> > > > > > calls.... All things are possible of course ;)
> > > > > >
> > > > > > If two replicas of the same shard have different documents, that
> could
> > > > > account
> > > > > > for what you're seeing, meanwhile begging the question of why
> that is
> > > > > the case
> > > > > > since it should never be true for a quiescent index. Technically
> there
> > > > > _are_
> > > > > > conditions where this is true on a very temporary basis, commits
> on the
> > > > > leader
> > > > > > and follower can trigger at different wall-clock times. Say your
> soft
> > > > > commit
> > > > > > (or hard-commit-with-opensearcher-true) is 10 seconds. It should
> never
> > > > > be the
> > > > > > case that s1r1 and s1r2 are out of sync 10 seconds after the
> last update
> > > > > was
> > > > > > sent. This doesn't seem likely from what you've described
> though...
> > > > > >
> > > > > > Hmmmm. I guess that one other thing I can set up is to have a
> bunch of
> > > > > dummy
> > > > > > collections laying around. Currently I have only the active one,
> and
> > > > > > if there's some
> > > > > > code path whereby the RTG request goes to a replica of a
> different
> > > > > > collection, my
> > > > > > test setup wouldn't reproduce it.
> > > > > >
> > > > > > Currently, I'm running a 2-shard, 1 replica setup, so if there's
> some
> > > > > > way that the replicas
> > > > > > get out of sync that wouldn't show either.
> > > > > >
> > > > > > So I'm starting another run with these changes:
> > > > > > > opening a new connection each query
> > > > > > > switched so the collection I'm querying is 2x2
> > > > > > > added some dummy collections that are empty
> > > > > >
> > > > > > One nit, while "core" is exactly correct. When we talk about a
> core
> > > > > > that's part of a collection, we try to use "replica" to be clear
> we're
> > > > > > talking about
> > > > > > a core with some added characteristics, i.e. we're in
> SolrCloud-land.
> > > > > > No big deal
> > > > > > of course....
> > > > > >
> > > > > > Best,
> > > > > > Erick
> > > > > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <
> apache@elyograg.org>
> > > > > wrote:
> > > > > > >
> > > > > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > > > > @Shawn
> > > > > > > > We're running two instance on one machine for two reason:
> > > > > > > > 1. The box has plenty of resources (48 cores / 256GB ram)
> and since
> > > > > I was
> > > > > > > > reading that it's not recommended to use more than 31GB of
> heap in
> > > > > SOLR we
> > > > > > > > figured 96 GB for keeping index data in OS cache + 31 GB of
> heap per
> > > > > > > > instance was a good idea.
> > > > > > >
> > > > > > > Do you know that these Solr instances actually DO need 31 GB
> of heap,
> > > > > or
> > > > > > > are you following advice from somewhere, saying "use one
> quarter of
> > > > > your
> > > > > > > memory as the heap size"?  That advice is not in the Solr
> > > > > documentation,
> > > > > > > and never will be.  Figuring out the right heap size requires
> > > > > > > experimentation.
> > > > > > >
> > > > > > >
> > > > >
> https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > > > > >
> > > > > > > How big (on disk) are each of these nine cores, and how many
> documents
> > > > > > > are in each one?  Which of them is in each Solr instance?
> With that
> > > > > > > information, we can make a *guess* about how big your heap
> should be.
> > > > > > > Figuring out whether the guess is correct generally requires
> careful
> > > > > > > analysis of a GC log.
> > > > > > >
> > > > > > > > 2. We're in testing phase so we wanted a SOLR cloud
> configuration,
> > > > > we will
> > > > > > > > most likely have a much bigger deployment once going to
> production.
> > > > > In prod
> > > > > > > > right now, we currently to run a six machines Riak cluster.
> Riak is a
> > > > > > > > key/value document store an has SOLR built-in for search,
> but we are
> > > > > trying
> > > > > > > > to push the key/value aspect of Riak inside SOLR. That way
> we would
> > > > > have
> > > > > > > > one less piece to worry about in our system.
> > > > > > >
> > > > > > > Solr is not a database.  It is not intended to be a data
> repository.
> > > > > > > All of its optimizations (most of which are actually in
> Lucene) are
> > > > > > > geared towards search.  While technically it can be a
> key-value store,
> > > > > > > that is not what it was MADE for.  Software actually designed
> for that
> > > > > > > role is going to be much better than Solr as a key-value store.
> > > > > > >
> > > > > > > > When I say null document, I mean the /get API returns: {doc:
> null}
> > > > > > > >
> > > > > > > > The problem is definitely not always there. We also have
> large
> > > > > period of
> > > > > > > > time (few hours) were we have no problems. I'm just extremely
> > > > > hesitant on
> > > > > > > > retrying when I get a null document because in some case,
> getting a
> > > > > null
> > > > > > > > document is a valid outcome. Our caching layer heavily rely
> on this
> > > > > for
> > > > > > > > example. If I was to retry every nulls I'd pay a big penalty
> in
> > > > > > > > performance.
> > > > > > >
> > > > > > > I've just done a little test with the 7.5.0 techproducts
> example.  It
> > > > > > > looks like returning doc:null actually is how the RTG handler
> says it
> > > > > > > didn't find the document.  This seems very wrong to me, but I
> didn't
> > > > > > > design it, and that response needs SOME kind of format.
> > > > > > >
> > > > > > > Have you done any testing to see whether the standard
> searching handler
> > > > > > > (typically /select, but many other URL paths are possible)
> returns
> > > > > > > results when RTG doesn't?  Do you know for these failures
> whether the
> > > > > > > document has been committed or not?
> > > > > > >
> > > > > > > > As for your last comment, part of our testing phase is also
> testing
> > > > > the
> > > > > > > > limits. Our framework has auto-scaling built-in so if we
> have a
> > > > > burst of
> > > > > > > > request, the system will automatically spin up more clients.
> We're
> > > > > pushing
> > > > > > > > 10% of our production system to that Test server to see how
> it will
> > > > > handle
> > > > > > > > it.
> > > > > > >
> > > > > > > To spin up another replica, Solr must copy all its index data
> from the
> > > > > > > leader replica.  Not only can this take a long time if the
> index is
> > > > > big,
> > > > > > > but it will put a lot of extra I/O load on the machine(s) with
> the
> > > > > > > leader roles.  So performance will actually be WORSE before it
> gets
> > > > > > > better when you spin up another replica, and if the index is
> big, that
> > > > > > > condition will persist for quite a while.  Copying the index
> data will
> > > > > > > be constrained by the speed of your network and by the speed
> of your
> > > > > > > disks.  Often the disks are slower than the network, but that
> is not
> > > > > > > always the case.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Shawn
> > > > > > >
> > > > >
>

Re: Realtime get not always returning existing data

Posted by Erick Erickson <er...@gmail.com>.
Well assigning a bogus version that generates a 409 error then
immediately doing an RTG on the doc doesn't fail for me either 18
million tries later. So I'm afraid I haven't a clue where to go from
here. Unless we can somehow find a way to generate this failure I'm
going to drop it for the foreseeable future.

Erick
On Tue, Oct 9, 2018 at 7:39 AM Erick Erickson <er...@gmail.com> wrote:
>
> Hmmmm. I wonder if a version conflict or perhaps other failure can
> somehow cause this. It shouldn't be very hard to add that to my test
> setup, just randomly add n _version_ field value.
>
> Erick
> On Mon, Oct 1, 2018 at 8:20 AM Erick Erickson <er...@gmail.com> wrote:
> >
> > Thanks. I'll be away for the rest of the week, so won't be able to try
> > anything more....
> > On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <cu...@iq.media> wrote:
> > >
> > > In our case, we are heavily indexing in the collection while the /get
> > > requests are happening which is what we assumed was causing this very rare
> > > behavior. However, we have experienced the problem for a collection where
> > > the following happens in sequence with minutes in between them.
> > >
> > > 1. Document id=1 is indexed
> > > 2. Document successfully retrieved with /get?id=1
> > > 3. Document failed to be retrieved with /get?id=1
> > > 4. Document successfully retrieved with /get?id=1
> > >
> > > We've haven't looked at the issue in a while, so I don't have the exact
> > > timing of that sequence on hand right now. I'll try to find an actual
> > > example, although I'm relatively certain it was multiple minutes in between
> > > each of those requests. However our autocommit (and soft commit) times are
> > > 60s for both collections.
> > >
> > > I think the following two are probably the biggest differences for our
> > > setup, besides the version difference (v6.3.0):
> > >
> > > > index to this collection, perhaps not at a high rate
> > > > separate the machines running solr from the one doing any querying or
> > > indexing
> > >
> > > The clients are on 3 hosts separate from the solr instances. The total
> > > number of threads that are making updates and making /get requests is
> > > around 120-150. About 40-50 per host. Each of our two collections gets an
> > > average of 500 requests per second constantly for ~5 minutes, and then the
> > > number slowly tapers off to essentially 0 after ~15 minutes.
> > >
> > > Every thread attempts to make the same series of requests.
> > >
> > > -- Update with "_version_=-1". If successful, no other requests are made.
> > > -- On 409 Conflict failure, it makes a /get request for the id
> > > -- On doc:null failure, the client handles the error and moves on
> > >
> > > Combining this with the previous series of /get requests, we end up with
> > > situations where an update fails as expected, but the subsequent /get
> > > request fails to retrieve the existing document:
> > >
> > > 1. Thread 1 updates id=1 successfully
> > > 2. Thread 2 tries to update id=1, fails (409)
> > > 3. Thread 2 tries to get id=1 succeeds.
> > >
> > > ...Minutes later...
> > >
> > > 4. Thread 3 tries to update id=1, fails (409)
> > > 5. Thread 3 tries to get id=1, fails (doc:null)
> > >
> > > ...Minutes later...
> > >
> > > 6. Thread 4 tries to update id=1, fails (409)
> > > 7. Thread 4 tries to get id=1 succeeds.
> > >
> > > As Steven mentioned, it happens very, very rarely. We tried to recreate it
> > > in a more controlled environment, but ran into the same issue that you are,
> > > Erick. Every simplified situation we ran produced no problems. Since it's
> > > not a large issue for us and happens very rarely, we stopped trying to
> > > recreate it.
> > >
> > >
> > > On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <er...@gmail.com>
> > > wrote:
> > >
> > > > 57 million queries later, with constant indexing going on and 9 dummy
> > > > collections in the mix and the main collection I'm querying having 2
> > > > shards, 2 replicas each, I have no errors.
> > > >
> > > > So unless the code doesn't look like it exercises any similar path,
> > > > I'm not sure what more I can test. "It works on my machine" ;)
> > > >
> > > > Here's my querying code, does it look like it what you're seeing?
> > > >
> > > >       while (Main.allStop.get() == false) {
> > > >         try (SolrClient client = new HttpSolrClient.Builder()
> > > > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
> > > >             .withBaseSolrUrl("http://localhost:8981/solr/eoe").build()) {
> > > >
> > > >           //SolrQuery query = new SolrQuery();
> > > >           String lower = Integer.toString(rand.nextInt(1_000_000));
> > > >           SolrDocument rsp = client.getById(lower);
> > > >           if (rsp == null) {
> > > >             System.out.println("Got a null response!");
> > > >             Main.allStop.set(true);
> > > >           }
> > > >
> > > >           rsp = client.getById(lower);
> > > >
> > > >           if (rsp.get("id").equals(lower) == false) {
> > > >             System.out.println("Got an invalid response, looking for "
> > > > + lower + " got: " + rsp.get("id"));
> > > >             Main.allStop.set(true);
> > > >           }
> > > >           long queries = Main.eoeCounter.incrementAndGet();
> > > >           if ((queries % 100_000) == 0) {
> > > >             long seconds = (System.currentTimeMillis() - Main.start) /
> > > > 1000;
> > > >             System.out.println("Query count: " +
> > > > numFormatter.format(queries) + ", rate is " +
> > > > numFormatter.format(queries / seconds) + " QPS");
> > > >           }
> > > >         } catch (Exception cle) {
> > > >           cle.printStackTrace();
> > > >           Main.allStop.set(true);
> > > >         }
> > > >       }
> > > >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > > > <er...@gmail.com> wrote:
> > > > >
> > > > > Steve:
> > > > >
> > > > > bq.  Basically, one core had data in it that should belong to another
> > > > > core. Here's my question about this: Is it possible that two request to
> > > > the
> > > > > /get API coming in at the same time would get confused and either both
> > > > get
> > > > > the same result or result get inverted?
> > > > >
> > > > > Well, that shouldn't be happening, these are all supposed to be
> > > > thread-safe
> > > > > calls.... All things are possible of course ;)
> > > > >
> > > > > If two replicas of the same shard have different documents, that could
> > > > account
> > > > > for what you're seeing, meanwhile begging the question of why that is
> > > > the case
> > > > > since it should never be true for a quiescent index. Technically there
> > > > _are_
> > > > > conditions where this is true on a very temporary basis, commits on the
> > > > leader
> > > > > and follower can trigger at different wall-clock times. Say your soft
> > > > commit
> > > > > (or hard-commit-with-opensearcher-true) is 10 seconds. It should never
> > > > be the
> > > > > case that s1r1 and s1r2 are out of sync 10 seconds after the last update
> > > > was
> > > > > sent. This doesn't seem likely from what you've described though...
> > > > >
> > > > > Hmmmm. I guess that one other thing I can set up is to have a bunch of
> > > > dummy
> > > > > collections laying around. Currently I have only the active one, and
> > > > > if there's some
> > > > > code path whereby the RTG request goes to a replica of a different
> > > > > collection, my
> > > > > test setup wouldn't reproduce it.
> > > > >
> > > > > Currently, I'm running a 2-shard, 1 replica setup, so if there's some
> > > > > way that the replicas
> > > > > get out of sync that wouldn't show either.
> > > > >
> > > > > So I'm starting another run with these changes:
> > > > > > opening a new connection each query
> > > > > > switched so the collection I'm querying is 2x2
> > > > > > added some dummy collections that are empty
> > > > >
> > > > > One nit, while "core" is exactly correct. When we talk about a core
> > > > > that's part of a collection, we try to use "replica" to be clear we're
> > > > > talking about
> > > > > a core with some added characteristics, i.e. we're in SolrCloud-land.
> > > > > No big deal
> > > > > of course....
> > > > >
> > > > > Best,
> > > > > Erick
> > > > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <ap...@elyograg.org>
> > > > wrote:
> > > > > >
> > > > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > > > @Shawn
> > > > > > > We're running two instance on one machine for two reason:
> > > > > > > 1. The box has plenty of resources (48 cores / 256GB ram) and since
> > > > I was
> > > > > > > reading that it's not recommended to use more than 31GB of heap in
> > > > SOLR we
> > > > > > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> > > > > > > instance was a good idea.
> > > > > >
> > > > > > Do you know that these Solr instances actually DO need 31 GB of heap,
> > > > or
> > > > > > are you following advice from somewhere, saying "use one quarter of
> > > > your
> > > > > > memory as the heap size"?  That advice is not in the Solr
> > > > documentation,
> > > > > > and never will be.  Figuring out the right heap size requires
> > > > > > experimentation.
> > > > > >
> > > > > >
> > > > https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > > > >
> > > > > > How big (on disk) are each of these nine cores, and how many documents
> > > > > > are in each one?  Which of them is in each Solr instance?  With that
> > > > > > information, we can make a *guess* about how big your heap should be.
> > > > > > Figuring out whether the guess is correct generally requires careful
> > > > > > analysis of a GC log.
> > > > > >
> > > > > > > 2. We're in testing phase so we wanted a SOLR cloud configuration,
> > > > we will
> > > > > > > most likely have a much bigger deployment once going to production.
> > > > In prod
> > > > > > > right now, we currently to run a six machines Riak cluster. Riak is a
> > > > > > > key/value document store an has SOLR built-in for search, but we are
> > > > trying
> > > > > > > to push the key/value aspect of Riak inside SOLR. That way we would
> > > > have
> > > > > > > one less piece to worry about in our system.
> > > > > >
> > > > > > Solr is not a database.  It is not intended to be a data repository.
> > > > > > All of its optimizations (most of which are actually in Lucene) are
> > > > > > geared towards search.  While technically it can be a key-value store,
> > > > > > that is not what it was MADE for.  Software actually designed for that
> > > > > > role is going to be much better than Solr as a key-value store.
> > > > > >
> > > > > > > When I say null document, I mean the /get API returns: {doc: null}
> > > > > > >
> > > > > > > The problem is definitely not always there. We also have large
> > > > period of
> > > > > > > time (few hours) were we have no problems. I'm just extremely
> > > > hesitant on
> > > > > > > retrying when I get a null document because in some case, getting a
> > > > null
> > > > > > > document is a valid outcome. Our caching layer heavily rely on this
> > > > for
> > > > > > > example. If I was to retry every nulls I'd pay a big penalty in
> > > > > > > performance.
> > > > > >
> > > > > > I've just done a little test with the 7.5.0 techproducts example.  It
> > > > > > looks like returning doc:null actually is how the RTG handler says it
> > > > > > didn't find the document.  This seems very wrong to me, but I didn't
> > > > > > design it, and that response needs SOME kind of format.
> > > > > >
> > > > > > Have you done any testing to see whether the standard searching handler
> > > > > > (typically /select, but many other URL paths are possible) returns
> > > > > > results when RTG doesn't?  Do you know for these failures whether the
> > > > > > document has been committed or not?
> > > > > >
> > > > > > > As for your last comment, part of our testing phase is also testing
> > > > the
> > > > > > > limits. Our framework has auto-scaling built-in so if we have a
> > > > burst of
> > > > > > > request, the system will automatically spin up more clients. We're
> > > > pushing
> > > > > > > 10% of our production system to that Test server to see how it will
> > > > handle
> > > > > > > it.
> > > > > >
> > > > > > To spin up another replica, Solr must copy all its index data from the
> > > > > > leader replica.  Not only can this take a long time if the index is
> > > > big,
> > > > > > but it will put a lot of extra I/O load on the machine(s) with the
> > > > > > leader roles.  So performance will actually be WORSE before it gets
> > > > > > better when you spin up another replica, and if the index is big, that
> > > > > > condition will persist for quite a while.  Copying the index data will
> > > > > > be constrained by the speed of your network and by the speed of your
> > > > > > disks.  Often the disks are slower than the network, but that is not
> > > > > > always the case.
> > > > > >
> > > > > > Thanks,
> > > > > > Shawn
> > > > > >
> > > >

Re: Realtime get not always returning existing data

Posted by Erick Erickson <er...@gmail.com>.
Hmmmm. I wonder if a version conflict or perhaps other failure can
somehow cause this. It shouldn't be very hard to add that to my test
setup, just randomly add n _version_ field value.

Erick
On Mon, Oct 1, 2018 at 8:20 AM Erick Erickson <er...@gmail.com> wrote:
>
> Thanks. I'll be away for the rest of the week, so won't be able to try
> anything more....
> On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <cu...@iq.media> wrote:
> >
> > In our case, we are heavily indexing in the collection while the /get
> > requests are happening which is what we assumed was causing this very rare
> > behavior. However, we have experienced the problem for a collection where
> > the following happens in sequence with minutes in between them.
> >
> > 1. Document id=1 is indexed
> > 2. Document successfully retrieved with /get?id=1
> > 3. Document failed to be retrieved with /get?id=1
> > 4. Document successfully retrieved with /get?id=1
> >
> > We've haven't looked at the issue in a while, so I don't have the exact
> > timing of that sequence on hand right now. I'll try to find an actual
> > example, although I'm relatively certain it was multiple minutes in between
> > each of those requests. However our autocommit (and soft commit) times are
> > 60s for both collections.
> >
> > I think the following two are probably the biggest differences for our
> > setup, besides the version difference (v6.3.0):
> >
> > > index to this collection, perhaps not at a high rate
> > > separate the machines running solr from the one doing any querying or
> > indexing
> >
> > The clients are on 3 hosts separate from the solr instances. The total
> > number of threads that are making updates and making /get requests is
> > around 120-150. About 40-50 per host. Each of our two collections gets an
> > average of 500 requests per second constantly for ~5 minutes, and then the
> > number slowly tapers off to essentially 0 after ~15 minutes.
> >
> > Every thread attempts to make the same series of requests.
> >
> > -- Update with "_version_=-1". If successful, no other requests are made.
> > -- On 409 Conflict failure, it makes a /get request for the id
> > -- On doc:null failure, the client handles the error and moves on
> >
> > Combining this with the previous series of /get requests, we end up with
> > situations where an update fails as expected, but the subsequent /get
> > request fails to retrieve the existing document:
> >
> > 1. Thread 1 updates id=1 successfully
> > 2. Thread 2 tries to update id=1, fails (409)
> > 3. Thread 2 tries to get id=1 succeeds.
> >
> > ...Minutes later...
> >
> > 4. Thread 3 tries to update id=1, fails (409)
> > 5. Thread 3 tries to get id=1, fails (doc:null)
> >
> > ...Minutes later...
> >
> > 6. Thread 4 tries to update id=1, fails (409)
> > 7. Thread 4 tries to get id=1 succeeds.
> >
> > As Steven mentioned, it happens very, very rarely. We tried to recreate it
> > in a more controlled environment, but ran into the same issue that you are,
> > Erick. Every simplified situation we ran produced no problems. Since it's
> > not a large issue for us and happens very rarely, we stopped trying to
> > recreate it.
> >
> >
> > On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <er...@gmail.com>
> > wrote:
> >
> > > 57 million queries later, with constant indexing going on and 9 dummy
> > > collections in the mix and the main collection I'm querying having 2
> > > shards, 2 replicas each, I have no errors.
> > >
> > > So unless the code doesn't look like it exercises any similar path,
> > > I'm not sure what more I can test. "It works on my machine" ;)
> > >
> > > Here's my querying code, does it look like it what you're seeing?
> > >
> > >       while (Main.allStop.get() == false) {
> > >         try (SolrClient client = new HttpSolrClient.Builder()
> > > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
> > >             .withBaseSolrUrl("http://localhost:8981/solr/eoe").build()) {
> > >
> > >           //SolrQuery query = new SolrQuery();
> > >           String lower = Integer.toString(rand.nextInt(1_000_000));
> > >           SolrDocument rsp = client.getById(lower);
> > >           if (rsp == null) {
> > >             System.out.println("Got a null response!");
> > >             Main.allStop.set(true);
> > >           }
> > >
> > >           rsp = client.getById(lower);
> > >
> > >           if (rsp.get("id").equals(lower) == false) {
> > >             System.out.println("Got an invalid response, looking for "
> > > + lower + " got: " + rsp.get("id"));
> > >             Main.allStop.set(true);
> > >           }
> > >           long queries = Main.eoeCounter.incrementAndGet();
> > >           if ((queries % 100_000) == 0) {
> > >             long seconds = (System.currentTimeMillis() - Main.start) /
> > > 1000;
> > >             System.out.println("Query count: " +
> > > numFormatter.format(queries) + ", rate is " +
> > > numFormatter.format(queries / seconds) + " QPS");
> > >           }
> > >         } catch (Exception cle) {
> > >           cle.printStackTrace();
> > >           Main.allStop.set(true);
> > >         }
> > >       }
> > >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > > <er...@gmail.com> wrote:
> > > >
> > > > Steve:
> > > >
> > > > bq.  Basically, one core had data in it that should belong to another
> > > > core. Here's my question about this: Is it possible that two request to
> > > the
> > > > /get API coming in at the same time would get confused and either both
> > > get
> > > > the same result or result get inverted?
> > > >
> > > > Well, that shouldn't be happening, these are all supposed to be
> > > thread-safe
> > > > calls.... All things are possible of course ;)
> > > >
> > > > If two replicas of the same shard have different documents, that could
> > > account
> > > > for what you're seeing, meanwhile begging the question of why that is
> > > the case
> > > > since it should never be true for a quiescent index. Technically there
> > > _are_
> > > > conditions where this is true on a very temporary basis, commits on the
> > > leader
> > > > and follower can trigger at different wall-clock times. Say your soft
> > > commit
> > > > (or hard-commit-with-opensearcher-true) is 10 seconds. It should never
> > > be the
> > > > case that s1r1 and s1r2 are out of sync 10 seconds after the last update
> > > was
> > > > sent. This doesn't seem likely from what you've described though...
> > > >
> > > > Hmmmm. I guess that one other thing I can set up is to have a bunch of
> > > dummy
> > > > collections laying around. Currently I have only the active one, and
> > > > if there's some
> > > > code path whereby the RTG request goes to a replica of a different
> > > > collection, my
> > > > test setup wouldn't reproduce it.
> > > >
> > > > Currently, I'm running a 2-shard, 1 replica setup, so if there's some
> > > > way that the replicas
> > > > get out of sync that wouldn't show either.
> > > >
> > > > So I'm starting another run with these changes:
> > > > > opening a new connection each query
> > > > > switched so the collection I'm querying is 2x2
> > > > > added some dummy collections that are empty
> > > >
> > > > One nit, while "core" is exactly correct. When we talk about a core
> > > > that's part of a collection, we try to use "replica" to be clear we're
> > > > talking about
> > > > a core with some added characteristics, i.e. we're in SolrCloud-land.
> > > > No big deal
> > > > of course....
> > > >
> > > > Best,
> > > > Erick
> > > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <ap...@elyograg.org>
> > > wrote:
> > > > >
> > > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > > @Shawn
> > > > > > We're running two instance on one machine for two reason:
> > > > > > 1. The box has plenty of resources (48 cores / 256GB ram) and since
> > > I was
> > > > > > reading that it's not recommended to use more than 31GB of heap in
> > > SOLR we
> > > > > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> > > > > > instance was a good idea.
> > > > >
> > > > > Do you know that these Solr instances actually DO need 31 GB of heap,
> > > or
> > > > > are you following advice from somewhere, saying "use one quarter of
> > > your
> > > > > memory as the heap size"?  That advice is not in the Solr
> > > documentation,
> > > > > and never will be.  Figuring out the right heap size requires
> > > > > experimentation.
> > > > >
> > > > >
> > > https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > > >
> > > > > How big (on disk) are each of these nine cores, and how many documents
> > > > > are in each one?  Which of them is in each Solr instance?  With that
> > > > > information, we can make a *guess* about how big your heap should be.
> > > > > Figuring out whether the guess is correct generally requires careful
> > > > > analysis of a GC log.
> > > > >
> > > > > > 2. We're in testing phase so we wanted a SOLR cloud configuration,
> > > we will
> > > > > > most likely have a much bigger deployment once going to production.
> > > In prod
> > > > > > right now, we currently to run a six machines Riak cluster. Riak is a
> > > > > > key/value document store an has SOLR built-in for search, but we are
> > > trying
> > > > > > to push the key/value aspect of Riak inside SOLR. That way we would
> > > have
> > > > > > one less piece to worry about in our system.
> > > > >
> > > > > Solr is not a database.  It is not intended to be a data repository.
> > > > > All of its optimizations (most of which are actually in Lucene) are
> > > > > geared towards search.  While technically it can be a key-value store,
> > > > > that is not what it was MADE for.  Software actually designed for that
> > > > > role is going to be much better than Solr as a key-value store.
> > > > >
> > > > > > When I say null document, I mean the /get API returns: {doc: null}
> > > > > >
> > > > > > The problem is definitely not always there. We also have large
> > > period of
> > > > > > time (few hours) were we have no problems. I'm just extremely
> > > hesitant on
> > > > > > retrying when I get a null document because in some case, getting a
> > > null
> > > > > > document is a valid outcome. Our caching layer heavily rely on this
> > > for
> > > > > > example. If I was to retry every nulls I'd pay a big penalty in
> > > > > > performance.
> > > > >
> > > > > I've just done a little test with the 7.5.0 techproducts example.  It
> > > > > looks like returning doc:null actually is how the RTG handler says it
> > > > > didn't find the document.  This seems very wrong to me, but I didn't
> > > > > design it, and that response needs SOME kind of format.
> > > > >
> > > > > Have you done any testing to see whether the standard searching handler
> > > > > (typically /select, but many other URL paths are possible) returns
> > > > > results when RTG doesn't?  Do you know for these failures whether the
> > > > > document has been committed or not?
> > > > >
> > > > > > As for your last comment, part of our testing phase is also testing
> > > the
> > > > > > limits. Our framework has auto-scaling built-in so if we have a
> > > burst of
> > > > > > request, the system will automatically spin up more clients. We're
> > > pushing
> > > > > > 10% of our production system to that Test server to see how it will
> > > handle
> > > > > > it.
> > > > >
> > > > > To spin up another replica, Solr must copy all its index data from the
> > > > > leader replica.  Not only can this take a long time if the index is
> > > big,
> > > > > but it will put a lot of extra I/O load on the machine(s) with the
> > > > > leader roles.  So performance will actually be WORSE before it gets
> > > > > better when you spin up another replica, and if the index is big, that
> > > > > condition will persist for quite a while.  Copying the index data will
> > > > > be constrained by the speed of your network and by the speed of your
> > > > > disks.  Often the disks are slower than the network, but that is not
> > > > > always the case.
> > > > >
> > > > > Thanks,
> > > > > Shawn
> > > > >
> > >

Re: Realtime get not always returning existing data

Posted by Erick Erickson <er...@gmail.com>.
Thanks. I'll be away for the rest of the week, so won't be able to try
anything more....
On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <cu...@iq.media> wrote:
>
> In our case, we are heavily indexing in the collection while the /get
> requests are happening which is what we assumed was causing this very rare
> behavior. However, we have experienced the problem for a collection where
> the following happens in sequence with minutes in between them.
>
> 1. Document id=1 is indexed
> 2. Document successfully retrieved with /get?id=1
> 3. Document failed to be retrieved with /get?id=1
> 4. Document successfully retrieved with /get?id=1
>
> We've haven't looked at the issue in a while, so I don't have the exact
> timing of that sequence on hand right now. I'll try to find an actual
> example, although I'm relatively certain it was multiple minutes in between
> each of those requests. However our autocommit (and soft commit) times are
> 60s for both collections.
>
> I think the following two are probably the biggest differences for our
> setup, besides the version difference (v6.3.0):
>
> > index to this collection, perhaps not at a high rate
> > separate the machines running solr from the one doing any querying or
> indexing
>
> The clients are on 3 hosts separate from the solr instances. The total
> number of threads that are making updates and making /get requests is
> around 120-150. About 40-50 per host. Each of our two collections gets an
> average of 500 requests per second constantly for ~5 minutes, and then the
> number slowly tapers off to essentially 0 after ~15 minutes.
>
> Every thread attempts to make the same series of requests.
>
> -- Update with "_version_=-1". If successful, no other requests are made.
> -- On 409 Conflict failure, it makes a /get request for the id
> -- On doc:null failure, the client handles the error and moves on
>
> Combining this with the previous series of /get requests, we end up with
> situations where an update fails as expected, but the subsequent /get
> request fails to retrieve the existing document:
>
> 1. Thread 1 updates id=1 successfully
> 2. Thread 2 tries to update id=1, fails (409)
> 3. Thread 2 tries to get id=1 succeeds.
>
> ...Minutes later...
>
> 4. Thread 3 tries to update id=1, fails (409)
> 5. Thread 3 tries to get id=1, fails (doc:null)
>
> ...Minutes later...
>
> 6. Thread 4 tries to update id=1, fails (409)
> 7. Thread 4 tries to get id=1 succeeds.
>
> As Steven mentioned, it happens very, very rarely. We tried to recreate it
> in a more controlled environment, but ran into the same issue that you are,
> Erick. Every simplified situation we ran produced no problems. Since it's
> not a large issue for us and happens very rarely, we stopped trying to
> recreate it.
>
>
> On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <er...@gmail.com>
> wrote:
>
> > 57 million queries later, with constant indexing going on and 9 dummy
> > collections in the mix and the main collection I'm querying having 2
> > shards, 2 replicas each, I have no errors.
> >
> > So unless the code doesn't look like it exercises any similar path,
> > I'm not sure what more I can test. "It works on my machine" ;)
> >
> > Here's my querying code, does it look like it what you're seeing?
> >
> >       while (Main.allStop.get() == false) {
> >         try (SolrClient client = new HttpSolrClient.Builder()
> > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
> >             .withBaseSolrUrl("http://localhost:8981/solr/eoe").build()) {
> >
> >           //SolrQuery query = new SolrQuery();
> >           String lower = Integer.toString(rand.nextInt(1_000_000));
> >           SolrDocument rsp = client.getById(lower);
> >           if (rsp == null) {
> >             System.out.println("Got a null response!");
> >             Main.allStop.set(true);
> >           }
> >
> >           rsp = client.getById(lower);
> >
> >           if (rsp.get("id").equals(lower) == false) {
> >             System.out.println("Got an invalid response, looking for "
> > + lower + " got: " + rsp.get("id"));
> >             Main.allStop.set(true);
> >           }
> >           long queries = Main.eoeCounter.incrementAndGet();
> >           if ((queries % 100_000) == 0) {
> >             long seconds = (System.currentTimeMillis() - Main.start) /
> > 1000;
> >             System.out.println("Query count: " +
> > numFormatter.format(queries) + ", rate is " +
> > numFormatter.format(queries / seconds) + " QPS");
> >           }
> >         } catch (Exception cle) {
> >           cle.printStackTrace();
> >           Main.allStop.set(true);
> >         }
> >       }
> >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > <er...@gmail.com> wrote:
> > >
> > > Steve:
> > >
> > > bq.  Basically, one core had data in it that should belong to another
> > > core. Here's my question about this: Is it possible that two request to
> > the
> > > /get API coming in at the same time would get confused and either both
> > get
> > > the same result or result get inverted?
> > >
> > > Well, that shouldn't be happening, these are all supposed to be
> > thread-safe
> > > calls.... All things are possible of course ;)
> > >
> > > If two replicas of the same shard have different documents, that could
> > account
> > > for what you're seeing, meanwhile begging the question of why that is
> > the case
> > > since it should never be true for a quiescent index. Technically there
> > _are_
> > > conditions where this is true on a very temporary basis, commits on the
> > leader
> > > and follower can trigger at different wall-clock times. Say your soft
> > commit
> > > (or hard-commit-with-opensearcher-true) is 10 seconds. It should never
> > be the
> > > case that s1r1 and s1r2 are out of sync 10 seconds after the last update
> > was
> > > sent. This doesn't seem likely from what you've described though...
> > >
> > > Hmmmm. I guess that one other thing I can set up is to have a bunch of
> > dummy
> > > collections laying around. Currently I have only the active one, and
> > > if there's some
> > > code path whereby the RTG request goes to a replica of a different
> > > collection, my
> > > test setup wouldn't reproduce it.
> > >
> > > Currently, I'm running a 2-shard, 1 replica setup, so if there's some
> > > way that the replicas
> > > get out of sync that wouldn't show either.
> > >
> > > So I'm starting another run with these changes:
> > > > opening a new connection each query
> > > > switched so the collection I'm querying is 2x2
> > > > added some dummy collections that are empty
> > >
> > > One nit, while "core" is exactly correct. When we talk about a core
> > > that's part of a collection, we try to use "replica" to be clear we're
> > > talking about
> > > a core with some added characteristics, i.e. we're in SolrCloud-land.
> > > No big deal
> > > of course....
> > >
> > > Best,
> > > Erick
> > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <ap...@elyograg.org>
> > wrote:
> > > >
> > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > @Shawn
> > > > > We're running two instance on one machine for two reason:
> > > > > 1. The box has plenty of resources (48 cores / 256GB ram) and since
> > I was
> > > > > reading that it's not recommended to use more than 31GB of heap in
> > SOLR we
> > > > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> > > > > instance was a good idea.
> > > >
> > > > Do you know that these Solr instances actually DO need 31 GB of heap,
> > or
> > > > are you following advice from somewhere, saying "use one quarter of
> > your
> > > > memory as the heap size"?  That advice is not in the Solr
> > documentation,
> > > > and never will be.  Figuring out the right heap size requires
> > > > experimentation.
> > > >
> > > >
> > https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > >
> > > > How big (on disk) are each of these nine cores, and how many documents
> > > > are in each one?  Which of them is in each Solr instance?  With that
> > > > information, we can make a *guess* about how big your heap should be.
> > > > Figuring out whether the guess is correct generally requires careful
> > > > analysis of a GC log.
> > > >
> > > > > 2. We're in testing phase so we wanted a SOLR cloud configuration,
> > we will
> > > > > most likely have a much bigger deployment once going to production.
> > In prod
> > > > > right now, we currently to run a six machines Riak cluster. Riak is a
> > > > > key/value document store an has SOLR built-in for search, but we are
> > trying
> > > > > to push the key/value aspect of Riak inside SOLR. That way we would
> > have
> > > > > one less piece to worry about in our system.
> > > >
> > > > Solr is not a database.  It is not intended to be a data repository.
> > > > All of its optimizations (most of which are actually in Lucene) are
> > > > geared towards search.  While technically it can be a key-value store,
> > > > that is not what it was MADE for.  Software actually designed for that
> > > > role is going to be much better than Solr as a key-value store.
> > > >
> > > > > When I say null document, I mean the /get API returns: {doc: null}
> > > > >
> > > > > The problem is definitely not always there. We also have large
> > period of
> > > > > time (few hours) were we have no problems. I'm just extremely
> > hesitant on
> > > > > retrying when I get a null document because in some case, getting a
> > null
> > > > > document is a valid outcome. Our caching layer heavily rely on this
> > for
> > > > > example. If I was to retry every nulls I'd pay a big penalty in
> > > > > performance.
> > > >
> > > > I've just done a little test with the 7.5.0 techproducts example.  It
> > > > looks like returning doc:null actually is how the RTG handler says it
> > > > didn't find the document.  This seems very wrong to me, but I didn't
> > > > design it, and that response needs SOME kind of format.
> > > >
> > > > Have you done any testing to see whether the standard searching handler
> > > > (typically /select, but many other URL paths are possible) returns
> > > > results when RTG doesn't?  Do you know for these failures whether the
> > > > document has been committed or not?
> > > >
> > > > > As for your last comment, part of our testing phase is also testing
> > the
> > > > > limits. Our framework has auto-scaling built-in so if we have a
> > burst of
> > > > > request, the system will automatically spin up more clients. We're
> > pushing
> > > > > 10% of our production system to that Test server to see how it will
> > handle
> > > > > it.
> > > >
> > > > To spin up another replica, Solr must copy all its index data from the
> > > > leader replica.  Not only can this take a long time if the index is
> > big,
> > > > but it will put a lot of extra I/O load on the machine(s) with the
> > > > leader roles.  So performance will actually be WORSE before it gets
> > > > better when you spin up another replica, and if the index is big, that
> > > > condition will persist for quite a while.  Copying the index data will
> > > > be constrained by the speed of your network and by the speed of your
> > > > disks.  Often the disks are slower than the network, but that is not
> > > > always the case.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> >

Re: Realtime get not always returning existing data

Posted by Chris Ulicny <cu...@iq.media>.
In our case, we are heavily indexing in the collection while the /get
requests are happening which is what we assumed was causing this very rare
behavior. However, we have experienced the problem for a collection where
the following happens in sequence with minutes in between them.

1. Document id=1 is indexed
2. Document successfully retrieved with /get?id=1
3. Document failed to be retrieved with /get?id=1
4. Document successfully retrieved with /get?id=1

We've haven't looked at the issue in a while, so I don't have the exact
timing of that sequence on hand right now. I'll try to find an actual
example, although I'm relatively certain it was multiple minutes in between
each of those requests. However our autocommit (and soft commit) times are
60s for both collections.

I think the following two are probably the biggest differences for our
setup, besides the version difference (v6.3.0):

> index to this collection, perhaps not at a high rate
> separate the machines running solr from the one doing any querying or
indexing

The clients are on 3 hosts separate from the solr instances. The total
number of threads that are making updates and making /get requests is
around 120-150. About 40-50 per host. Each of our two collections gets an
average of 500 requests per second constantly for ~5 minutes, and then the
number slowly tapers off to essentially 0 after ~15 minutes.

Every thread attempts to make the same series of requests.

-- Update with "_version_=-1". If successful, no other requests are made.
-- On 409 Conflict failure, it makes a /get request for the id
-- On doc:null failure, the client handles the error and moves on

Combining this with the previous series of /get requests, we end up with
situations where an update fails as expected, but the subsequent /get
request fails to retrieve the existing document:

1. Thread 1 updates id=1 successfully
2. Thread 2 tries to update id=1, fails (409)
3. Thread 2 tries to get id=1 succeeds.

...Minutes later...

4. Thread 3 tries to update id=1, fails (409)
5. Thread 3 tries to get id=1, fails (doc:null)

...Minutes later...

6. Thread 4 tries to update id=1, fails (409)
7. Thread 4 tries to get id=1 succeeds.

As Steven mentioned, it happens very, very rarely. We tried to recreate it
in a more controlled environment, but ran into the same issue that you are,
Erick. Every simplified situation we ran produced no problems. Since it's
not a large issue for us and happens very rarely, we stopped trying to
recreate it.


On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <er...@gmail.com>
wrote:

> 57 million queries later, with constant indexing going on and 9 dummy
> collections in the mix and the main collection I'm querying having 2
> shards, 2 replicas each, I have no errors.
>
> So unless the code doesn't look like it exercises any similar path,
> I'm not sure what more I can test. "It works on my machine" ;)
>
> Here's my querying code, does it look like it what you're seeing?
>
>       while (Main.allStop.get() == false) {
>         try (SolrClient client = new HttpSolrClient.Builder()
> //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4")) {
>             .withBaseSolrUrl("http://localhost:8981/solr/eoe").build()) {
>
>           //SolrQuery query = new SolrQuery();
>           String lower = Integer.toString(rand.nextInt(1_000_000));
>           SolrDocument rsp = client.getById(lower);
>           if (rsp == null) {
>             System.out.println("Got a null response!");
>             Main.allStop.set(true);
>           }
>
>           rsp = client.getById(lower);
>
>           if (rsp.get("id").equals(lower) == false) {
>             System.out.println("Got an invalid response, looking for "
> + lower + " got: " + rsp.get("id"));
>             Main.allStop.set(true);
>           }
>           long queries = Main.eoeCounter.incrementAndGet();
>           if ((queries % 100_000) == 0) {
>             long seconds = (System.currentTimeMillis() - Main.start) /
> 1000;
>             System.out.println("Query count: " +
> numFormatter.format(queries) + ", rate is " +
> numFormatter.format(queries / seconds) + " QPS");
>           }
>         } catch (Exception cle) {
>           cle.printStackTrace();
>           Main.allStop.set(true);
>         }
>       }
>   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> <er...@gmail.com> wrote:
> >
> > Steve:
> >
> > bq.  Basically, one core had data in it that should belong to another
> > core. Here's my question about this: Is it possible that two request to
> the
> > /get API coming in at the same time would get confused and either both
> get
> > the same result or result get inverted?
> >
> > Well, that shouldn't be happening, these are all supposed to be
> thread-safe
> > calls.... All things are possible of course ;)
> >
> > If two replicas of the same shard have different documents, that could
> account
> > for what you're seeing, meanwhile begging the question of why that is
> the case
> > since it should never be true for a quiescent index. Technically there
> _are_
> > conditions where this is true on a very temporary basis, commits on the
> leader
> > and follower can trigger at different wall-clock times. Say your soft
> commit
> > (or hard-commit-with-opensearcher-true) is 10 seconds. It should never
> be the
> > case that s1r1 and s1r2 are out of sync 10 seconds after the last update
> was
> > sent. This doesn't seem likely from what you've described though...
> >
> > Hmmmm. I guess that one other thing I can set up is to have a bunch of
> dummy
> > collections laying around. Currently I have only the active one, and
> > if there's some
> > code path whereby the RTG request goes to a replica of a different
> > collection, my
> > test setup wouldn't reproduce it.
> >
> > Currently, I'm running a 2-shard, 1 replica setup, so if there's some
> > way that the replicas
> > get out of sync that wouldn't show either.
> >
> > So I'm starting another run with these changes:
> > > opening a new connection each query
> > > switched so the collection I'm querying is 2x2
> > > added some dummy collections that are empty
> >
> > One nit, while "core" is exactly correct. When we talk about a core
> > that's part of a collection, we try to use "replica" to be clear we're
> > talking about
> > a core with some added characteristics, i.e. we're in SolrCloud-land.
> > No big deal
> > of course....
> >
> > Best,
> > Erick
> > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <ap...@elyograg.org>
> wrote:
> > >
> > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > @Shawn
> > > > We're running two instance on one machine for two reason:
> > > > 1. The box has plenty of resources (48 cores / 256GB ram) and since
> I was
> > > > reading that it's not recommended to use more than 31GB of heap in
> SOLR we
> > > > figured 96 GB for keeping index data in OS cache + 31 GB of heap per
> > > > instance was a good idea.
> > >
> > > Do you know that these Solr instances actually DO need 31 GB of heap,
> or
> > > are you following advice from somewhere, saying "use one quarter of
> your
> > > memory as the heap size"?  That advice is not in the Solr
> documentation,
> > > and never will be.  Figuring out the right heap size requires
> > > experimentation.
> > >
> > >
> https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > >
> > > How big (on disk) are each of these nine cores, and how many documents
> > > are in each one?  Which of them is in each Solr instance?  With that
> > > information, we can make a *guess* about how big your heap should be.
> > > Figuring out whether the guess is correct generally requires careful
> > > analysis of a GC log.
> > >
> > > > 2. We're in testing phase so we wanted a SOLR cloud configuration,
> we will
> > > > most likely have a much bigger deployment once going to production.
> In prod
> > > > right now, we currently to run a six machines Riak cluster. Riak is a
> > > > key/value document store an has SOLR built-in for search, but we are
> trying
> > > > to push the key/value aspect of Riak inside SOLR. That way we would
> have
> > > > one less piece to worry about in our system.
> > >
> > > Solr is not a database.  It is not intended to be a data repository.
> > > All of its optimizations (most of which are actually in Lucene) are
> > > geared towards search.  While technically it can be a key-value store,
> > > that is not what it was MADE for.  Software actually designed for that
> > > role is going to be much better than Solr as a key-value store.
> > >
> > > > When I say null document, I mean the /get API returns: {doc: null}
> > > >
> > > > The problem is definitely not always there. We also have large
> period of
> > > > time (few hours) were we have no problems. I'm just extremely
> hesitant on
> > > > retrying when I get a null document because in some case, getting a
> null
> > > > document is a valid outcome. Our caching layer heavily rely on this
> for
> > > > example. If I was to retry every nulls I'd pay a big penalty in
> > > > performance.
> > >
> > > I've just done a little test with the 7.5.0 techproducts example.  It
> > > looks like returning doc:null actually is how the RTG handler says it
> > > didn't find the document.  This seems very wrong to me, but I didn't
> > > design it, and that response needs SOME kind of format.
> > >
> > > Have you done any testing to see whether the standard searching handler
> > > (typically /select, but many other URL paths are possible) returns
> > > results when RTG doesn't?  Do you know for these failures whether the
> > > document has been committed or not?
> > >
> > > > As for your last comment, part of our testing phase is also testing
> the
> > > > limits. Our framework has auto-scaling built-in so if we have a
> burst of
> > > > request, the system will automatically spin up more clients. We're
> pushing
> > > > 10% of our production system to that Test server to see how it will
> handle
> > > > it.
> > >
> > > To spin up another replica, Solr must copy all its index data from the
> > > leader replica.  Not only can this take a long time if the index is
> big,
> > > but it will put a lot of extra I/O load on the machine(s) with the
> > > leader roles.  So performance will actually be WORSE before it gets
> > > better when you spin up another replica, and if the index is big, that
> > > condition will persist for quite a while.  Copying the index data will
> > > be constrained by the speed of your network and by the speed of your
> > > disks.  Often the disks are slower than the network, but that is not
> > > always the case.
> > >
> > > Thanks,
> > > Shawn
> > >
>