You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Edwin Lee <ed...@gmail.com> on 2015/06/11 18:10:31 UTC

loadOnStartup & transientCoreSize & BUG Report on CoreContainer.java

*Hi Members of Lucene/Solr Developer Team,*


  I am new to Solr source code. If I am wrong about it, I am sorry. But I
hope it could attract your attention.

  I am 100% sure I found a bug on solr code in the CoreContainer.java

The Bug we encountered is like it:

-------------------------------------------------------------------------------------------------------------------------

   1. transientSize=2
   2. create 6 transient cores, and all of them are loadOnStarup=True
   3.

   Restart solr.

   Then, you will find 2 cores loaded, let say core5, core6. They work fine.
   But you cannot access core1, core2, core3, core4. There is no failure,
   no error message. But you just need to keep waiting… Forever.

   What make thing even funnier, you open the JConsole, you will find 6
   core’s metrics are there — In a normal situation, there should only be two
   cores metrics.

--------------------------------------------------------------------------------------------------------------------------

Reason

Read the log, you will find core1, core2, core3, core4 are closing —- But
actually they are not.
In Solr code, you will find core1-4 are just remove from the list in core
container. But the actual closing operation are not done. Because the
closer thread are hang!
I will point out the bug relative to that bug in the next email.
How to bypass it

There may be plenty ways to bypass this bug. And I am listing the 1 way:


   1. make sure there is at lease one transient core(let call it
   “SaviorCore”) who is loadOnStarup=False
   2. When the Situation described above happen, send luke/mbean/any
   request to load “SaviorCore”. Then you will find core1, core2, core3, core4
   are actually closed in the JConsole.


BUG Description

I am now point out the bug on Solr code.

When the transient core is closing. It will do three things:

   1. Remove the core from the cache, which they do it right.
   2. Put the core in the queue, which they do it right.
   3.

   Send the signal to the CloserThread() to actually close it, which they
   *fail* it.

   It is a classical producer/consumer scenario. The authors are too
   confident to ignore the BlockingQueue in the Java API designed by the Great
   Doug Lea.

Code Relevant to the Solr Bug

There is a CloserThread in CoreContainer.java. The hanging problem will be
the modify lock and the wait().

while (! container.isShutDown()) {
      synchronized (solrCores.getModifyLock()) { // need this so we
can wait and be awoken.
        try {
          solrCores.getModifyLock().wait();
        } catch (InterruptedException e) {
          // Well, if we've been told to stop, we will. Otherwise,
continue on and check to see if there are
          // any cores to close.
        }
      }
      for (SolrCore removeMe = solrCores.getCoreToClose();
           removeMe != null && !container.isShutDown();
           removeMe = solrCores.getCoreToClose()) {
        try {
          removeMe.close();
        } finally {
          solrCores.removeFromPendingOps(removeMe.getName());
        }
      }
    }

This is way how SolrCores.java remove the eldest transient core in the
cache.

  protected void allo
cateLazyCores(final int cacheSize, final SolrResourceLoader loader) {
    if (cacheSize != Integer.MAX_VALUE) {
      CoreContainer.log.info("Allocating transient cache for {}
transient cores", cacheSize);
      transientCores = new LinkedHashMap<String, SolrCore>(cacheSize,
0.75f, true) {
        @Override
        protected boolean removeEldestEntry(Map.Entry<String,
SolrCore> eldest) {
          if (size() > cacheSize) {
            synchronized (modifyLock) {
              SolrCore coreToClose = eldest.getValue();
              logger.info("Closing transient core [{}]", coreToClose.getName());
              pendingCloses.add(coreToClose); // Essentially just
queue this core up for closing.
              modifyLock.notifyAll(); // Wakes up closer thread too
            }
            return true;
          }
          return false;
        }
      };
    }
  }

The notifyAll() signal may be loss if the CloseThread busy closing other
thread. So when CloserThread come back to wait(), it doesn’t know there is
another core awaiting closing.
My attempt

I modify the solr code to confirm my conclusion. I give the wait a time out
by

     log.info("Edwin's Log: We are trying to fix the startup problem");
     solrCores.getModifyLock().wait(*1000*);
     log.info("end waiting.");

I build the jar and test it. *Situation gone, problem solved.*


Regards,

Junhao Li

20150612

Re: loadOnStartup & transientCoreSize & BUG Report on CoreContainer.java

Posted by Erick Erickson <er...@gmail.com>.

Good catch!

I raised a JIRA about this, please add a comment with these details.
See: https://issues.apache.org/jira/browse/SOLR-7667. Anyone can raise
a JIRA or comment, you just have to create a user ID. That said,
you've done the correct thing by raising it here first.

If possible, could you include a patch? You're right, this code really
operates on the assumption that loadOnStartup < transient cache size
and the events that trigger the cleanup depend on exceeding the cache
size later....  Besides fixing the code, the hack is to just "don't do
that", i.e. have fewer cores loaded on startup than the cache size.
that said, This needs to be fixed of course, with core discovery it's
not always easy to _know_ how many transient cores there are.

A test case would be pretty cool here, don't have time right now to
sketch one out but there are other tests in TestLazyCores.java that we
could use as a model.

Hmmm, I wonder if another way to approach it would be to simply stop
loading transient cores at startup when the transient cache size was
reached, perhaps with a WARN level message? Anyway, we can hash this
kind of thing out on the JIRA.

http://wiki.apache.org/solr/HowToContribute

Thanks!
Erick

On Thu, Jun 11, 2015 at 9:10 AM, Edwin Lee <ed...@gmail.com> wrote:
> Hi Members of Lucene/Solr Developer Team,
>
>
>   I am new to Solr source code. If I am wrong about it, I am sorry. But I
> hope it could attract your attention.
>
>   I am 100% sure I found a bug on solr code in the CoreContainer.java
>
> The Bug we encountered is like it:
>
> -------------------------------------------------------------------------------------------------------------------------
>
> transientSize=2
> create 6 transient cores, and all of them are loadOnStarup=True
>
> Restart solr.
>
> Then, you will find 2 cores loaded, let say core5, core6. They work fine.
> But you cannot access core1, core2, core3, core4. There is no failure, no
> error message. But you just need to keep waiting… Forever.
>
> What make thing even funnier, you open the JConsole, you will find 6 core’s
> metrics are there — In a normal situation, there should only be two cores
> metrics.
>
> --------------------------------------------------------------------------------------------------------------------------
>
> Reason
>
> Read the log, you will find core1, core2, core3, core4 are closing —- But
> actually they are not.
> In Solr code, you will find core1-4 are just remove from the list in core
> container. But the actual closing operation are not done. Because the closer
> thread are hang!
> I will point out the bug relative to that bug in the next email.
>
> How to bypass it
>
> There may be plenty ways to bypass this bug. And I am listing the 1 way:
>
> make sure there is at lease one transient core(let call it “SaviorCore”) who
> is loadOnStarup=False
> When the Situation described above happen, send luke/mbean/any request to
> load “SaviorCore”. Then you will find core1, core2, core3, core4 are
> actually closed in the JConsole.
>
>
> BUG Description
>
> I am now point out the bug on Solr code.
>
> When the transient core is closing. It will do three things:
>
> Remove the core from the cache, which they do it right.
> Put the core in the queue, which they do it right.
>
> Send the signal to the CloserThread() to actually close it, which they fail
> it.
>
> It is a classical producer/consumer scenario. The authors are too confident
> to ignore the BlockingQueue in the Java API designed by the Great Doug Lea.
>
> Code Relevant to the Solr Bug
>
> There is a CloserThread in CoreContainer.java. The hanging problem will be
> the modify lock and the wait().
>
> while (! container.isShutDown()) {
>       synchronized (solrCores.getModifyLock()) { // need this so we can wait
> and be awoken.
>         try {
>           solrCores.getModifyLock().wait();
>         } catch (InterruptedException e) {
>           // Well, if we've been told to stop, we will. Otherwise, continue
> on and check to see if there are
>           // any cores to close.
>         }
>       }
>       for (SolrCore removeMe = solrCores.getCoreToClose();
>            removeMe != null && !container.isShutDown();
>            removeMe = solrCores.getCoreToClose()) {
>         try {
>           removeMe.close();
>         } finally {
>           solrCores.removeFromPendingOps(removeMe.getName());
>         }
>       }
>     }
>
> This is way how SolrCores.java remove the eldest transient core in the
> cache.
>
>   protected void allo
> cateLazyCores(final int cacheSize, final SolrResourceLoader loader) {
>     if (cacheSize != Integer.MAX_VALUE) {
>       CoreContainer.log.info("Allocating transient cache for {} transient
> cores", cacheSize);
>       transientCores = new LinkedHashMap<String, SolrCore>(cacheSize, 0.75f,
> true) {
>         @Override
>         protected boolean removeEldestEntry(Map.Entry<String, SolrCore>
> eldest) {
>           if (size() > cacheSize) {
>             synchronized (modifyLock) {
>               SolrCore coreToClose = eldest.getValue();
>               logger.info("Closing transient core [{}]",
> coreToClose.getName());
>               pendingCloses.add(coreToClose); // Essentially just queue this
> core up for closing.
>               modifyLock.notifyAll(); // Wakes up closer thread too
>             }
>             return true;
>           }
>           return false;
>         }
>       };
>     }
>   }
>
> The notifyAll() signal may be loss if the CloseThread busy closing other
> thread. So when CloserThread come back to wait(), it doesn’t know there is
> another core awaiting closing.
>
> My attempt
>
> I modify the solr code to confirm my conclusion. I give the wait a time out
> by
>
>      log.info("Edwin's Log: We are trying to fix the startup problem");
>      solrCores.getModifyLock().wait(1000);
>      log.info("end waiting.");
>
> I build the jar and test it. Situation gone, problem solved.
>
>
> Regards,
>
> Junhao Li
>
> 20150612

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org