You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Joe Obernberger <jo...@gmail.com> on 2017/12/04 17:54:12 UTC

Re: Recovery Issue - Solr 6.6.1 and HDFS

Hi All - this same problem happened again, and I think I partially 
understand what is going on.  The part I don't know is what caused any 
of the replicas to go into full recovery in the first place, but once 
they do, they cause network interfaces on servers to go fully utilized 
in both in/out directions.  It appears that when a solr replica needs to 
recover, it calls on the leader for all the data.  In HDFS, the data 
from the leader's point of view goes:

HDFS --> Solr Leader Process -->Network--> Replica Solr Process -->HDFS

Do I have this correct?  That poor network in the middle becomes a 
bottleneck and causes other replicas to go into recovery, which causes 
more network traffic.  Perhaps going to TLOG replicas with 7.1 would be 
better with HDFS?  Would it be possible for the leader to send a message 
to the replica to instead get the data straight from HDFS instead of 
going from one solr process to another?  HDFS would better be able to 
use the cluster since each block has 3x replicas.  Perhaps there is a 
better way to handle replicas with a shared file system.

Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.  
Good idea?  Thank you!

-Joe


On 11/22/2017 8:17 PM, Erick Erickson wrote:
> Hmm. This is quite possible. Any time things take "too long" it can be
>   a problem. For instance, if the leader sends docs to a replica and
> the request times out, the leader throws the follower into "Leader
> Initiated Recovery". The smoking gun here is that there are no errors
> on the follower, just the notification that the leader put it into
> recovery.
>
> There are other variations on the theme, it all boils down to when
> communications fall apart replicas go into recovery.....
>
> Best,
> Erick
>
> On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
> <jo...@gmail.com> wrote:
>> Hi Shawn - thank you for your reply.  The index is 29.9TBytes as reported
>> by:
>> hadoop fs -du -s -h /solr6.6.0
>> 29.9 T  89.9 T  /solr6.6.0
>>
>> The 89.9TBytes is due to HDFS having 3x replication.  There are about 1.1
>> billion documents indexed and we index about 2.5 million documents per day.
>> Assuming an even distribution, each node is handling about 680GBytes of
>> index.  So our cache size is 1.4%. Perhaps 'relatively small block cache'
>> was an understatement! This is why we split the largest collection into two,
>> where one is data going back 30 days, and the other is all the data.  Most
>> of our searches are not longer than 30 days back.  The 30 day index is
>> 2.6TBytes total.  I don't know how the HDFS block cache splits between
>> collections, but the 30 day index performs acceptable for our specific
>> application.
>>
>> If we wanted to cache 50% of the index, each of our 45 nodes would need a
>> block cache of about 350GBytes.  I'm accepting offers of DIMMs!
>>
>> What I believe caused our 'recovery, fail, retry loop' was one of our
>> servers died.  This caused HDFS to start to replicate blocks across the
>> cluster and produced a lot of network activity.  When this happened, I
>> believe there was high network contention for specific nodes in the cluster
>> and their network interfaces became pegged and requests for HDFS blocks
>> timed out.  When that happened, SolrCloud went into recovery which caused
>> more network traffic.  Fun stuff.
>>
>> -Joe
>>
>>
>> On 11/22/2017 11:44 AM, Shawn Heisey wrote:
>>> On 11/22/2017 6:44 AM, Joe Obernberger wrote:
>>>> Right now, we have a relatively small block cache due to the
>>>> requirements that the servers run other software.  We tried to find
>>>> the best balance between block cache size, and RAM for programs, while
>>>> still giving enough for local FS cache.  This came out to be 84 128M
>>>> blocks - or about 10G for the cache per node (45 nodes total).
>>> How much data is being handled on a server with 10GB allocated for
>>> caching HDFS data?
>>>
>>> The first message in this thread says the index size is 31TB, which is
>>> *enormous*.  You have also said that the index takes 93TB of disk
>>> space.  If the data is distributed somewhat evenly, then the answer to
>>> my question would be that each of those 45 Solr servers would be
>>> handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.
>>>
>>> When index data that Solr needs to access for an operation is not in the
>>> cache and Solr must actually wait for disk and/or network I/O, the
>>> resulting performance usually isn't very good.  In most cases you don't
>>> need to have enough memory to fully cache the index data ... but less
>>> than half a percent is not going to be enough.
>>>
>>> Thanks,
>>> Shawn
>>>
>>>
>>> ---
>>> This email has been checked for viruses by AVG.
>>> http://www.avg.com
>>>

Re: Recovery Issue - Solr 6.6.1 and HDFS

Posted by Erick Erickson <er...@gmail.com>.

Right, look at autoAddReplicas which is designed to do this
automagically (but I confess I don't have much experience with it).

What that doesn't handle is capacity, if you need to increase the QPS
you need to add replicas though.

Depends on your needs of course.

Best,
Erick

On Mon, Dec 11, 2017 at 2:39 PM, Joe Obernberger
<jo...@gmail.com> wrote:
> Thank you Erick.  Perhaps it makes more sense to not use any replicas when
> using HDFS for storage (and having a very large index) since it is already
> replicated.  It seems to me that if there were no replicas, and a leader
> went down, that another node could take over by just going through the
> regular startup cycle (replaying logs etc.) similar to the auto add replicas
> capability.  Not sure how one would handle a node coming back.
>
> I think there could be a lot to be gained by taking advantage of a global
> file system with Solr.  Would be fun!
>
> -Joe
>
>
> On 12/9/2017 10:26 PM, Erick Erickson wrote:
>>
>> The complications are things like this:
>>
>> Say an update comes in and gets written to the tlog and indexed but
>> not committed. Now the leader goes down. How does the replica that
>> takes over leadership
>> 1> understand the current state of the index, i.e. that there are
>> uncommitted updates
>> 2> replay the updates from the tlog correctly?
>>
>> Not to mention that during leader election one of the read-only
>> replicas must become a read/write replica when it takes over
>> leadership.
>>
>> The current mechanism does, indeed, use Zk to elect a new leader, the
>> devil is in the details of how in-flight updates get handled properly.
>>
>> There's no a-priori reason all those details couldn't be worked out,
>> it's just gnarly. Nobody has yet stepped up to commit the
>> time/resources to work them all out. My guess is that the cost of
>> having a bunch more disks is cheaper than the engineering time it
>> would take to changes this. The standard answer is "patches welcome"
>> ;).
>>
>> Best,
>> Erick
>>
>> On Sat, Dec 9, 2017 at 1:02 PM, Hendrik Haddorp <he...@gmx.net>
>> wrote:
>>>
>>> Ok, thanks for the answer. The leader election and update notification
>>> sound
>>> like they should work using ZooKeeper (leader election recipe and a
>>> normal
>>> watch) but I guess there are some details that make things more
>>> complicated.
>>>
>>> On 09.12.2017 20:19, Erick Erickson wrote:
>>>>
>>>> This has been bandied about on a number of occasions, it boils down to
>>>> nobody has stepped up to make it happen. It turns out there are a
>>>> number of tricky issues:
>>>>
>>>>> how does leadership change if the leader goes down?
>>>>> the raw complexity of getting it right. Getting it wrong corrupts
>>>>> indexes
>>>>> how do you resolve leadership in the first place so only the leader
>>>>> writes to the index?
>>>>> how would that affect performance if N replicas were autowarming at the
>>>>> same time, thus reading from HDFS?
>>>>> how do the read-only replicas know to open a new searcher?
>>>>> I'm sure there are a bunch more.
>>>>
>>>> So this is one of those things that everyone agrees is interesting,
>>>> but nobody is willing to code and it's not actually clear that it
>>>> makes sense in the Solr context. It'd be a pity to put in all the work
>>>> then discover that the performance issues prohibited using it.
>>>>
>>>> If you _guarantee_ that the index doesn't change, there's the
>>>> NoLockFactory you could specify. That would allow you to share a
>>>> common index, woe be unto you if you start updating the index though.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Sat, Dec 9, 2017 at 4:46 AM, Hendrik Haddorp
>>>> <he...@gmx.net>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> for the HDFS case wouldn't it be nice if there was a mode in which the
>>>>> replicas just read the same index files as the leader? I mean after all
>>>>> the
>>>>> data is already on a shared readable file system so why would one even
>>>>> need
>>>>> to replicate the transaction log files?
>>>>>
>>>>> regards,
>>>>> Hendrik
>>>>>
>>>>>
>>>>> On 08.12.2017 21:07, Erick Erickson wrote:
>>>>>>
>>>>>> bq: Will TLOG replicas use less network bandwidth?
>>>>>>
>>>>>> No, probably more bandwidth. TLOG replicas work like this:
>>>>>> 1> the raw docs are forwarded
>>>>>> 2> the old-style master/slave replication is used
>>>>>>
>>>>>> So what you do save is CPU processing on the TLOG replica in exchange
>>>>>> for increased bandwidth.
>>>>>>
>>>>>> Since the only thing forwarded in NRT replicas (outside of recovery)
>>>>>> is the raw documents, I expect that TLOG replicas would _increase_
>>>>>> network usage. The deal is that TLOG replicas can take over leadership
>>>>>> if the leader goes down so they must have an
>>>>>> up-to-date-after-last-index-sync set of tlogs.
>>>>>>
>>>>>> At least that's my current understanding...
>>>>>>
>>>>>> Best,
>>>>>> Erick
>>>>>>
>>>>>> On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
>>>>>> <jo...@gmail.com> wrote:
>>>>>>>
>>>>>>> Anyone have any thoughts on this?  Will TLOG replicas use less
>>>>>>> network
>>>>>>> bandwidth?
>>>>>>>
>>>>>>> -Joe
>>>>>>>
>>>>>>>
>>>>>>> On 12/4/2017 12:54 PM, Joe Obernberger wrote:
>>>>>>>>
>>>>>>>> Hi All - this same problem happened again, and I think I partially
>>>>>>>> understand what is going on.  The part I don't know is what caused
>>>>>>>> any
>>>>>>>> of
>>>>>>>> the replicas to go into full recovery in the first place, but once
>>>>>>>> they
>>>>>>>> do,
>>>>>>>> they cause network interfaces on servers to go fully utilized in
>>>>>>>> both
>>>>>>>> in/out
>>>>>>>> directions.  It appears that when a solr replica needs to recover,
>>>>>>>> it
>>>>>>>> calls
>>>>>>>> on the leader for all the data.  In HDFS, the data from the leader's
>>>>>>>> point
>>>>>>>> of view goes:
>>>>>>>>
>>>>>>>> HDFS --> Solr Leader Process -->Network--> Replica Solr Process
>>>>>>>> -->HDFS
>>>>>>>>
>>>>>>>> Do I have this correct?  That poor network in the middle becomes a
>>>>>>>> bottleneck and causes other replicas to go into recovery, which
>>>>>>>> causes
>>>>>>>> more
>>>>>>>> network traffic.  Perhaps going to TLOG replicas with 7.1 would be
>>>>>>>> better
>>>>>>>> with HDFS?  Would it be possible for the leader to send a message to
>>>>>>>> the
>>>>>>>> replica to instead get the data straight from HDFS instead of going
>>>>>>>> from
>>>>>>>> one
>>>>>>>> solr process to another?  HDFS would better be able to use the
>>>>>>>> cluster
>>>>>>>> since
>>>>>>>> each block has 3x replicas.  Perhaps there is a better way to handle
>>>>>>>> replicas with a shared file system.
>>>>>>>>
>>>>>>>> Our current plan to fix the issue is to go to Solr 7.1.0 and use
>>>>>>>> TLOG.
>>>>>>>> Good idea?  Thank you!
>>>>>>>>
>>>>>>>> -Joe
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/22/2017 8:17 PM, Erick Erickson wrote:
>>>>>>>>>
>>>>>>>>> Hmm. This is quite possible. Any time things take "too long" it can
>>>>>>>>> be
>>>>>>>>>      a problem. For instance, if the leader sends docs to a replica
>>>>>>>>> and
>>>>>>>>> the request times out, the leader throws the follower into "Leader
>>>>>>>>> Initiated Recovery". The smoking gun here is that there are no
>>>>>>>>> errors
>>>>>>>>> on the follower, just the notification that the leader put it into
>>>>>>>>> recovery.
>>>>>>>>>
>>>>>>>>> There are other variations on the theme, it all boils down to when
>>>>>>>>> communications fall apart replicas go into recovery.....
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Erick
>>>>>>>>>
>>>>>>>>> On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
>>>>>>>>> <jo...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Shawn - thank you for your reply. The index is 29.9TBytes as
>>>>>>>>>> reported
>>>>>>>>>> by:
>>>>>>>>>> hadoop fs -du -s -h /solr6.6.0
>>>>>>>>>> 29.9 T  89.9 T  /solr6.6.0
>>>>>>>>>>
>>>>>>>>>> The 89.9TBytes is due to HDFS having 3x replication.  There are
>>>>>>>>>> about
>>>>>>>>>> 1.1
>>>>>>>>>> billion documents indexed and we index about 2.5 million documents
>>>>>>>>>> per
>>>>>>>>>> day.
>>>>>>>>>> Assuming an even distribution, each node is handling about
>>>>>>>>>> 680GBytes
>>>>>>>>>> of
>>>>>>>>>> index.  So our cache size is 1.4%. Perhaps 'relatively small block
>>>>>>>>>> cache'
>>>>>>>>>> was an understatement! This is why we split the largest collection
>>>>>>>>>> into
>>>>>>>>>> two,
>>>>>>>>>> where one is data going back 30 days, and the other is all the
>>>>>>>>>> data.
>>>>>>>>>> Most
>>>>>>>>>> of our searches are not longer than 30 days back.  The 30 day
>>>>>>>>>> index
>>>>>>>>>> is
>>>>>>>>>> 2.6TBytes total.  I don't know how the HDFS block cache splits
>>>>>>>>>> between
>>>>>>>>>> collections, but the 30 day index performs acceptable for our
>>>>>>>>>> specific
>>>>>>>>>> application.
>>>>>>>>>>
>>>>>>>>>> If we wanted to cache 50% of the index, each of our 45 nodes would
>>>>>>>>>> need
>>>>>>>>>> a
>>>>>>>>>> block cache of about 350GBytes.  I'm accepting offers of DIMMs!
>>>>>>>>>>
>>>>>>>>>> What I believe caused our 'recovery, fail, retry loop' was one of
>>>>>>>>>> our
>>>>>>>>>> servers died.  This caused HDFS to start to replicate blocks
>>>>>>>>>> across
>>>>>>>>>> the
>>>>>>>>>> cluster and produced a lot of network activity.  When this
>>>>>>>>>> happened,
>>>>>>>>>> I
>>>>>>>>>> believe there was high network contention for specific nodes in
>>>>>>>>>> the
>>>>>>>>>> cluster
>>>>>>>>>> and their network interfaces became pegged and requests for HDFS
>>>>>>>>>> blocks
>>>>>>>>>> timed out.  When that happened, SolrCloud went into recovery which
>>>>>>>>>> caused
>>>>>>>>>> more network traffic.  Fun stuff.
>>>>>>>>>>
>>>>>>>>>> -Joe
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 11/22/2017 11:44 AM, Shawn Heisey wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 11/22/2017 6:44 AM, Joe Obernberger wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Right now, we have a relatively small block cache due to the
>>>>>>>>>>>> requirements that the servers run other software.  We tried to
>>>>>>>>>>>> find
>>>>>>>>>>>> the best balance between block cache size, and RAM for programs,
>>>>>>>>>>>> while
>>>>>>>>>>>> still giving enough for local FS cache.  This came out to be 84
>>>>>>>>>>>> 128M
>>>>>>>>>>>> blocks - or about 10G for the cache per node (45 nodes total).
>>>>>>>>>>>
>>>>>>>>>>> How much data is being handled on a server with 10GB allocated
>>>>>>>>>>> for
>>>>>>>>>>> caching HDFS data?
>>>>>>>>>>>
>>>>>>>>>>> The first message in this thread says the index size is 31TB,
>>>>>>>>>>> which
>>>>>>>>>>> is
>>>>>>>>>>> *enormous*.  You have also said that the index takes 93TB of disk
>>>>>>>>>>> space.  If the data is distributed somewhat evenly, then the
>>>>>>>>>>> answer
>>>>>>>>>>> to
>>>>>>>>>>> my question would be that each of those 45 Solr servers would be
>>>>>>>>>>> handling over 2TB of data.  A 10GB cache is *nothing* compared to
>>>>>>>>>>> 2TB.
>>>>>>>>>>>
>>>>>>>>>>> When index data that Solr needs to access for an operation is not
>>>>>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>> cache and Solr must actually wait for disk and/or network I/O,
>>>>>>>>>>> the
>>>>>>>>>>> resulting performance usually isn't very good.  In most cases you
>>>>>>>>>>> don't
>>>>>>>>>>> need to have enough memory to fully cache the index data ... but
>>>>>>>>>>> less
>>>>>>>>>>> than half a percent is not going to be enough.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Shawn
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>> This email has been checked for viruses by AVG.
>>>>>>>>>>> http://www.avg.com
>>>>>>>>>>>
>

Re: Recovery Issue - Solr 6.6.1 and HDFS

Posted by Joe Obernberger <jo...@gmail.com>.

Thank you Erick.  Perhaps it makes more sense to not use any replicas 
when using HDFS for storage (and having a very large index) since it is 
already replicated.  It seems to me that if there were no replicas, and 
a leader went down, that another node could take over by just going 
through the regular startup cycle (replaying logs etc.) similar to the 
auto add replicas capability.  Not sure how one would handle a node 
coming back.

I think there could be a lot to be gained by taking advantage of a 
global file system with Solr.  Would be fun!

-Joe


On 12/9/2017 10:26 PM, Erick Erickson wrote:
> The complications are things like this:
>
> Say an update comes in and gets written to the tlog and indexed but
> not committed. Now the leader goes down. How does the replica that
> takes over leadership
> 1> understand the current state of the index, i.e. that there are
> uncommitted updates
> 2> replay the updates from the tlog correctly?
>
> Not to mention that during leader election one of the read-only
> replicas must become a read/write replica when it takes over
> leadership.
>
> The current mechanism does, indeed, use Zk to elect a new leader, the
> devil is in the details of how in-flight updates get handled properly.
>
> There's no a-priori reason all those details couldn't be worked out,
> it's just gnarly. Nobody has yet stepped up to commit the
> time/resources to work them all out. My guess is that the cost of
> having a bunch more disks is cheaper than the engineering time it
> would take to changes this. The standard answer is "patches welcome"
> ;).
>
> Best,
> Erick
>
> On Sat, Dec 9, 2017 at 1:02 PM, Hendrik Haddorp <he...@gmx.net> wrote:
>> Ok, thanks for the answer. The leader election and update notification sound
>> like they should work using ZooKeeper (leader election recipe and a normal
>> watch) but I guess there are some details that make things more complicated.
>>
>> On 09.12.2017 20:19, Erick Erickson wrote:
>>> This has been bandied about on a number of occasions, it boils down to
>>> nobody has stepped up to make it happen. It turns out there are a
>>> number of tricky issues:
>>>
>>>> how does leadership change if the leader goes down?
>>>> the raw complexity of getting it right. Getting it wrong corrupts indexes
>>>> how do you resolve leadership in the first place so only the leader
>>>> writes to the index?
>>>> how would that affect performance if N replicas were autowarming at the
>>>> same time, thus reading from HDFS?
>>>> how do the read-only replicas know to open a new searcher?
>>>> I'm sure there are a bunch more.
>>> So this is one of those things that everyone agrees is interesting,
>>> but nobody is willing to code and it's not actually clear that it
>>> makes sense in the Solr context. It'd be a pity to put in all the work
>>> then discover that the performance issues prohibited using it.
>>>
>>> If you _guarantee_ that the index doesn't change, there's the
>>> NoLockFactory you could specify. That would allow you to share a
>>> common index, woe be unto you if you start updating the index though.
>>>
>>> Best,
>>> Erick
>>>
>>> On Sat, Dec 9, 2017 at 4:46 AM, Hendrik Haddorp <he...@gmx.net>
>>> wrote:
>>>> Hi,
>>>>
>>>> for the HDFS case wouldn't it be nice if there was a mode in which the
>>>> replicas just read the same index files as the leader? I mean after all
>>>> the
>>>> data is already on a shared readable file system so why would one even
>>>> need
>>>> to replicate the transaction log files?
>>>>
>>>> regards,
>>>> Hendrik
>>>>
>>>>
>>>> On 08.12.2017 21:07, Erick Erickson wrote:
>>>>> bq: Will TLOG replicas use less network bandwidth?
>>>>>
>>>>> No, probably more bandwidth. TLOG replicas work like this:
>>>>> 1> the raw docs are forwarded
>>>>> 2> the old-style master/slave replication is used
>>>>>
>>>>> So what you do save is CPU processing on the TLOG replica in exchange
>>>>> for increased bandwidth.
>>>>>
>>>>> Since the only thing forwarded in NRT replicas (outside of recovery)
>>>>> is the raw documents, I expect that TLOG replicas would _increase_
>>>>> network usage. The deal is that TLOG replicas can take over leadership
>>>>> if the leader goes down so they must have an
>>>>> up-to-date-after-last-index-sync set of tlogs.
>>>>>
>>>>> At least that's my current understanding...
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
>>>>> <jo...@gmail.com> wrote:
>>>>>> Anyone have any thoughts on this?  Will TLOG replicas use less network
>>>>>> bandwidth?
>>>>>>
>>>>>> -Joe
>>>>>>
>>>>>>
>>>>>> On 12/4/2017 12:54 PM, Joe Obernberger wrote:
>>>>>>> Hi All - this same problem happened again, and I think I partially
>>>>>>> understand what is going on.  The part I don't know is what caused any
>>>>>>> of
>>>>>>> the replicas to go into full recovery in the first place, but once
>>>>>>> they
>>>>>>> do,
>>>>>>> they cause network interfaces on servers to go fully utilized in both
>>>>>>> in/out
>>>>>>> directions.  It appears that when a solr replica needs to recover, it
>>>>>>> calls
>>>>>>> on the leader for all the data.  In HDFS, the data from the leader's
>>>>>>> point
>>>>>>> of view goes:
>>>>>>>
>>>>>>> HDFS --> Solr Leader Process -->Network--> Replica Solr Process
>>>>>>> -->HDFS
>>>>>>>
>>>>>>> Do I have this correct?  That poor network in the middle becomes a
>>>>>>> bottleneck and causes other replicas to go into recovery, which causes
>>>>>>> more
>>>>>>> network traffic.  Perhaps going to TLOG replicas with 7.1 would be
>>>>>>> better
>>>>>>> with HDFS?  Would it be possible for the leader to send a message to
>>>>>>> the
>>>>>>> replica to instead get the data straight from HDFS instead of going
>>>>>>> from
>>>>>>> one
>>>>>>> solr process to another?  HDFS would better be able to use the cluster
>>>>>>> since
>>>>>>> each block has 3x replicas.  Perhaps there is a better way to handle
>>>>>>> replicas with a shared file system.
>>>>>>>
>>>>>>> Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.
>>>>>>> Good idea?  Thank you!
>>>>>>>
>>>>>>> -Joe
>>>>>>>
>>>>>>>
>>>>>>> On 11/22/2017 8:17 PM, Erick Erickson wrote:
>>>>>>>> Hmm. This is quite possible. Any time things take "too long" it can
>>>>>>>> be
>>>>>>>>      a problem. For instance, if the leader sends docs to a replica
>>>>>>>> and
>>>>>>>> the request times out, the leader throws the follower into "Leader
>>>>>>>> Initiated Recovery". The smoking gun here is that there are no errors
>>>>>>>> on the follower, just the notification that the leader put it into
>>>>>>>> recovery.
>>>>>>>>
>>>>>>>> There are other variations on the theme, it all boils down to when
>>>>>>>> communications fall apart replicas go into recovery.....
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Erick
>>>>>>>>
>>>>>>>> On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
>>>>>>>> <jo...@gmail.com> wrote:
>>>>>>>>> Hi Shawn - thank you for your reply. The index is 29.9TBytes as
>>>>>>>>> reported
>>>>>>>>> by:
>>>>>>>>> hadoop fs -du -s -h /solr6.6.0
>>>>>>>>> 29.9 T  89.9 T  /solr6.6.0
>>>>>>>>>
>>>>>>>>> The 89.9TBytes is due to HDFS having 3x replication.  There are
>>>>>>>>> about
>>>>>>>>> 1.1
>>>>>>>>> billion documents indexed and we index about 2.5 million documents
>>>>>>>>> per
>>>>>>>>> day.
>>>>>>>>> Assuming an even distribution, each node is handling about 680GBytes
>>>>>>>>> of
>>>>>>>>> index.  So our cache size is 1.4%. Perhaps 'relatively small block
>>>>>>>>> cache'
>>>>>>>>> was an understatement! This is why we split the largest collection
>>>>>>>>> into
>>>>>>>>> two,
>>>>>>>>> where one is data going back 30 days, and the other is all the data.
>>>>>>>>> Most
>>>>>>>>> of our searches are not longer than 30 days back.  The 30 day index
>>>>>>>>> is
>>>>>>>>> 2.6TBytes total.  I don't know how the HDFS block cache splits
>>>>>>>>> between
>>>>>>>>> collections, but the 30 day index performs acceptable for our
>>>>>>>>> specific
>>>>>>>>> application.
>>>>>>>>>
>>>>>>>>> If we wanted to cache 50% of the index, each of our 45 nodes would
>>>>>>>>> need
>>>>>>>>> a
>>>>>>>>> block cache of about 350GBytes.  I'm accepting offers of DIMMs!
>>>>>>>>>
>>>>>>>>> What I believe caused our 'recovery, fail, retry loop' was one of
>>>>>>>>> our
>>>>>>>>> servers died.  This caused HDFS to start to replicate blocks across
>>>>>>>>> the
>>>>>>>>> cluster and produced a lot of network activity.  When this happened,
>>>>>>>>> I
>>>>>>>>> believe there was high network contention for specific nodes in the
>>>>>>>>> cluster
>>>>>>>>> and their network interfaces became pegged and requests for HDFS
>>>>>>>>> blocks
>>>>>>>>> timed out.  When that happened, SolrCloud went into recovery which
>>>>>>>>> caused
>>>>>>>>> more network traffic.  Fun stuff.
>>>>>>>>>
>>>>>>>>> -Joe
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 11/22/2017 11:44 AM, Shawn Heisey wrote:
>>>>>>>>>> On 11/22/2017 6:44 AM, Joe Obernberger wrote:
>>>>>>>>>>> Right now, we have a relatively small block cache due to the
>>>>>>>>>>> requirements that the servers run other software.  We tried to
>>>>>>>>>>> find
>>>>>>>>>>> the best balance between block cache size, and RAM for programs,
>>>>>>>>>>> while
>>>>>>>>>>> still giving enough for local FS cache.  This came out to be 84
>>>>>>>>>>> 128M
>>>>>>>>>>> blocks - or about 10G for the cache per node (45 nodes total).
>>>>>>>>>> How much data is being handled on a server with 10GB allocated for
>>>>>>>>>> caching HDFS data?
>>>>>>>>>>
>>>>>>>>>> The first message in this thread says the index size is 31TB, which
>>>>>>>>>> is
>>>>>>>>>> *enormous*.  You have also said that the index takes 93TB of disk
>>>>>>>>>> space.  If the data is distributed somewhat evenly, then the answer
>>>>>>>>>> to
>>>>>>>>>> my question would be that each of those 45 Solr servers would be
>>>>>>>>>> handling over 2TB of data.  A 10GB cache is *nothing* compared to
>>>>>>>>>> 2TB.
>>>>>>>>>>
>>>>>>>>>> When index data that Solr needs to access for an operation is not
>>>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>> cache and Solr must actually wait for disk and/or network I/O, the
>>>>>>>>>> resulting performance usually isn't very good.  In most cases you
>>>>>>>>>> don't
>>>>>>>>>> need to have enough memory to fully cache the index data ... but
>>>>>>>>>> less
>>>>>>>>>> than half a percent is not going to be enough.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shawn
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>> This email has been checked for viruses by AVG.
>>>>>>>>>> http://www.avg.com
>>>>>>>>>>

Re: Recovery Issue - Solr 6.6.1 and HDFS

Posted by Erick Erickson <er...@gmail.com>.

The complications are things like this:

Say an update comes in and gets written to the tlog and indexed but
not committed. Now the leader goes down. How does the replica that
takes over leadership
1> understand the current state of the index, i.e. that there are
uncommitted updates
2> replay the updates from the tlog correctly?

Not to mention that during leader election one of the read-only
replicas must become a read/write replica when it takes over
leadership.

The current mechanism does, indeed, use Zk to elect a new leader, the
devil is in the details of how in-flight updates get handled properly.

There's no a-priori reason all those details couldn't be worked out,
it's just gnarly. Nobody has yet stepped up to commit the
time/resources to work them all out. My guess is that the cost of
having a bunch more disks is cheaper than the engineering time it
would take to changes this. The standard answer is "patches welcome"
;).

Best,
Erick

On Sat, Dec 9, 2017 at 1:02 PM, Hendrik Haddorp <he...@gmx.net> wrote:
> Ok, thanks for the answer. The leader election and update notification sound
> like they should work using ZooKeeper (leader election recipe and a normal
> watch) but I guess there are some details that make things more complicated.
>
> On 09.12.2017 20:19, Erick Erickson wrote:
>>
>> This has been bandied about on a number of occasions, it boils down to
>> nobody has stepped up to make it happen. It turns out there are a
>> number of tricky issues:
>>
>>> how does leadership change if the leader goes down?
>>> the raw complexity of getting it right. Getting it wrong corrupts indexes
>>> how do you resolve leadership in the first place so only the leader
>>> writes to the index?
>>> how would that affect performance if N replicas were autowarming at the
>>> same time, thus reading from HDFS?
>>> how do the read-only replicas know to open a new searcher?
>>> I'm sure there are a bunch more.
>>
>> So this is one of those things that everyone agrees is interesting,
>> but nobody is willing to code and it's not actually clear that it
>> makes sense in the Solr context. It'd be a pity to put in all the work
>> then discover that the performance issues prohibited using it.
>>
>> If you _guarantee_ that the index doesn't change, there's the
>> NoLockFactory you could specify. That would allow you to share a
>> common index, woe be unto you if you start updating the index though.
>>
>> Best,
>> Erick
>>
>> On Sat, Dec 9, 2017 at 4:46 AM, Hendrik Haddorp <he...@gmx.net>
>> wrote:
>>>
>>> Hi,
>>>
>>> for the HDFS case wouldn't it be nice if there was a mode in which the
>>> replicas just read the same index files as the leader? I mean after all
>>> the
>>> data is already on a shared readable file system so why would one even
>>> need
>>> to replicate the transaction log files?
>>>
>>> regards,
>>> Hendrik
>>>
>>>
>>> On 08.12.2017 21:07, Erick Erickson wrote:
>>>>
>>>> bq: Will TLOG replicas use less network bandwidth?
>>>>
>>>> No, probably more bandwidth. TLOG replicas work like this:
>>>> 1> the raw docs are forwarded
>>>> 2> the old-style master/slave replication is used
>>>>
>>>> So what you do save is CPU processing on the TLOG replica in exchange
>>>> for increased bandwidth.
>>>>
>>>> Since the only thing forwarded in NRT replicas (outside of recovery)
>>>> is the raw documents, I expect that TLOG replicas would _increase_
>>>> network usage. The deal is that TLOG replicas can take over leadership
>>>> if the leader goes down so they must have an
>>>> up-to-date-after-last-index-sync set of tlogs.
>>>>
>>>> At least that's my current understanding...
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
>>>> <jo...@gmail.com> wrote:
>>>>>
>>>>> Anyone have any thoughts on this?  Will TLOG replicas use less network
>>>>> bandwidth?
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>> On 12/4/2017 12:54 PM, Joe Obernberger wrote:
>>>>>>
>>>>>> Hi All - this same problem happened again, and I think I partially
>>>>>> understand what is going on.  The part I don't know is what caused any
>>>>>> of
>>>>>> the replicas to go into full recovery in the first place, but once
>>>>>> they
>>>>>> do,
>>>>>> they cause network interfaces on servers to go fully utilized in both
>>>>>> in/out
>>>>>> directions.  It appears that when a solr replica needs to recover, it
>>>>>> calls
>>>>>> on the leader for all the data.  In HDFS, the data from the leader's
>>>>>> point
>>>>>> of view goes:
>>>>>>
>>>>>> HDFS --> Solr Leader Process -->Network--> Replica Solr Process
>>>>>> -->HDFS
>>>>>>
>>>>>> Do I have this correct?  That poor network in the middle becomes a
>>>>>> bottleneck and causes other replicas to go into recovery, which causes
>>>>>> more
>>>>>> network traffic.  Perhaps going to TLOG replicas with 7.1 would be
>>>>>> better
>>>>>> with HDFS?  Would it be possible for the leader to send a message to
>>>>>> the
>>>>>> replica to instead get the data straight from HDFS instead of going
>>>>>> from
>>>>>> one
>>>>>> solr process to another?  HDFS would better be able to use the cluster
>>>>>> since
>>>>>> each block has 3x replicas.  Perhaps there is a better way to handle
>>>>>> replicas with a shared file system.
>>>>>>
>>>>>> Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.
>>>>>> Good idea?  Thank you!
>>>>>>
>>>>>> -Joe
>>>>>>
>>>>>>
>>>>>> On 11/22/2017 8:17 PM, Erick Erickson wrote:
>>>>>>>
>>>>>>> Hmm. This is quite possible. Any time things take "too long" it can
>>>>>>> be
>>>>>>>     a problem. For instance, if the leader sends docs to a replica
>>>>>>> and
>>>>>>> the request times out, the leader throws the follower into "Leader
>>>>>>> Initiated Recovery". The smoking gun here is that there are no errors
>>>>>>> on the follower, just the notification that the leader put it into
>>>>>>> recovery.
>>>>>>>
>>>>>>> There are other variations on the theme, it all boils down to when
>>>>>>> communications fall apart replicas go into recovery.....
>>>>>>>
>>>>>>> Best,
>>>>>>> Erick
>>>>>>>
>>>>>>> On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
>>>>>>> <jo...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi Shawn - thank you for your reply. The index is 29.9TBytes as
>>>>>>>> reported
>>>>>>>> by:
>>>>>>>> hadoop fs -du -s -h /solr6.6.0
>>>>>>>> 29.9 T  89.9 T  /solr6.6.0
>>>>>>>>
>>>>>>>> The 89.9TBytes is due to HDFS having 3x replication.  There are
>>>>>>>> about
>>>>>>>> 1.1
>>>>>>>> billion documents indexed and we index about 2.5 million documents
>>>>>>>> per
>>>>>>>> day.
>>>>>>>> Assuming an even distribution, each node is handling about 680GBytes
>>>>>>>> of
>>>>>>>> index.  So our cache size is 1.4%. Perhaps 'relatively small block
>>>>>>>> cache'
>>>>>>>> was an understatement! This is why we split the largest collection
>>>>>>>> into
>>>>>>>> two,
>>>>>>>> where one is data going back 30 days, and the other is all the data.
>>>>>>>> Most
>>>>>>>> of our searches are not longer than 30 days back.  The 30 day index
>>>>>>>> is
>>>>>>>> 2.6TBytes total.  I don't know how the HDFS block cache splits
>>>>>>>> between
>>>>>>>> collections, but the 30 day index performs acceptable for our
>>>>>>>> specific
>>>>>>>> application.
>>>>>>>>
>>>>>>>> If we wanted to cache 50% of the index, each of our 45 nodes would
>>>>>>>> need
>>>>>>>> a
>>>>>>>> block cache of about 350GBytes.  I'm accepting offers of DIMMs!
>>>>>>>>
>>>>>>>> What I believe caused our 'recovery, fail, retry loop' was one of
>>>>>>>> our
>>>>>>>> servers died.  This caused HDFS to start to replicate blocks across
>>>>>>>> the
>>>>>>>> cluster and produced a lot of network activity.  When this happened,
>>>>>>>> I
>>>>>>>> believe there was high network contention for specific nodes in the
>>>>>>>> cluster
>>>>>>>> and their network interfaces became pegged and requests for HDFS
>>>>>>>> blocks
>>>>>>>> timed out.  When that happened, SolrCloud went into recovery which
>>>>>>>> caused
>>>>>>>> more network traffic.  Fun stuff.
>>>>>>>>
>>>>>>>> -Joe
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/22/2017 11:44 AM, Shawn Heisey wrote:
>>>>>>>>>
>>>>>>>>> On 11/22/2017 6:44 AM, Joe Obernberger wrote:
>>>>>>>>>>
>>>>>>>>>> Right now, we have a relatively small block cache due to the
>>>>>>>>>> requirements that the servers run other software.  We tried to
>>>>>>>>>> find
>>>>>>>>>> the best balance between block cache size, and RAM for programs,
>>>>>>>>>> while
>>>>>>>>>> still giving enough for local FS cache.  This came out to be 84
>>>>>>>>>> 128M
>>>>>>>>>> blocks - or about 10G for the cache per node (45 nodes total).
>>>>>>>>>
>>>>>>>>> How much data is being handled on a server with 10GB allocated for
>>>>>>>>> caching HDFS data?
>>>>>>>>>
>>>>>>>>> The first message in this thread says the index size is 31TB, which
>>>>>>>>> is
>>>>>>>>> *enormous*.  You have also said that the index takes 93TB of disk
>>>>>>>>> space.  If the data is distributed somewhat evenly, then the answer
>>>>>>>>> to
>>>>>>>>> my question would be that each of those 45 Solr servers would be
>>>>>>>>> handling over 2TB of data.  A 10GB cache is *nothing* compared to
>>>>>>>>> 2TB.
>>>>>>>>>
>>>>>>>>> When index data that Solr needs to access for an operation is not
>>>>>>>>> in
>>>>>>>>> the
>>>>>>>>> cache and Solr must actually wait for disk and/or network I/O, the
>>>>>>>>> resulting performance usually isn't very good.  In most cases you
>>>>>>>>> don't
>>>>>>>>> need to have enough memory to fully cache the index data ... but
>>>>>>>>> less
>>>>>>>>> than half a percent is not going to be enough.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> This email has been checked for viruses by AVG.
>>>>>>>>> http://www.avg.com
>>>>>>>>>
>

Re: Recovery Issue - Solr 6.6.1 and HDFS

Posted by Hendrik Haddorp <he...@gmx.net>.

Ok, thanks for the answer. The leader election and update notification 
sound like they should work using ZooKeeper (leader election recipe and 
a normal watch) but I guess there are some details that make things more 
complicated.

On 09.12.2017 20:19, Erick Erickson wrote:
> This has been bandied about on a number of occasions, it boils down to
> nobody has stepped up to make it happen. It turns out there are a
> number of tricky issues:
>
>> how does leadership change if the leader goes down?
>> the raw complexity of getting it right. Getting it wrong corrupts indexes
>> how do you resolve leadership in the first place so only the leader writes to the index?
>> how would that affect performance if N replicas were autowarming at the same time, thus reading from HDFS?
>> how do the read-only replicas know to open a new searcher?
>> I'm sure there are a bunch more.
> So this is one of those things that everyone agrees is interesting,
> but nobody is willing to code and it's not actually clear that it
> makes sense in the Solr context. It'd be a pity to put in all the work
> then discover that the performance issues prohibited using it.
>
> If you _guarantee_ that the index doesn't change, there's the
> NoLockFactory you could specify. That would allow you to share a
> common index, woe be unto you if you start updating the index though.
>
> Best,
> Erick
>
> On Sat, Dec 9, 2017 at 4:46 AM, Hendrik Haddorp <he...@gmx.net> wrote:
>> Hi,
>>
>> for the HDFS case wouldn't it be nice if there was a mode in which the
>> replicas just read the same index files as the leader? I mean after all the
>> data is already on a shared readable file system so why would one even need
>> to replicate the transaction log files?
>>
>> regards,
>> Hendrik
>>
>>
>> On 08.12.2017 21:07, Erick Erickson wrote:
>>> bq: Will TLOG replicas use less network bandwidth?
>>>
>>> No, probably more bandwidth. TLOG replicas work like this:
>>> 1> the raw docs are forwarded
>>> 2> the old-style master/slave replication is used
>>>
>>> So what you do save is CPU processing on the TLOG replica in exchange
>>> for increased bandwidth.
>>>
>>> Since the only thing forwarded in NRT replicas (outside of recovery)
>>> is the raw documents, I expect that TLOG replicas would _increase_
>>> network usage. The deal is that TLOG replicas can take over leadership
>>> if the leader goes down so they must have an
>>> up-to-date-after-last-index-sync set of tlogs.
>>>
>>> At least that's my current understanding...
>>>
>>> Best,
>>> Erick
>>>
>>> On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
>>> <jo...@gmail.com> wrote:
>>>> Anyone have any thoughts on this?  Will TLOG replicas use less network
>>>> bandwidth?
>>>>
>>>> -Joe
>>>>
>>>>
>>>> On 12/4/2017 12:54 PM, Joe Obernberger wrote:
>>>>> Hi All - this same problem happened again, and I think I partially
>>>>> understand what is going on.  The part I don't know is what caused any
>>>>> of
>>>>> the replicas to go into full recovery in the first place, but once they
>>>>> do,
>>>>> they cause network interfaces on servers to go fully utilized in both
>>>>> in/out
>>>>> directions.  It appears that when a solr replica needs to recover, it
>>>>> calls
>>>>> on the leader for all the data.  In HDFS, the data from the leader's
>>>>> point
>>>>> of view goes:
>>>>>
>>>>> HDFS --> Solr Leader Process -->Network--> Replica Solr Process -->HDFS
>>>>>
>>>>> Do I have this correct?  That poor network in the middle becomes a
>>>>> bottleneck and causes other replicas to go into recovery, which causes
>>>>> more
>>>>> network traffic.  Perhaps going to TLOG replicas with 7.1 would be
>>>>> better
>>>>> with HDFS?  Would it be possible for the leader to send a message to the
>>>>> replica to instead get the data straight from HDFS instead of going from
>>>>> one
>>>>> solr process to another?  HDFS would better be able to use the cluster
>>>>> since
>>>>> each block has 3x replicas.  Perhaps there is a better way to handle
>>>>> replicas with a shared file system.
>>>>>
>>>>> Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.
>>>>> Good idea?  Thank you!
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>> On 11/22/2017 8:17 PM, Erick Erickson wrote:
>>>>>> Hmm. This is quite possible. Any time things take "too long" it can be
>>>>>>     a problem. For instance, if the leader sends docs to a replica and
>>>>>> the request times out, the leader throws the follower into "Leader
>>>>>> Initiated Recovery". The smoking gun here is that there are no errors
>>>>>> on the follower, just the notification that the leader put it into
>>>>>> recovery.
>>>>>>
>>>>>> There are other variations on the theme, it all boils down to when
>>>>>> communications fall apart replicas go into recovery.....
>>>>>>
>>>>>> Best,
>>>>>> Erick
>>>>>>
>>>>>> On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
>>>>>> <jo...@gmail.com> wrote:
>>>>>>> Hi Shawn - thank you for your reply. The index is 29.9TBytes as
>>>>>>> reported
>>>>>>> by:
>>>>>>> hadoop fs -du -s -h /solr6.6.0
>>>>>>> 29.9 T  89.9 T  /solr6.6.0
>>>>>>>
>>>>>>> The 89.9TBytes is due to HDFS having 3x replication.  There are about
>>>>>>> 1.1
>>>>>>> billion documents indexed and we index about 2.5 million documents per
>>>>>>> day.
>>>>>>> Assuming an even distribution, each node is handling about 680GBytes
>>>>>>> of
>>>>>>> index.  So our cache size is 1.4%. Perhaps 'relatively small block
>>>>>>> cache'
>>>>>>> was an understatement! This is why we split the largest collection
>>>>>>> into
>>>>>>> two,
>>>>>>> where one is data going back 30 days, and the other is all the data.
>>>>>>> Most
>>>>>>> of our searches are not longer than 30 days back.  The 30 day index is
>>>>>>> 2.6TBytes total.  I don't know how the HDFS block cache splits between
>>>>>>> collections, but the 30 day index performs acceptable for our specific
>>>>>>> application.
>>>>>>>
>>>>>>> If we wanted to cache 50% of the index, each of our 45 nodes would
>>>>>>> need
>>>>>>> a
>>>>>>> block cache of about 350GBytes.  I'm accepting offers of DIMMs!
>>>>>>>
>>>>>>> What I believe caused our 'recovery, fail, retry loop' was one of our
>>>>>>> servers died.  This caused HDFS to start to replicate blocks across
>>>>>>> the
>>>>>>> cluster and produced a lot of network activity.  When this happened, I
>>>>>>> believe there was high network contention for specific nodes in the
>>>>>>> cluster
>>>>>>> and their network interfaces became pegged and requests for HDFS
>>>>>>> blocks
>>>>>>> timed out.  When that happened, SolrCloud went into recovery which
>>>>>>> caused
>>>>>>> more network traffic.  Fun stuff.
>>>>>>>
>>>>>>> -Joe
>>>>>>>
>>>>>>>
>>>>>>> On 11/22/2017 11:44 AM, Shawn Heisey wrote:
>>>>>>>> On 11/22/2017 6:44 AM, Joe Obernberger wrote:
>>>>>>>>> Right now, we have a relatively small block cache due to the
>>>>>>>>> requirements that the servers run other software.  We tried to find
>>>>>>>>> the best balance between block cache size, and RAM for programs,
>>>>>>>>> while
>>>>>>>>> still giving enough for local FS cache.  This came out to be 84 128M
>>>>>>>>> blocks - or about 10G for the cache per node (45 nodes total).
>>>>>>>> How much data is being handled on a server with 10GB allocated for
>>>>>>>> caching HDFS data?
>>>>>>>>
>>>>>>>> The first message in this thread says the index size is 31TB, which
>>>>>>>> is
>>>>>>>> *enormous*.  You have also said that the index takes 93TB of disk
>>>>>>>> space.  If the data is distributed somewhat evenly, then the answer
>>>>>>>> to
>>>>>>>> my question would be that each of those 45 Solr servers would be
>>>>>>>> handling over 2TB of data.  A 10GB cache is *nothing* compared to
>>>>>>>> 2TB.
>>>>>>>>
>>>>>>>> When index data that Solr needs to access for an operation is not in
>>>>>>>> the
>>>>>>>> cache and Solr must actually wait for disk and/or network I/O, the
>>>>>>>> resulting performance usually isn't very good.  In most cases you
>>>>>>>> don't
>>>>>>>> need to have enough memory to fully cache the index data ... but less
>>>>>>>> than half a percent is not going to be enough.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>>
>>>>>>>> ---
>>>>>>>> This email has been checked for viruses by AVG.
>>>>>>>> http://www.avg.com
>>>>>>>>

Re: Recovery Issue - Solr 6.6.1 and HDFS

Posted by Erick Erickson <er...@gmail.com>.

This has been bandied about on a number of occasions, it boils down to
nobody has stepped up to make it happen. It turns out there are a
number of tricky issues:

> how does leadership change if the leader goes down?
> the raw complexity of getting it right. Getting it wrong corrupts indexes
> how do you resolve leadership in the first place so only the leader writes to the index?
> how would that affect performance if N replicas were autowarming at the same time, thus reading from HDFS?
> how do the read-only replicas know to open a new searcher?
> I'm sure there are a bunch more.

So this is one of those things that everyone agrees is interesting,
but nobody is willing to code and it's not actually clear that it
makes sense in the Solr context. It'd be a pity to put in all the work
then discover that the performance issues prohibited using it.

If you _guarantee_ that the index doesn't change, there's the
NoLockFactory you could specify. That would allow you to share a
common index, woe be unto you if you start updating the index though.

Best,
Erick

On Sat, Dec 9, 2017 at 4:46 AM, Hendrik Haddorp <he...@gmx.net> wrote:
> Hi,
>
> for the HDFS case wouldn't it be nice if there was a mode in which the
> replicas just read the same index files as the leader? I mean after all the
> data is already on a shared readable file system so why would one even need
> to replicate the transaction log files?
>
> regards,
> Hendrik
>
>
> On 08.12.2017 21:07, Erick Erickson wrote:
>>
>> bq: Will TLOG replicas use less network bandwidth?
>>
>> No, probably more bandwidth. TLOG replicas work like this:
>> 1> the raw docs are forwarded
>> 2> the old-style master/slave replication is used
>>
>> So what you do save is CPU processing on the TLOG replica in exchange
>> for increased bandwidth.
>>
>> Since the only thing forwarded in NRT replicas (outside of recovery)
>> is the raw documents, I expect that TLOG replicas would _increase_
>> network usage. The deal is that TLOG replicas can take over leadership
>> if the leader goes down so they must have an
>> up-to-date-after-last-index-sync set of tlogs.
>>
>> At least that's my current understanding...
>>
>> Best,
>> Erick
>>
>> On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
>> <jo...@gmail.com> wrote:
>>>
>>> Anyone have any thoughts on this?  Will TLOG replicas use less network
>>> bandwidth?
>>>
>>> -Joe
>>>
>>>
>>> On 12/4/2017 12:54 PM, Joe Obernberger wrote:
>>>>
>>>> Hi All - this same problem happened again, and I think I partially
>>>> understand what is going on.  The part I don't know is what caused any
>>>> of
>>>> the replicas to go into full recovery in the first place, but once they
>>>> do,
>>>> they cause network interfaces on servers to go fully utilized in both
>>>> in/out
>>>> directions.  It appears that when a solr replica needs to recover, it
>>>> calls
>>>> on the leader for all the data.  In HDFS, the data from the leader's
>>>> point
>>>> of view goes:
>>>>
>>>> HDFS --> Solr Leader Process -->Network--> Replica Solr Process -->HDFS
>>>>
>>>> Do I have this correct?  That poor network in the middle becomes a
>>>> bottleneck and causes other replicas to go into recovery, which causes
>>>> more
>>>> network traffic.  Perhaps going to TLOG replicas with 7.1 would be
>>>> better
>>>> with HDFS?  Would it be possible for the leader to send a message to the
>>>> replica to instead get the data straight from HDFS instead of going from
>>>> one
>>>> solr process to another?  HDFS would better be able to use the cluster
>>>> since
>>>> each block has 3x replicas.  Perhaps there is a better way to handle
>>>> replicas with a shared file system.
>>>>
>>>> Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.
>>>> Good idea?  Thank you!
>>>>
>>>> -Joe
>>>>
>>>>
>>>> On 11/22/2017 8:17 PM, Erick Erickson wrote:
>>>>>
>>>>> Hmm. This is quite possible. Any time things take "too long" it can be
>>>>>    a problem. For instance, if the leader sends docs to a replica and
>>>>> the request times out, the leader throws the follower into "Leader
>>>>> Initiated Recovery". The smoking gun here is that there are no errors
>>>>> on the follower, just the notification that the leader put it into
>>>>> recovery.
>>>>>
>>>>> There are other variations on the theme, it all boils down to when
>>>>> communications fall apart replicas go into recovery.....
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
>>>>> <jo...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Shawn - thank you for your reply. The index is 29.9TBytes as
>>>>>> reported
>>>>>> by:
>>>>>> hadoop fs -du -s -h /solr6.6.0
>>>>>> 29.9 T  89.9 T  /solr6.6.0
>>>>>>
>>>>>> The 89.9TBytes is due to HDFS having 3x replication.  There are about
>>>>>> 1.1
>>>>>> billion documents indexed and we index about 2.5 million documents per
>>>>>> day.
>>>>>> Assuming an even distribution, each node is handling about 680GBytes
>>>>>> of
>>>>>> index.  So our cache size is 1.4%. Perhaps 'relatively small block
>>>>>> cache'
>>>>>> was an understatement! This is why we split the largest collection
>>>>>> into
>>>>>> two,
>>>>>> where one is data going back 30 days, and the other is all the data.
>>>>>> Most
>>>>>> of our searches are not longer than 30 days back.  The 30 day index is
>>>>>> 2.6TBytes total.  I don't know how the HDFS block cache splits between
>>>>>> collections, but the 30 day index performs acceptable for our specific
>>>>>> application.
>>>>>>
>>>>>> If we wanted to cache 50% of the index, each of our 45 nodes would
>>>>>> need
>>>>>> a
>>>>>> block cache of about 350GBytes.  I'm accepting offers of DIMMs!
>>>>>>
>>>>>> What I believe caused our 'recovery, fail, retry loop' was one of our
>>>>>> servers died.  This caused HDFS to start to replicate blocks across
>>>>>> the
>>>>>> cluster and produced a lot of network activity.  When this happened, I
>>>>>> believe there was high network contention for specific nodes in the
>>>>>> cluster
>>>>>> and their network interfaces became pegged and requests for HDFS
>>>>>> blocks
>>>>>> timed out.  When that happened, SolrCloud went into recovery which
>>>>>> caused
>>>>>> more network traffic.  Fun stuff.
>>>>>>
>>>>>> -Joe
>>>>>>
>>>>>>
>>>>>> On 11/22/2017 11:44 AM, Shawn Heisey wrote:
>>>>>>>
>>>>>>> On 11/22/2017 6:44 AM, Joe Obernberger wrote:
>>>>>>>>
>>>>>>>> Right now, we have a relatively small block cache due to the
>>>>>>>> requirements that the servers run other software.  We tried to find
>>>>>>>> the best balance between block cache size, and RAM for programs,
>>>>>>>> while
>>>>>>>> still giving enough for local FS cache.  This came out to be 84 128M
>>>>>>>> blocks - or about 10G for the cache per node (45 nodes total).
>>>>>>>
>>>>>>> How much data is being handled on a server with 10GB allocated for
>>>>>>> caching HDFS data?
>>>>>>>
>>>>>>> The first message in this thread says the index size is 31TB, which
>>>>>>> is
>>>>>>> *enormous*.  You have also said that the index takes 93TB of disk
>>>>>>> space.  If the data is distributed somewhat evenly, then the answer
>>>>>>> to
>>>>>>> my question would be that each of those 45 Solr servers would be
>>>>>>> handling over 2TB of data.  A 10GB cache is *nothing* compared to
>>>>>>> 2TB.
>>>>>>>
>>>>>>> When index data that Solr needs to access for an operation is not in
>>>>>>> the
>>>>>>> cache and Solr must actually wait for disk and/or network I/O, the
>>>>>>> resulting performance usually isn't very good.  In most cases you
>>>>>>> don't
>>>>>>> need to have enough memory to fully cache the index data ... but less
>>>>>>> than half a percent is not going to be enough.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shawn
>>>>>>>
>>>>>>>
>>>>>>> ---
>>>>>>> This email has been checked for viruses by AVG.
>>>>>>> http://www.avg.com
>>>>>>>
>

Re: Recovery Issue - Solr 6.6.1 and HDFS

Posted by Hendrik Haddorp <he...@gmx.net>.

Hi,

for the HDFS case wouldn't it be nice if there was a mode in which the 
replicas just read the same index files as the leader? I mean after all 
the data is already on a shared readable file system so why would one 
even need to replicate the transaction log files?

regards,
Hendrik

On 08.12.2017 21:07, Erick Erickson wrote:
> bq: Will TLOG replicas use less network bandwidth?
>
> No, probably more bandwidth. TLOG replicas work like this:
> 1> the raw docs are forwarded
> 2> the old-style master/slave replication is used
>
> So what you do save is CPU processing on the TLOG replica in exchange
> for increased bandwidth.
>
> Since the only thing forwarded in NRT replicas (outside of recovery)
> is the raw documents, I expect that TLOG replicas would _increase_
> network usage. The deal is that TLOG replicas can take over leadership
> if the leader goes down so they must have an
> up-to-date-after-last-index-sync set of tlogs.
>
> At least that's my current understanding...
>
> Best,
> Erick
>
> On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
> <jo...@gmail.com> wrote:
>> Anyone have any thoughts on this?  Will TLOG replicas use less network
>> bandwidth?
>>
>> -Joe
>>
>>
>> On 12/4/2017 12:54 PM, Joe Obernberger wrote:
>>> Hi All - this same problem happened again, and I think I partially
>>> understand what is going on.  The part I don't know is what caused any of
>>> the replicas to go into full recovery in the first place, but once they do,
>>> they cause network interfaces on servers to go fully utilized in both in/out
>>> directions.  It appears that when a solr replica needs to recover, it calls
>>> on the leader for all the data.  In HDFS, the data from the leader's point
>>> of view goes:
>>>
>>> HDFS --> Solr Leader Process -->Network--> Replica Solr Process -->HDFS
>>>
>>> Do I have this correct?  That poor network in the middle becomes a
>>> bottleneck and causes other replicas to go into recovery, which causes more
>>> network traffic.  Perhaps going to TLOG replicas with 7.1 would be better
>>> with HDFS?  Would it be possible for the leader to send a message to the
>>> replica to instead get the data straight from HDFS instead of going from one
>>> solr process to another?  HDFS would better be able to use the cluster since
>>> each block has 3x replicas.  Perhaps there is a better way to handle
>>> replicas with a shared file system.
>>>
>>> Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.
>>> Good idea?  Thank you!
>>>
>>> -Joe
>>>
>>>
>>> On 11/22/2017 8:17 PM, Erick Erickson wrote:
>>>> Hmm. This is quite possible. Any time things take "too long" it can be
>>>>    a problem. For instance, if the leader sends docs to a replica and
>>>> the request times out, the leader throws the follower into "Leader
>>>> Initiated Recovery". The smoking gun here is that there are no errors
>>>> on the follower, just the notification that the leader put it into
>>>> recovery.
>>>>
>>>> There are other variations on the theme, it all boils down to when
>>>> communications fall apart replicas go into recovery.....
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
>>>> <jo...@gmail.com> wrote:
>>>>> Hi Shawn - thank you for your reply. The index is 29.9TBytes as reported
>>>>> by:
>>>>> hadoop fs -du -s -h /solr6.6.0
>>>>> 29.9 T  89.9 T  /solr6.6.0
>>>>>
>>>>> The 89.9TBytes is due to HDFS having 3x replication.  There are about
>>>>> 1.1
>>>>> billion documents indexed and we index about 2.5 million documents per
>>>>> day.
>>>>> Assuming an even distribution, each node is handling about 680GBytes of
>>>>> index.  So our cache size is 1.4%. Perhaps 'relatively small block
>>>>> cache'
>>>>> was an understatement! This is why we split the largest collection into
>>>>> two,
>>>>> where one is data going back 30 days, and the other is all the data.
>>>>> Most
>>>>> of our searches are not longer than 30 days back.  The 30 day index is
>>>>> 2.6TBytes total.  I don't know how the HDFS block cache splits between
>>>>> collections, but the 30 day index performs acceptable for our specific
>>>>> application.
>>>>>
>>>>> If we wanted to cache 50% of the index, each of our 45 nodes would need
>>>>> a
>>>>> block cache of about 350GBytes.  I'm accepting offers of DIMMs!
>>>>>
>>>>> What I believe caused our 'recovery, fail, retry loop' was one of our
>>>>> servers died.  This caused HDFS to start to replicate blocks across the
>>>>> cluster and produced a lot of network activity.  When this happened, I
>>>>> believe there was high network contention for specific nodes in the
>>>>> cluster
>>>>> and their network interfaces became pegged and requests for HDFS blocks
>>>>> timed out.  When that happened, SolrCloud went into recovery which
>>>>> caused
>>>>> more network traffic.  Fun stuff.
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>> On 11/22/2017 11:44 AM, Shawn Heisey wrote:
>>>>>> On 11/22/2017 6:44 AM, Joe Obernberger wrote:
>>>>>>> Right now, we have a relatively small block cache due to the
>>>>>>> requirements that the servers run other software.  We tried to find
>>>>>>> the best balance between block cache size, and RAM for programs, while
>>>>>>> still giving enough for local FS cache.  This came out to be 84 128M
>>>>>>> blocks - or about 10G for the cache per node (45 nodes total).
>>>>>> How much data is being handled on a server with 10GB allocated for
>>>>>> caching HDFS data?
>>>>>>
>>>>>> The first message in this thread says the index size is 31TB, which is
>>>>>> *enormous*.  You have also said that the index takes 93TB of disk
>>>>>> space.  If the data is distributed somewhat evenly, then the answer to
>>>>>> my question would be that each of those 45 Solr servers would be
>>>>>> handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.
>>>>>>
>>>>>> When index data that Solr needs to access for an operation is not in
>>>>>> the
>>>>>> cache and Solr must actually wait for disk and/or network I/O, the
>>>>>> resulting performance usually isn't very good.  In most cases you don't
>>>>>> need to have enough memory to fully cache the index data ... but less
>>>>>> than half a percent is not going to be enough.
>>>>>>
>>>>>> Thanks,
>>>>>> Shawn
>>>>>>
>>>>>>
>>>>>> ---
>>>>>> This email has been checked for viruses by AVG.
>>>>>> http://www.avg.com
>>>>>>

Re: Recovery Issue - Solr 6.6.1 and HDFS

Posted by Erick Erickson <er...@gmail.com>.

bq: Will TLOG replicas use less network bandwidth?

No, probably more bandwidth. TLOG replicas work like this:
1> the raw docs are forwarded
2> the old-style master/slave replication is used

So what you do save is CPU processing on the TLOG replica in exchange
for increased bandwidth.

Since the only thing forwarded in NRT replicas (outside of recovery)
is the raw documents, I expect that TLOG replicas would _increase_
network usage. The deal is that TLOG replicas can take over leadership
if the leader goes down so they must have an
up-to-date-after-last-index-sync set of tlogs.

At least that's my current understanding...

Best,
Erick

On Fri, Dec 8, 2017 at 12:01 PM, Joe Obernberger
<jo...@gmail.com> wrote:
> Anyone have any thoughts on this?  Will TLOG replicas use less network
> bandwidth?
>
> -Joe
>
>
> On 12/4/2017 12:54 PM, Joe Obernberger wrote:
>>
>> Hi All - this same problem happened again, and I think I partially
>> understand what is going on.  The part I don't know is what caused any of
>> the replicas to go into full recovery in the first place, but once they do,
>> they cause network interfaces on servers to go fully utilized in both in/out
>> directions.  It appears that when a solr replica needs to recover, it calls
>> on the leader for all the data.  In HDFS, the data from the leader's point
>> of view goes:
>>
>> HDFS --> Solr Leader Process -->Network--> Replica Solr Process -->HDFS
>>
>> Do I have this correct?  That poor network in the middle becomes a
>> bottleneck and causes other replicas to go into recovery, which causes more
>> network traffic.  Perhaps going to TLOG replicas with 7.1 would be better
>> with HDFS?  Would it be possible for the leader to send a message to the
>> replica to instead get the data straight from HDFS instead of going from one
>> solr process to another?  HDFS would better be able to use the cluster since
>> each block has 3x replicas.  Perhaps there is a better way to handle
>> replicas with a shared file system.
>>
>> Our current plan to fix the issue is to go to Solr 7.1.0 and use TLOG.
>> Good idea?  Thank you!
>>
>> -Joe
>>
>>
>> On 11/22/2017 8:17 PM, Erick Erickson wrote:
>>>
>>> Hmm. This is quite possible. Any time things take "too long" it can be
>>>   a problem. For instance, if the leader sends docs to a replica and
>>> the request times out, the leader throws the follower into "Leader
>>> Initiated Recovery". The smoking gun here is that there are no errors
>>> on the follower, just the notification that the leader put it into
>>> recovery.
>>>
>>> There are other variations on the theme, it all boils down to when
>>> communications fall apart replicas go into recovery.....
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
>>> <jo...@gmail.com> wrote:
>>>>
>>>> Hi Shawn - thank you for your reply. The index is 29.9TBytes as reported
>>>> by:
>>>> hadoop fs -du -s -h /solr6.6.0
>>>> 29.9 T  89.9 T  /solr6.6.0
>>>>
>>>> The 89.9TBytes is due to HDFS having 3x replication.  There are about
>>>> 1.1
>>>> billion documents indexed and we index about 2.5 million documents per
>>>> day.
>>>> Assuming an even distribution, each node is handling about 680GBytes of
>>>> index.  So our cache size is 1.4%. Perhaps 'relatively small block
>>>> cache'
>>>> was an understatement! This is why we split the largest collection into
>>>> two,
>>>> where one is data going back 30 days, and the other is all the data.
>>>> Most
>>>> of our searches are not longer than 30 days back.  The 30 day index is
>>>> 2.6TBytes total.  I don't know how the HDFS block cache splits between
>>>> collections, but the 30 day index performs acceptable for our specific
>>>> application.
>>>>
>>>> If we wanted to cache 50% of the index, each of our 45 nodes would need
>>>> a
>>>> block cache of about 350GBytes.  I'm accepting offers of DIMMs!
>>>>
>>>> What I believe caused our 'recovery, fail, retry loop' was one of our
>>>> servers died.  This caused HDFS to start to replicate blocks across the
>>>> cluster and produced a lot of network activity.  When this happened, I
>>>> believe there was high network contention for specific nodes in the
>>>> cluster
>>>> and their network interfaces became pegged and requests for HDFS blocks
>>>> timed out.  When that happened, SolrCloud went into recovery which
>>>> caused
>>>> more network traffic.  Fun stuff.
>>>>
>>>> -Joe
>>>>
>>>>
>>>> On 11/22/2017 11:44 AM, Shawn Heisey wrote:
>>>>>
>>>>> On 11/22/2017 6:44 AM, Joe Obernberger wrote:
>>>>>>
>>>>>> Right now, we have a relatively small block cache due to the
>>>>>> requirements that the servers run other software.  We tried to find
>>>>>> the best balance between block cache size, and RAM for programs, while
>>>>>> still giving enough for local FS cache.  This came out to be 84 128M
>>>>>> blocks - or about 10G for the cache per node (45 nodes total).
>>>>>
>>>>> How much data is being handled on a server with 10GB allocated for
>>>>> caching HDFS data?
>>>>>
>>>>> The first message in this thread says the index size is 31TB, which is
>>>>> *enormous*.  You have also said that the index takes 93TB of disk
>>>>> space.  If the data is distributed somewhat evenly, then the answer to
>>>>> my question would be that each of those 45 Solr servers would be
>>>>> handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.
>>>>>
>>>>> When index data that Solr needs to access for an operation is not in
>>>>> the
>>>>> cache and Solr must actually wait for disk and/or network I/O, the
>>>>> resulting performance usually isn't very good.  In most cases you don't
>>>>> need to have enough memory to fully cache the index data ... but less
>>>>> than half a percent is not going to be enough.
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>>
>>>>> ---
>>>>> This email has been checked for viruses by AVG.
>>>>> http://www.avg.com
>>>>>
>>
>

Re: Recovery Issue - Solr 6.6.1 and HDFS

Posted by Joe Obernberger <jo...@gmail.com>.

Anyone have any thoughts on this?  Will TLOG replicas use less network 
bandwidth?

-Joe


On 12/4/2017 12:54 PM, Joe Obernberger wrote:
> Hi All - this same problem happened again, and I think I partially 
> understand what is going on.  The part I don't know is what caused any 
> of the replicas to go into full recovery in the first place, but once 
> they do, they cause network interfaces on servers to go fully utilized 
> in both in/out directions.  It appears that when a solr replica needs 
> to recover, it calls on the leader for all the data.  In HDFS, the 
> data from the leader's point of view goes:
>
> HDFS --> Solr Leader Process -->Network--> Replica Solr Process -->HDFS
>
> Do I have this correct?  That poor network in the middle becomes a 
> bottleneck and causes other replicas to go into recovery, which causes 
> more network traffic.  Perhaps going to TLOG replicas with 7.1 would 
> be better with HDFS?  Would it be possible for the leader to send a 
> message to the replica to instead get the data straight from HDFS 
> instead of going from one solr process to another?  HDFS would better 
> be able to use the cluster since each block has 3x replicas.  Perhaps 
> there is a better way to handle replicas with a shared file system.
>
> Our current plan to fix the issue is to go to Solr 7.1.0 and use 
> TLOG.  Good idea?  Thank you!
>
> -Joe
>
>
> On 11/22/2017 8:17 PM, Erick Erickson wrote:
>> Hmm. This is quite possible. Any time things take "too long" it can be
>>   a problem. For instance, if the leader sends docs to a replica and
>> the request times out, the leader throws the follower into "Leader
>> Initiated Recovery". The smoking gun here is that there are no errors
>> on the follower, just the notification that the leader put it into
>> recovery.
>>
>> There are other variations on the theme, it all boils down to when
>> communications fall apart replicas go into recovery.....
>>
>> Best,
>> Erick
>>
>> On Wed, Nov 22, 2017 at 11:02 AM, Joe Obernberger
>> <jo...@gmail.com> wrote:
>>> Hi Shawn - thank you for your reply. The index is 29.9TBytes as 
>>> reported
>>> by:
>>> hadoop fs -du -s -h /solr6.6.0
>>> 29.9 T  89.9 T  /solr6.6.0
>>>
>>> The 89.9TBytes is due to HDFS having 3x replication.  There are 
>>> about 1.1
>>> billion documents indexed and we index about 2.5 million documents 
>>> per day.
>>> Assuming an even distribution, each node is handling about 680GBytes of
>>> index.  So our cache size is 1.4%. Perhaps 'relatively small block 
>>> cache'
>>> was an understatement! This is why we split the largest collection 
>>> into two,
>>> where one is data going back 30 days, and the other is all the 
>>> data.  Most
>>> of our searches are not longer than 30 days back.  The 30 day index is
>>> 2.6TBytes total.  I don't know how the HDFS block cache splits between
>>> collections, but the 30 day index performs acceptable for our specific
>>> application.
>>>
>>> If we wanted to cache 50% of the index, each of our 45 nodes would 
>>> need a
>>> block cache of about 350GBytes.  I'm accepting offers of DIMMs!
>>>
>>> What I believe caused our 'recovery, fail, retry loop' was one of our
>>> servers died.  This caused HDFS to start to replicate blocks across the
>>> cluster and produced a lot of network activity.  When this happened, I
>>> believe there was high network contention for specific nodes in the 
>>> cluster
>>> and their network interfaces became pegged and requests for HDFS blocks
>>> timed out.  When that happened, SolrCloud went into recovery which 
>>> caused
>>> more network traffic.  Fun stuff.
>>>
>>> -Joe
>>>
>>>
>>> On 11/22/2017 11:44 AM, Shawn Heisey wrote:
>>>> On 11/22/2017 6:44 AM, Joe Obernberger wrote:
>>>>> Right now, we have a relatively small block cache due to the
>>>>> requirements that the servers run other software.  We tried to find
>>>>> the best balance between block cache size, and RAM for programs, 
>>>>> while
>>>>> still giving enough for local FS cache.  This came out to be 84 128M
>>>>> blocks - or about 10G for the cache per node (45 nodes total).
>>>> How much data is being handled on a server with 10GB allocated for
>>>> caching HDFS data?
>>>>
>>>> The first message in this thread says the index size is 31TB, which is
>>>> *enormous*.  You have also said that the index takes 93TB of disk
>>>> space.  If the data is distributed somewhat evenly, then the answer to
>>>> my question would be that each of those 45 Solr servers would be
>>>> handling over 2TB of data.  A 10GB cache is *nothing* compared to 2TB.
>>>>
>>>> When index data that Solr needs to access for an operation is not 
>>>> in the
>>>> cache and Solr must actually wait for disk and/or network I/O, the
>>>> resulting performance usually isn't very good.  In most cases you 
>>>> don't
>>>> need to have enough memory to fully cache the index data ... but less
>>>> than half a percent is not going to be enough.
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>>>
>>>> ---
>>>> This email has been checked for viruses by AVG.
>>>> http://www.avg.com
>>>>
>