You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@fluo.apache.org by "Meier, Caleb" <Ca...@parsons.com> on 2017/10/26 15:34:37 UTC

fluo accumulo table tablet servers not keeping up with application

Hello Fluo Devs,

We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.

One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?

In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.

Sincerely,
Caleb Meier

Caleb A. Meier, Ph.D.
Senior Software Engineer ♦ Analyst
Parsons Corporation
1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
Office:  (703)797-3066
Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦ www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.com+>


Re: fluo accumulo table tablet servers not keeping up with application

Posted by Keith Turner <ke...@deenlo.com>.
On Thu, Oct 26, 2017 at 2:50 PM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hey Keith,
>
> Thanks for the reply.  Regarding our benchmark, I've attached some screenshots of our Accumulo UI that were taken while the benchmark was running.  Basically, our ingest rate is pretty low (about 150 entries/s, but our scan rate is off the charts - approaching 6 million entries/s!).  Also, notice the disparity between reads and returned in the Scan chart.  That disparity would suggest that we're possibly doing full table scans somewhere, which is strange given that all of our scans are RowColumn constrained.  Perhaps we are building our Scanner incorrectly.   In an effort to maximize the number of TabletServers, we split the Fluo table into 5MB tablets.  Also, the data is not well balanced -- the tablet servers do take turns being maxed out while others are idle.  We're considering possible sharding strategies.
>
> Given that our TabletServers are getting saturated so quickly for such a low ingest rate, it seems like we definitely need to cut down on the number of scans as a first line of attack to see what that buys us.  Then we'll look into tuning Accumulo and Fluo.  Does this seem like a reasonable approach to you?  Does the scan rate of our application strike you as extremely high?  When you look at the Rya Observers, can you pay attention to how we are building our scans to make sure that we're not inadvertently doing full table scans?  Also, what exactly do you mean by "are the 6 lookups in the transaction done sequentially"?

If you compact the table, what impact does this have on the scan rate?
 I ask because Fluo does some garbage collection in compaction
iterators.

Also for write performance, the write ahead log settings can have huge
performance implications.  What version of Accumulo are you running?
For Accumulo 1.7 and 1.8 you can run the following commands to use
flush instead of sync.

config -s table.durability=flush
config -t accumulo.metadata -d table.durability
config -t accumulo.root -d table.durability

These settings are from this article.

http://accumulo.apache.org/blog/2016/11/02/durability-performance.html

>
> Thanks,
> Caleb
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Keith Turner [mailto:keith@deenlo.com]
> Sent: Thursday, October 26, 2017 1:39 PM
> To: fluo-dev <de...@fluo.apache.org>
> Subject: Re: fluo accumulo table tablet servers not keeping up with application
>
> Caleb
>
> What if any tuning have you done?  The following tune-able Accumulo parameters impact performance.
>
>  * Write ahead log sync settings (this can have huge performance implications)
>  * Files per tablet
>  * Tablet server cache sizes
>  * Accumulo data block sizes
>  * Tablet server client thread pool size
>
> For Fluo the following tune-able parameters are important.
>
>  * Commit memory (this determines how many transactions are held in memory while committing)
>  * Threads running transactions
>
> What does the load (CPU and memory) on the cluster look like?  I'm curious how even it is?  For example is one tserver at 100% cpu while others are idle, this could be caused by uneven data access patterns.
>
> Would it be possible for me to see or run the benchmark?  I am going to take a look at the Rya observers, let me know if there is anything in particular I should look at.
>
> Are the 6 lookups in the transaction done sequentially?
>
> Keith
>
> On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>> Hello Fluo Devs,
>>
>> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>>
>> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>>
>> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>>
>> Sincerely,
>> Caleb Meier
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦
>> www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.com+>
>>

RE: fluo accumulo table tablet servers not keeping up with application

Posted by "Meier, Caleb" <Ca...@parsons.com>.
> What did you see in the list scans output that enabled you to pinpoint a particular observer?

Every time the Accumulo Tablet servers had a heavy work load, I saw that the SnapshotIterator was iterating over the same column.  As that column only gets scanned at one point in our application, that allowed me to pinpoint what appears to be the main source of our problems.  Unfortunately, I don't think the problem can be solved with caching alone, as each cache is confined to a specific worker.  Essentially, the scan was checking to see if new queries had been registered or not before attempting to match new data to queries.  I think that some sort of metadata observer that watches for new queries and sets a flag should help.  Then the those scans will only be executed after checking to see if the "new metadata cell" has been set to true.


Caleb A. Meier, Ph.D.
Senior Software Engineer ♦ Analyst
Parsons Corporation
1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
Office:  (703)797-3066
Caleb.Meier@Parsons.com ♦ www.parsons.com

-----Original Message-----
From: Keith Turner [mailto:keith@deenlo.com] 
Sent: Wednesday, November 08, 2017 5:49 PM
To: fluo-dev <de...@fluo.apache.org>
Subject: Re: fluo accumulo table tablet servers not keeping up with application

On Wed, Nov 8, 2017 at 4:49 PM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hey Keith,
>
> I finally got around to trying out some of your suggestions.  The first thing I tried was applying some variation of your RowHasher recipe, and we noticed an immediate improvement.  However, we are still running into issues with our Tablet servers bogging down.  Doing things like flushing and compacting certainly buys us time, but there appears to be a diminishing margin of return between each compaction/flush cycle.  That is, after every compaction, it seems like the outstanding notifications increase at a slightly faster oscillatory rate than in the previous compaction cycle.  I ran the experiment you suggested -- I counted the number of deleted notifications and outstanding notifications, ran a flush, counted again, ran a compaction, and counted again.  The numbers are as follows:
>
> Unprocessed: 679
> Deleted: 105,745
>
> Flush
>
> Unprocessed: 352
> Deleted: 4395
>
> Compaction
>
> Unprocessed: 81
> Deleted: 2663
>
> Obviously flushing and compacting are extremely beneficial.  Is there an outstanding recipe to check for the number of deleted/unprocessed notifications and then flush/compact based on some sort of threshold on those notifications?  Is there a way to configure the garbage collector to be more active?

You could do two things lower the in memory map size for accumulo.
This will cause Accumuo to buffer less and lead to more frequent flushes, which run the Fluo GC Iter.  For compactions, you can adjust the compaction ratio lower which will cause more frequent compactions.
Lowering the compaction ratio also results in less file per tablet, which is good for random seek performance.  Could try setting it to 1.5, 1.75, or 2, lower is better until compactions are too frequent :)

>
> Finally, the piece of advice that you gave me which may end up providing the biggest breakthrough is profiling the iterators on Accumulo using listscans.  It appears that most of the scanning (at least according to my random sampling) is happening within one of our observers.  I think that we can cache most of those lookups, so I'm currently optimistic that we can put a serious dent in the scan traffic that we're seeing.  Thanks so much for your help in debugging this issue.  I'll let you know if caching solves our problems.

I have found using the watch command to eyeball something to be invaluable over the years.  A technique I leaned from Eric Newton.

What did you see in the list scans output that enabled you to pinpoint a particular observer?

>
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Keith Turner [mailto:keith@deenlo.com]
> Sent: Tuesday, October 31, 2017 5:14 PM
> To: fluo-dev <de...@fluo.apache.org>
> Subject: Re: fluo accumulo table tablet servers not keeping up with 
> application
>
> On Tue, Oct 31, 2017 at 2:22 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>> Hey Keith,
>>
>> Just following up on your last message.  After looking at the worker ScanTask logs, it seems like the workers are conducting scans as frequently as the min sleep time permits.  That is, if the min sleep time is set to 10s, a ScanTask is being executed every 10s.  In addition, running the Fluo wait command indicates that the number of outstanding notifications steadily increases or is held constant (depending on the number of workers).  Based on your comments below, it seems like the workers should be scanning at a lower rate given that the notification work queue is constantly increasing in size.  Another thing that we tried was reducing the number of workers and increasing the min sleep time.  This lowered the scan burden on the tablet server, but unsurprisingly our processing rate plummeted.  We also tried lowering the ingest rate for a fixed number of workers (lowering the notification rate for each worker thread).  While it took longer for the TabletServer to become saturated, it still became overwhelmed.
>>
>> In general, for the queries that we are benchmarking, our notification:data ratio is about 7:1 (i.e. each piece of ingested data generates about 7 notifications on the way to being entirely processed).  I think that this is our primary culprit, but I think that our application specific scans are also part of the problem (I'm still in the process of trying to determine what portion of the scans that we are seeing is specific to our observers and what portion is specific to notification discovery - any suggestions here would be appreciated).  One reason that I think notification discovery is the culprit is that we implemented an in memory cache for the metadata, and that didn't seem to affect the scan rate too much (metadata seeks constitute about 30% of our seeks/scans).
>>
>> Going forward, we're going to shard our data and look into ways to cut down on scans.  Any other suggestions about how to improve performance would be appreciated.
>
> In 1.0.0 each worker scans all tablets for notifications.  In 1.1.0 
> tablets and workers split into groups, you can adjust the worker group 
> size[1], it defaults to 7.  If you are using 1.1.0, I would recommend 
> experimenting with this.  If you have 70 workers, then you will have
> 10 groups.  The tablets will also be divided into 10 groups.  Each worker will scan all of the tablets in its group.  Notifications are hash partitioned within a group.  If you lower the group size, then you will have less scanning.  But as you lower the group size you increase the chance of work being unevenly spread.  For example with a group size of 7 that means at most 7 workers will scan a tablet.  It also means the notifications in  tablet can only be processed by 7 workers.  In the worst case if one tablet has all of the notifications, then only only 7 workers will process those notifications.  If the notifications in the table are evenly spread across tablets, then you could probably decrease the group size to 2 or 3.
>
> There are two possible ways to get sense of what scans are up to via sampling.  One is to sample listscans commands in the accumulo shell and see what iterators are in use.  Transactions and notification scanning will use different iterators.  Could also sample scan jstacks in some tservers and look at which iterators are used.
>
> Another thing to look into would be to see how many deleted 
> notifications there are.  Using the command
>
>   fluo scan --raw -c ntfy
>
> Should be able to see notifications and deletes for notifications.  I am curious how many deletes there are.  When a table if flushed/minor compacted some notifications will be GC by an iterator.  A full compaction will do more.  These deletes have to be filtered at scan time.  If you have a chance I would be interested in the following numbers (or ratios for the three numbers).
>
>  * How many deleted notifications are there? How many notifications are there?
>  * Flush table
>  * How many deleted notifications are there? How many notifications are there?
>  * compact table
>  * How many deleted notifications are there? How many notifications are there?
>
> Keith
>
> [1]: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> _fluo_blob_rel_fluo-2D1.1.0-2Dincubating_modules_core_src_main_java_or
> g_apache_fluo_core_impl_FluoConfigurationImpl.java-23L30&d=DwIFaQ&c=Nw
> f-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7C
> rMHCgeo_4WXTD0qo8&m=k70bvUqcPprQoM9YUjVaq3OPArW9UM31cFg6zMbHi2E&s=9dKm
> FBU2QTqdxIGu9fsCH5NwaeiKKWfC13Ty_s0XJ7A&e=
>
>>
>> Thanks,
>> Caleb
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>
>> -----Original Message-----
>> From: Keith Turner [mailto:keith@deenlo.com]
>> Sent: Friday, October 27, 2017 12:17 PM
>> To: fluo-dev <de...@fluo.apache.org>
>> Subject: Re: fluo accumulo table tablet servers not keeping up with 
>> application
>>
>> On Fri, Oct 27, 2017 at 11:03 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>>> Hey Keith,
>>>
>>> Our benchmark consists of a single query that is a join of two statement patterns (essentially patterns that incoming data matches, where a unit of data is a statement).  We are ingesting 50 pairs of statements a minute (100 total), where each statement in the pair matches one of the statement patterns.  Because the data is being ingested at a constant rate, the statement pattern Observers and Join Observers are constantly working.  One thing that is worth mentioning is that we changed the property fluo.implScanTask.maxSleep from 5 min to 10 seconds.  Based on the constant ingest rate, your comments below, and our low maxSleep, it seems like the workers would constantly be scanning for new notifications.
>>>
>>>> Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.
>>>
>>> How does the maxSleep property work in conjunction with this?  If the max sleep time elapses before a worker processes half of the notifications, will it scan?
>>
>> I don't think it will scan again until the # of queued notifications is cut in half.  I looked in 1.0.0 and 1.1.0 and I think while loops linked below should hold off on the scan until the queue halves.
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apach
>> e 
>> _fluo_blob_rel_fluo-2D1.0.0-2Dincubating_modules_core_src_main_java_o
>> r 
>> g_apache_fluo_core_worker_finder_hash_ScanTask.java-23L85&d=DwIFaQ&c=
>> N
>> wf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ
>> 7 
>> CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=BR
>> y QS2DPBtEfUvHT-JKBXPWABrSyihP6yaJcfE1BJFQ&e=
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apach
>> e 
>> _fluo_blob_rel_fluo-2D1.1.0-2Dincubating_modules_core_src_main_java_o
>> r 
>> g_apache_fluo_core_worker_finder_hash_ScanTask.java-23L88&d=DwIFaQ&c=
>> N
>> wf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ
>> 7 
>> CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=Zx
>> U RCZE5k65I008z7o4UQGsm6o0mBtJnwV_N6Y668oM&e=
>>
>> Were you able to find the ScanTask debug messages in the worker logs?
>> Below are the log messages int the code to give a sense of what to look for.
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apach
>> e 
>> _fluo_blob_rel_fluo-2D1.0.0-2Dincubating_modules_core_src_main_java_o
>> r 
>> g_apache_fluo_core_worker_finder_hash_ScanTask.java-23L130&d=DwIFaQ&c
>> = 
>> Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpz
>> J
>> 7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=C
>> 1 41kYyjygBL3kWZyUObU1-nu4ZjvMnu7xp_QbIGkCA&e=
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apach
>> e 
>> _fluo_blob_rel_fluo-2D1.1.0-2Dincubating_modules_core_src_main_java_o
>> r 
>> g_apache_fluo_core_worker_finder_hash_ScanTask.java-23L146&d=DwIFaQ&c
>> = 
>> Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpz
>> J 
>> 7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=4
>> Q y1-LbMEpJ7NZLqngU8ZOEOBv6nB0nXM8mjkWdpEL4&e=
>>
>> IIRC I think if notifications were found in a tablet during the last scan, then it will always scan it during the next scan loop.  As notifications are not found in a tablet then that tablets next scan time doubles up to fluo.implScanTask.maxSleep.
>>
>> So its possible that all notifications found are being processed quickly and then the workers are scanning for more.  The debug messages would show this.
>>
>> There is also a minSleep time.  This property determines the minimum amount of time it will sleep between scan loops, seems to default to 5 secs.  Could try increasing this.
>>
>> Looking at the props, it seems they prop names for min and max sleep changed between 1.0.0 and 1.1.0.
>>
>>
>>>
>>> Caleb A. Meier, Ph.D.
>>> Senior Software Engineer ♦ Analyst
>>> Parsons Corporation
>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>> Office:  (703)797-3066
>>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>>
>>> -----Original Message-----
>>> From: Keith Turner [mailto:keith@deenlo.com]
>>> Sent: Thursday, October 26, 2017 6:20 PM
>>> To: fluo-dev <de...@fluo.apache.org>
>>> Subject: Re: fluo accumulo table tablet servers not keeping up with 
>>> application
>>>
>>> On Thu, Oct 26, 2017 at 5:47 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>>>> Hey Keith,
>>>>
>>>> We'll rerun the benchmarks tomorrow and track the outstanding notifications.  We'll also see if compacting at some point during ingest helps with the scan rate.  Have you observed such high scan rates for such a small amount of data in any of your benchmarking?  What would account for the huge disparity in results read vs. results returned?  It seems like our scans are extremely inefficient for some reason.  Our tablet servers are becoming overwhelmed even before data gets flushed to disk.
>>>
>>> Oh I never saw you attachment, may not be able to attach stuff on mailing list.
>>>
>>> Its possible that what you are seeing is the workers scanning for notifications.  If you look in the workers logs do you see messages about scanning for notifications?  If so what do they look like?
>>>
>>> In 1.0.0 each worker scans all tablets in random order.  When it scans it has an iterator that uses hash+mod to select a subset of notifications.  The iterator also suppresses deleted notifications.
>>> So the selection and suppression by that iterator could explain the read vs returned.  It does exponential back off on tablets where it does not find data.  Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.
>>>
>>> In the beginning, would you have a lot of notifications?  If so I would expect a lot of scanning and then it should slow down once the workers get a list of notifications to process.
>>>
>>> In 1.1.0 the workers divide up the tablets (so workers no longer scan
>>> all tablets, groups of workers share groups of tablets).   If the
>>> table is splits after the workers start, it may take them a bit to execute the distributed algorithm that divys tablets among workers.
>>>
>>> Anyway the debug messages about scanning for notifications in the workers should provide some insight into this.
>>>
>>> If its not notification scanning, then it could be that the application is scanning over a lots of data that was deleted or something like that.
>>>
>>>>
>>>> Caleb A. Meier, Ph.D.
>>>> Senior Software Engineer ♦ Analyst
>>>> Parsons Corporation
>>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>>> Office:  (703)797-3066
>>>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>>>
>>>> -----Original Message-----
>>>> From: Keith Turner [mailto:keith@deenlo.com]
>>>> Sent: Thursday, October 26, 2017 5:36 PM
>>>> To: fluo-dev <de...@fluo.apache.org>
>>>> Subject: Re: fluo accumulo table tablet servers not keeping up with 
>>>> application
>>>>
>>>> On Thu, Oct 26, 2017 at 2:50 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>>>>> Hey Keith,
>>>>>
>>>>> Thanks for the reply.  Regarding our benchmark, I've attached some screenshots of our Accumulo UI that were taken while the benchmark was running.  Basically, our ingest rate is pretty low (about 150 entries/s, but our scan rate is off the charts - approaching 6 million entries/s!).  Also, notice the disparity between reads and returned in the Scan chart.  That disparity would suggest that we're possibly doing full table scans somewhere, which is strange given that all of our scans are RowColumn constrained.  Perhaps we are building our Scanner incorrectly.   In an effort to maximize the number of TabletServers, we split the Fluo table into 5MB tablets.  Also, the data is not well balanced -- the tablet servers do take turns being maxed out while others are idle.  We're considering possible sharding strategies.
>>>>>
>>>>> Given that our TabletServers are getting saturated so quickly for such a low ingest rate, it seems like we definitely need to cut down on the number of scans as a first line of attack to see what that buys us.  Then we'll look into tuning Accumulo and Fluo.  Does this seem like a reasonable approach to you?  Does the scan rate of our application strike you as extremely high?  When you look at the Rya Observers, can you pay attention to how we are building our scans to make sure that we're not inadvertently doing full table scans?  Also, what exactly do you mean by "are the 6 lookups in the transaction done sequentially"?
>>>>
>>>> Regarding the scan rate there are few things I Am curious about.
>>>>
>>>> Fluo workers scan for notifications in addition to the scanning 
>>>> done by your apps.  I made some changes in 1.1.0 to reduce the 
>>>> amount of scanning needed to find notifications, but this should 
>>>> not make much of a difference on a small amount of nodes.  Details 
>>>> about this are in
>>>> 1.1.0 release notes.  I am not sure what the best way is to determine how much of the scanning you are seeing is app vs notification finding.  Can you run the fluo wait command to see how many outstanding notifications there are?
>>>>
>>>> Transactions leave a paper trail behind and compactions clean this up (Fluo has a garbage collection iterator).  This is why I asked what effect compacting the table had.  Compactions will also clean up deleted notifications.
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Caleb
>>>>>
>>>>> Caleb A. Meier, Ph.D.
>>>>> Senior Software Engineer ♦ Analyst Parsons Corporation
>>>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>>>> Office:  (703)797-3066
>>>>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>>>>
>>>>> -----Original Message-----
>>>>> From: Keith Turner [mailto:keith@deenlo.com]
>>>>> Sent: Thursday, October 26, 2017 1:39 PM
>>>>> To: fluo-dev <de...@fluo.apache.org>
>>>>> Subject: Re: fluo accumulo table tablet servers not keeping up 
>>>>> with application
>>>>>
>>>>> Caleb
>>>>>
>>>>> What if any tuning have you done?  The following tune-able Accumulo parameters impact performance.
>>>>>
>>>>>  * Write ahead log sync settings (this can have huge performance
>>>>> implications)
>>>>>  * Files per tablet
>>>>>  * Tablet server cache sizes
>>>>>  * Accumulo data block sizes
>>>>>  * Tablet server client thread pool size
>>>>>
>>>>> For Fluo the following tune-able parameters are important.
>>>>>
>>>>>  * Commit memory (this determines how many transactions are held 
>>>>> in memory while committing)
>>>>>  * Threads running transactions
>>>>>
>>>>> What does the load (CPU and memory) on the cluster look like?  I'm curious how even it is?  For example is one tserver at 100% cpu while others are idle, this could be caused by uneven data access patterns.
>>>>>
>>>>> Would it be possible for me to see or run the benchmark?  I am going to take a look at the Rya observers, let me know if there is anything in particular I should look at.
>>>>>
>>>>> Are the 6 lookups in the transaction done sequentially?
>>>>>
>>>>> Keith
>>>>>
>>>>> On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>>>>>> Hello Fluo Devs,
>>>>>>
>>>>>> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>>>>>>
>>>>>> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>>>>>>
>>>>>> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>>>>>>
>>>>>> Sincerely,
>>>>>> Caleb Meier
>>>>>>
>>>>>> Caleb A. Meier, Ph.D.
>>>>>> Senior Software Engineer ♦ Analyst Parsons Corporation
>>>>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>>>>> Office:  (703)797-3066
>>>>>> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦ 
>>>>>> www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.
>>>>>> c
>>>>>> om+>
>>>>>>

Re: fluo accumulo table tablet servers not keeping up with application

Posted by Keith Turner <ke...@deenlo.com>.
On Wed, Nov 8, 2017 at 4:49 PM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hey Keith,
>
> I finally got around to trying out some of your suggestions.  The first thing I tried was applying some variation of your RowHasher recipe, and we noticed an immediate improvement.  However, we are still running into issues with our Tablet servers bogging down.  Doing things like flushing and compacting certainly buys us time, but there appears to be a diminishing margin of return between each compaction/flush cycle.  That is, after every compaction, it seems like the outstanding notifications increase at a slightly faster oscillatory rate than in the previous compaction cycle.  I ran the experiment you suggested -- I counted the number of deleted notifications and outstanding notifications, ran a flush, counted again, ran a compaction, and counted again.  The numbers are as follows:
>
> Unprocessed: 679
> Deleted: 105,745
>
> Flush
>
> Unprocessed: 352
> Deleted: 4395
>
> Compaction
>
> Unprocessed: 81
> Deleted: 2663
>
> Obviously flushing and compacting are extremely beneficial.  Is there an outstanding recipe to check for the number of deleted/unprocessed notifications and then flush/compact based on some sort of threshold on those notifications?  Is there a way to configure the garbage collector to be more active?

You could do two things lower the in memory map size for accumulo.
This will cause Accumuo to buffer less and lead to more frequent
flushes, which run the Fluo GC Iter.  For compactions, you can adjust
the compaction ratio lower which will cause more frequent compactions.
Lowering the compaction ratio also results in less file per tablet,
which is good for random seek performance.  Could try setting it to
1.5, 1.75, or 2, lower is better until compactions are too frequent :)

>
> Finally, the piece of advice that you gave me which may end up providing the biggest breakthrough is profiling the iterators on Accumulo using listscans.  It appears that most of the scanning (at least according to my random sampling) is happening within one of our observers.  I think that we can cache most of those lookups, so I'm currently optimistic that we can put a serious dent in the scan traffic that we're seeing.  Thanks so much for your help in debugging this issue.  I'll let you know if caching solves our problems.

I have found using the watch command to eyeball something to be
invaluable over the years.  A technique I leaned from Eric Newton.

What did you see in the list scans output that enabled you to pinpoint
a particular observer?

>
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Keith Turner [mailto:keith@deenlo.com]
> Sent: Tuesday, October 31, 2017 5:14 PM
> To: fluo-dev <de...@fluo.apache.org>
> Subject: Re: fluo accumulo table tablet servers not keeping up with application
>
> On Tue, Oct 31, 2017 at 2:22 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>> Hey Keith,
>>
>> Just following up on your last message.  After looking at the worker ScanTask logs, it seems like the workers are conducting scans as frequently as the min sleep time permits.  That is, if the min sleep time is set to 10s, a ScanTask is being executed every 10s.  In addition, running the Fluo wait command indicates that the number of outstanding notifications steadily increases or is held constant (depending on the number of workers).  Based on your comments below, it seems like the workers should be scanning at a lower rate given that the notification work queue is constantly increasing in size.  Another thing that we tried was reducing the number of workers and increasing the min sleep time.  This lowered the scan burden on the tablet server, but unsurprisingly our processing rate plummeted.  We also tried lowering the ingest rate for a fixed number of workers (lowering the notification rate for each worker thread).  While it took longer for the TabletServer to become saturated, it still became overwhelmed.
>>
>> In general, for the queries that we are benchmarking, our notification:data ratio is about 7:1 (i.e. each piece of ingested data generates about 7 notifications on the way to being entirely processed).  I think that this is our primary culprit, but I think that our application specific scans are also part of the problem (I'm still in the process of trying to determine what portion of the scans that we are seeing is specific to our observers and what portion is specific to notification discovery - any suggestions here would be appreciated).  One reason that I think notification discovery is the culprit is that we implemented an in memory cache for the metadata, and that didn't seem to affect the scan rate too much (metadata seeks constitute about 30% of our seeks/scans).
>>
>> Going forward, we're going to shard our data and look into ways to cut down on scans.  Any other suggestions about how to improve performance would be appreciated.
>
> In 1.0.0 each worker scans all tablets for notifications.  In 1.1.0 tablets and workers split into groups, you can adjust the worker group size[1], it defaults to 7.  If you are using 1.1.0, I would recommend experimenting with this.  If you have 70 workers, then you will have
> 10 groups.  The tablets will also be divided into 10 groups.  Each worker will scan all of the tablets in its group.  Notifications are hash partitioned within a group.  If you lower the group size, then you will have less scanning.  But as you lower the group size you increase the chance of work being unevenly spread.  For example with a group size of 7 that means at most 7 workers will scan a tablet.  It also means the notifications in  tablet can only be processed by 7 workers.  In the worst case if one tablet has all of the notifications, then only only 7 workers will process those notifications.  If the notifications in the table are evenly spread across tablets, then you could probably decrease the group size to 2 or 3.
>
> There are two possible ways to get sense of what scans are up to via sampling.  One is to sample listscans commands in the accumulo shell and see what iterators are in use.  Transactions and notification scanning will use different iterators.  Could also sample scan jstacks in some tservers and look at which iterators are used.
>
> Another thing to look into would be to see how many deleted notifications there are.  Using the command
>
>   fluo scan --raw -c ntfy
>
> Should be able to see notifications and deletes for notifications.  I am curious how many deletes there are.  When a table if flushed/minor compacted some notifications will be GC by an iterator.  A full compaction will do more.  These deletes have to be filtered at scan time.  If you have a chance I would be interested in the following numbers (or ratios for the three numbers).
>
>  * How many deleted notifications are there? How many notifications are there?
>  * Flush table
>  * How many deleted notifications are there? How many notifications are there?
>  * compact table
>  * How many deleted notifications are there? How many notifications are there?
>
> Keith
>
> [1]: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_fluo_blob_rel_fluo-2D1.1.0-2Dincubating_modules_core_src_main_java_org_apache_fluo_core_impl_FluoConfigurationImpl.java-23L30&d=DwIFaQ&c=Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7CrMHCgeo_4WXTD0qo8&m=k70bvUqcPprQoM9YUjVaq3OPArW9UM31cFg6zMbHi2E&s=9dKmFBU2QTqdxIGu9fsCH5NwaeiKKWfC13Ty_s0XJ7A&e=
>
>>
>> Thanks,
>> Caleb
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>
>> -----Original Message-----
>> From: Keith Turner [mailto:keith@deenlo.com]
>> Sent: Friday, October 27, 2017 12:17 PM
>> To: fluo-dev <de...@fluo.apache.org>
>> Subject: Re: fluo accumulo table tablet servers not keeping up with
>> application
>>
>> On Fri, Oct 27, 2017 at 11:03 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>>> Hey Keith,
>>>
>>> Our benchmark consists of a single query that is a join of two statement patterns (essentially patterns that incoming data matches, where a unit of data is a statement).  We are ingesting 50 pairs of statements a minute (100 total), where each statement in the pair matches one of the statement patterns.  Because the data is being ingested at a constant rate, the statement pattern Observers and Join Observers are constantly working.  One thing that is worth mentioning is that we changed the property fluo.implScanTask.maxSleep from 5 min to 10 seconds.  Based on the constant ingest rate, your comments below, and our low maxSleep, it seems like the workers would constantly be scanning for new notifications.
>>>
>>>> Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.
>>>
>>> How does the maxSleep property work in conjunction with this?  If the max sleep time elapses before a worker processes half of the notifications, will it scan?
>>
>> I don't think it will scan again until the # of queued notifications is cut in half.  I looked in 1.0.0 and 1.1.0 and I think while loops linked below should hold off on the scan until the queue halves.
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
>> _fluo_blob_rel_fluo-2D1.0.0-2Dincubating_modules_core_src_main_java_or
>> g_apache_fluo_core_worker_finder_hash_ScanTask.java-23L85&d=DwIFaQ&c=N
>> wf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7
>> CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=BRy
>> QS2DPBtEfUvHT-JKBXPWABrSyihP6yaJcfE1BJFQ&e=
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
>> _fluo_blob_rel_fluo-2D1.1.0-2Dincubating_modules_core_src_main_java_or
>> g_apache_fluo_core_worker_finder_hash_ScanTask.java-23L88&d=DwIFaQ&c=N
>> wf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7
>> CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=ZxU
>> RCZE5k65I008z7o4UQGsm6o0mBtJnwV_N6Y668oM&e=
>>
>> Were you able to find the ScanTask debug messages in the worker logs?
>> Below are the log messages int the code to give a sense of what to look for.
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
>> _fluo_blob_rel_fluo-2D1.0.0-2Dincubating_modules_core_src_main_java_or
>> g_apache_fluo_core_worker_finder_hash_ScanTask.java-23L130&d=DwIFaQ&c=
>> Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ
>> 7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=C1
>> 41kYyjygBL3kWZyUObU1-nu4ZjvMnu7xp_QbIGkCA&e=
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
>> _fluo_blob_rel_fluo-2D1.1.0-2Dincubating_modules_core_src_main_java_or
>> g_apache_fluo_core_worker_finder_hash_ScanTask.java-23L146&d=DwIFaQ&c=
>> Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ
>> 7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=4Q
>> y1-LbMEpJ7NZLqngU8ZOEOBv6nB0nXM8mjkWdpEL4&e=
>>
>> IIRC I think if notifications were found in a tablet during the last scan, then it will always scan it during the next scan loop.  As notifications are not found in a tablet then that tablets next scan time doubles up to fluo.implScanTask.maxSleep.
>>
>> So its possible that all notifications found are being processed quickly and then the workers are scanning for more.  The debug messages would show this.
>>
>> There is also a minSleep time.  This property determines the minimum amount of time it will sleep between scan loops, seems to default to 5 secs.  Could try increasing this.
>>
>> Looking at the props, it seems they prop names for min and max sleep changed between 1.0.0 and 1.1.0.
>>
>>
>>>
>>> Caleb A. Meier, Ph.D.
>>> Senior Software Engineer ♦ Analyst
>>> Parsons Corporation
>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>> Office:  (703)797-3066
>>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>>
>>> -----Original Message-----
>>> From: Keith Turner [mailto:keith@deenlo.com]
>>> Sent: Thursday, October 26, 2017 6:20 PM
>>> To: fluo-dev <de...@fluo.apache.org>
>>> Subject: Re: fluo accumulo table tablet servers not keeping up with
>>> application
>>>
>>> On Thu, Oct 26, 2017 at 5:47 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>>>> Hey Keith,
>>>>
>>>> We'll rerun the benchmarks tomorrow and track the outstanding notifications.  We'll also see if compacting at some point during ingest helps with the scan rate.  Have you observed such high scan rates for such a small amount of data in any of your benchmarking?  What would account for the huge disparity in results read vs. results returned?  It seems like our scans are extremely inefficient for some reason.  Our tablet servers are becoming overwhelmed even before data gets flushed to disk.
>>>
>>> Oh I never saw you attachment, may not be able to attach stuff on mailing list.
>>>
>>> Its possible that what you are seeing is the workers scanning for notifications.  If you look in the workers logs do you see messages about scanning for notifications?  If so what do they look like?
>>>
>>> In 1.0.0 each worker scans all tablets in random order.  When it scans it has an iterator that uses hash+mod to select a subset of notifications.  The iterator also suppresses deleted notifications.
>>> So the selection and suppression by that iterator could explain the read vs returned.  It does exponential back off on tablets where it does not find data.  Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.
>>>
>>> In the beginning, would you have a lot of notifications?  If so I would expect a lot of scanning and then it should slow down once the workers get a list of notifications to process.
>>>
>>> In 1.1.0 the workers divide up the tablets (so workers no longer scan
>>> all tablets, groups of workers share groups of tablets).   If the
>>> table is splits after the workers start, it may take them a bit to execute the distributed algorithm that divys tablets among workers.
>>>
>>> Anyway the debug messages about scanning for notifications in the workers should provide some insight into this.
>>>
>>> If its not notification scanning, then it could be that the application is scanning over a lots of data that was deleted or something like that.
>>>
>>>>
>>>> Caleb A. Meier, Ph.D.
>>>> Senior Software Engineer ♦ Analyst
>>>> Parsons Corporation
>>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>>> Office:  (703)797-3066
>>>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>>>
>>>> -----Original Message-----
>>>> From: Keith Turner [mailto:keith@deenlo.com]
>>>> Sent: Thursday, October 26, 2017 5:36 PM
>>>> To: fluo-dev <de...@fluo.apache.org>
>>>> Subject: Re: fluo accumulo table tablet servers not keeping up with
>>>> application
>>>>
>>>> On Thu, Oct 26, 2017 at 2:50 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>>>>> Hey Keith,
>>>>>
>>>>> Thanks for the reply.  Regarding our benchmark, I've attached some screenshots of our Accumulo UI that were taken while the benchmark was running.  Basically, our ingest rate is pretty low (about 150 entries/s, but our scan rate is off the charts - approaching 6 million entries/s!).  Also, notice the disparity between reads and returned in the Scan chart.  That disparity would suggest that we're possibly doing full table scans somewhere, which is strange given that all of our scans are RowColumn constrained.  Perhaps we are building our Scanner incorrectly.   In an effort to maximize the number of TabletServers, we split the Fluo table into 5MB tablets.  Also, the data is not well balanced -- the tablet servers do take turns being maxed out while others are idle.  We're considering possible sharding strategies.
>>>>>
>>>>> Given that our TabletServers are getting saturated so quickly for such a low ingest rate, it seems like we definitely need to cut down on the number of scans as a first line of attack to see what that buys us.  Then we'll look into tuning Accumulo and Fluo.  Does this seem like a reasonable approach to you?  Does the scan rate of our application strike you as extremely high?  When you look at the Rya Observers, can you pay attention to how we are building our scans to make sure that we're not inadvertently doing full table scans?  Also, what exactly do you mean by "are the 6 lookups in the transaction done sequentially"?
>>>>
>>>> Regarding the scan rate there are few things I Am curious about.
>>>>
>>>> Fluo workers scan for notifications in addition to the scanning done
>>>> by your apps.  I made some changes in 1.1.0 to reduce the amount of
>>>> scanning needed to find notifications, but this should not make much
>>>> of a difference on a small amount of nodes.  Details about this are
>>>> in
>>>> 1.1.0 release notes.  I am not sure what the best way is to determine how much of the scanning you are seeing is app vs notification finding.  Can you run the fluo wait command to see how many outstanding notifications there are?
>>>>
>>>> Transactions leave a paper trail behind and compactions clean this up (Fluo has a garbage collection iterator).  This is why I asked what effect compacting the table had.  Compactions will also clean up deleted notifications.
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Caleb
>>>>>
>>>>> Caleb A. Meier, Ph.D.
>>>>> Senior Software Engineer ♦ Analyst
>>>>> Parsons Corporation
>>>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>>>> Office:  (703)797-3066
>>>>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>>>>
>>>>> -----Original Message-----
>>>>> From: Keith Turner [mailto:keith@deenlo.com]
>>>>> Sent: Thursday, October 26, 2017 1:39 PM
>>>>> To: fluo-dev <de...@fluo.apache.org>
>>>>> Subject: Re: fluo accumulo table tablet servers not keeping up with
>>>>> application
>>>>>
>>>>> Caleb
>>>>>
>>>>> What if any tuning have you done?  The following tune-able Accumulo parameters impact performance.
>>>>>
>>>>>  * Write ahead log sync settings (this can have huge performance
>>>>> implications)
>>>>>  * Files per tablet
>>>>>  * Tablet server cache sizes
>>>>>  * Accumulo data block sizes
>>>>>  * Tablet server client thread pool size
>>>>>
>>>>> For Fluo the following tune-able parameters are important.
>>>>>
>>>>>  * Commit memory (this determines how many transactions are held in
>>>>> memory while committing)
>>>>>  * Threads running transactions
>>>>>
>>>>> What does the load (CPU and memory) on the cluster look like?  I'm curious how even it is?  For example is one tserver at 100% cpu while others are idle, this could be caused by uneven data access patterns.
>>>>>
>>>>> Would it be possible for me to see or run the benchmark?  I am going to take a look at the Rya observers, let me know if there is anything in particular I should look at.
>>>>>
>>>>> Are the 6 lookups in the transaction done sequentially?
>>>>>
>>>>> Keith
>>>>>
>>>>> On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>>>>>> Hello Fluo Devs,
>>>>>>
>>>>>> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>>>>>>
>>>>>> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>>>>>>
>>>>>> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>>>>>>
>>>>>> Sincerely,
>>>>>> Caleb Meier
>>>>>>
>>>>>> Caleb A. Meier, Ph.D.
>>>>>> Senior Software Engineer ♦ Analyst Parsons Corporation
>>>>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>>>>> Office:  (703)797-3066
>>>>>> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦
>>>>>> www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.
>>>>>> c
>>>>>> om+>
>>>>>>

RE: fluo accumulo table tablet servers not keeping up with application

Posted by "Meier, Caleb" <Ca...@parsons.com>.
Hey Keith,

I finally got around to trying out some of your suggestions.  The first thing I tried was applying some variation of your RowHasher recipe, and we noticed an immediate improvement.  However, we are still running into issues with our Tablet servers bogging down.  Doing things like flushing and compacting certainly buys us time, but there appears to be a diminishing margin of return between each compaction/flush cycle.  That is, after every compaction, it seems like the outstanding notifications increase at a slightly faster oscillatory rate than in the previous compaction cycle.  I ran the experiment you suggested -- I counted the number of deleted notifications and outstanding notifications, ran a flush, counted again, ran a compaction, and counted again.  The numbers are as follows:

Unprocessed: 679
Deleted: 105,745

Flush

Unprocessed: 352
Deleted: 4395

Compaction

Unprocessed: 81
Deleted: 2663

Obviously flushing and compacting are extremely beneficial.  Is there an outstanding recipe to check for the number of deleted/unprocessed notifications and then flush/compact based on some sort of threshold on those notifications?  Is there a way to configure the garbage collector to be more active?

Finally, the piece of advice that you gave me which may end up providing the biggest breakthrough is profiling the iterators on Accumulo using listscans.  It appears that most of the scanning (at least according to my random sampling) is happening within one of our observers.  I think that we can cache most of those lookups, so I'm currently optimistic that we can put a serious dent in the scan traffic that we're seeing.  Thanks so much for your help in debugging this issue.  I'll let you know if caching solves our problems.   


Caleb A. Meier, Ph.D.
Senior Software Engineer ♦ Analyst
Parsons Corporation
1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
Office:  (703)797-3066
Caleb.Meier@Parsons.com ♦ www.parsons.com

-----Original Message-----
From: Keith Turner [mailto:keith@deenlo.com] 
Sent: Tuesday, October 31, 2017 5:14 PM
To: fluo-dev <de...@fluo.apache.org>
Subject: Re: fluo accumulo table tablet servers not keeping up with application

On Tue, Oct 31, 2017 at 2:22 PM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hey Keith,
>
> Just following up on your last message.  After looking at the worker ScanTask logs, it seems like the workers are conducting scans as frequently as the min sleep time permits.  That is, if the min sleep time is set to 10s, a ScanTask is being executed every 10s.  In addition, running the Fluo wait command indicates that the number of outstanding notifications steadily increases or is held constant (depending on the number of workers).  Based on your comments below, it seems like the workers should be scanning at a lower rate given that the notification work queue is constantly increasing in size.  Another thing that we tried was reducing the number of workers and increasing the min sleep time.  This lowered the scan burden on the tablet server, but unsurprisingly our processing rate plummeted.  We also tried lowering the ingest rate for a fixed number of workers (lowering the notification rate for each worker thread).  While it took longer for the TabletServer to become saturated, it still became overwhelmed.
>
> In general, for the queries that we are benchmarking, our notification:data ratio is about 7:1 (i.e. each piece of ingested data generates about 7 notifications on the way to being entirely processed).  I think that this is our primary culprit, but I think that our application specific scans are also part of the problem (I'm still in the process of trying to determine what portion of the scans that we are seeing is specific to our observers and what portion is specific to notification discovery - any suggestions here would be appreciated).  One reason that I think notification discovery is the culprit is that we implemented an in memory cache for the metadata, and that didn't seem to affect the scan rate too much (metadata seeks constitute about 30% of our seeks/scans).
>
> Going forward, we're going to shard our data and look into ways to cut down on scans.  Any other suggestions about how to improve performance would be appreciated.

In 1.0.0 each worker scans all tablets for notifications.  In 1.1.0 tablets and workers split into groups, you can adjust the worker group size[1], it defaults to 7.  If you are using 1.1.0, I would recommend experimenting with this.  If you have 70 workers, then you will have
10 groups.  The tablets will also be divided into 10 groups.  Each worker will scan all of the tablets in its group.  Notifications are hash partitioned within a group.  If you lower the group size, then you will have less scanning.  But as you lower the group size you increase the chance of work being unevenly spread.  For example with a group size of 7 that means at most 7 workers will scan a tablet.  It also means the notifications in  tablet can only be processed by 7 workers.  In the worst case if one tablet has all of the notifications, then only only 7 workers will process those notifications.  If the notifications in the table are evenly spread across tablets, then you could probably decrease the group size to 2 or 3.

There are two possible ways to get sense of what scans are up to via sampling.  One is to sample listscans commands in the accumulo shell and see what iterators are in use.  Transactions and notification scanning will use different iterators.  Could also sample scan jstacks in some tservers and look at which iterators are used.

Another thing to look into would be to see how many deleted notifications there are.  Using the command

  fluo scan --raw -c ntfy

Should be able to see notifications and deletes for notifications.  I am curious how many deletes there are.  When a table if flushed/minor compacted some notifications will be GC by an iterator.  A full compaction will do more.  These deletes have to be filtered at scan time.  If you have a chance I would be interested in the following numbers (or ratios for the three numbers).

 * How many deleted notifications are there? How many notifications are there?
 * Flush table
 * How many deleted notifications are there? How many notifications are there?
 * compact table
 * How many deleted notifications are there? How many notifications are there?

Keith

[1]: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_fluo_blob_rel_fluo-2D1.1.0-2Dincubating_modules_core_src_main_java_org_apache_fluo_core_impl_FluoConfigurationImpl.java-23L30&d=DwIFaQ&c=Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7CrMHCgeo_4WXTD0qo8&m=k70bvUqcPprQoM9YUjVaq3OPArW9UM31cFg6zMbHi2E&s=9dKmFBU2QTqdxIGu9fsCH5NwaeiKKWfC13Ty_s0XJ7A&e=

>
> Thanks,
> Caleb
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Keith Turner [mailto:keith@deenlo.com]
> Sent: Friday, October 27, 2017 12:17 PM
> To: fluo-dev <de...@fluo.apache.org>
> Subject: Re: fluo accumulo table tablet servers not keeping up with 
> application
>
> On Fri, Oct 27, 2017 at 11:03 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>> Hey Keith,
>>
>> Our benchmark consists of a single query that is a join of two statement patterns (essentially patterns that incoming data matches, where a unit of data is a statement).  We are ingesting 50 pairs of statements a minute (100 total), where each statement in the pair matches one of the statement patterns.  Because the data is being ingested at a constant rate, the statement pattern Observers and Join Observers are constantly working.  One thing that is worth mentioning is that we changed the property fluo.implScanTask.maxSleep from 5 min to 10 seconds.  Based on the constant ingest rate, your comments below, and our low maxSleep, it seems like the workers would constantly be scanning for new notifications.
>>
>>> Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.
>>
>> How does the maxSleep property work in conjunction with this?  If the max sleep time elapses before a worker processes half of the notifications, will it scan?
>
> I don't think it will scan again until the # of queued notifications is cut in half.  I looked in 1.0.0 and 1.1.0 and I think while loops linked below should hold off on the scan until the queue halves.
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> _fluo_blob_rel_fluo-2D1.0.0-2Dincubating_modules_core_src_main_java_or
> g_apache_fluo_core_worker_finder_hash_ScanTask.java-23L85&d=DwIFaQ&c=N
> wf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7
> CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=BRy
> QS2DPBtEfUvHT-JKBXPWABrSyihP6yaJcfE1BJFQ&e=
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> _fluo_blob_rel_fluo-2D1.1.0-2Dincubating_modules_core_src_main_java_or
> g_apache_fluo_core_worker_finder_hash_ScanTask.java-23L88&d=DwIFaQ&c=N
> wf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7
> CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=ZxU
> RCZE5k65I008z7o4UQGsm6o0mBtJnwV_N6Y668oM&e=
>
> Were you able to find the ScanTask debug messages in the worker logs?
> Below are the log messages int the code to give a sense of what to look for.
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> _fluo_blob_rel_fluo-2D1.0.0-2Dincubating_modules_core_src_main_java_or
> g_apache_fluo_core_worker_finder_hash_ScanTask.java-23L130&d=DwIFaQ&c=
> Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ
> 7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=C1
> 41kYyjygBL3kWZyUObU1-nu4ZjvMnu7xp_QbIGkCA&e=
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache
> _fluo_blob_rel_fluo-2D1.1.0-2Dincubating_modules_core_src_main_java_or
> g_apache_fluo_core_worker_finder_hash_ScanTask.java-23L146&d=DwIFaQ&c=
> Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ
> 7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=4Q
> y1-LbMEpJ7NZLqngU8ZOEOBv6nB0nXM8mjkWdpEL4&e=
>
> IIRC I think if notifications were found in a tablet during the last scan, then it will always scan it during the next scan loop.  As notifications are not found in a tablet then that tablets next scan time doubles up to fluo.implScanTask.maxSleep.
>
> So its possible that all notifications found are being processed quickly and then the workers are scanning for more.  The debug messages would show this.
>
> There is also a minSleep time.  This property determines the minimum amount of time it will sleep between scan loops, seems to default to 5 secs.  Could try increasing this.
>
> Looking at the props, it seems they prop names for min and max sleep changed between 1.0.0 and 1.1.0.
>
>
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>
>> -----Original Message-----
>> From: Keith Turner [mailto:keith@deenlo.com]
>> Sent: Thursday, October 26, 2017 6:20 PM
>> To: fluo-dev <de...@fluo.apache.org>
>> Subject: Re: fluo accumulo table tablet servers not keeping up with 
>> application
>>
>> On Thu, Oct 26, 2017 at 5:47 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>>> Hey Keith,
>>>
>>> We'll rerun the benchmarks tomorrow and track the outstanding notifications.  We'll also see if compacting at some point during ingest helps with the scan rate.  Have you observed such high scan rates for such a small amount of data in any of your benchmarking?  What would account for the huge disparity in results read vs. results returned?  It seems like our scans are extremely inefficient for some reason.  Our tablet servers are becoming overwhelmed even before data gets flushed to disk.
>>
>> Oh I never saw you attachment, may not be able to attach stuff on mailing list.
>>
>> Its possible that what you are seeing is the workers scanning for notifications.  If you look in the workers logs do you see messages about scanning for notifications?  If so what do they look like?
>>
>> In 1.0.0 each worker scans all tablets in random order.  When it scans it has an iterator that uses hash+mod to select a subset of notifications.  The iterator also suppresses deleted notifications.
>> So the selection and suppression by that iterator could explain the read vs returned.  It does exponential back off on tablets where it does not find data.  Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.
>>
>> In the beginning, would you have a lot of notifications?  If so I would expect a lot of scanning and then it should slow down once the workers get a list of notifications to process.
>>
>> In 1.1.0 the workers divide up the tablets (so workers no longer scan
>> all tablets, groups of workers share groups of tablets).   If the
>> table is splits after the workers start, it may take them a bit to execute the distributed algorithm that divys tablets among workers.
>>
>> Anyway the debug messages about scanning for notifications in the workers should provide some insight into this.
>>
>> If its not notification scanning, then it could be that the application is scanning over a lots of data that was deleted or something like that.
>>
>>>
>>> Caleb A. Meier, Ph.D.
>>> Senior Software Engineer ♦ Analyst
>>> Parsons Corporation
>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>> Office:  (703)797-3066
>>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>>
>>> -----Original Message-----
>>> From: Keith Turner [mailto:keith@deenlo.com]
>>> Sent: Thursday, October 26, 2017 5:36 PM
>>> To: fluo-dev <de...@fluo.apache.org>
>>> Subject: Re: fluo accumulo table tablet servers not keeping up with 
>>> application
>>>
>>> On Thu, Oct 26, 2017 at 2:50 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>>>> Hey Keith,
>>>>
>>>> Thanks for the reply.  Regarding our benchmark, I've attached some screenshots of our Accumulo UI that were taken while the benchmark was running.  Basically, our ingest rate is pretty low (about 150 entries/s, but our scan rate is off the charts - approaching 6 million entries/s!).  Also, notice the disparity between reads and returned in the Scan chart.  That disparity would suggest that we're possibly doing full table scans somewhere, which is strange given that all of our scans are RowColumn constrained.  Perhaps we are building our Scanner incorrectly.   In an effort to maximize the number of TabletServers, we split the Fluo table into 5MB tablets.  Also, the data is not well balanced -- the tablet servers do take turns being maxed out while others are idle.  We're considering possible sharding strategies.
>>>>
>>>> Given that our TabletServers are getting saturated so quickly for such a low ingest rate, it seems like we definitely need to cut down on the number of scans as a first line of attack to see what that buys us.  Then we'll look into tuning Accumulo and Fluo.  Does this seem like a reasonable approach to you?  Does the scan rate of our application strike you as extremely high?  When you look at the Rya Observers, can you pay attention to how we are building our scans to make sure that we're not inadvertently doing full table scans?  Also, what exactly do you mean by "are the 6 lookups in the transaction done sequentially"?
>>>
>>> Regarding the scan rate there are few things I Am curious about.
>>>
>>> Fluo workers scan for notifications in addition to the scanning done 
>>> by your apps.  I made some changes in 1.1.0 to reduce the amount of 
>>> scanning needed to find notifications, but this should not make much 
>>> of a difference on a small amount of nodes.  Details about this are 
>>> in
>>> 1.1.0 release notes.  I am not sure what the best way is to determine how much of the scanning you are seeing is app vs notification finding.  Can you run the fluo wait command to see how many outstanding notifications there are?
>>>
>>> Transactions leave a paper trail behind and compactions clean this up (Fluo has a garbage collection iterator).  This is why I asked what effect compacting the table had.  Compactions will also clean up deleted notifications.
>>>
>>>
>>>>
>>>> Thanks,
>>>> Caleb
>>>>
>>>> Caleb A. Meier, Ph.D.
>>>> Senior Software Engineer ♦ Analyst
>>>> Parsons Corporation
>>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>>> Office:  (703)797-3066
>>>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>>>
>>>> -----Original Message-----
>>>> From: Keith Turner [mailto:keith@deenlo.com]
>>>> Sent: Thursday, October 26, 2017 1:39 PM
>>>> To: fluo-dev <de...@fluo.apache.org>
>>>> Subject: Re: fluo accumulo table tablet servers not keeping up with 
>>>> application
>>>>
>>>> Caleb
>>>>
>>>> What if any tuning have you done?  The following tune-able Accumulo parameters impact performance.
>>>>
>>>>  * Write ahead log sync settings (this can have huge performance
>>>> implications)
>>>>  * Files per tablet
>>>>  * Tablet server cache sizes
>>>>  * Accumulo data block sizes
>>>>  * Tablet server client thread pool size
>>>>
>>>> For Fluo the following tune-able parameters are important.
>>>>
>>>>  * Commit memory (this determines how many transactions are held in 
>>>> memory while committing)
>>>>  * Threads running transactions
>>>>
>>>> What does the load (CPU and memory) on the cluster look like?  I'm curious how even it is?  For example is one tserver at 100% cpu while others are idle, this could be caused by uneven data access patterns.
>>>>
>>>> Would it be possible for me to see or run the benchmark?  I am going to take a look at the Rya observers, let me know if there is anything in particular I should look at.
>>>>
>>>> Are the 6 lookups in the transaction done sequentially?
>>>>
>>>> Keith
>>>>
>>>> On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>>>>> Hello Fluo Devs,
>>>>>
>>>>> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>>>>>
>>>>> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>>>>>
>>>>> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>>>>>
>>>>> Sincerely,
>>>>> Caleb Meier
>>>>>
>>>>> Caleb A. Meier, Ph.D.
>>>>> Senior Software Engineer ♦ Analyst Parsons Corporation
>>>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>>>> Office:  (703)797-3066
>>>>> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦ 
>>>>> www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.
>>>>> c
>>>>> om+>
>>>>>

Re: fluo accumulo table tablet servers not keeping up with application

Posted by Keith Turner <ke...@deenlo.com>.
On Tue, Oct 31, 2017 at 2:22 PM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hey Keith,
>
> Just following up on your last message.  After looking at the worker ScanTask logs, it seems like the workers are conducting scans as frequently as the min sleep time permits.  That is, if the min sleep time is set to 10s, a ScanTask is being executed every 10s.  In addition, running the Fluo wait command indicates that the number of outstanding notifications steadily increases or is held constant (depending on the number of workers).  Based on your comments below, it seems like the workers should be scanning at a lower rate given that the notification work queue is constantly increasing in size.  Another thing that we tried was reducing the number of workers and increasing the min sleep time.  This lowered the scan burden on the tablet server, but unsurprisingly our processing rate plummeted.  We also tried lowering the ingest rate for a fixed number of workers (lowering the notification rate for each worker thread).  While it took longer for the TabletServer to become saturated, it still became overwhelmed.
>
> In general, for the queries that we are benchmarking, our notification:data ratio is about 7:1 (i.e. each piece of ingested data generates about 7 notifications on the way to being entirely processed).  I think that this is our primary culprit, but I think that our application specific scans are also part of the problem (I'm still in the process of trying to determine what portion of the scans that we are seeing is specific to our observers and what portion is specific to notification discovery - any suggestions here would be appreciated).  One reason that I think notification discovery is the culprit is that we implemented an in memory cache for the metadata, and that didn't seem to affect the scan rate too much (metadata seeks constitute about 30% of our seeks/scans).
>
> Going forward, we're going to shard our data and look into ways to cut down on scans.  Any other suggestions about how to improve performance would be appreciated.

In 1.0.0 each worker scans all tablets for notifications.  In 1.1.0
tablets and workers split into groups, you can adjust the worker group
size[1], it defaults to 7.  If you are using 1.1.0, I would recommend
experimenting with this.  If you have 70 workers, then you will have
10 groups.  The tablets will also be divided into 10 groups.  Each
worker will scan all of the tablets in its group.  Notifications are
hash partitioned within a group.  If you lower the group size, then
you will have less scanning.  But as you lower the group size you
increase the chance of work being unevenly spread.  For example with a
group size of 7 that means at most 7 workers will scan a tablet.  It
also means the notifications in  tablet can only be processed by 7
workers.  In the worst case if one tablet has all of the
notifications, then only only 7 workers will process those
notifications.  If the notifications in the table are evenly spread
across tablets, then you could probably decrease the group size to 2
or 3.

There are two possible ways to get sense of what scans are up to via
sampling.  One is to sample listscans commands in the accumulo shell
and see what iterators are in use.  Transactions and notification
scanning will use different iterators.  Could also sample scan jstacks
in some tservers and look at which iterators are used.

Another thing to look into would be to see how many deleted
notifications there are.  Using the command

  fluo scan --raw -c ntfy

Should be able to see notifications and deletes for notifications.  I
am curious how many deletes there are.  When a table if flushed/minor
compacted some notifications will be GC by an iterator.  A full
compaction will do more.  These deletes have to be filtered at scan
time.  If you have a chance I would be interested in the following
numbers (or ratios for the three numbers).

 * How many deleted notifications are there? How many notifications are there?
 * Flush table
 * How many deleted notifications are there? How many notifications are there?
 * compact table
 * How many deleted notifications are there? How many notifications are there?

Keith

[1]: https://github.com/apache/fluo/blob/rel/fluo-1.1.0-incubating/modules/core/src/main/java/org/apache/fluo/core/impl/FluoConfigurationImpl.java#L30

>
> Thanks,
> Caleb
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Keith Turner [mailto:keith@deenlo.com]
> Sent: Friday, October 27, 2017 12:17 PM
> To: fluo-dev <de...@fluo.apache.org>
> Subject: Re: fluo accumulo table tablet servers not keeping up with application
>
> On Fri, Oct 27, 2017 at 11:03 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>> Hey Keith,
>>
>> Our benchmark consists of a single query that is a join of two statement patterns (essentially patterns that incoming data matches, where a unit of data is a statement).  We are ingesting 50 pairs of statements a minute (100 total), where each statement in the pair matches one of the statement patterns.  Because the data is being ingested at a constant rate, the statement pattern Observers and Join Observers are constantly working.  One thing that is worth mentioning is that we changed the property fluo.implScanTask.maxSleep from 5 min to 10 seconds.  Based on the constant ingest rate, your comments below, and our low maxSleep, it seems like the workers would constantly be scanning for new notifications.
>>
>>> Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.
>>
>> How does the maxSleep property work in conjunction with this?  If the max sleep time elapses before a worker processes half of the notifications, will it scan?
>
> I don't think it will scan again until the # of queued notifications is cut in half.  I looked in 1.0.0 and 1.1.0 and I think while loops linked below should hold off on the scan until the queue halves.
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_fluo_blob_rel_fluo-2D1.0.0-2Dincubating_modules_core_src_main_java_org_apache_fluo_core_worker_finder_hash_ScanTask.java-23L85&d=DwIFaQ&c=Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=BRyQS2DPBtEfUvHT-JKBXPWABrSyihP6yaJcfE1BJFQ&e=
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_fluo_blob_rel_fluo-2D1.1.0-2Dincubating_modules_core_src_main_java_org_apache_fluo_core_worker_finder_hash_ScanTask.java-23L88&d=DwIFaQ&c=Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=ZxURCZE5k65I008z7o4UQGsm6o0mBtJnwV_N6Y668oM&e=
>
> Were you able to find the ScanTask debug messages in the worker logs?
> Below are the log messages int the code to give a sense of what to look for.
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_fluo_blob_rel_fluo-2D1.0.0-2Dincubating_modules_core_src_main_java_org_apache_fluo_core_worker_finder_hash_ScanTask.java-23L130&d=DwIFaQ&c=Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=C141kYyjygBL3kWZyUObU1-nu4ZjvMnu7xp_QbIGkCA&e=
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_fluo_blob_rel_fluo-2D1.1.0-2Dincubating_modules_core_src_main_java_org_apache_fluo_core_worker_finder_hash_ScanTask.java-23L146&d=DwIFaQ&c=Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=4Qy1-LbMEpJ7NZLqngU8ZOEOBv6nB0nXM8mjkWdpEL4&e=
>
> IIRC I think if notifications were found in a tablet during the last scan, then it will always scan it during the next scan loop.  As notifications are not found in a tablet then that tablets next scan time doubles up to fluo.implScanTask.maxSleep.
>
> So its possible that all notifications found are being processed quickly and then the workers are scanning for more.  The debug messages would show this.
>
> There is also a minSleep time.  This property determines the minimum amount of time it will sleep between scan loops, seems to default to 5 secs.  Could try increasing this.
>
> Looking at the props, it seems they prop names for min and max sleep changed between 1.0.0 and 1.1.0.
>
>
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>
>> -----Original Message-----
>> From: Keith Turner [mailto:keith@deenlo.com]
>> Sent: Thursday, October 26, 2017 6:20 PM
>> To: fluo-dev <de...@fluo.apache.org>
>> Subject: Re: fluo accumulo table tablet servers not keeping up with
>> application
>>
>> On Thu, Oct 26, 2017 at 5:47 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>>> Hey Keith,
>>>
>>> We'll rerun the benchmarks tomorrow and track the outstanding notifications.  We'll also see if compacting at some point during ingest helps with the scan rate.  Have you observed such high scan rates for such a small amount of data in any of your benchmarking?  What would account for the huge disparity in results read vs. results returned?  It seems like our scans are extremely inefficient for some reason.  Our tablet servers are becoming overwhelmed even before data gets flushed to disk.
>>
>> Oh I never saw you attachment, may not be able to attach stuff on mailing list.
>>
>> Its possible that what you are seeing is the workers scanning for notifications.  If you look in the workers logs do you see messages about scanning for notifications?  If so what do they look like?
>>
>> In 1.0.0 each worker scans all tablets in random order.  When it scans it has an iterator that uses hash+mod to select a subset of notifications.  The iterator also suppresses deleted notifications.
>> So the selection and suppression by that iterator could explain the read vs returned.  It does exponential back off on tablets where it does not find data.  Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.
>>
>> In the beginning, would you have a lot of notifications?  If so I would expect a lot of scanning and then it should slow down once the workers get a list of notifications to process.
>>
>> In 1.1.0 the workers divide up the tablets (so workers no longer scan
>> all tablets, groups of workers share groups of tablets).   If the
>> table is splits after the workers start, it may take them a bit to execute the distributed algorithm that divys tablets among workers.
>>
>> Anyway the debug messages about scanning for notifications in the workers should provide some insight into this.
>>
>> If its not notification scanning, then it could be that the application is scanning over a lots of data that was deleted or something like that.
>>
>>>
>>> Caleb A. Meier, Ph.D.
>>> Senior Software Engineer ♦ Analyst
>>> Parsons Corporation
>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>> Office:  (703)797-3066
>>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>>
>>> -----Original Message-----
>>> From: Keith Turner [mailto:keith@deenlo.com]
>>> Sent: Thursday, October 26, 2017 5:36 PM
>>> To: fluo-dev <de...@fluo.apache.org>
>>> Subject: Re: fluo accumulo table tablet servers not keeping up with
>>> application
>>>
>>> On Thu, Oct 26, 2017 at 2:50 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>>>> Hey Keith,
>>>>
>>>> Thanks for the reply.  Regarding our benchmark, I've attached some screenshots of our Accumulo UI that were taken while the benchmark was running.  Basically, our ingest rate is pretty low (about 150 entries/s, but our scan rate is off the charts - approaching 6 million entries/s!).  Also, notice the disparity between reads and returned in the Scan chart.  That disparity would suggest that we're possibly doing full table scans somewhere, which is strange given that all of our scans are RowColumn constrained.  Perhaps we are building our Scanner incorrectly.   In an effort to maximize the number of TabletServers, we split the Fluo table into 5MB tablets.  Also, the data is not well balanced -- the tablet servers do take turns being maxed out while others are idle.  We're considering possible sharding strategies.
>>>>
>>>> Given that our TabletServers are getting saturated so quickly for such a low ingest rate, it seems like we definitely need to cut down on the number of scans as a first line of attack to see what that buys us.  Then we'll look into tuning Accumulo and Fluo.  Does this seem like a reasonable approach to you?  Does the scan rate of our application strike you as extremely high?  When you look at the Rya Observers, can you pay attention to how we are building our scans to make sure that we're not inadvertently doing full table scans?  Also, what exactly do you mean by "are the 6 lookups in the transaction done sequentially"?
>>>
>>> Regarding the scan rate there are few things I Am curious about.
>>>
>>> Fluo workers scan for notifications in addition to the scanning done
>>> by your apps.  I made some changes in 1.1.0 to reduce the amount of
>>> scanning needed to find notifications, but this should not make much
>>> of a difference on a small amount of nodes.  Details about this are
>>> in
>>> 1.1.0 release notes.  I am not sure what the best way is to determine how much of the scanning you are seeing is app vs notification finding.  Can you run the fluo wait command to see how many outstanding notifications there are?
>>>
>>> Transactions leave a paper trail behind and compactions clean this up (Fluo has a garbage collection iterator).  This is why I asked what effect compacting the table had.  Compactions will also clean up deleted notifications.
>>>
>>>
>>>>
>>>> Thanks,
>>>> Caleb
>>>>
>>>> Caleb A. Meier, Ph.D.
>>>> Senior Software Engineer ♦ Analyst
>>>> Parsons Corporation
>>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>>> Office:  (703)797-3066
>>>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>>>
>>>> -----Original Message-----
>>>> From: Keith Turner [mailto:keith@deenlo.com]
>>>> Sent: Thursday, October 26, 2017 1:39 PM
>>>> To: fluo-dev <de...@fluo.apache.org>
>>>> Subject: Re: fluo accumulo table tablet servers not keeping up with
>>>> application
>>>>
>>>> Caleb
>>>>
>>>> What if any tuning have you done?  The following tune-able Accumulo parameters impact performance.
>>>>
>>>>  * Write ahead log sync settings (this can have huge performance
>>>> implications)
>>>>  * Files per tablet
>>>>  * Tablet server cache sizes
>>>>  * Accumulo data block sizes
>>>>  * Tablet server client thread pool size
>>>>
>>>> For Fluo the following tune-able parameters are important.
>>>>
>>>>  * Commit memory (this determines how many transactions are held in
>>>> memory while committing)
>>>>  * Threads running transactions
>>>>
>>>> What does the load (CPU and memory) on the cluster look like?  I'm curious how even it is?  For example is one tserver at 100% cpu while others are idle, this could be caused by uneven data access patterns.
>>>>
>>>> Would it be possible for me to see or run the benchmark?  I am going to take a look at the Rya observers, let me know if there is anything in particular I should look at.
>>>>
>>>> Are the 6 lookups in the transaction done sequentially?
>>>>
>>>> Keith
>>>>
>>>> On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>>>>> Hello Fluo Devs,
>>>>>
>>>>> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>>>>>
>>>>> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>>>>>
>>>>> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>>>>>
>>>>> Sincerely,
>>>>> Caleb Meier
>>>>>
>>>>> Caleb A. Meier, Ph.D.
>>>>> Senior Software Engineer ♦ Analyst
>>>>> Parsons Corporation
>>>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>>>> Office:  (703)797-3066
>>>>> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦
>>>>> www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.
>>>>> c
>>>>> om+>
>>>>>

RE: fluo accumulo table tablet servers not keeping up with application

Posted by "Meier, Caleb" <Ca...@parsons.com>.
Hey Keith,

Just following up on your last message.  After looking at the worker ScanTask logs, it seems like the workers are conducting scans as frequently as the min sleep time permits.  That is, if the min sleep time is set to 10s, a ScanTask is being executed every 10s.  In addition, running the Fluo wait command indicates that the number of outstanding notifications steadily increases or is held constant (depending on the number of workers).  Based on your comments below, it seems like the workers should be scanning at a lower rate given that the notification work queue is constantly increasing in size.  Another thing that we tried was reducing the number of workers and increasing the min sleep time.  This lowered the scan burden on the tablet server, but unsurprisingly our processing rate plummeted.  We also tried lowering the ingest rate for a fixed number of workers (lowering the notification rate for each worker thread).  While it took longer for the TabletServer to become saturated, it still became overwhelmed.  

In general, for the queries that we are benchmarking, our notification:data ratio is about 7:1 (i.e. each piece of ingested data generates about 7 notifications on the way to being entirely processed).  I think that this is our primary culprit, but I think that our application specific scans are also part of the problem (I'm still in the process of trying to determine what portion of the scans that we are seeing is specific to our observers and what portion is specific to notification discovery - any suggestions here would be appreciated).  One reason that I think notification discovery is the culprit is that we implemented an in memory cache for the metadata, and that didn't seem to affect the scan rate too much (metadata seeks constitute about 30% of our seeks/scans). 

Going forward, we're going to shard our data and look into ways to cut down on scans.  Any other suggestions about how to improve performance would be appreciated.

Thanks,
Caleb

Caleb A. Meier, Ph.D.
Senior Software Engineer ♦ Analyst
Parsons Corporation
1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
Office:  (703)797-3066
Caleb.Meier@Parsons.com ♦ www.parsons.com

-----Original Message-----
From: Keith Turner [mailto:keith@deenlo.com] 
Sent: Friday, October 27, 2017 12:17 PM
To: fluo-dev <de...@fluo.apache.org>
Subject: Re: fluo accumulo table tablet servers not keeping up with application

On Fri, Oct 27, 2017 at 11:03 AM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hey Keith,
>
> Our benchmark consists of a single query that is a join of two statement patterns (essentially patterns that incoming data matches, where a unit of data is a statement).  We are ingesting 50 pairs of statements a minute (100 total), where each statement in the pair matches one of the statement patterns.  Because the data is being ingested at a constant rate, the statement pattern Observers and Join Observers are constantly working.  One thing that is worth mentioning is that we changed the property fluo.implScanTask.maxSleep from 5 min to 10 seconds.  Based on the constant ingest rate, your comments below, and our low maxSleep, it seems like the workers would constantly be scanning for new notifications.
>
>> Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.
>
> How does the maxSleep property work in conjunction with this?  If the max sleep time elapses before a worker processes half of the notifications, will it scan?

I don't think it will scan again until the # of queued notifications is cut in half.  I looked in 1.0.0 and 1.1.0 and I think while loops linked below should hold off on the scan until the queue halves.

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_fluo_blob_rel_fluo-2D1.0.0-2Dincubating_modules_core_src_main_java_org_apache_fluo_core_worker_finder_hash_ScanTask.java-23L85&d=DwIFaQ&c=Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=BRyQS2DPBtEfUvHT-JKBXPWABrSyihP6yaJcfE1BJFQ&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_fluo_blob_rel_fluo-2D1.1.0-2Dincubating_modules_core_src_main_java_org_apache_fluo_core_worker_finder_hash_ScanTask.java-23L88&d=DwIFaQ&c=Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=ZxURCZE5k65I008z7o4UQGsm6o0mBtJnwV_N6Y668oM&e=

Were you able to find the ScanTask debug messages in the worker logs?
Below are the log messages int the code to give a sense of what to look for.

https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_fluo_blob_rel_fluo-2D1.0.0-2Dincubating_modules_core_src_main_java_org_apache_fluo_core_worker_finder_hash_ScanTask.java-23L130&d=DwIFaQ&c=Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=C141kYyjygBL3kWZyUObU1-nu4ZjvMnu7xp_QbIGkCA&e=
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_fluo_blob_rel_fluo-2D1.1.0-2Dincubating_modules_core_src_main_java_org_apache_fluo_core_worker_finder_hash_ScanTask.java-23L146&d=DwIFaQ&c=Nwf-pp4xtYRe0sCRVM8_LWH54joYF7EKmrYIdfxIq10&r=vuVdzYC2kksVZR5STiFwDpzJ7CrMHCgeo_4WXTD0qo8&m=btY_WNg1O7SuwcHi1m2ksRp3ggzrI7nJlnC2B5cHgaU&s=4Qy1-LbMEpJ7NZLqngU8ZOEOBv6nB0nXM8mjkWdpEL4&e=

IIRC I think if notifications were found in a tablet during the last scan, then it will always scan it during the next scan loop.  As notifications are not found in a tablet then that tablets next scan time doubles up to fluo.implScanTask.maxSleep.

So its possible that all notifications found are being processed quickly and then the workers are scanning for more.  The debug messages would show this.

There is also a minSleep time.  This property determines the minimum amount of time it will sleep between scan loops, seems to default to 5 secs.  Could try increasing this.

Looking at the props, it seems they prop names for min and max sleep changed between 1.0.0 and 1.1.0.


>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Keith Turner [mailto:keith@deenlo.com]
> Sent: Thursday, October 26, 2017 6:20 PM
> To: fluo-dev <de...@fluo.apache.org>
> Subject: Re: fluo accumulo table tablet servers not keeping up with 
> application
>
> On Thu, Oct 26, 2017 at 5:47 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>> Hey Keith,
>>
>> We'll rerun the benchmarks tomorrow and track the outstanding notifications.  We'll also see if compacting at some point during ingest helps with the scan rate.  Have you observed such high scan rates for such a small amount of data in any of your benchmarking?  What would account for the huge disparity in results read vs. results returned?  It seems like our scans are extremely inefficient for some reason.  Our tablet servers are becoming overwhelmed even before data gets flushed to disk.
>
> Oh I never saw you attachment, may not be able to attach stuff on mailing list.
>
> Its possible that what you are seeing is the workers scanning for notifications.  If you look in the workers logs do you see messages about scanning for notifications?  If so what do they look like?
>
> In 1.0.0 each worker scans all tablets in random order.  When it scans it has an iterator that uses hash+mod to select a subset of notifications.  The iterator also suppresses deleted notifications.
> So the selection and suppression by that iterator could explain the read vs returned.  It does exponential back off on tablets where it does not find data.  Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.
>
> In the beginning, would you have a lot of notifications?  If so I would expect a lot of scanning and then it should slow down once the workers get a list of notifications to process.
>
> In 1.1.0 the workers divide up the tablets (so workers no longer scan
> all tablets, groups of workers share groups of tablets).   If the
> table is splits after the workers start, it may take them a bit to execute the distributed algorithm that divys tablets among workers.
>
> Anyway the debug messages about scanning for notifications in the workers should provide some insight into this.
>
> If its not notification scanning, then it could be that the application is scanning over a lots of data that was deleted or something like that.
>
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>
>> -----Original Message-----
>> From: Keith Turner [mailto:keith@deenlo.com]
>> Sent: Thursday, October 26, 2017 5:36 PM
>> To: fluo-dev <de...@fluo.apache.org>
>> Subject: Re: fluo accumulo table tablet servers not keeping up with 
>> application
>>
>> On Thu, Oct 26, 2017 at 2:50 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>>> Hey Keith,
>>>
>>> Thanks for the reply.  Regarding our benchmark, I've attached some screenshots of our Accumulo UI that were taken while the benchmark was running.  Basically, our ingest rate is pretty low (about 150 entries/s, but our scan rate is off the charts - approaching 6 million entries/s!).  Also, notice the disparity between reads and returned in the Scan chart.  That disparity would suggest that we're possibly doing full table scans somewhere, which is strange given that all of our scans are RowColumn constrained.  Perhaps we are building our Scanner incorrectly.   In an effort to maximize the number of TabletServers, we split the Fluo table into 5MB tablets.  Also, the data is not well balanced -- the tablet servers do take turns being maxed out while others are idle.  We're considering possible sharding strategies.
>>>
>>> Given that our TabletServers are getting saturated so quickly for such a low ingest rate, it seems like we definitely need to cut down on the number of scans as a first line of attack to see what that buys us.  Then we'll look into tuning Accumulo and Fluo.  Does this seem like a reasonable approach to you?  Does the scan rate of our application strike you as extremely high?  When you look at the Rya Observers, can you pay attention to how we are building our scans to make sure that we're not inadvertently doing full table scans?  Also, what exactly do you mean by "are the 6 lookups in the transaction done sequentially"?
>>
>> Regarding the scan rate there are few things I Am curious about.
>>
>> Fluo workers scan for notifications in addition to the scanning done 
>> by your apps.  I made some changes in 1.1.0 to reduce the amount of 
>> scanning needed to find notifications, but this should not make much 
>> of a difference on a small amount of nodes.  Details about this are 
>> in
>> 1.1.0 release notes.  I am not sure what the best way is to determine how much of the scanning you are seeing is app vs notification finding.  Can you run the fluo wait command to see how many outstanding notifications there are?
>>
>> Transactions leave a paper trail behind and compactions clean this up (Fluo has a garbage collection iterator).  This is why I asked what effect compacting the table had.  Compactions will also clean up deleted notifications.
>>
>>
>>>
>>> Thanks,
>>> Caleb
>>>
>>> Caleb A. Meier, Ph.D.
>>> Senior Software Engineer ♦ Analyst
>>> Parsons Corporation
>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>> Office:  (703)797-3066
>>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>>
>>> -----Original Message-----
>>> From: Keith Turner [mailto:keith@deenlo.com]
>>> Sent: Thursday, October 26, 2017 1:39 PM
>>> To: fluo-dev <de...@fluo.apache.org>
>>> Subject: Re: fluo accumulo table tablet servers not keeping up with 
>>> application
>>>
>>> Caleb
>>>
>>> What if any tuning have you done?  The following tune-able Accumulo parameters impact performance.
>>>
>>>  * Write ahead log sync settings (this can have huge performance
>>> implications)
>>>  * Files per tablet
>>>  * Tablet server cache sizes
>>>  * Accumulo data block sizes
>>>  * Tablet server client thread pool size
>>>
>>> For Fluo the following tune-able parameters are important.
>>>
>>>  * Commit memory (this determines how many transactions are held in 
>>> memory while committing)
>>>  * Threads running transactions
>>>
>>> What does the load (CPU and memory) on the cluster look like?  I'm curious how even it is?  For example is one tserver at 100% cpu while others are idle, this could be caused by uneven data access patterns.
>>>
>>> Would it be possible for me to see or run the benchmark?  I am going to take a look at the Rya observers, let me know if there is anything in particular I should look at.
>>>
>>> Are the 6 lookups in the transaction done sequentially?
>>>
>>> Keith
>>>
>>> On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>>>> Hello Fluo Devs,
>>>>
>>>> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>>>>
>>>> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>>>>
>>>> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>>>>
>>>> Sincerely,
>>>> Caleb Meier
>>>>
>>>> Caleb A. Meier, Ph.D.
>>>> Senior Software Engineer ♦ Analyst
>>>> Parsons Corporation
>>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>>> Office:  (703)797-3066
>>>> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦ 
>>>> www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.
>>>> c
>>>> om+>
>>>>

Re: fluo accumulo table tablet servers not keeping up with application

Posted by Keith Turner <ke...@deenlo.com>.
On Fri, Oct 27, 2017 at 11:03 AM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hey Keith,
>
> Our benchmark consists of a single query that is a join of two statement patterns (essentially patterns that incoming data matches, where a unit of data is a statement).  We are ingesting 50 pairs of statements a minute (100 total), where each statement in the pair matches one of the statement patterns.  Because the data is being ingested at a constant rate, the statement pattern Observers and Join Observers are constantly working.  One thing that is worth mentioning is that we changed the property fluo.implScanTask.maxSleep from 5 min to 10 seconds.  Based on the constant ingest rate, your comments below, and our low maxSleep, it seems like the workers would constantly be scanning for new notifications.
>
>> Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.
>
> How does the maxSleep property work in conjunction with this?  If the max sleep time elapses before a worker processes half of the notifications, will it scan?

I don't think it will scan again until the # of queued notifications
is cut in half.  I looked in 1.0.0 and 1.1.0 and I think while loops
linked below should hold off on the scan until the queue halves.

https://github.com/apache/fluo/blob/rel/fluo-1.0.0-incubating/modules/core/src/main/java/org/apache/fluo/core/worker/finder/hash/ScanTask.java#L85
https://github.com/apache/fluo/blob/rel/fluo-1.1.0-incubating/modules/core/src/main/java/org/apache/fluo/core/worker/finder/hash/ScanTask.java#L88

Were you able to find the ScanTask debug messages in the worker logs?
Below are the log messages int the code to give a sense of what to
look for.

https://github.com/apache/fluo/blob/rel/fluo-1.0.0-incubating/modules/core/src/main/java/org/apache/fluo/core/worker/finder/hash/ScanTask.java#L130
https://github.com/apache/fluo/blob/rel/fluo-1.1.0-incubating/modules/core/src/main/java/org/apache/fluo/core/worker/finder/hash/ScanTask.java#L146

IIRC I think if notifications were found in a tablet during the last
scan, then it will always scan it during the next scan loop.  As
notifications are not found in a tablet then that tablets next scan
time doubles up to fluo.implScanTask.maxSleep.

So its possible that all notifications found are being processed
quickly and then the workers are scanning for more.  The debug
messages would show this.

There is also a minSleep time.  This property determines the minimum
amount of time it will sleep between scan loops, seems to default to 5
secs.  Could try increasing this.

Looking at the props, it seems they prop names for min and max sleep
changed between 1.0.0 and 1.1.0.


>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Keith Turner [mailto:keith@deenlo.com]
> Sent: Thursday, October 26, 2017 6:20 PM
> To: fluo-dev <de...@fluo.apache.org>
> Subject: Re: fluo accumulo table tablet servers not keeping up with application
>
> On Thu, Oct 26, 2017 at 5:47 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>> Hey Keith,
>>
>> We'll rerun the benchmarks tomorrow and track the outstanding notifications.  We'll also see if compacting at some point during ingest helps with the scan rate.  Have you observed such high scan rates for such a small amount of data in any of your benchmarking?  What would account for the huge disparity in results read vs. results returned?  It seems like our scans are extremely inefficient for some reason.  Our tablet servers are becoming overwhelmed even before data gets flushed to disk.
>
> Oh I never saw you attachment, may not be able to attach stuff on mailing list.
>
> Its possible that what you are seeing is the workers scanning for notifications.  If you look in the workers logs do you see messages about scanning for notifications?  If so what do they look like?
>
> In 1.0.0 each worker scans all tablets in random order.  When it scans it has an iterator that uses hash+mod to select a subset of notifications.  The iterator also suppresses deleted notifications.
> So the selection and suppression by that iterator could explain the read vs returned.  It does exponential back off on tablets where it does not find data.  Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.
>
> In the beginning, would you have a lot of notifications?  If so I would expect a lot of scanning and then it should slow down once the workers get a list of notifications to process.
>
> In 1.1.0 the workers divide up the tablets (so workers no longer scan
> all tablets, groups of workers share groups of tablets).   If the
> table is splits after the workers start, it may take them a bit to execute the distributed algorithm that divys tablets among workers.
>
> Anyway the debug messages about scanning for notifications in the workers should provide some insight into this.
>
> If its not notification scanning, then it could be that the application is scanning over a lots of data that was deleted or something like that.
>
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>
>> -----Original Message-----
>> From: Keith Turner [mailto:keith@deenlo.com]
>> Sent: Thursday, October 26, 2017 5:36 PM
>> To: fluo-dev <de...@fluo.apache.org>
>> Subject: Re: fluo accumulo table tablet servers not keeping up with
>> application
>>
>> On Thu, Oct 26, 2017 at 2:50 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>>> Hey Keith,
>>>
>>> Thanks for the reply.  Regarding our benchmark, I've attached some screenshots of our Accumulo UI that were taken while the benchmark was running.  Basically, our ingest rate is pretty low (about 150 entries/s, but our scan rate is off the charts - approaching 6 million entries/s!).  Also, notice the disparity between reads and returned in the Scan chart.  That disparity would suggest that we're possibly doing full table scans somewhere, which is strange given that all of our scans are RowColumn constrained.  Perhaps we are building our Scanner incorrectly.   In an effort to maximize the number of TabletServers, we split the Fluo table into 5MB tablets.  Also, the data is not well balanced -- the tablet servers do take turns being maxed out while others are idle.  We're considering possible sharding strategies.
>>>
>>> Given that our TabletServers are getting saturated so quickly for such a low ingest rate, it seems like we definitely need to cut down on the number of scans as a first line of attack to see what that buys us.  Then we'll look into tuning Accumulo and Fluo.  Does this seem like a reasonable approach to you?  Does the scan rate of our application strike you as extremely high?  When you look at the Rya Observers, can you pay attention to how we are building our scans to make sure that we're not inadvertently doing full table scans?  Also, what exactly do you mean by "are the 6 lookups in the transaction done sequentially"?
>>
>> Regarding the scan rate there are few things I Am curious about.
>>
>> Fluo workers scan for notifications in addition to the scanning done
>> by your apps.  I made some changes in 1.1.0 to reduce the amount of
>> scanning needed to find notifications, but this should not make much
>> of a difference on a small amount of nodes.  Details about this are in
>> 1.1.0 release notes.  I am not sure what the best way is to determine how much of the scanning you are seeing is app vs notification finding.  Can you run the fluo wait command to see how many outstanding notifications there are?
>>
>> Transactions leave a paper trail behind and compactions clean this up (Fluo has a garbage collection iterator).  This is why I asked what effect compacting the table had.  Compactions will also clean up deleted notifications.
>>
>>
>>>
>>> Thanks,
>>> Caleb
>>>
>>> Caleb A. Meier, Ph.D.
>>> Senior Software Engineer ♦ Analyst
>>> Parsons Corporation
>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>> Office:  (703)797-3066
>>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>>
>>> -----Original Message-----
>>> From: Keith Turner [mailto:keith@deenlo.com]
>>> Sent: Thursday, October 26, 2017 1:39 PM
>>> To: fluo-dev <de...@fluo.apache.org>
>>> Subject: Re: fluo accumulo table tablet servers not keeping up with
>>> application
>>>
>>> Caleb
>>>
>>> What if any tuning have you done?  The following tune-able Accumulo parameters impact performance.
>>>
>>>  * Write ahead log sync settings (this can have huge performance
>>> implications)
>>>  * Files per tablet
>>>  * Tablet server cache sizes
>>>  * Accumulo data block sizes
>>>  * Tablet server client thread pool size
>>>
>>> For Fluo the following tune-able parameters are important.
>>>
>>>  * Commit memory (this determines how many transactions are held in
>>> memory while committing)
>>>  * Threads running transactions
>>>
>>> What does the load (CPU and memory) on the cluster look like?  I'm curious how even it is?  For example is one tserver at 100% cpu while others are idle, this could be caused by uneven data access patterns.
>>>
>>> Would it be possible for me to see or run the benchmark?  I am going to take a look at the Rya observers, let me know if there is anything in particular I should look at.
>>>
>>> Are the 6 lookups in the transaction done sequentially?
>>>
>>> Keith
>>>
>>> On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>>>> Hello Fluo Devs,
>>>>
>>>> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>>>>
>>>> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>>>>
>>>> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>>>>
>>>> Sincerely,
>>>> Caleb Meier
>>>>
>>>> Caleb A. Meier, Ph.D.
>>>> Senior Software Engineer ♦ Analyst
>>>> Parsons Corporation
>>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>>> Office:  (703)797-3066
>>>> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦
>>>> www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.
>>>> c
>>>> om+>
>>>>

RE: fluo accumulo table tablet servers not keeping up with application

Posted by "Meier, Caleb" <Ca...@parsons.com>.
Hey Keith,

Our benchmark consists of a single query that is a join of two statement patterns (essentially patterns that incoming data matches, where a unit of data is a statement).  We are ingesting 50 pairs of statements a minute (100 total), where each statement in the pair matches one of the statement patterns.  Because the data is being ingested at a constant rate, the statement pattern Observers and Join Observers are constantly working.  One thing that is worth mentioning is that we changed the property fluo.implScanTask.maxSleep from 5 min to 10 seconds.  Based on the constant ingest rate, your comments below, and our low maxSleep, it seems like the workers would constantly be scanning for new notifications.  

> Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.

How does the maxSleep property work in conjunction with this?  If the max sleep time elapses before a worker processes half of the notifications, will it scan?   

Caleb A. Meier, Ph.D.
Senior Software Engineer ♦ Analyst
Parsons Corporation
1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
Office:  (703)797-3066
Caleb.Meier@Parsons.com ♦ www.parsons.com

-----Original Message-----
From: Keith Turner [mailto:keith@deenlo.com] 
Sent: Thursday, October 26, 2017 6:20 PM
To: fluo-dev <de...@fluo.apache.org>
Subject: Re: fluo accumulo table tablet servers not keeping up with application

On Thu, Oct 26, 2017 at 5:47 PM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hey Keith,
>
> We'll rerun the benchmarks tomorrow and track the outstanding notifications.  We'll also see if compacting at some point during ingest helps with the scan rate.  Have you observed such high scan rates for such a small amount of data in any of your benchmarking?  What would account for the huge disparity in results read vs. results returned?  It seems like our scans are extremely inefficient for some reason.  Our tablet servers are becoming overwhelmed even before data gets flushed to disk.

Oh I never saw you attachment, may not be able to attach stuff on mailing list.

Its possible that what you are seeing is the workers scanning for notifications.  If you look in the workers logs do you see messages about scanning for notifications?  If so what do they look like?

In 1.0.0 each worker scans all tablets in random order.  When it scans it has an iterator that uses hash+mod to select a subset of notifications.  The iterator also suppresses deleted notifications.
So the selection and suppression by that iterator could explain the read vs returned.  It does exponential back off on tablets where it does not find data.  Once a worker scans all tablets and finds a list of notifications, it does not scan again until half of those notifications are processed.

In the beginning, would you have a lot of notifications?  If so I would expect a lot of scanning and then it should slow down once the workers get a list of notifications to process.

In 1.1.0 the workers divide up the tablets (so workers no longer scan
all tablets, groups of workers share groups of tablets).   If the
table is splits after the workers start, it may take them a bit to execute the distributed algorithm that divys tablets among workers.

Anyway the debug messages about scanning for notifications in the workers should provide some insight into this.

If its not notification scanning, then it could be that the application is scanning over a lots of data that was deleted or something like that.

>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Keith Turner [mailto:keith@deenlo.com]
> Sent: Thursday, October 26, 2017 5:36 PM
> To: fluo-dev <de...@fluo.apache.org>
> Subject: Re: fluo accumulo table tablet servers not keeping up with 
> application
>
> On Thu, Oct 26, 2017 at 2:50 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>> Hey Keith,
>>
>> Thanks for the reply.  Regarding our benchmark, I've attached some screenshots of our Accumulo UI that were taken while the benchmark was running.  Basically, our ingest rate is pretty low (about 150 entries/s, but our scan rate is off the charts - approaching 6 million entries/s!).  Also, notice the disparity between reads and returned in the Scan chart.  That disparity would suggest that we're possibly doing full table scans somewhere, which is strange given that all of our scans are RowColumn constrained.  Perhaps we are building our Scanner incorrectly.   In an effort to maximize the number of TabletServers, we split the Fluo table into 5MB tablets.  Also, the data is not well balanced -- the tablet servers do take turns being maxed out while others are idle.  We're considering possible sharding strategies.
>>
>> Given that our TabletServers are getting saturated so quickly for such a low ingest rate, it seems like we definitely need to cut down on the number of scans as a first line of attack to see what that buys us.  Then we'll look into tuning Accumulo and Fluo.  Does this seem like a reasonable approach to you?  Does the scan rate of our application strike you as extremely high?  When you look at the Rya Observers, can you pay attention to how we are building our scans to make sure that we're not inadvertently doing full table scans?  Also, what exactly do you mean by "are the 6 lookups in the transaction done sequentially"?
>
> Regarding the scan rate there are few things I Am curious about.
>
> Fluo workers scan for notifications in addition to the scanning done 
> by your apps.  I made some changes in 1.1.0 to reduce the amount of 
> scanning needed to find notifications, but this should not make much 
> of a difference on a small amount of nodes.  Details about this are in
> 1.1.0 release notes.  I am not sure what the best way is to determine how much of the scanning you are seeing is app vs notification finding.  Can you run the fluo wait command to see how many outstanding notifications there are?
>
> Transactions leave a paper trail behind and compactions clean this up (Fluo has a garbage collection iterator).  This is why I asked what effect compacting the table had.  Compactions will also clean up deleted notifications.
>
>
>>
>> Thanks,
>> Caleb
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>
>> -----Original Message-----
>> From: Keith Turner [mailto:keith@deenlo.com]
>> Sent: Thursday, October 26, 2017 1:39 PM
>> To: fluo-dev <de...@fluo.apache.org>
>> Subject: Re: fluo accumulo table tablet servers not keeping up with 
>> application
>>
>> Caleb
>>
>> What if any tuning have you done?  The following tune-able Accumulo parameters impact performance.
>>
>>  * Write ahead log sync settings (this can have huge performance
>> implications)
>>  * Files per tablet
>>  * Tablet server cache sizes
>>  * Accumulo data block sizes
>>  * Tablet server client thread pool size
>>
>> For Fluo the following tune-able parameters are important.
>>
>>  * Commit memory (this determines how many transactions are held in 
>> memory while committing)
>>  * Threads running transactions
>>
>> What does the load (CPU and memory) on the cluster look like?  I'm curious how even it is?  For example is one tserver at 100% cpu while others are idle, this could be caused by uneven data access patterns.
>>
>> Would it be possible for me to see or run the benchmark?  I am going to take a look at the Rya observers, let me know if there is anything in particular I should look at.
>>
>> Are the 6 lookups in the transaction done sequentially?
>>
>> Keith
>>
>> On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>>> Hello Fluo Devs,
>>>
>>> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>>>
>>> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>>>
>>> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>>>
>>> Sincerely,
>>> Caleb Meier
>>>
>>> Caleb A. Meier, Ph.D.
>>> Senior Software Engineer ♦ Analyst
>>> Parsons Corporation
>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>> Office:  (703)797-3066
>>> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦ 
>>> www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.
>>> c
>>> om+>
>>>

Re: fluo accumulo table tablet servers not keeping up with application

Posted by Keith Turner <ke...@deenlo.com>.
On Thu, Oct 26, 2017 at 5:47 PM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hey Keith,
>
> We'll rerun the benchmarks tomorrow and track the outstanding notifications.  We'll also see if compacting at some point during ingest helps with the scan rate.  Have you observed such high scan rates for such a small amount of data in any of your benchmarking?  What would account for the huge disparity in results read vs. results returned?  It seems like our scans are extremely inefficient for some reason.  Our tablet servers are becoming overwhelmed even before data gets flushed to disk.

Oh I never saw you attachment, may not be able to attach stuff on
mailing list.

Its possible that what you are seeing is the workers scanning for
notifications.  If you look in the workers logs do you see messages
about scanning for notifications?  If so what do they look like?

In 1.0.0 each worker scans all tablets in random order.  When it scans
it has an iterator that uses hash+mod to select a subset of
notifications.  The iterator also suppresses deleted notifications.
So the selection and suppression by that iterator could explain the
read vs returned.  It does exponential back off on tablets where it
does not find data.  Once a worker scans all tablets and finds a list
of notifications, it does not scan again until half of those
notifications are processed.

In the beginning, would you have a lot of notifications?  If so I
would expect a lot of scanning and then it should slow down once the
workers get a list of notifications to process.

In 1.1.0 the workers divide up the tablets (so workers no longer scan
all tablets, groups of workers share groups of tablets).   If the
table is splits after the workers start, it may take them a bit to
execute the distributed algorithm that divys tablets among workers.

Anyway the debug messages about scanning for notifications in the
workers should provide some insight into this.

If its not notification scanning, then it could be that the
application is scanning over a lots of data that was deleted or
something like that.

>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Keith Turner [mailto:keith@deenlo.com]
> Sent: Thursday, October 26, 2017 5:36 PM
> To: fluo-dev <de...@fluo.apache.org>
> Subject: Re: fluo accumulo table tablet servers not keeping up with application
>
> On Thu, Oct 26, 2017 at 2:50 PM, Meier, Caleb <Ca...@parsons.com> wrote:
>> Hey Keith,
>>
>> Thanks for the reply.  Regarding our benchmark, I've attached some screenshots of our Accumulo UI that were taken while the benchmark was running.  Basically, our ingest rate is pretty low (about 150 entries/s, but our scan rate is off the charts - approaching 6 million entries/s!).  Also, notice the disparity between reads and returned in the Scan chart.  That disparity would suggest that we're possibly doing full table scans somewhere, which is strange given that all of our scans are RowColumn constrained.  Perhaps we are building our Scanner incorrectly.   In an effort to maximize the number of TabletServers, we split the Fluo table into 5MB tablets.  Also, the data is not well balanced -- the tablet servers do take turns being maxed out while others are idle.  We're considering possible sharding strategies.
>>
>> Given that our TabletServers are getting saturated so quickly for such a low ingest rate, it seems like we definitely need to cut down on the number of scans as a first line of attack to see what that buys us.  Then we'll look into tuning Accumulo and Fluo.  Does this seem like a reasonable approach to you?  Does the scan rate of our application strike you as extremely high?  When you look at the Rya Observers, can you pay attention to how we are building our scans to make sure that we're not inadvertently doing full table scans?  Also, what exactly do you mean by "are the 6 lookups in the transaction done sequentially"?
>
> Regarding the scan rate there are few things I Am curious about.
>
> Fluo workers scan for notifications in addition to the scanning done by your apps.  I made some changes in 1.1.0 to reduce the amount of scanning needed to find notifications, but this should not make much of a difference on a small amount of nodes.  Details about this are in
> 1.1.0 release notes.  I am not sure what the best way is to determine how much of the scanning you are seeing is app vs notification finding.  Can you run the fluo wait command to see how many outstanding notifications there are?
>
> Transactions leave a paper trail behind and compactions clean this up (Fluo has a garbage collection iterator).  This is why I asked what effect compacting the table had.  Compactions will also clean up deleted notifications.
>
>
>>
>> Thanks,
>> Caleb
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> Caleb.Meier@Parsons.com ♦ www.parsons.com
>>
>> -----Original Message-----
>> From: Keith Turner [mailto:keith@deenlo.com]
>> Sent: Thursday, October 26, 2017 1:39 PM
>> To: fluo-dev <de...@fluo.apache.org>
>> Subject: Re: fluo accumulo table tablet servers not keeping up with
>> application
>>
>> Caleb
>>
>> What if any tuning have you done?  The following tune-able Accumulo parameters impact performance.
>>
>>  * Write ahead log sync settings (this can have huge performance
>> implications)
>>  * Files per tablet
>>  * Tablet server cache sizes
>>  * Accumulo data block sizes
>>  * Tablet server client thread pool size
>>
>> For Fluo the following tune-able parameters are important.
>>
>>  * Commit memory (this determines how many transactions are held in
>> memory while committing)
>>  * Threads running transactions
>>
>> What does the load (CPU and memory) on the cluster look like?  I'm curious how even it is?  For example is one tserver at 100% cpu while others are idle, this could be caused by uneven data access patterns.
>>
>> Would it be possible for me to see or run the benchmark?  I am going to take a look at the Rya observers, let me know if there is anything in particular I should look at.
>>
>> Are the 6 lookups in the transaction done sequentially?
>>
>> Keith
>>
>> On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>>> Hello Fluo Devs,
>>>
>>> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>>>
>>> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>>>
>>> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>>>
>>> Sincerely,
>>> Caleb Meier
>>>
>>> Caleb A. Meier, Ph.D.
>>> Senior Software Engineer ♦ Analyst
>>> Parsons Corporation
>>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>>> Office:  (703)797-3066
>>> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦
>>> www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.c
>>> om+>
>>>

RE: fluo accumulo table tablet servers not keeping up with application

Posted by "Meier, Caleb" <Ca...@parsons.com>.
Hey Keith,

We'll rerun the benchmarks tomorrow and track the outstanding notifications.  We'll also see if compacting at some point during ingest helps with the scan rate.  Have you observed such high scan rates for such a small amount of data in any of your benchmarking?  What would account for the huge disparity in results read vs. results returned?  It seems like our scans are extremely inefficient for some reason.  Our tablet servers are becoming overwhelmed even before data gets flushed to disk.

Caleb A. Meier, Ph.D.
Senior Software Engineer ♦ Analyst
Parsons Corporation
1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
Office:  (703)797-3066
Caleb.Meier@Parsons.com ♦ www.parsons.com

-----Original Message-----
From: Keith Turner [mailto:keith@deenlo.com] 
Sent: Thursday, October 26, 2017 5:36 PM
To: fluo-dev <de...@fluo.apache.org>
Subject: Re: fluo accumulo table tablet servers not keeping up with application

On Thu, Oct 26, 2017 at 2:50 PM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hey Keith,
>
> Thanks for the reply.  Regarding our benchmark, I've attached some screenshots of our Accumulo UI that were taken while the benchmark was running.  Basically, our ingest rate is pretty low (about 150 entries/s, but our scan rate is off the charts - approaching 6 million entries/s!).  Also, notice the disparity between reads and returned in the Scan chart.  That disparity would suggest that we're possibly doing full table scans somewhere, which is strange given that all of our scans are RowColumn constrained.  Perhaps we are building our Scanner incorrectly.   In an effort to maximize the number of TabletServers, we split the Fluo table into 5MB tablets.  Also, the data is not well balanced -- the tablet servers do take turns being maxed out while others are idle.  We're considering possible sharding strategies.
>
> Given that our TabletServers are getting saturated so quickly for such a low ingest rate, it seems like we definitely need to cut down on the number of scans as a first line of attack to see what that buys us.  Then we'll look into tuning Accumulo and Fluo.  Does this seem like a reasonable approach to you?  Does the scan rate of our application strike you as extremely high?  When you look at the Rya Observers, can you pay attention to how we are building our scans to make sure that we're not inadvertently doing full table scans?  Also, what exactly do you mean by "are the 6 lookups in the transaction done sequentially"?

Regarding the scan rate there are few things I Am curious about.

Fluo workers scan for notifications in addition to the scanning done by your apps.  I made some changes in 1.1.0 to reduce the amount of scanning needed to find notifications, but this should not make much of a difference on a small amount of nodes.  Details about this are in
1.1.0 release notes.  I am not sure what the best way is to determine how much of the scanning you are seeing is app vs notification finding.  Can you run the fluo wait command to see how many outstanding notifications there are?

Transactions leave a paper trail behind and compactions clean this up (Fluo has a garbage collection iterator).  This is why I asked what effect compacting the table had.  Compactions will also clean up deleted notifications.


>
> Thanks,
> Caleb
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Keith Turner [mailto:keith@deenlo.com]
> Sent: Thursday, October 26, 2017 1:39 PM
> To: fluo-dev <de...@fluo.apache.org>
> Subject: Re: fluo accumulo table tablet servers not keeping up with 
> application
>
> Caleb
>
> What if any tuning have you done?  The following tune-able Accumulo parameters impact performance.
>
>  * Write ahead log sync settings (this can have huge performance 
> implications)
>  * Files per tablet
>  * Tablet server cache sizes
>  * Accumulo data block sizes
>  * Tablet server client thread pool size
>
> For Fluo the following tune-able parameters are important.
>
>  * Commit memory (this determines how many transactions are held in 
> memory while committing)
>  * Threads running transactions
>
> What does the load (CPU and memory) on the cluster look like?  I'm curious how even it is?  For example is one tserver at 100% cpu while others are idle, this could be caused by uneven data access patterns.
>
> Would it be possible for me to see or run the benchmark?  I am going to take a look at the Rya observers, let me know if there is anything in particular I should look at.
>
> Are the 6 lookups in the transaction done sequentially?
>
> Keith
>
> On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>> Hello Fluo Devs,
>>
>> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>>
>> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>>
>> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>>
>> Sincerely,
>> Caleb Meier
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦ 
>> www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.c
>> om+>
>>

Re: fluo accumulo table tablet servers not keeping up with application

Posted by Keith Turner <ke...@deenlo.com>.
On Thu, Oct 26, 2017 at 2:50 PM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hey Keith,
>
> Thanks for the reply.  Regarding our benchmark, I've attached some screenshots of our Accumulo UI that were taken while the benchmark was running.  Basically, our ingest rate is pretty low (about 150 entries/s, but our scan rate is off the charts - approaching 6 million entries/s!).  Also, notice the disparity between reads and returned in the Scan chart.  That disparity would suggest that we're possibly doing full table scans somewhere, which is strange given that all of our scans are RowColumn constrained.  Perhaps we are building our Scanner incorrectly.   In an effort to maximize the number of TabletServers, we split the Fluo table into 5MB tablets.  Also, the data is not well balanced -- the tablet servers do take turns being maxed out while others are idle.  We're considering possible sharding strategies.
>
> Given that our TabletServers are getting saturated so quickly for such a low ingest rate, it seems like we definitely need to cut down on the number of scans as a first line of attack to see what that buys us.  Then we'll look into tuning Accumulo and Fluo.  Does this seem like a reasonable approach to you?  Does the scan rate of our application strike you as extremely high?  When you look at the Rya Observers, can you pay attention to how we are building our scans to make sure that we're not inadvertently doing full table scans?  Also, what exactly do you mean by "are the 6 lookups in the transaction done sequentially"?

Regarding the scan rate there are few things I Am curious about.

Fluo workers scan for notifications in addition to the scanning done
by your apps.  I made some changes in 1.1.0 to reduce the amount of
scanning needed to find notifications, but this should not make much
of a difference on a small amount of nodes.  Details about this are in
1.1.0 release notes.  I am not sure what the best way is to determine
how much of the scanning you are seeing is app vs notification
finding.  Can you run the fluo wait command to see how many
outstanding notifications there are?

Transactions leave a paper trail behind and compactions clean this up
(Fluo has a garbage collection iterator).  This is why I asked what
effect compacting the table had.  Compactions will also clean up
deleted notifications.


>
> Thanks,
> Caleb
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Keith Turner [mailto:keith@deenlo.com]
> Sent: Thursday, October 26, 2017 1:39 PM
> To: fluo-dev <de...@fluo.apache.org>
> Subject: Re: fluo accumulo table tablet servers not keeping up with application
>
> Caleb
>
> What if any tuning have you done?  The following tune-able Accumulo parameters impact performance.
>
>  * Write ahead log sync settings (this can have huge performance implications)
>  * Files per tablet
>  * Tablet server cache sizes
>  * Accumulo data block sizes
>  * Tablet server client thread pool size
>
> For Fluo the following tune-able parameters are important.
>
>  * Commit memory (this determines how many transactions are held in memory while committing)
>  * Threads running transactions
>
> What does the load (CPU and memory) on the cluster look like?  I'm curious how even it is?  For example is one tserver at 100% cpu while others are idle, this could be caused by uneven data access patterns.
>
> Would it be possible for me to see or run the benchmark?  I am going to take a look at the Rya observers, let me know if there is anything in particular I should look at.
>
> Are the 6 lookups in the transaction done sequentially?
>
> Keith
>
> On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>> Hello Fluo Devs,
>>
>> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>>
>> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>>
>> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>>
>> Sincerely,
>> Caleb Meier
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦
>> www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.com+>
>>

Re: fluo accumulo table tablet servers not keeping up with application

Posted by Keith Turner <ke...@deenlo.com>.
On Thu, Oct 26, 2017 at 2:50 PM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hey Keith,
>
> Thanks for the reply.  Regarding our benchmark, I've attached some screenshots of our Accumulo UI that were taken while the benchmark was running.  Basically, our ingest rate is pretty low (about 150 entries/s, but our scan rate is off the charts - approaching 6 million entries/s!).  Also, notice the disparity between reads and returned in the Scan chart.  That disparity would suggest that we're possibly doing full table scans somewhere, which is strange given that all of our scans are RowColumn constrained.  Perhaps we are building our Scanner incorrectly.   In an effort to maximize the number of TabletServers, we split the Fluo table into 5MB tablets.  Also, the data is not well balanced -- the tablet servers do take turns being maxed out while others are idle.  We're considering possible sharding strategies.

Yeah need to do something to evenly spread the load or else the
application will not scale.  I have done two things for this in the
past a hash prefix and evenly spreading data sets across tservers.
Both of these approaches are in Fluo Recipes.

http://fluo.apache.org/docs/fluo-recipes/1.1.0-incubating/row-hasher/

The following is the documentation for evenly spreading data sets
across tservers, looking at it now I realize its not very good in
showing how to actually use the functionality.

http://fluo.apache.org/docs/fluo-recipes/1.1.0-incubating/table-optimization/


>
> Given that our TabletServers are getting saturated so quickly for such a low ingest rate, it seems like we definitely need to cut down on the number of scans as a first line of attack to see what that buys us.  Then we'll look into tuning Accumulo and Fluo.  Does this seem like a reasonable approach to you?  Does the scan rate of our application strike you as extremely high?  When you look at the Rya Observers, can you pay attention to how we are building our scans to make sure that we're not inadvertently doing full table scans?  Also, what exactly do you mean by "are the 6 lookups in the transaction done sequentially"?

Basically I was wondering what methods you are using to do the 6 gets.
The following step of the tour talks about what I was thinking about
when I asked that question.

http://fluo.apache.org/tour/multi-get/

>
> Thanks,
> Caleb
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com ♦ www.parsons.com
>
> -----Original Message-----
> From: Keith Turner [mailto:keith@deenlo.com]
> Sent: Thursday, October 26, 2017 1:39 PM
> To: fluo-dev <de...@fluo.apache.org>
> Subject: Re: fluo accumulo table tablet servers not keeping up with application
>
> Caleb
>
> What if any tuning have you done?  The following tune-able Accumulo parameters impact performance.
>
>  * Write ahead log sync settings (this can have huge performance implications)
>  * Files per tablet
>  * Tablet server cache sizes
>  * Accumulo data block sizes
>  * Tablet server client thread pool size
>
> For Fluo the following tune-able parameters are important.
>
>  * Commit memory (this determines how many transactions are held in memory while committing)
>  * Threads running transactions
>
> What does the load (CPU and memory) on the cluster look like?  I'm curious how even it is?  For example is one tserver at 100% cpu while others are idle, this could be caused by uneven data access patterns.
>
> Would it be possible for me to see or run the benchmark?  I am going to take a look at the Rya observers, let me know if there is anything in particular I should look at.
>
> Are the 6 lookups in the transaction done sequentially?
>
> Keith
>
> On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
>> Hello Fluo Devs,
>>
>> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>>
>> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>>
>> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>>
>> Sincerely,
>> Caleb Meier
>>
>> Caleb A. Meier, Ph.D.
>> Senior Software Engineer ♦ Analyst
>> Parsons Corporation
>> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
>> Office:  (703)797-3066
>> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦
>> www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.com+>
>>

RE: fluo accumulo table tablet servers not keeping up with application

Posted by "Meier, Caleb" <Ca...@parsons.com>.
Hey Keith,

Thanks for the reply.  Regarding our benchmark, I've attached some screenshots of our Accumulo UI that were taken while the benchmark was running.  Basically, our ingest rate is pretty low (about 150 entries/s, but our scan rate is off the charts - approaching 6 million entries/s!).  Also, notice the disparity between reads and returned in the Scan chart.  That disparity would suggest that we're possibly doing full table scans somewhere, which is strange given that all of our scans are RowColumn constrained.  Perhaps we are building our Scanner incorrectly.   In an effort to maximize the number of TabletServers, we split the Fluo table into 5MB tablets.  Also, the data is not well balanced -- the tablet servers do take turns being maxed out while others are idle.  We're considering possible sharding strategies.

Given that our TabletServers are getting saturated so quickly for such a low ingest rate, it seems like we definitely need to cut down on the number of scans as a first line of attack to see what that buys us.  Then we'll look into tuning Accumulo and Fluo.  Does this seem like a reasonable approach to you?  Does the scan rate of our application strike you as extremely high?  When you look at the Rya Observers, can you pay attention to how we are building our scans to make sure that we're not inadvertently doing full table scans?  Also, what exactly do you mean by "are the 6 lookups in the transaction done sequentially"?

Thanks,
Caleb

Caleb A. Meier, Ph.D.
Senior Software Engineer ♦ Analyst
Parsons Corporation
1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
Office:  (703)797-3066
Caleb.Meier@Parsons.com ♦ www.parsons.com

-----Original Message-----
From: Keith Turner [mailto:keith@deenlo.com] 
Sent: Thursday, October 26, 2017 1:39 PM
To: fluo-dev <de...@fluo.apache.org>
Subject: Re: fluo accumulo table tablet servers not keeping up with application

Caleb

What if any tuning have you done?  The following tune-able Accumulo parameters impact performance.

 * Write ahead log sync settings (this can have huge performance implications)
 * Files per tablet
 * Tablet server cache sizes
 * Accumulo data block sizes
 * Tablet server client thread pool size

For Fluo the following tune-able parameters are important.

 * Commit memory (this determines how many transactions are held in memory while committing)
 * Threads running transactions

What does the load (CPU and memory) on the cluster look like?  I'm curious how even it is?  For example is one tserver at 100% cpu while others are idle, this could be caused by uneven data access patterns.

Would it be possible for me to see or run the benchmark?  I am going to take a look at the Rya observers, let me know if there is anything in particular I should look at.

Are the 6 lookups in the transaction done sequentially?

Keith

On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hello Fluo Devs,
>
> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>
> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>
> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>
> Sincerely,
> Caleb Meier
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦ 
> www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.com+>
>

Re: fluo accumulo table tablet servers not keeping up with application

Posted by Keith Turner <ke...@deenlo.com>.
Caleb

What if any tuning have you done?  The following tune-able Accumulo
parameters impact performance.

 * Write ahead log sync settings (this can have huge performance implications)
 * Files per tablet
 * Tablet server cache sizes
 * Accumulo data block sizes
 * Tablet server client thread pool size

For Fluo the following tune-able parameters are important.

 * Commit memory (this determines how many transactions are held in
memory while committing)
 * Threads running transactions

What does the load (CPU and memory) on the cluster look like?  I'm
curious how even it is?  For example is one tserver at 100% cpu while
others are idle, this could be caused by uneven data access patterns.

Would it be possible for me to see or run the benchmark?  I am going
to take a look at the Rya observers, let me know if there is anything
in particular I should look at.

Are the 6 lookups in the transaction done sequentially?

Keith

On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hello Fluo Devs,
>
> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>
> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?
>
> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>
> Sincerely,
> Caleb Meier
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦ www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.com+>
>

Re: fluo accumulo table tablet servers not keeping up with application

Posted by Keith Turner <ke...@deenlo.com>.
On Thu, Oct 26, 2017 at 11:34 AM, Meier, Caleb <Ca...@parsons.com> wrote:
> Hello Fluo Devs,
>
> We have implemented an incremental query evaluation service for Apache Rya that leverages Apache Fluo.  We’ve been doing some benchmarking and we’ve found that the Accumulo Tablet servers for the Fluo table are falling behind pretty quickly for our application.  We’ve tried splitting the Accumulo Table so that we have more Tablet Servers, but that doesn’t really buy us too much.  Our application is fairly scan intensive—we have a metadata framework in place that allows us to pass query results through the query tree, and each observer needs to look up metadata to determine which observer to route its data to after processing.  To give you some indication of our scan rates, our Join Observer does about 6 lookups, builds a scanner to do one RowColumn restricted scan, and then does many writes.  So an obvious way to alleviate the burden on the TableServer is to cut down on the number of scans.
>
> One approach that we are considering is to import all of our metadata into memory.  Essentially, each Observer would need access to an in memory metadata cache.  We’re considering using the Observer context, but this cache needs to be mutable because a user needs to be able to register new queries.  Is it possible to update the context, or would we need to restart the application to do that?  I guess other options would be to create a static cache for each Observer that stores the metadata, or to store it in Zookeeper.  Have any of you devs ever had create a solution to share state between Observers that doesn’t rely on the Fluo table?


If you did want to cache something between observers this would
require using static stuff in 1.0.  In 1.1.0 Fluo introduced a new API
for creating observers called the ObserverProvider.  Using this new
API, static stuff would not be required.  The cache could be created
in the ObserverProvider and passed to the Observers.  The 1.1.0
release notes give an overview of the new API.

http://fluo.apache.org/release/fluo-1.1.0-incubating/#new-api-for-configuring-observers

>
> In addition to cutting down on the scan rate, are there any other approaches that you would consider?  I assume that the problem lies primarily with how we’ve implemented our application, but I’m also wondering if there is anything we can do from a configuration point of view to reduce the burden on the Tablet servers.  Would reducing the number of workers/worker threads to cut down on the number of times a single observation is processed be helpful?  It seems like this approach would cut out some redundant scans as well, but it might be more of a second order optimization. In general, any insight that you might have on this problem would be greatly appreciated.
>
> Sincerely,
> Caleb Meier
>
> Caleb A. Meier, Ph.D.
> Senior Software Engineer ♦ Analyst
> Parsons Corporation
> 1911 N. Fort Myer Drive, Suite 800 ♦ Arlington, VA 22209
> Office:  (703)797-3066
> Caleb.Meier@Parsons.com<ma...@Parsons.com> ♦ www.parsons.com<https://webportal.parsons.com/,DanaInfo=www.parsons.com+>
>