You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2019/04/21 19:25:36 UTC

[GitHub] [incubator-superset] betodealmeida opened a new issue #7340: Caching charts in Superset

betodealmeida opened a new issue #7340: Caching charts in Superset
URL: https://github.com/apache/incubator-superset/issues/7340
 
 
   ## Summary
   
   This document describes the strategies for caching responses and invalidating them in Superset. They’re current implemented in the following PRs:
   
   - https://github.com/apache/incubator-superset/pull/7032
   - https://github.com/apache-superset/superset-ui/pull/119
   - https://github.com/apache/incubator-superset/pull/7255
   - https://github.com/apache/incubator-superset/pull/7319 (not merged)
   - https://github.com/apache-superset/superset-ui/pull/137 (not merged)
   
   I’m sharing this since it seems we didn’t have enough time to discuss the changes in the first PR.
   
   ## Terminology
   
   - **Native cache** refers the cache automatically handled by the browser. This cannot be accessed programmatically, so https://github.com/apache/incubator-superset/pull/7319 bypasses this cache using the “no-cache” control, delegating cache management to `SupersetClient` in https://github.com/apache-superset/superset-ui/pull/137.
   - **Cache API** is a new API that allows programmatic access to the browser cache. It allows responses to be inspected and invalidated from Javascript. We also call this the **client cache**.
   - **Server cache** refers to a new server-side cache introduced in https://github.com/apache/incubator-superset/pull/7032 which saves response objects, keyed by request URL. It’s used to cache GET requests.
   - **Dataframe cache** refers to a server-side cache which stores dataframes, keyed by the query object. it’s used for both POST and GET requests.
   
   ## GET vs. POST
   
   1. When a user visits a chart or a dashboard, a GET request is made (technically this happens when a `Chart` component is mounted). The GET request has the slice id, the visualization type and, in the case of dashboards, any additional filters that the dashboard might apply.
     - This allows the charts to be cached by the browser, **with the same duration as the server cache and the dataframe cache**.
       - If the server cache hasn’t been invalidated, this saves a request. For dashboards, this should happen most of the time.
       - If the server cache has been invalidated, this shows stale data (stored in the client cache), forcing the user to refresh the chart/dash in order to get new data.
     - This also allows the browser to perform conditional requests, in case the client cache has expired but the data hasn’t changed. This still performs a request, but if the data hasn’t changed only headers are returned.
   2. When a user clicks “run query” in a chart, or “force refresh” in a dashboard, the chart payload is requested with a POST request.
     - This will invalidate the client cache: **all stored responses that reference the chart are invalidated**, including charts with extra filters from dashboards. This is an aggressive strategy, **erring on the side of caution**, since in theory we would only need to do this when a chart is saved or when a dashboard is force refreshed.
     - This will also invalidate the server cache, **but only for that specific URL**. This means that when a user visits a dashboard **with extra filters** after modifying a chart, they will see the old chart, requiring a manual refresh.
   3. When a user changes a filter box in a dashboard, GET requests are performed. This makes it almost instantaneous to change between values back and forth in the filter box.
   
   ## Scenarios
   
   ### Single user in Explore view
   1. User creates a new chart
     - Browser does POST request to `explore_json` with form data
     - Server caches the dataframe
   2. User click “run query” a few times, modifying the form data until they’re happy
     - Browser does POST requests to `explore_json` for each click
     - Server checks dataframe cache
       - If the query object hasn’t changed the cache is reused
       - Otherwise the query is run, and the dataframe is cached
   3. User saves chart (id=1)
     - Browser reloads page
     - Browser does GET request to `explore_json?form_data={“slice_id”:1}`
     - Response has Etag and Expires headers, browser caches it using the Cache API
     - Server caches the response
   4. User visits the chart before the cache expiration
     - Browser uses Cache API and determines cached response is valid
     - Browser reuses cached response
     - **No HTTP request is made**
   5. User visits the chart after the cache expiration
     - Browser uses Cache API and **finds an expired cached response**
     - Browser does conditional request, sending the hash of the cached response
       - If the server returns a 304, the cached response is used, and **no body is transferred, only headers**
       - If the server returns a 200, the response is cached and used
   
   ### Single user Explore/Dashboard interaction
   1. User creates a new chart, saves it
   2. User creates dashboard with chart
   3. User visits the dashboard before cache expiration
     - Browser uses Cache API and determines cached response is valid
     - Browser reuses cached response
     - **No HTTP request is made**
   4. User clicks “force refresh”, either in chart or dashboard
     - Browser does POST request
     - Browser invalidates all cached responses that reference the chart
     - Server invalidates cached responses associated with the dashboard (key is based on slice id and extra filters)
     - Fresh data is served
   5. User visits dashboard again
     - Browser cache is empty from previous step
     - Browser does GET requests
     - Server side cache is empty from previous step
     - Dataframe cache is empty from previous step
     - Server computes results and store in dataframe cache and server cache
     - Browser caches responses
   6. User visits dashboard once more before the cache expiration
     - Responses in the client cache are reused
     - **No HTTP requests are made**
   7. User visits dashboard after cache expiration
     - Conditional requests are done, trying to reuse client cache
   
   ### Multiple user Explore/Dashboard interaction
   
   1. User **A** visits dashboard **with extra filters** every day
     - If client cache is not expired, cached response is reused
     - Otherwise browser does conditional GET requests
       - Requests might hit server cache or dataframe cache, depending on the timing
   2. User **B** modifies a chart in the dashboard, using the explore view
     - The client cache is invalidated only in **B**’s browser
     - The server cache is invalidated only for the slice id (but not cached responses that have the extra filters)
     - The dataframe cache is invalidated
    3. User **A** visits the dashboard again, before cache expiration
     - The client cache wasn’t invalidated, so the response is reused, **showing stale data**
     - The user has to click “force refresh” in order to see the new data
   
   Not that we can overcome the staleness problem described in (3) by making the dashboard component force refresh slices that were changed after the response was cached in the browser. The payload received by the dashboard has a `changed_on` attribute for each slice, and the responses in the client cache have the timestamp when the requests were made. @graceguo-supercat, I think this addresses your concern?
   
   ## Performance
   
   I measured dashboard loading times before and after the GET requests were introduced, using the top 10 dashboards at Lyft. The average improvement was 20%, and the biggest improvement observed was 60%.
   
   ### Dashboard 1
   
   - Loading times (seconds):
     - Before: 2.32, 2.3, 2.1, 2.23, 2.52, 2.33, 3.09, 2.31, 2.24, 2.3
     - After:  2.06, 1.97, 2.76, 2.48, 2.31, 2.55, 1.93, 2.17, 2.17, 2.27
   - Improvement: 3.07%
   
   ### Dashboard 2
   
   - Loading times (seconds):
     - Before: 2.47, 2.56, 2.68, 2.48, 2.64, 2.71, 2.55, 2.49, 2.79, 2.62
     - After:  1.68, 1.57, 1.45, 1.42, 1.44, 1.52, 1.52, 1.51, 1.67, 1.68
   - Improvement: 41.12%
   
   ### Dashboard 3
   
   - Loading times (seconds):
     - Before: 40.19, 23.41, 1.96, 1.91, 1.96, 2.44, 2.5, 2.01, 2.18, 1.85
     - After:  2.41, 1.65, 1.72, 1.99, 1.66, 1.92, 1.84, 1.73, 1.69, 1.89
   - Improvement: 62.37%
   
   ### Dashboard 4
   
   - Loading times (seconds):
     - Before: 8.97, 1.72, 1.69, 1.61, 1.47, 1.75, 1.54, 1.71, 1.42, 5.83
     - After:  1.79, 1.46, 1.62, 1.46, 1.62, 1.75, 1.55, 1.45, 1.74, 1.61
   - Improvement: 26.04%
   
   ### Dashboard 5
   
   - Loading times (seconds):
     - Before: 1.84, 1.99, 2.17, 1.89, 1.88, 1.87, 1.86, 2.07, 2.08, 1.77
     - After:  2.02, 2.07, 1.9, 1.95, 1.71, 2.09, 1.94, 2.03, 1.83, 1.91
   - Improvement: **-1.10%**
   
   ### Dashboard 6
   
   - Loading times (seconds):
     - Before: 45.31, 6.21, 6.11, 6.57, 5.49, 5.66, 5.94, 6.86, 6.67, 6.79
     - After:  7.7, 6.51, 5.12, 5.03, 5.24, 5.8, 3.81, 4.0, 5.17, 5.1
   - Improvement: 17.40%
   
   ### Dashboard 7
   - Loading times (seconds):
     - Before: 5.1, 4.71, 5.02, 5.07, 4.82, 5.17, 4.36, 5.15, 4.79, 4.83
     - After:  1.86, 1.98, 2.03, 2.05, 2.16, 2.05, 2.14, 2.01, 2.01, 2.21
   - Improvement: 58.39%
   
   ### Dashboard 8
   
   - Loading times (seconds):
     - Before: 1.76, 1.83, 1.95, 1.72, 1.63, 1.72, 2.09, 1.84, 1.78, 1.77
     - After:  1.93, 1.57, 1.55, 1.59, 1.91, 1.59, 1.7, 1.57, 1.61, 1.79
   - Improvement: 7.24%
   
   ### Dashboard 9
   
   - Loading times (seconds):
     - Before: 4.41, 2.8, 1.89, 2.1, 1.84, 2.01, 1.85, 1.89, 1.85, 1.99
     - After:  1.77, 2.78, 1.77, 2.3, 2.01, 1.88, 2.18, 2.36, 2.0, 2.22
   - Improvement: **-4.31%**
   
   ### Dashboard 10
   
   - Loading times (seconds):
     - Before: 1.74, 1.49, 1.93, 1.53, 1.46, 1.47, 1.6, 1.53, 1.47, 1.39
     - After:  2.73, 1.56, 1.41, 1.52, 1.5, 1.33, 1.38, 1.35, 1.51, 1.59
   - Improvement: 3.82%
   
   Note: improvement was computed by dropping the highest/lowest values and taking the average.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org