You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "Marcell Ortutay (JIRA)" <ji...@apache.org> on 2018/03/21 22:13:13 UTC
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

    [ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408666#comment-16408666 ] 

Marcell Ortutay commented on PHOENIX-4666:
------------------------------------------

As I mentioned above we’re working on a design proposal for this internally at 23andMe, and there’s one big decision that I wanted to get feedback on.

There is currently “server cache” that is used by the hash join process in Phoenix. Hash join tables are broadcast to all region servers that need it, and the hash joining happens via coprocessor. This cache is deleted after the query ends.

My first thought for a persistent cache was to re-use the server cache, and extend the TTL and change the key (“cacheId”) generation. I implemented this as a hacky proof-of-concept and it worked quite well, the performance was much improved.

However, I’m wondering if a separate cache makes more sense. The current server cache has a different use case than a persistent cache, and as such it may be a good idea to separate the two.

Some ways in which they are different:

- A persistent cache performs eviction when there is no space left. The server cache raises an exception, and the user must do a merge sort join instead.

- Users may want to configure the two differently, eg. allocate more space for a persistent cache than the server cache, and set a higher TTL

- The server cache data must be available on all region servers doing the hash join. In contrast, the persistent cache only needs 1 copy of the data across the system (ie. across all region servers) until the data is needed. Doing this would be more space efficient, but result in more network transfer.

- You could in theory have a pluggable system for the persistent cache, eg. use memcache or something

 

That said, there are advantages to keeping it all in the server cache:

 

- Simpler implementation, does not add a new system to Phoenix

- Faster in the case that you get a cache hit, since there is no network transfer involved

 

Would love to get some feedback / opinions on this, thanks!

> Add a subquery cache that persists beyond the life of a query
> -------------------------------------------------------------
>
>                 Key: PHOENIX-4666
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4666
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Marcell Ortutay
>            Priority: Major
>
> The user list thread for additional context is here: [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> ----
> A Phoenix query may contain expensive subqueries, and moreover those expensive subqueries may be used across multiple different queries. While whole result caching is possible at the application level, it is not possible to cache subresults in the application. This can cause bad performance for queries in which the subquery is the most expensive part of the query, and the application is powerless to do anything at the query level. It would be good if Phoenix provided a way to cache subquery results, as it would provide a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it doesn't change between queries. The rest of the query does because of the \{id} parameter. This means the application can't cache it, but it would be good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data in this "cache" is not persisted across queries. It is deleted after a TTL expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to provide a patch with some guidance from Phoenix maintainers. We are currently putting together a design document for a solution, and we'll post it to this Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)