You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Michael Shtelma <ms...@gmail.com> on 2017/05/10 16:16:30 UTC

In-memory cache in Drill

Hi all,

Are there any way to cache the data that was loaded from the actual
storage plugin in Drill?
As far as I understand, when the query is executed, the data is first
loaded from the storage plugin and handled by the format plugin. After
that, the data is stored using internal vectorized representation and
the query is executed. Is it correct? I am wondering, if there is a
way to store somewhere these data vectors, so that they do not have to
be loaded from the actual storage for each query? Spark does something
like that, by storing data frames  in off heap storage.

Regards,
Michael

Re: In-memory cache in Drill

Posted by Kunal Khatua <kk...@mapr.com>.

If the connection pool is managed by your app, then, yes. The application needs to do the clean up explicitly when a connection is returned back to the pool.
________________________________
From: Rahul Raj <ra...@option3consulting.com>
Sent: Wednesday, May 10, 2017 9:56:57 AM
To: user@drill.apache.org
Subject: Re: In-memory cache in Drill

The documentation says a temporary table does not outlive it's session.
What happens when drill connections are wrapped in a connection pool?
Should we drop them after each query in this case?

Regards,
Rahul

On May 10, 2017 10:15 PM, "Michael Shtelma" <ms...@gmail.com> wrote:

> yes, for sure this is also the viable approach... but it would be far
> better to be able to have the data also in memory..
> Does it make sense to have something like an in-memory storage plugin?
> In this case it can be also used as a storage for the temporary
> tables.
> Sincerely,
> Michael Shtelma
>
>
> On Wed, May 10, 2017 at 6:30 PM, Kunal Khatua <kk...@mapr.com> wrote:
> > Drill does not cache data in memory because it introduces the risk of
> dealing with stale data when working with data at a large scale.
> >
> >
> > If you want to avoid hitting the actual storage repeatedly, one option
> is to use the 'create temp table ' feature (https://drill.apache.org/
> docs/create-temporary-table-as-cttas/). This allows you to land the data
> to a local (or distributed) F, and use that data storage instead. These
> tables are alive only for the lifetime of the session (connection your
> client/SQLLine) makes to the Drill cluster.
> >
> >
> > There is a second benefit of using this approach. You can translate the
> original data source into a format that is highly suitable to what you are
> doing with the data. For e.g., you could pull in data from an RDBMS or a
> JSON store and write the temp table in parquet for performing analytics on.
> >
> >
> > ~ Kunal
> >
> > ________________________________
> > From: Michael Shtelma <ms...@gmail.com>
> > Sent: Wednesday, May 10, 2017 9:16:30 AM
> > To: user@drill.apache.org
> > Subject: In-memory cache in Drill
> >
> > Hi all,
> >
> > Are there any way to cache the data that was loaded from the actual
> > storage plugin in Drill?
> > As far as I understand, when the query is executed, the data is first
> > loaded from the storage plugin and handled by the format plugin. After
> > that, the data is stored using internal vectorized representation and
> > the query is executed. Is it correct? I am wondering, if there is a
> > way to store somewhere these data vectors, so that they do not have to
> > be loaded from the actual storage for each query? Spark does something
> > like that, by storing data frames  in off heap storage.
> >
> > Regards,
> > Michael
>

--
**** This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom it is
addressed. If you are not the named addressee then you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately and delete this e-mail from your system.****

Re: In-memory cache in Drill

Posted by Rahul Raj <ra...@option3consulting.com>.

The documentation says a temporary table does not outlive it's session.
What happens when drill connections are wrapped in a connection pool?
Should we drop them after each query in this case?

Regards,
Rahul

On May 10, 2017 10:15 PM, "Michael Shtelma" <ms...@gmail.com> wrote:

> yes, for sure this is also the viable approach... but it would be far
> better to be able to have the data also in memory..
> Does it make sense to have something like an in-memory storage plugin?
> In this case it can be also used as a storage for the temporary
> tables.
> Sincerely,
> Michael Shtelma
>
>
> On Wed, May 10, 2017 at 6:30 PM, Kunal Khatua <kk...@mapr.com> wrote:
> > Drill does not cache data in memory because it introduces the risk of
> dealing with stale data when working with data at a large scale.
> >
> >
> > If you want to avoid hitting the actual storage repeatedly, one option
> is to use the 'create temp table ' feature (https://drill.apache.org/
> docs/create-temporary-table-as-cttas/). This allows you to land the data
> to a local (or distributed) F, and use that data storage instead. These
> tables are alive only for the lifetime of the session (connection your
> client/SQLLine) makes to the Drill cluster.
> >
> >
> > There is a second benefit of using this approach. You can translate the
> original data source into a format that is highly suitable to what you are
> doing with the data. For e.g., you could pull in data from an RDBMS or a
> JSON store and write the temp table in parquet for performing analytics on.
> >
> >
> > ~ Kunal
> >
> > ________________________________
> > From: Michael Shtelma <ms...@gmail.com>
> > Sent: Wednesday, May 10, 2017 9:16:30 AM
> > To: user@drill.apache.org
> > Subject: In-memory cache in Drill
> >
> > Hi all,
> >
> > Are there any way to cache the data that was loaded from the actual
> > storage plugin in Drill?
> > As far as I understand, when the query is executed, the data is first
> > loaded from the storage plugin and handled by the format plugin. After
> > that, the data is stored using internal vectorized representation and
> > the query is executed. Is it correct? I am wondering, if there is a
> > way to store somewhere these data vectors, so that they do not have to
> > be loaded from the actual storage for each query? Spark does something
> > like that, by storing data frames  in off heap storage.
> >
> > Regards,
> > Michael
>

-- 
**** This email and any files transmitted with it are confidential and 
intended solely for the use of the individual or entity to whom it is 
addressed. If you are not the named addressee then you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately and delete this e-mail from your system.****

Re: In-memory cache in Drill

Posted by Padma Penumarthy <pp...@mapr.com>.

I am not sure about data. But, caching metadata, which is much smaller than actual data 
in memory will help with cutting down on planning time.
For ex. for Parquet, metadata cache files are read from disk every time a query is run. 
Also,  checking modification times (which in itself is an expensive file system operation)
for every query to make sure metadata cache is not stale is an overhead.

Thanks,
Padma


> On May 10, 2017, at 1:26 PM, Paul Rogers <pr...@mapr.com> wrote:
> 
> Hi Michael,
> 
> Caching can help — depending on what is cached. The Hive plugin caches schema information to avoid hitting Hive for each query that needs the schema.
> 
> If your data is small, then you can cache. Maybe you have a file that maps county codes to countries: you’d have 200+ entries that are used over and over. But, if your data is small, and comes from a file, then it is likely that your OS or file system already caches the data for you. It still needs to be copied from a file into vectors, but that is a low cost for a small table.
> 
> Caching data in heap memory is fairly safe (if the data is small): the memory will be reclaimed eventually via the normal Java mechanisms. You’d have to work out a cache invalidation strategy, and come up with a modified storage plugin that is cache-aware — not a trivial task.
> 
> Caching data in direct memory, across queries, is a whole new area of exploration. You would have to learn how Drill’s reference counting works — a bit of a project that would have to have huge benefit to justify the costs.
> 
> If data is large, then it is doubtful that caching will help. If the caching simply prevents rerunning, say, a JDBC query, then the suggested temp table route would be fine. Or, outsource the work from Drill and have a periodic batch job that does an ETL from the original system to a file in your file system, then let the file system caching work for you.
> 
> Further, Drill tries to optimize this case: Drill will determine if it is faster to read the file once, on the node with the data, and ship the data to all the nodes that need it, or whether it is faster to do a remote read on all nodes (which makes more sense for small files.) Your caching strategy would have to be aware of what the planner is doing to avoid working at cross-purposes.
> 
> What is the use case you are trying to optimize?
> 
> Thanks,
> 
> - Paul
> 
> 
>> On May 10, 2017, at 9:56 AM, Kunal Khatua <kk...@mapr.com> wrote:
>> 
>> Not really :)
>> 
>> 
>> You get into the problem of having to deal with cache management. Once you start using memory to serve a cache for holding a table in-memory, you are sacrificing the memory resource for doing the actual computation. Also, Drill actually tries to work with Direct Memory and not heap. To work around this, you would then have to introduce a swapping policy, so as to reclaim the memory.
>> 
>> 
>> If you were to use Heap for storing the table in memory, then Drill will need to copy the data into DirectMemory to do useful work. So now you have about 2x the memory being used for the data!
>> 
>> 
>> If you are using HDFS (or MapR-FS), these filesystems themselves implement a cache management, so we are already leveraging (to a limited extent) the benefits of an in-memory cache.
>> 
>> 
>> 
>> ________________________________
>> From: Michael Shtelma <ms...@gmail.com>
>> Sent: Wednesday, May 10, 2017 9:44:50 AM
>> To: user@drill.apache.org
>> Subject: Re: In-memory cache in Drill
>> 
>> yes, for sure this is also the viable approach... but it would be far
>> better to be able to have the data also in memory..
>> Does it make sense to have something like an in-memory storage plugin?
>> In this case it can be also used as a storage for the temporary
>> tables.
>> Sincerely,
>> Michael Shtelma
>> 
>> 
>> On Wed, May 10, 2017 at 6:30 PM, Kunal Khatua <kk...@mapr.com> wrote:
>>> Drill does not cache data in memory because it introduces the risk of dealing with stale data when working with data at a large scale.
>>> 
>>> 
>>> If you want to avoid hitting the actual storage repeatedly, one option is to use the 'create temp table ' feature (https://drill.apache.org/docs/create-temporary-table-as-cttas/). This allows you to land the data to a local (or distributed) F, and use that data storage instead. These tables are alive only for the lifetime of the session (connection your client/SQLLine) makes to the Drill cluster.
>>> 
>>> 
>>> There is a second benefit of using this approach. You can translate the original data source into a format that is highly suitable to what you are doing with the data. For e.g., you could pull in data from an RDBMS or a JSON store and write the temp table in parquet for performing analytics on.
>>> 
>>> 
>>> ~ Kunal
>>> 
>>> ________________________________
>>> From: Michael Shtelma <ms...@gmail.com>
>>> Sent: Wednesday, May 10, 2017 9:16:30 AM
>>> To: user@drill.apache.org
>>> Subject: In-memory cache in Drill
>>> 
>>> Hi all,
>>> 
>>> Are there any way to cache the data that was loaded from the actual
>>> storage plugin in Drill?
>>> As far as I understand, when the query is executed, the data is first
>>> loaded from the storage plugin and handled by the format plugin. After
>>> that, the data is stored using internal vectorized representation and
>>> the query is executed. Is it correct? I am wondering, if there is a
>>> way to store somewhere these data vectors, so that they do not have to
>>> be loaded from the actual storage for each query? Spark does something
>>> like that, by storing data frames  in off heap storage.
>>> 
>>> Regards,
>>> Michael
>

Re: In-memory cache in Drill

Posted by Paul Rogers <pr...@mapr.com>.

Hi Michael,

Caching can help — depending on what is cached. The Hive plugin caches schema information to avoid hitting Hive for each query that needs the schema.

If your data is small, then you can cache. Maybe you have a file that maps county codes to countries: you’d have 200+ entries that are used over and over. But, if your data is small, and comes from a file, then it is likely that your OS or file system already caches the data for you. It still needs to be copied from a file into vectors, but that is a low cost for a small table.

Caching data in heap memory is fairly safe (if the data is small): the memory will be reclaimed eventually via the normal Java mechanisms. You’d have to work out a cache invalidation strategy, and come up with a modified storage plugin that is cache-aware — not a trivial task.

Caching data in direct memory, across queries, is a whole new area of exploration. You would have to learn how Drill’s reference counting works — a bit of a project that would have to have huge benefit to justify the costs.

If data is large, then it is doubtful that caching will help. If the caching simply prevents rerunning, say, a JDBC query, then the suggested temp table route would be fine. Or, outsource the work from Drill and have a periodic batch job that does an ETL from the original system to a file in your file system, then let the file system caching work for you.

Further, Drill tries to optimize this case: Drill will determine if it is faster to read the file once, on the node with the data, and ship the data to all the nodes that need it, or whether it is faster to do a remote read on all nodes (which makes more sense for small files.) Your caching strategy would have to be aware of what the planner is doing to avoid working at cross-purposes.

What is the use case you are trying to optimize?

Thanks,

- Paul

> On May 10, 2017, at 9:56 AM, Kunal Khatua <kk...@mapr.com> wrote:
> 
> Not really :)
> 
> 
> You get into the problem of having to deal with cache management. Once you start using memory to serve a cache for holding a table in-memory, you are sacrificing the memory resource for doing the actual computation. Also, Drill actually tries to work with Direct Memory and not heap. To work around this, you would then have to introduce a swapping policy, so as to reclaim the memory.
> 
> 
> If you were to use Heap for storing the table in memory, then Drill will need to copy the data into DirectMemory to do useful work. So now you have about 2x the memory being used for the data!
> 
> 
> If you are using HDFS (or MapR-FS), these filesystems themselves implement a cache management, so we are already leveraging (to a limited extent) the benefits of an in-memory cache.
> 
> 
> 
> ________________________________
> From: Michael Shtelma <ms...@gmail.com>
> Sent: Wednesday, May 10, 2017 9:44:50 AM
> To: user@drill.apache.org
> Subject: Re: In-memory cache in Drill
> 
> yes, for sure this is also the viable approach... but it would be far
> better to be able to have the data also in memory..
> Does it make sense to have something like an in-memory storage plugin?
> In this case it can be also used as a storage for the temporary
> tables.
> Sincerely,
> Michael Shtelma
> 
> 
> On Wed, May 10, 2017 at 6:30 PM, Kunal Khatua <kk...@mapr.com> wrote:
>> Drill does not cache data in memory because it introduces the risk of dealing with stale data when working with data at a large scale.
>> 
>> 
>> If you want to avoid hitting the actual storage repeatedly, one option is to use the 'create temp table ' feature (https://drill.apache.org/docs/create-temporary-table-as-cttas/). This allows you to land the data to a local (or distributed) F, and use that data storage instead. These tables are alive only for the lifetime of the session (connection your client/SQLLine) makes to the Drill cluster.
>> 
>> 
>> There is a second benefit of using this approach. You can translate the original data source into a format that is highly suitable to what you are doing with the data. For e.g., you could pull in data from an RDBMS or a JSON store and write the temp table in parquet for performing analytics on.
>> 
>> 
>> ~ Kunal
>> 
>> ________________________________
>> From: Michael Shtelma <ms...@gmail.com>
>> Sent: Wednesday, May 10, 2017 9:16:30 AM
>> To: user@drill.apache.org
>> Subject: In-memory cache in Drill
>> 
>> Hi all,
>> 
>> Are there any way to cache the data that was loaded from the actual
>> storage plugin in Drill?
>> As far as I understand, when the query is executed, the data is first
>> loaded from the storage plugin and handled by the format plugin. After
>> that, the data is stored using internal vectorized representation and
>> the query is executed. Is it correct? I am wondering, if there is a
>> way to store somewhere these data vectors, so that they do not have to
>> be loaded from the actual storage for each query? Spark does something
>> like that, by storing data frames  in off heap storage.
>> 
>> Regards,
>> Michael

Re: In-memory cache in Drill

Posted by Kunal Khatua <kk...@mapr.com>.

Not really :)

You get into the problem of having to deal with cache management. Once you start using memory to serve a cache for holding a table in-memory, you are sacrificing the memory resource for doing the actual computation. Also, Drill actually tries to work with Direct Memory and not heap. To work around this, you would then have to introduce a swapping policy, so as to reclaim the memory.

If you were to use Heap for storing the table in memory, then Drill will need to copy the data into DirectMemory to do useful work. So now you have about 2x the memory being used for the data!

If you are using HDFS (or MapR-FS), these filesystems themselves implement a cache management, so we are already leveraging (to a limited extent) the benefits of an in-memory cache.

________________________________
From: Michael Shtelma <ms...@gmail.com>
Sent: Wednesday, May 10, 2017 9:44:50 AM
To: user@drill.apache.org
Subject: Re: In-memory cache in Drill

yes, for sure this is also the viable approach... but it would be far
better to be able to have the data also in memory..
Does it make sense to have something like an in-memory storage plugin?
In this case it can be also used as a storage for the temporary
tables.
Sincerely,
Michael Shtelma

On Wed, May 10, 2017 at 6:30 PM, Kunal Khatua <kk...@mapr.com> wrote:
> Drill does not cache data in memory because it introduces the risk of dealing with stale data when working with data at a large scale.
>
>
> If you want to avoid hitting the actual storage repeatedly, one option is to use the 'create temp table ' feature (https://drill.apache.org/docs/create-temporary-table-as-cttas/). This allows you to land the data to a local (or distributed) F, and use that data storage instead. These tables are alive only for the lifetime of the session (connection your client/SQLLine) makes to the Drill cluster.
>
>
> There is a second benefit of using this approach. You can translate the original data source into a format that is highly suitable to what you are doing with the data. For e.g., you could pull in data from an RDBMS or a JSON store and write the temp table in parquet for performing analytics on.
>
>
> ~ Kunal
>
> ________________________________
> From: Michael Shtelma <ms...@gmail.com>
> Sent: Wednesday, May 10, 2017 9:16:30 AM
> To: user@drill.apache.org
> Subject: In-memory cache in Drill
>
> Hi all,
>
> Are there any way to cache the data that was loaded from the actual
> storage plugin in Drill?
> As far as I understand, when the query is executed, the data is first
> loaded from the storage plugin and handled by the format plugin. After
> that, the data is stored using internal vectorized representation and
> the query is executed. Is it correct? I am wondering, if there is a
> way to store somewhere these data vectors, so that they do not have to
> be loaded from the actual storage for each query? Spark does something
> like that, by storing data frames  in off heap storage.
>
> Regards,
> Michael

Re: In-memory cache in Drill

Posted by Michael Shtelma <ms...@gmail.com>.

yes, for sure this is also the viable approach... but it would be far
better to be able to have the data also in memory..
Does it make sense to have something like an in-memory storage plugin?
In this case it can be also used as a storage for the temporary
tables.
Sincerely,
Michael Shtelma


On Wed, May 10, 2017 at 6:30 PM, Kunal Khatua <kk...@mapr.com> wrote:
> Drill does not cache data in memory because it introduces the risk of dealing with stale data when working with data at a large scale.
>
>
> If you want to avoid hitting the actual storage repeatedly, one option is to use the 'create temp table ' feature (https://drill.apache.org/docs/create-temporary-table-as-cttas/). This allows you to land the data to a local (or distributed) F, and use that data storage instead. These tables are alive only for the lifetime of the session (connection your client/SQLLine) makes to the Drill cluster.
>
>
> There is a second benefit of using this approach. You can translate the original data source into a format that is highly suitable to what you are doing with the data. For e.g., you could pull in data from an RDBMS or a JSON store and write the temp table in parquet for performing analytics on.
>
>
> ~ Kunal
>
> ________________________________
> From: Michael Shtelma <ms...@gmail.com>
> Sent: Wednesday, May 10, 2017 9:16:30 AM
> To: user@drill.apache.org
> Subject: In-memory cache in Drill
>
> Hi all,
>
> Are there any way to cache the data that was loaded from the actual
> storage plugin in Drill?
> As far as I understand, when the query is executed, the data is first
> loaded from the storage plugin and handled by the format plugin. After
> that, the data is stored using internal vectorized representation and
> the query is executed. Is it correct? I am wondering, if there is a
> way to store somewhere these data vectors, so that they do not have to
> be loaded from the actual storage for each query? Spark does something
> like that, by storing data frames  in off heap storage.
>
> Regards,
> Michael

Re: In-memory cache in Drill

Posted by Kunal Khatua <kk...@mapr.com>.

Drill does not cache data in memory because it introduces the risk of dealing with stale data when working with data at a large scale.


If you want to avoid hitting the actual storage repeatedly, one option is to use the 'create temp table ' feature (https://drill.apache.org/docs/create-temporary-table-as-cttas/). This allows you to land the data to a local (or distributed) F, and use that data storage instead. These tables are alive only for the lifetime of the session (connection your client/SQLLine) makes to the Drill cluster.


There is a second benefit of using this approach. You can translate the original data source into a format that is highly suitable to what you are doing with the data. For e.g., you could pull in data from an RDBMS or a JSON store and write the temp table in parquet for performing analytics on.


~ Kunal

________________________________
From: Michael Shtelma <ms...@gmail.com>
Sent: Wednesday, May 10, 2017 9:16:30 AM
To: user@drill.apache.org
Subject: In-memory cache in Drill

Hi all,

Are there any way to cache the data that was loaded from the actual
storage plugin in Drill?
As far as I understand, when the query is executed, the data is first
loaded from the storage plugin and handled by the format plugin. After
that, the data is stored using internal vectorized representation and
the query is executed. Is it correct? I am wondering, if there is a
way to store somewhere these data vectors, so that they do not have to
be loaded from the actual storage for each query? Spark does something
like that, by storing data frames  in off heap storage.

Regards,
Michael