You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Mauricio Aristizabal <ma...@impactradius.com> on 2017/11/28 01:02:44 UTC

Time-travel reads via SQL query

Hi all, has there been any talk of supporting this any time soon?

Time travel reads are such a cool feature, but even more than in ETL jobs
(via Java/Scala), they would be most useful via SQL to ensure consistency
when reading.

Specifically, for example our spark streaming job updates dozens of
aggregation tables every 30 seconds.  To make the data fully consistent we
would love to have views over these aggs tagged with the exact timestamp we
want to expose.  When each batch is done and all tables updated, we would
update all the views forward, effectively hiding the updates we're doing
until they're all ready.

-m



-- 
*MAURICIO ARISTIZABAL*
Architect - Business Intelligence + Data Science
mauricio@impactradius.com(m)+1 323 309 4260
223 E. De La Guerra St. | Santa Barbara, CA 93101

Overview <http://www.impactradius.com/?src=slsap> | Twitter
<https://twitter.com/impactradius> | Facebook
<https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn
<https://www.linkedin.com/company/impact-radius-inc->

Re: Time-travel reads via SQL query

Posted by Mauricio Aristizabal <ma...@impactradius.com>.
Thanks Andrew.  However if I'm understanding the low level details in that
ticket I don't think it's what I'm asking either, at least not
entirely: KUDU-1702 is for exposing point-in-time reads to Impala
DEVELOPERS, i'm hoping it can be exposed to Impala USERS via a SQL
extension, so we can leverage it from user queries/reports. -m

On Tue, Nov 28, 2017 at 2:09 PM, Andrew Wong <aw...@cloudera.com> wrote:

> Ah I see, looks like this is an open, tracked issue. See KUDU-1702
> <https://issues.apache.org/jira/browse/KUDU-1702?jql=project%20%3D%20KUDU%20AND%20resolution%20%3D%20Unresolved%20AND%20text%20~%20%22read_at_snapshot%22%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC>,
> although not sure what the status on it is.
>
> On Tue, Nov 28, 2017 at 12:17 PM, Mauricio Aristizabal <
> mauricio@impactradius.com> wrote:
>
>> Thanks very much David and Andrew.  Yes I'm aware this functionality is
>> available via the java and C++ clients, but actually what I'm asking is if
>> it could be made available via SQL/impala.  Something like "select X from Y
>> where snapshot_micros = 2343242423 <(234)%20324-2423>" (where
>> snapshot_micros is a virtual column that would need a better name), or
>> perhaps as part of the table name like "select X from Y@2343242423
>> <(234)%20324-2423>".  -m
>>
>> On Tue, Nov 28, 2017 at 12:05 PM, David Alves <da...@gmail.com>
>> wrote:
>>
>>> Hi Mauricio
>>>
>>>   Andrew is right. That feature already exists in some form. With
>>> READ_AT_SNAPSHOT you can provide a timestamp which will be the timepoint
>>> under which all the scans are performed.
>>>   Note that, while generally supported and functionally tested, we
>>> haven't focused a lot of resources into testing this, so your performance
>>> mileage may vary.
>>>   In order to enable this for time points more than 5 mins in the past
>>> you need to increase the "--tablet_history_max_age_sec" flag so that
>>> the history won't get garbage collected.
>>>
>>> HTH
>>> -david
>>>
>>> On Mon, Nov 27, 2017 at 9:42 PM, Andrew Wong <aw...@cloudera.com> wrote:
>>>
>>>> Hi Mauricio,
>>>>
>>>> If you haven't already, take a look at the READ_AT_SNAPSHOT read mode
>>>> (more info here
>>>> <https://kudu.apache.org/docs/transaction_semantics.html#_read_operations_scans>).
>>>> IIUC, it seems similar to, if not exactly what you're looking for!
>>>>
>>>>
>>>> Andrew
>>>>
>>>> On Mon, Nov 27, 2017 at 5:02 PM, Mauricio Aristizabal <
>>>> mauricio@impactradius.com> wrote:
>>>>
>>>>> Hi all, has there been any talk of supporting this any time soon?
>>>>>
>>>>> Time travel reads are such a cool feature, but even more than in ETL
>>>>> jobs (via Java/Scala), they would be most useful via SQL to ensure
>>>>> consistency when reading.
>>>>>
>>>>> Specifically, for example our spark streaming job updates dozens of
>>>>> aggregation tables every 30 seconds.  To make the data fully consistent we
>>>>> would love to have views over these aggs tagged with the exact timestamp we
>>>>> want to expose.  When each batch is done and all tables updated, we would
>>>>> update all the views forward, effectively hiding the updates we're doing
>>>>> until they're all ready.
>>>>>
>>>>> -m
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *MAURICIO ARISTIZABAL*
>>>>> Architect - Business Intelligence + Data Science
>>>>> mauricio@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260>
>>>>> 223 E. De La Guerra St. | Santa Barbara, CA 93101
>>>>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g>
>>>>>
>>>>> Overview <http://www.impactradius.com/?src=slsap> | Twitter
>>>>> <https://twitter.com/impactradius> | Facebook
>>>>> <https://www.facebook.com/pages/Impact-Radius/153376411365183> |
>>>>> LinkedIn <https://www.linkedin.com/company/impact-radius-inc->
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Andrew Wong
>>>>
>>>
>>>
>>
>>
>> --
>> *MAURICIO ARISTIZABAL*
>> Architect - Business Intelligence + Data Science
>> mauricio@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260>
>> 223 E. De La Guerra St. | Santa Barbara, CA 93101
>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g>
>>
>> Overview <http://www.impactradius.com/?src=slsap> | Twitter
>> <https://twitter.com/impactradius> | Facebook
>> <https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn
>> <https://www.linkedin.com/company/impact-radius-inc->
>>
>
>
>
> --
> Andrew Wong
>



-- 
*MAURICIO ARISTIZABAL*
Architect - Business Intelligence + Data Science
mauricio@impactradius.com(m)+1 323 309 4260
223 E. De La Guerra St. | Santa Barbara, CA 93101

Overview <http://www.impactradius.com/?src=slsap> | Twitter
<https://twitter.com/impactradius> | Facebook
<https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn
<https://www.linkedin.com/company/impact-radius-inc->

Re: Time-travel reads via SQL query

Posted by Andrew Wong <aw...@cloudera.com>.
Ah I see, looks like this is an open, tracked issue. See KUDU-1702
<https://issues.apache.org/jira/browse/KUDU-1702?jql=project%20%3D%20KUDU%20AND%20resolution%20%3D%20Unresolved%20AND%20text%20~%20%22read_at_snapshot%22%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC>,
although not sure what the status on it is.

On Tue, Nov 28, 2017 at 12:17 PM, Mauricio Aristizabal <
mauricio@impactradius.com> wrote:

> Thanks very much David and Andrew.  Yes I'm aware this functionality is
> available via the java and C++ clients, but actually what I'm asking is if
> it could be made available via SQL/impala.  Something like "select X from Y
> where snapshot_micros = 2343242423 <(234)%20324-2423>" (where
> snapshot_micros is a virtual column that would need a better name), or
> perhaps as part of the table name like "select X from Y@2343242423
> <(234)%20324-2423>".  -m
>
> On Tue, Nov 28, 2017 at 12:05 PM, David Alves <da...@gmail.com>
> wrote:
>
>> Hi Mauricio
>>
>>   Andrew is right. That feature already exists in some form. With
>> READ_AT_SNAPSHOT you can provide a timestamp which will be the timepoint
>> under which all the scans are performed.
>>   Note that, while generally supported and functionally tested, we
>> haven't focused a lot of resources into testing this, so your performance
>> mileage may vary.
>>   In order to enable this for time points more than 5 mins in the past
>> you need to increase the "--tablet_history_max_age_sec" flag so that the
>> history won't get garbage collected.
>>
>> HTH
>> -david
>>
>> On Mon, Nov 27, 2017 at 9:42 PM, Andrew Wong <aw...@cloudera.com> wrote:
>>
>>> Hi Mauricio,
>>>
>>> If you haven't already, take a look at the READ_AT_SNAPSHOT read mode
>>> (more info here
>>> <https://kudu.apache.org/docs/transaction_semantics.html#_read_operations_scans>).
>>> IIUC, it seems similar to, if not exactly what you're looking for!
>>>
>>>
>>> Andrew
>>>
>>> On Mon, Nov 27, 2017 at 5:02 PM, Mauricio Aristizabal <
>>> mauricio@impactradius.com> wrote:
>>>
>>>> Hi all, has there been any talk of supporting this any time soon?
>>>>
>>>> Time travel reads are such a cool feature, but even more than in ETL
>>>> jobs (via Java/Scala), they would be most useful via SQL to ensure
>>>> consistency when reading.
>>>>
>>>> Specifically, for example our spark streaming job updates dozens of
>>>> aggregation tables every 30 seconds.  To make the data fully consistent we
>>>> would love to have views over these aggs tagged with the exact timestamp we
>>>> want to expose.  When each batch is done and all tables updated, we would
>>>> update all the views forward, effectively hiding the updates we're doing
>>>> until they're all ready.
>>>>
>>>> -m
>>>>
>>>>
>>>>
>>>> --
>>>> *MAURICIO ARISTIZABAL*
>>>> Architect - Business Intelligence + Data Science
>>>> mauricio@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260>
>>>> 223 E. De La Guerra St. | Santa Barbara, CA 93101
>>>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g>
>>>>
>>>> Overview <http://www.impactradius.com/?src=slsap> | Twitter
>>>> <https://twitter.com/impactradius> | Facebook
>>>> <https://www.facebook.com/pages/Impact-Radius/153376411365183> |
>>>> LinkedIn <https://www.linkedin.com/company/impact-radius-inc->
>>>>
>>>
>>>
>>>
>>> --
>>> Andrew Wong
>>>
>>
>>
>
>
> --
> *MAURICIO ARISTIZABAL*
> Architect - Business Intelligence + Data Science
> mauricio@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260>
> 223 E. De La Guerra St. | Santa Barbara, CA 93101
> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g>
>
> Overview <http://www.impactradius.com/?src=slsap> | Twitter
> <https://twitter.com/impactradius> | Facebook
> <https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn
> <https://www.linkedin.com/company/impact-radius-inc->
>



-- 
Andrew Wong

Re: Time-travel reads via SQL query

Posted by Mauricio Aristizabal <ma...@impactradius.com>.
Thanks very much David and Andrew.  Yes I'm aware this functionality is
available via the java and C++ clients, but actually what I'm asking is if
it could be made available via SQL/impala.  Something like "select X from Y
where snapshot_micros = 2343242423" (where snapshot_micros is a virtual
column that would need a better name), or perhaps as part of the table name
like "select X from Y@2343242423".  -m

On Tue, Nov 28, 2017 at 12:05 PM, David Alves <da...@gmail.com> wrote:

> Hi Mauricio
>
>   Andrew is right. That feature already exists in some form. With
> READ_AT_SNAPSHOT you can provide a timestamp which will be the timepoint
> under which all the scans are performed.
>   Note that, while generally supported and functionally tested, we haven't
> focused a lot of resources into testing this, so your performance mileage
> may vary.
>   In order to enable this for time points more than 5 mins in the past you
> need to increase the "--tablet_history_max_age_sec" flag so that the
> history won't get garbage collected.
>
> HTH
> -david
>
> On Mon, Nov 27, 2017 at 9:42 PM, Andrew Wong <aw...@cloudera.com> wrote:
>
>> Hi Mauricio,
>>
>> If you haven't already, take a look at the READ_AT_SNAPSHOT read mode
>> (more info here
>> <https://kudu.apache.org/docs/transaction_semantics.html#_read_operations_scans>).
>> IIUC, it seems similar to, if not exactly what you're looking for!
>>
>>
>> Andrew
>>
>> On Mon, Nov 27, 2017 at 5:02 PM, Mauricio Aristizabal <
>> mauricio@impactradius.com> wrote:
>>
>>> Hi all, has there been any talk of supporting this any time soon?
>>>
>>> Time travel reads are such a cool feature, but even more than in ETL
>>> jobs (via Java/Scala), they would be most useful via SQL to ensure
>>> consistency when reading.
>>>
>>> Specifically, for example our spark streaming job updates dozens of
>>> aggregation tables every 30 seconds.  To make the data fully consistent we
>>> would love to have views over these aggs tagged with the exact timestamp we
>>> want to expose.  When each batch is done and all tables updated, we would
>>> update all the views forward, effectively hiding the updates we're doing
>>> until they're all ready.
>>>
>>> -m
>>>
>>>
>>>
>>> --
>>> *MAURICIO ARISTIZABAL*
>>> Architect - Business Intelligence + Data Science
>>> mauricio@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260>
>>> 223 E. De La Guerra St. | Santa Barbara, CA 93101
>>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g>
>>>
>>> Overview <http://www.impactradius.com/?src=slsap> | Twitter
>>> <https://twitter.com/impactradius> | Facebook
>>> <https://www.facebook.com/pages/Impact-Radius/153376411365183> |
>>> LinkedIn <https://www.linkedin.com/company/impact-radius-inc->
>>>
>>
>>
>>
>> --
>> Andrew Wong
>>
>
>


-- 
*MAURICIO ARISTIZABAL*
Architect - Business Intelligence + Data Science
mauricio@impactradius.com(m)+1 323 309 4260
223 E. De La Guerra St. | Santa Barbara, CA 93101

Overview <http://www.impactradius.com/?src=slsap> | Twitter
<https://twitter.com/impactradius> | Facebook
<https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn
<https://www.linkedin.com/company/impact-radius-inc->

Re: Time-travel reads via SQL query

Posted by David Alves <da...@gmail.com>.
Hi Mauricio

  Andrew is right. That feature already exists in some form. With
READ_AT_SNAPSHOT you can provide a timestamp which will be the timepoint
under which all the scans are performed.
  Note that, while generally supported and functionally tested, we haven't
focused a lot of resources into testing this, so your performance mileage
may vary.
  In order to enable this for time points more than 5 mins in the past you
need to increase the "--tablet_history_max_age_sec" flag so that the
history won't get garbage collected.

HTH
-david

On Mon, Nov 27, 2017 at 9:42 PM, Andrew Wong <aw...@cloudera.com> wrote:

> Hi Mauricio,
>
> If you haven't already, take a look at the READ_AT_SNAPSHOT read mode
> (more info here
> <https://kudu.apache.org/docs/transaction_semantics.html#_read_operations_scans>).
> IIUC, it seems similar to, if not exactly what you're looking for!
>
>
> Andrew
>
> On Mon, Nov 27, 2017 at 5:02 PM, Mauricio Aristizabal <
> mauricio@impactradius.com> wrote:
>
>> Hi all, has there been any talk of supporting this any time soon?
>>
>> Time travel reads are such a cool feature, but even more than in ETL jobs
>> (via Java/Scala), they would be most useful via SQL to ensure consistency
>> when reading.
>>
>> Specifically, for example our spark streaming job updates dozens of
>> aggregation tables every 30 seconds.  To make the data fully consistent we
>> would love to have views over these aggs tagged with the exact timestamp we
>> want to expose.  When each batch is done and all tables updated, we would
>> update all the views forward, effectively hiding the updates we're doing
>> until they're all ready.
>>
>> -m
>>
>>
>>
>> --
>> *MAURICIO ARISTIZABAL*
>> Architect - Business Intelligence + Data Science
>> mauricio@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260>
>> 223 E. De La Guerra St. | Santa Barbara, CA 93101
>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g>
>>
>> Overview <http://www.impactradius.com/?src=slsap> | Twitter
>> <https://twitter.com/impactradius> | Facebook
>> <https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn
>> <https://www.linkedin.com/company/impact-radius-inc->
>>
>
>
>
> --
> Andrew Wong
>

Re: Time-travel reads via SQL query

Posted by Andrew Wong <aw...@cloudera.com>.
Hi Mauricio,

If you haven't already, take a look at the READ_AT_SNAPSHOT read mode (more
info here
<https://kudu.apache.org/docs/transaction_semantics.html#_read_operations_scans>).
IIUC, it seems similar to, if not exactly what you're looking for!


Andrew

On Mon, Nov 27, 2017 at 5:02 PM, Mauricio Aristizabal <
mauricio@impactradius.com> wrote:

> Hi all, has there been any talk of supporting this any time soon?
>
> Time travel reads are such a cool feature, but even more than in ETL jobs
> (via Java/Scala), they would be most useful via SQL to ensure consistency
> when reading.
>
> Specifically, for example our spark streaming job updates dozens of
> aggregation tables every 30 seconds.  To make the data fully consistent we
> would love to have views over these aggs tagged with the exact timestamp we
> want to expose.  When each batch is done and all tables updated, we would
> update all the views forward, effectively hiding the updates we're doing
> until they're all ready.
>
> -m
>
>
>
> --
> *MAURICIO ARISTIZABAL*
> Architect - Business Intelligence + Data Science
> mauricio@impactradius.com(m)+1 323 309 4260 <(323)%20309-4260>
> 223 E. De La Guerra St. | Santa Barbara, CA 93101
> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g>
>
> Overview <http://www.impactradius.com/?src=slsap> | Twitter
> <https://twitter.com/impactradius> | Facebook
> <https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn
> <https://www.linkedin.com/company/impact-radius-inc->
>



-- 
Andrew Wong